directory-indexer

MCP server for semantic search across codebases. Indexes directories by chunking files, generating vector embeddings, and storing them in Qdrant for similarity search. Supports Ollama (local) and OpenAI embedding backends with workspace-scoped queries.

Overview

Language: TypeScript
Repo: peteretelej/directory-indexer
Install: npm install -g directory-indexer
Status: active (v0.3.0)

Architecture

The server uses the @modelcontextprotocol/sdk directly (not FastMCP) and exposes 6 MCP tools through handler functions in src/mcp-handlers.ts. Storage is split between SQLite (file metadata, chunk boundaries) and Qdrant (vector embeddings for search).

Modules:

mcp.ts - MCP server setup, tool registration, and stdio transport. Defines tool schemas with Zod-style JSON Schema.
mcp-handlers.ts - Tool handler functions (handleIndexTool, handleSearchTool, handleSimilarFilesTool, handleGetContentTool, handleGetChunkTool, handleDeleteIndexTool). Includes per-directory mutex to serialize concurrent indexing of the same path, plus a cached set of indexed directories for path validation.
indexing.ts - Directory scanning and file processing pipeline. scanDirectory() walks the filesystem respecting .gitignore rules, file size limits, and ignore patterns. indexDirectories() orchestrates the full pipeline: scan, diff against existing records (modtime fast-path, then content hash), chunk changed files, generate embeddings, upsert into both stores, and clean up deleted files.
embedding.ts - EmbeddingProvider interface with three implementations: OllamaEmbeddingProvider (local, 768 dimensions, nomic-embed-text default), OpenAIEmbeddingProvider (1536 dimensions), and MockEmbeddingProvider (deterministic sin-hash vectors for testing). Provider selection via EMBEDDING_PROVIDER env var.
storage.ts - Dual storage layer. SQLiteStorage uses better-sqlite3 for file metadata (path, size, hash, modification time, chunks JSON, parent directories) with WAL mode and busy timeout. QdrantClient wraps the Qdrant HTTP REST API for vector operations (upsert, search, delete by filter, collection management). Also provides getIndexStatus() with workspace health analysis and Qdrant consistency checks.
search.ts - Search operations. searchContent() generates a query embedding, searches Qdrant with optional workspace filter (via parentDirectories payload field), groups results by file path with average scores, and enriches with file sizes from SQLite. findSimilarFiles() uses the first chunk of an existing file as the query vector. getFileContent() and getChunkContent() serve indexed content.
config.ts - Zod-validated configuration from environment variables. Defines storage paths, Qdrant endpoint, embedding provider/model/endpoint, indexing parameters (chunk size 512, overlap 50, max file size 10MB), and workspace definitions parsed from WORKSPACE_* env vars (comma-separated or JSON array paths).
gitignore.ts - Loads and applies .gitignore rules during scanning.
path-validation.ts - Validates that requested file paths fall within indexed directories (security boundary).
prerequisites.ts - Pre-flight checks (Qdrant health, Ollama availability) with actionable error messages.

Data flow for indexing:

handleIndexTool acquires per-directory mutex, calls indexDirectories()
scanDirectory() walks the filesystem, respects gitignore, filters by size/type
For each file: check SQLite for existing record, compare modtime then hash
Changed files: read content, normalize line endings, split into overlapping chunks (512 chars, 50 overlap)
For each chunk: generate embedding via configured provider, upsert point into Qdrant with file path and parent directories in payload
Upsert file metadata into SQLite
Clean up: find files in SQLite that no longer exist on disk, remove from both stores

Tools:

index - Index one or more directories. Incremental by default (skips unchanged files).
search - Semantic search across indexed content. Supports workspace filter and configurable result limit.
similar_files - Find files semantically similar to a given file path.
get_content - Read file content, optionally by chunk range.
get_chunk - Read a specific chunk by file path and chunk ID.
delete_index - Remove all indexed data for a directory (both SQLite and Qdrant).

Key Design Decisions

Dual storage (SQLite + Qdrant). SQLite stores structured metadata (file paths, hashes, chunk boundaries, modification times) while Qdrant handles vector similarity search. This avoids embedding a vector database into SQLite and lets each store do what it does best. The trade-off is requiring a running Qdrant instance.

Incremental indexing with modtime fast-path. On re-index, the system first checks file modification time against the stored record. Only if the modtime is newer does it compute a content hash. This makes re-indexing large directories nearly instant when few files have changed.

Workspace filtering via parent directories. Each Qdrant point stores the file’s parent directory chain in its payload. Workspace-scoped searches use Qdrant’s native must filter on parentDirectories, so filtering happens at the vector DB level rather than post-search. Workspaces are defined via WORKSPACE_* environment variables.

Per-directory mutex. Concurrent index calls targeting the same directory are serialized via a promise-based mutex keyed by normalized path. This prevents duplicate work and data corruption from parallel indexing of overlapping directories.

Path validation security boundary. get_content and get_chunk validate that requested file paths fall within indexed directories. This prevents the tool from being used to read arbitrary files outside the indexed scope.

Development

# Install dependencies
npm install

# Build
npm run build

# Run tests
npm test

# Watch mode
npm run dev

# Run the CLI
npm run cli -- --help

Requires Qdrant running locally (docker run -p 6333:6333 qdrant/qdrant) and optionally Ollama for local embeddings.