directory-indexer
MCP server for semantic search across codebases. Indexes directories by chunking files, generating vector embeddings, and storing them in Qdrant for similarity search. Supports Ollama (local) and OpenAI embedding backends with workspace-scoped queries.
Overview
- Language: TypeScript
- Repo: peteretelej/directory-indexer
- Install:
npm install -g directory-indexer - Status: active (v0.3.0)
Architecture
The server uses the @modelcontextprotocol/sdk directly (not FastMCP) and exposes 6 MCP tools through handler functions in src/mcp-handlers.ts. Storage is split between SQLite (file metadata, chunk boundaries) and Qdrant (vector embeddings for search).
Modules:
mcp.ts- MCP server setup, tool registration, and stdio transport. Defines tool schemas with Zod-style JSON Schema.mcp-handlers.ts- Tool handler functions (handleIndexTool,handleSearchTool,handleSimilarFilesTool,handleGetContentTool,handleGetChunkTool,handleDeleteIndexTool). Includes per-directory mutex to serialize concurrent indexing of the same path, plus a cached set of indexed directories for path validation.indexing.ts- Directory scanning and file processing pipeline.scanDirectory()walks the filesystem respecting.gitignorerules, file size limits, and ignore patterns.indexDirectories()orchestrates the full pipeline: scan, diff against existing records (modtime fast-path, then content hash), chunk changed files, generate embeddings, upsert into both stores, and clean up deleted files.embedding.ts-EmbeddingProviderinterface with three implementations:OllamaEmbeddingProvider(local, 768 dimensions,nomic-embed-textdefault),OpenAIEmbeddingProvider(1536 dimensions), andMockEmbeddingProvider(deterministic sin-hash vectors for testing). Provider selection viaEMBEDDING_PROVIDERenv var.storage.ts- Dual storage layer.SQLiteStorageusesbetter-sqlite3for file metadata (path, size, hash, modification time, chunks JSON, parent directories) with WAL mode and busy timeout.QdrantClientwraps the Qdrant HTTP REST API for vector operations (upsert, search, delete by filter, collection management). Also providesgetIndexStatus()with workspace health analysis and Qdrant consistency checks.search.ts- Search operations.searchContent()generates a query embedding, searches Qdrant with optional workspace filter (viaparentDirectoriespayload field), groups results by file path with average scores, and enriches with file sizes from SQLite.findSimilarFiles()uses the first chunk of an existing file as the query vector.getFileContent()andgetChunkContent()serve indexed content.config.ts- Zod-validated configuration from environment variables. Defines storage paths, Qdrant endpoint, embedding provider/model/endpoint, indexing parameters (chunk size 512, overlap 50, max file size 10MB), and workspace definitions parsed fromWORKSPACE_*env vars (comma-separated or JSON array paths).gitignore.ts- Loads and applies.gitignorerules during scanning.path-validation.ts- Validates that requested file paths fall within indexed directories (security boundary).prerequisites.ts- Pre-flight checks (Qdrant health, Ollama availability) with actionable error messages.
Data flow for indexing:
handleIndexToolacquires per-directory mutex, callsindexDirectories()scanDirectory()walks the filesystem, respects gitignore, filters by size/type- For each file: check SQLite for existing record, compare modtime then hash
- Changed files: read content, normalize line endings, split into overlapping chunks (512 chars, 50 overlap)
- For each chunk: generate embedding via configured provider, upsert point into Qdrant with file path and parent directories in payload
- Upsert file metadata into SQLite
- Clean up: find files in SQLite that no longer exist on disk, remove from both stores
Tools:
index- Index one or more directories. Incremental by default (skips unchanged files).search- Semantic search across indexed content. Supportsworkspacefilter and configurable result limit.similar_files- Find files semantically similar to a given file path.get_content- Read file content, optionally by chunk range.get_chunk- Read a specific chunk by file path and chunk ID.delete_index- Remove all indexed data for a directory (both SQLite and Qdrant).
Key Design Decisions
Dual storage (SQLite + Qdrant). SQLite stores structured metadata (file paths, hashes, chunk boundaries, modification times) while Qdrant handles vector similarity search. This avoids embedding a vector database into SQLite and lets each store do what it does best. The trade-off is requiring a running Qdrant instance.
Incremental indexing with modtime fast-path. On re-index, the system first checks file modification time against the stored record. Only if the modtime is newer does it compute a content hash. This makes re-indexing large directories nearly instant when few files have changed.
Workspace filtering via parent directories. Each Qdrant point stores the file’s parent directory chain in its payload. Workspace-scoped searches use Qdrant’s native must filter on parentDirectories, so filtering happens at the vector DB level rather than post-search. Workspaces are defined via WORKSPACE_* environment variables.
Per-directory mutex. Concurrent index calls targeting the same directory are serialized via a promise-based mutex keyed by normalized path. This prevents duplicate work and data corruption from parallel indexing of overlapping directories.
Path validation security boundary. get_content and get_chunk validate that requested file paths fall within indexed directories. This prevents the tool from being used to read arbitrary files outside the indexed scope.
Development
# Install dependencies
npm install
# Build
npm run build
# Run tests
npm test
# Watch mode
npm run dev
# Run the CLI
npm run cli -- --help
Requires Qdrant running locally (docker run -p 6333:6333 qdrant/qdrant) and optionally Ollama for local embeddings.