largefile
MCP server that lets AI assistants navigate, search, and edit large codebases, logs, and data files without loading them entirely into context. Uses Tree-sitter for semantic code analysis and tiered memory strategies to handle files of any size.
Overview
- Language: Python
- Repo: peteretelej/largefile
- Install:
uvx largefile-mcporpip install largefile - Status: stable (Production/Stable)
Architecture
The server exposes 7 MCP tools through FastMCP (src/server.py), with tool logic separated into src/tools.py and domain modules below it.
Modules:
server.py- FastMCP tool definitions with Pydantic-annotated parameters. Thin wrappers that delegate totools.py.tools.py- Tool orchestration layer. Coordinates file access, search, editing, and directory operations. Converts domain errors to MCPToolErrorresponses with actionable suggestions.file_access.py- Tiered file I/O. Picks a strategy based on file size: direct memory read (<50MB), memory-mapped (mmap) for medium files (<500MB), or streaming with configurable chunk size for anything larger. Also handles encoding detection viachardet, binary file detection (magic bytes + BOM awareness), backup creation/rotation, and atomic writes using temp-file-then-rename.search_engine.py- Three search modes: exact string matching, fuzzy matching viarapidfuzz(Levenshtein distance with configurable threshold), and Python regex. Results are combined with exact matches taking priority. Also providesfind_similar_patterns()for actionable error messages when edits fail.editor.py- Batch search/replace engine. Finds matches (exact or fuzzy), detects overlaps between edits, and applies changes bottom-to-top to preserve positions. Preview mode generates unified diffs without writing. All writes create timestamped backups with automatic rotation.tree_parser.py- Tree-sitter integration for Python, JavaScript, TypeScript, Rust, Go, and Java. Provides three features: semantic outline generation (functions, classes, structs), semantic context extraction (walk-up-the-tree breadcrumbs like “class Foo > method bar()”), and semantic chunk reading (extract complete AST nodes around a line number).config.py- Environment-variable-based configuration with sensible defaults. All thresholds, limits, and feature flags are tunable without code changes.data_models.py- Dataclass definitions forFileOverview,SearchResult,EditResult,Change,BackupInfo, etc.exceptions.py- Domain exception hierarchy (FileAccessError,SearchError,EditError,TreeSitterError) thattools.pycatches and converts to user-friendlyToolErrormessages.
Data flow for a typical edit:
server.pyreceivesedit_contentcall with search/replace pairstools.pyvalidates changes, delegates toeditor.batch_edit_content()- Editor reads file via
file_access.read_file_content()(strategy auto-selected) - For each change: find match position (exact first, then fuzzy via rapidfuzz)
- Overlap detection across all matched positions
- If preview mode: return unified diff. If apply mode: backup, then atomic write.
Key Design Decisions
Tiered memory strategy. Files under 50MB load into RAM, 50-500MB use mmap, and anything larger streams in 8KB chunks. The thresholds are environment-configurable. This avoids OOM on multi-GB log files while keeping small file access fast.
Fuzzy-first editing. The editor defaults to fuzzy matching because LLMs frequently produce whitespace or formatting variations when quoting code. Exact match is tried first (free), then rapidfuzz kicks in if needed. This eliminates the most common class of LLM editing failures.
Batch edits with overlap detection. Multiple search/replace changes in one call go through a three-phase pipeline: locate all matches, detect overlapping spans, then apply bottom-to-top. If any change fails, the entire batch is rejected (all-or-nothing semantics), preventing partial corruption.
Tree-sitter for navigation, not editing. AST parsing is used for read-only operations (outlines, semantic context, chunk extraction) but the editor uses text-based fuzzy matching. This keeps editing robust against syntax errors and unsupported languages while still providing rich navigation for supported ones.
Backups with path-hashed names. Backup filenames include an 8-char MD5 of the absolute path, preventing collisions when editing files with the same name in different directories. Automatic rotation keeps the last 10 backups per file.
Development
# Install dependencies
uv sync
# Run tests
uv run pytest
# Type checking
uv run mypy src/
# Lint
uv run ruff check src/
# Run the server directly
uv run largefile-mcp