largefile

MCP server that lets AI assistants navigate, search, and edit large codebases, logs, and data files without loading them entirely into context. Uses Tree-sitter for semantic code analysis and tiered memory strategies to handle files of any size.

Overview

Language: Python
Repo: peteretelej/largefile
Install: uvx largefile-mcp or pip install largefile
Status: stable (Production/Stable)

Architecture

The server exposes 7 MCP tools through FastMCP (src/server.py), with tool logic separated into src/tools.py and domain modules below it.

Modules:

server.py - FastMCP tool definitions with Pydantic-annotated parameters. Thin wrappers that delegate to tools.py.
tools.py - Tool orchestration layer. Coordinates file access, search, editing, and directory operations. Converts domain errors to MCP ToolError responses with actionable suggestions.
file_access.py - Tiered file I/O. Picks a strategy based on file size: direct memory read (<50MB), memory-mapped (mmap) for medium files (<500MB), or streaming with configurable chunk size for anything larger. Also handles encoding detection via chardet, binary file detection (magic bytes + BOM awareness), backup creation/rotation, and atomic writes using temp-file-then-rename.
search_engine.py - Three search modes: exact string matching, fuzzy matching via rapidfuzz (Levenshtein distance with configurable threshold), and Python regex. Results are combined with exact matches taking priority. Also provides find_similar_patterns() for actionable error messages when edits fail.
editor.py - Batch search/replace engine. Finds matches (exact or fuzzy), detects overlaps between edits, and applies changes bottom-to-top to preserve positions. Preview mode generates unified diffs without writing. All writes create timestamped backups with automatic rotation.
tree_parser.py - Tree-sitter integration for Python, JavaScript, TypeScript, Rust, Go, and Java. Provides three features: semantic outline generation (functions, classes, structs), semantic context extraction (walk-up-the-tree breadcrumbs like “class Foo > method bar()”), and semantic chunk reading (extract complete AST nodes around a line number).
config.py - Environment-variable-based configuration with sensible defaults. All thresholds, limits, and feature flags are tunable without code changes.
data_models.py - Dataclass definitions for FileOverview, SearchResult, EditResult, Change, BackupInfo, etc.
exceptions.py - Domain exception hierarchy (FileAccessError, SearchError, EditError, TreeSitterError) that tools.py catches and converts to user-friendly ToolError messages.

Data flow for a typical edit:

server.py receives edit_content call with search/replace pairs
tools.py validates changes, delegates to editor.batch_edit_content()
Editor reads file via file_access.read_file_content() (strategy auto-selected)
For each change: find match position (exact first, then fuzzy via rapidfuzz)
Overlap detection across all matched positions
If preview mode: return unified diff. If apply mode: backup, then atomic write.

Key Design Decisions

Tiered memory strategy. Files under 50MB load into RAM, 50-500MB use mmap, and anything larger streams in 8KB chunks. The thresholds are environment-configurable. This avoids OOM on multi-GB log files while keeping small file access fast.

Fuzzy-first editing. The editor defaults to fuzzy matching because LLMs frequently produce whitespace or formatting variations when quoting code. Exact match is tried first (free), then rapidfuzz kicks in if needed. This eliminates the most common class of LLM editing failures.

Batch edits with overlap detection. Multiple search/replace changes in one call go through a three-phase pipeline: locate all matches, detect overlapping spans, then apply bottom-to-top. If any change fails, the entire batch is rejected (all-or-nothing semantics), preventing partial corruption.

Tree-sitter for navigation, not editing. AST parsing is used for read-only operations (outlines, semantic context, chunk extraction) but the editor uses text-based fuzzy matching. This keeps editing robust against syntax errors and unsupported languages while still providing rich navigation for supported ones.

Backups with path-hashed names. Backup filenames include an 8-char MD5 of the absolute path, preventing collisions when editing files with the same name in different directories. Automatic rotation keeps the last 10 backups per file.

Development

# Install dependencies
uv sync

# Run tests
uv run pytest

# Type checking
uv run mypy src/

# Lint
uv run ruff check src/

# Run the server directly
uv run largefile-mcp