diffchunk
MCP server for navigating large unified diff files. Parses diffs into manageable chunks that fit within LLM context windows, with filtering for trivial changes and generated files.
Overview
- Language: Python
- Repo: peteretelej/diffchunk
- Install:
uvx diffchunk-mcporpip install diffchunk - Status: beta
Architecture
The server exposes 5 MCP tools through FastMCP (src/server.py) plus one MCP resource for session overview. All tools auto-load diffs on first access, so explicit loading is only needed for non-default settings.
Modules:
server.py- FastMCP tool and resource definitions. Each tool delegates toDiffChunkToolsmethods and serializes results as JSON strings.tools.py-DiffChunkToolsclass that manages diff sessions. Sessions are keyed by canonical path + content SHA-256 hash, so re-loading a modified file creates a fresh session while unchanged files reuse cached chunks. Providesload_diff,list_chunks,get_chunk,find_chunks_for_files, andget_file_diff.parser.py-DiffParserclass for unified diff format. Regex-based parsing that splits ondiff --githeaders and yields(files, content)tuples. Includes trivial change detection (whitespace-only diffs), generated file detection (lock files, minified JS, build artifacts,node_modules/), and glob-based include/exclude filtering viafnmatch.chunker.py-DiffChunkerclass that groups parsed file diffs into chunks. Small files accumulate into a chunk until reaching the line limit. Large files (exceedingmax_chunk_lines) get split at hunk boundaries (@@headers), targeting 80% of the limit to avoid edge-case overflows. A strict enforcement fallback splits mid-hunk if no boundary is found.models.py- Data models:DiffSession(manages chunks and file-to-chunk index),DiffChunk(content + metadata),DiffStats,ChunkInfo.DiffSessionmaintains afile_to_chunksreverse index for fast glob-based lookups.
Data flow:
- First tool call triggers auto-load:
_ensure_loaded()checks session cache by path+hash key DiffParser.parse_diff_file()yields per-file diffs with encoding detection viachardetDiffChunker.chunk_diff()filters (trivial, generated, include/exclude patterns), then groups into chunksDiffSessionstores chunks and builds the file-to-chunk reverse index- Subsequent tool calls (
get_chunk,find_chunks_for_files) read from the cached session
Tools:
load_diff- Parse with custom settings (chunk size, filtering). Only needed when defaults are wrong.list_chunks- Overview of all chunks: file names, line counts, summaries. Entry point for systematic review.get_chunk- Retrieve actual diff content for a numbered chunk, with optional context header.find_chunks_for_files- Glob-based search (e.g.,*.py,src/*) returns matching chunk numbers.get_file_diff- Extract complete diff for a single file across all chunks. Supports exact path, case-insensitive match, and glob patterns.
Key Design Decisions
Hunk-boundary chunking. Splitting large file diffs at @@ hunk headers keeps each chunk semantically coherent. The 80% target leaves room for file headers that get prepended to every sub-chunk, preventing silent overflows.
Content-hashed session keys. Combining the file path with a SHA-256 hash of the content means that if the diff file changes on disk, the next tool call automatically re-parses it. No stale cache issues, no manual invalidation.
Auto-load on every tool. Every tool except load_diff calls _ensure_loaded(), which creates a session with optimal defaults if none exists. This means the AI can skip the load step entirely for typical workflows, reducing tool call overhead.
Trivial + generated file filtering. By default, whitespace-only changes and lock files/build artifacts are excluded. This focuses the AI’s attention on meaningful code changes. Both filters can be disabled via load_diff parameters.
Per-file line counts in chunk metadata. Each chunk tracks how many diff lines belong to each file (file_line_counts). This lets the AI decide whether to use get_chunk (multiple small files) or get_file_diff (one specific file) without reading the actual content.
Development
# Install dependencies
uv sync
# Run tests
uv run pytest
# Type checking
uv run mypy src/
# Lint
uv run ruff check src/
# Run the server directly
uv run diffchunk-mcp