diffchunk

MCP server for navigating large unified diff files. Parses diffs into manageable chunks that fit within LLM context windows, with filtering for trivial changes and generated files.

Overview

  • Language: Python
  • Repo: peteretelej/diffchunk
  • Install: uvx diffchunk-mcp or pip install diffchunk
  • Status: beta

Architecture

The server exposes 5 MCP tools through FastMCP (src/server.py) plus one MCP resource for session overview. All tools auto-load diffs on first access, so explicit loading is only needed for non-default settings.

Modules:

  • server.py - FastMCP tool and resource definitions. Each tool delegates to DiffChunkTools methods and serializes results as JSON strings.
  • tools.py - DiffChunkTools class that manages diff sessions. Sessions are keyed by canonical path + content SHA-256 hash, so re-loading a modified file creates a fresh session while unchanged files reuse cached chunks. Provides load_diff, list_chunks, get_chunk, find_chunks_for_files, and get_file_diff.
  • parser.py - DiffParser class for unified diff format. Regex-based parsing that splits on diff --git headers and yields (files, content) tuples. Includes trivial change detection (whitespace-only diffs), generated file detection (lock files, minified JS, build artifacts, node_modules/), and glob-based include/exclude filtering via fnmatch.
  • chunker.py - DiffChunker class that groups parsed file diffs into chunks. Small files accumulate into a chunk until reaching the line limit. Large files (exceeding max_chunk_lines) get split at hunk boundaries (@@ headers), targeting 80% of the limit to avoid edge-case overflows. A strict enforcement fallback splits mid-hunk if no boundary is found.
  • models.py - Data models: DiffSession (manages chunks and file-to-chunk index), DiffChunk (content + metadata), DiffStats, ChunkInfo. DiffSession maintains a file_to_chunks reverse index for fast glob-based lookups.

Data flow:

  1. First tool call triggers auto-load: _ensure_loaded() checks session cache by path+hash key
  2. DiffParser.parse_diff_file() yields per-file diffs with encoding detection via chardet
  3. DiffChunker.chunk_diff() filters (trivial, generated, include/exclude patterns), then groups into chunks
  4. DiffSession stores chunks and builds the file-to-chunk reverse index
  5. Subsequent tool calls (get_chunk, find_chunks_for_files) read from the cached session

Tools:

  • load_diff - Parse with custom settings (chunk size, filtering). Only needed when defaults are wrong.
  • list_chunks - Overview of all chunks: file names, line counts, summaries. Entry point for systematic review.
  • get_chunk - Retrieve actual diff content for a numbered chunk, with optional context header.
  • find_chunks_for_files - Glob-based search (e.g., *.py, src/*) returns matching chunk numbers.
  • get_file_diff - Extract complete diff for a single file across all chunks. Supports exact path, case-insensitive match, and glob patterns.

Key Design Decisions

Hunk-boundary chunking. Splitting large file diffs at @@ hunk headers keeps each chunk semantically coherent. The 80% target leaves room for file headers that get prepended to every sub-chunk, preventing silent overflows.

Content-hashed session keys. Combining the file path with a SHA-256 hash of the content means that if the diff file changes on disk, the next tool call automatically re-parses it. No stale cache issues, no manual invalidation.

Auto-load on every tool. Every tool except load_diff calls _ensure_loaded(), which creates a session with optimal defaults if none exists. This means the AI can skip the load step entirely for typical workflows, reducing tool call overhead.

Trivial + generated file filtering. By default, whitespace-only changes and lock files/build artifacts are excluded. This focuses the AI’s attention on meaningful code changes. Both filters can be disabled via load_diff parameters.

Per-file line counts in chunk metadata. Each chunk tracks how many diff lines belong to each file (file_line_counts). This lets the AI decide whether to use get_chunk (multiple small files) or get_file_diff (one specific file) without reading the actual content.

Development

# Install dependencies
uv sync

# Run tests
uv run pytest

# Type checking
uv run mypy src/

# Lint
uv run ruff check src/

# Run the server directly
uv run diffchunk-mcp