Skip to content

Latest commit

 

History

History
716 lines (478 loc) · 26.7 KB

File metadata and controls

716 lines (478 loc) · 26.7 KB

Architecture

Version note: This document describes the current high-level architecture of jCodeMunch-MCP. Exact tool names, optional integrations, and ranking details may evolve over time. For user-facing setup and workflows, see USER_GUIDE.md. For protocol semantics, see SPEC.md.


Table of Contents


System Overview

jCodeMunch-MCP is a local-first structured code retrieval system for AI agents.

Its purpose is to let an MCP-compatible client explore a repository by symbol, outline, structure, and tightly scoped context instead of repeatedly opening large files and scanning them linearly. It indexes source code once, extracts symbols with tree-sitter, stores stable symbol metadata plus byte offsets into cached raw files, and serves precise retrieval and search operations through an MCP server.

The central architectural principle is:

Retrieval precision is more efficient than brute-force context expansion.

The system is therefore designed around symbol IDs, file outlines, bounded searches, context bundles, and direct byte-offset retrieval rather than repeated full-file reads.


Design Goals

jCodeMunch is built around the following priorities:

  1. Minimize tokens sent to the model by retrieving only relevant code.
  2. Preserve source fidelity by caching original raw files and retrieving exact spans by byte offset.
  3. Stay fast on repeat access through local storage, incremental indexing, sidecars, and caching.
  4. Remain explainable through stable IDs, structured metadata, and explicit tool boundaries.
  5. Support real engineering workflows such as onboarding, code navigation, impact analysis, class hierarchy inspection, and repository exploration.
  6. Favor deterministic structure-aware retrieval over opaque end-to-end semantic systems when exact code access is required.

Directory Structure

The repository includes core documentation, benchmark material, a lightweight CLI, and the MCP server package. A representative architecture-facing layout is shown below:

jcodemunch-mcp/
├── pyproject.toml
├── uv.lock
├── README.md
├── QUICKSTART.md
├── USER_GUIDE.md
├── ARCHITECTURE.md
├── SPEC.md
├── SECURITY.md
├── LANGUAGE_SUPPORT.md
├── CONTEXT_PROVIDERS.md
├── TOKEN_SAVINGS.md
├── TROUBLESHOOTING.md
├── benchmarks/
│   ├── tasks.json
│   ├── README.md
│   ├── METHODOLOGY.md
│   ├── results.md
│   └── run_benchmarks.py
├── cli/
│   ├── cli.py
│   └── README.md
├── src/jcodemunch_mcp/
│   ├── server.py
│   ├── security.py
│   ├── hook_event.py
│   ├── parser/
│   ├── storage/
│   │   ├── index_store.py
│   │   └── sqlite_store.py
│   ├── summarizer/
│   ├── tools/
│   └── ...
├── tests/
└── .github/workflows/

This layout is schematic rather than exhaustive. Its purpose is to reflect the primary architectural domains: server, parser, storage, summarization, tools, CLI, watch mode, documentation, and benchmark harnesses.


High-Level Data Flow

GitHub repo or local folder
    │
    ▼
Discovery / file enumeration
    │
    ▼
Security filtering
(path traversal, symlinks, secrets, binary/unsafe handling, size / path controls)
    │
    ▼
Language detection + tree-sitter parsing
    │
    ▼
Symbol extraction
(functions, classes, methods, constants, types, language-specific constructs)
    │
    ▼
Post-processing
(overload disambiguation, content hashing, import/reference metadata)
    │
    ▼
Context enrichment
(provider-derived metadata such as dbt or Git history)
    │
    ▼
Summarization
(docstring → AI batch → signature fallback)
    │
    ▼
Persistence
(index JSON + sidecars + cached raw files + savings telemetry)
    │
    ▼
MCP / CLI / watch consumers
(search, retrieval, outlines, bundles, hierarchy, blast radius, diff, suggestions)

Core Architectural Concepts

1. Index once, retrieve many times

The main performance strategy is to pay the parsing and indexing cost once, then make subsequent searches and retrievals cheap and precise. Cached raw files plus symbol offsets make this possible.

2. Structured retrieval over full-file reading

Most operations are designed to answer questions such as:

  • what symbols exist here?
  • where is the implementation?
  • what is related to this symbol?
  • what code would this change affect?
  • what is the surrounding context bundle?

This approach avoids loading entire files unless necessary.

3. Local-first storage

Indexes and raw-file caches live under ~/.code-index/ by default, optionally overridden via CODE_INDEX_PATH. This keeps repeat access fast and private.

4. Stable identities

The system relies on stable symbol IDs based on file path, qualified name, and kind, allowing later retrieval and cross-call workflows without fuzzy lookup.

5. Rich metadata envelopes

Operations return metadata describing timing, repository identity, truncation, token-savings estimates, and related execution details. Recent versions also label savings as estimates and include the estimation method where applicable.


Repository Identity Model

jCodeMunch indexes both:

  • GitHub repositories via index_repo
  • local folders via index_folder

For GitHub repositories, the user-facing identity is typically owner/repo. For local folders, the system uses stable internal repository IDs while still exposing friendly names where possible through repo-listing operations.

For GitHub indexing, the system can store the Git tree SHA so unchanged incremental reindex runs can return early without re-downloading repository contents unnecessarily.


Parsing and Symbol Extraction

Language registry pattern

The parser uses a registry-oriented design where each supported language contributes extraction behavior through a LanguageSpec-style definition.

Representative shape:

@dataclass
class LanguageSpec:
    ts_language: str
    symbol_node_types: dict[str, str]
    name_fields: dict[str, str]
    param_fields: dict[str, str]
    return_type_fields: dict[str, str]
    docstring_strategy: str
    decorator_node_type: str | None
    container_node_types: list[str]
    constant_patterns: list[str]
    type_patterns: list[str]

Extraction model

The generic extractor walks language-specific ASTs and emits normalized symbol records such as functions, classes, methods, constants, and types, plus language-specific constructs where useful.

Post-processing passes

Important post-extraction passes include:

  1. Overload disambiguation via numeric suffixes such as ~1, ~2
  2. Content hashing using SHA-256 for change detection and diff-oriented features

These post-processing steps support stable IDs, change detection, and later comparison across snapshots.


Context Provider Framework

A significant extension point in the system is the context provider framework. Providers enrich indexes with ecosystem-specific metadata during local folder indexing.

Current provider model

Providers are auto-detected, load metadata from project files or repository state, and inject descriptions, tags, or related properties into symbol summaries, file summaries, and search keywords.

Known provider examples

  • dbt provider: detects dbt_project.yml, parses docs blocks and schema files, and enriches models and symbols with descriptions, tags, and column metadata
  • Git blame provider: auto-activates when a .git directory is present and enriches files with author and modification metadata derived from repository history

Architectural role

Context providers sit between raw structural parsing and final summarization and persistence. They improve retrieval quality by supplying domain and repository metadata that tree-sitter alone cannot infer.

This design allows jCodeMunch to remain structure-aware while supporting ecosystem-specific enrichment without tightly coupling every domain into the parser itself.


Summarization Pipeline

The summarization layer generates concise symbol and file descriptions used for retrieval ergonomics and search relevance.

The high-level fallback chain is:

docstring → AI batch → signature fallback

Available summary backends

The system supports multiple summary backends, including:

  • Anthropic-based summaries
  • Gemini-based summaries
  • local OpenAI-compatible servers via OPENAI_API_BASE and OPENAI_MODEL

Architectural role

Summaries assist navigation and ranking, but they are not the architectural source of truth. The retrieval backbone remains symbol extraction, byte offsets, and cached source. Summaries enhance retrieval quality rather than replace deterministic access to code.


Storage Model

Indexes are stored under ~/.code-index/ by default and can be redirected with CODE_INDEX_PATH.

Core persisted artifacts

The primary storage backend is SQLite in WAL (Write-Ahead Logging) mode. Each indexed repository produces:

  • {repo_slug}.db — SQLite database containing metadata, symbols, files, imports, and content blobs
  • {repo_slug}.db-wal — WAL file for concurrent read/write access
  • {repo_slug}.db-shm — shared-memory file for WAL coordination
  • {repo_slug}.meta — lightweight metadata sidecar so repo listing does not require opening the database
  • {repo_slug}.checksum — SHA-256 checksum sidecar for integrity verification
  • {repo_slug}/... — cached raw source files for exact retrieval

The SQLite schema includes tables for meta, symbols, files, imports, raw_cache, and content_blob with appropriate indexes. Legacy JSON indexes ({repo_slug}.json) are auto-migrated to SQLite on first load.

WAL mode benefits

WAL mode enables concurrent readers alongside a single writer, which is important for watch-mode scenarios where incremental reindexing runs while MCP tool calls serve retrieval queries. On graceful shutdown, the server checkpoints and compacts WAL files.

Additional storage features

  • LRU index cache with mtime invalidation
  • metadata sidecars so repo listing does not require loading full indexes
  • schema validation on index load
  • SHA-256 checksum sidecars for integrity verification

Retrieval-critical property

Each symbol stores byte offsets into the cached raw file, enabling exact retrieval via direct byte seeking rather than reparsing or rescanning the file.


Indexing Strategies

Full indexing

A full index pass performs discovery, security filtering, parsing, enrichment, summarization, and persistence across the entire repository or folder.

Incremental indexing

Incremental indexing is based on stored file hashes and related repository metadata. Only changed files are reprocessed.

Enhancements to incremental indexing include:

  • GitHub tree SHA short-circuiting for no-change reindexes
  • mtime fast-path with fallback to hash comparison on mtime mismatch
  • watch-triggered incremental reindexing
  • SQLite WAL mode for concurrent read/write during incremental updates

File discovery optimization

File discovery uses pre-resolved path strings and inlined security checks to minimize per-file overhead. Gitignore matching uses string prefix comparison instead of Path operations. These optimizations reduced full-index discovery time by approximately 2x on large repositories.

Watch mode

The watch subcommand continuously monitors one or more directories using watchfiles and triggers incremental reindexing on change. This is useful for large repositories and multi-worktree development flows where manual reindexing would be inefficient.

watch-claude mode

The watch-claude subcommand extends watch mode for Claude Code's worktree workflow. Claude Code creates git worktrees when opening parallel sessions, and removes them when sessions end. Rather than requiring users to manually pass paths, watch-claude discovers worktrees automatically via two complementary mechanisms:

  • Hook-driven discovery — Claude Code's WorktreeCreate and WorktreeRemove hooks call jcodemunch-mcp hook-event create|remove, which appends events to a JSONL manifest at ~/.claude/jcodemunch-worktrees.jsonl. The watch-claude process watches this file with watchfiles and reacts to new lines instantly — no polling.

  • Git-based discovery — When --repos paths are specified, watch-claude periodically runs git worktree list --porcelain on each repo and filters for Claude-created worktrees (branches matching claude/* or worktree-*). This works with any worktree layout without requiring hooks.

Both modes can run concurrently, sharing a single active task map to avoid double-watching. On worktree removal, the watcher task is cancelled and invalidate_cache cleans up the index. Crashed watcher tasks are automatically restarted by the repos poller.


Retrieval Model

The retrieval layer is centered on bounded, structured access to code.

Basic retrieval flow

The most direct retrieval path is:

  1. search_symbols
  2. select a symbol_id
  3. call get_symbol_source

File-oriented discovery

When symbol names are unknown, file and repository outlines support navigation without loading full source:

  • get_repo_outline
  • get_file_tree
  • get_file_outline

Context bundles

get_context_bundle packages a symbol together with closely related material such as imports, neighboring items, or related symbols. Later versions extend this to support multi-symbol bundles, deduplicated imports, optional callers, and Markdown-formatted output.

Drift verification

get_symbol_source supports verification so a client can determine whether retrieved content still matches the indexed version. This helps mitigate stale-index problems in fast-changing repositories.


Search and Ranking

Symbol search

Symbol search began as a weighted scoring system using signals such as:

  • exact name match
  • substring match
  • word overlap
  • signature terms
  • summary terms
  • docstring and keyword matches

This remains a useful conceptual baseline.

Ranking improvements

The current ranking model incorporates additional mechanisms, including:

  • bounded heap search for efficient top-k result handling
  • BM25-based symbol ranking
  • centrality-aware ranking using a log-scaled bonus for symbols in frequently imported files as a tiebreaker

Practical interpretation

Search is best understood as a hybrid structured ranking pipeline that combines lexical and summary relevance with structure-derived signals such as file centrality.

Text search

search_text remains the fallback for non-symbol content such as strings, comments, TODOs, or other literal text that does not map naturally to a symbol record.

Suggestion mode

suggest_queries is designed to help users and agents enter unfamiliar repositories by surfacing useful keywords, frequently imported files, distributions, and ready-to-run query patterns.


Import Graph and Relationship Features

A major architectural expansion in recent versions is support for lightweight relationship and impact analysis.

find_importers and find_references

find_importers answers "what files import this file?" by resolving import specifiers against indexed source files. find_references answers "where is this identifier used?" by matching named imports and specifier stems. Both support batch array parameters (file_paths and identifiers respectively) for querying multiple targets in a single call with shared index loading.

get_dependency_graph

This operation traverses the file-level dependency graph up to 3 hops in either direction (imports, importers, or both), providing a structural overview of file relationships.

check_references

A composite tool that combines import-level reference checking (same logic as find_references) with content-level substring search (same approach as search_text) in one call. Returns an is_referenced boolean for quick dead-code detection. Skips defining files in content search to avoid false positives. Supports both singular and batch modes.

get_related_symbols

This operation uses multiple heuristics, including:

  • same-file co-location
  • shared importers
  • name-token overlap

get_class_hierarchy

This operation traverses inheritance chains upward and downward, including external bases not present in the index when they can be identified structurally.

get_blast_radius

This operation traverses the reverse import graph and inspects importers to estimate impacted files, separating confirmed from potential effects where possible.

get_symbol_diff

This operation compares snapshots by (name, kind) and uses content_hash to report added, removed, and changed symbols.

Architectural significance

These features mean the system is no longer limited to static symbol lookup. It increasingly supports lightweight code intelligence over the indexed repository while preserving the core retrieval-first design.


Tool Surface

The tool surface is best described by capability domain rather than by a fixed count.

Indexing and repository management

  • index_repo
  • index_folder
  • index_file
  • list_repos
  • resolve_repo
  • invalidate_cache

Discovery and outlines

  • get_repo_outline
  • get_file_tree
  • get_file_outline — supports batch via file_paths array parameter
  • suggest_queries

Retrieval

  • get_file_content
  • get_symbol_source
  • get_context_bundle

Search and reference checking

  • search_symbols
  • search_text
  • search_columns
  • check_references — composite tool combining import + content search for dead-code detection

Relationship and impact analysis

  • find_importers — supports batch via file_paths array parameter
  • find_references — supports batch via identifiers array parameter
  • get_dependency_graph
  • get_related_symbols
  • get_class_hierarchy
  • get_blast_radius
  • get_symbol_diff

Batch query support

Several tools accept array parameters alongside their singular equivalents, enabling multiple queries in a single call. This reduces LLM round-trips and cache re-reads. Singular parameters continue to return the original flat response shape for backward compatibility. Batch mode returns a grouped results array. Response-level tip fields in _meta guide models toward batch usage without requiring external configuration.

Domain-specific retrieval

The system also supports provider-aware or ecosystem-aware retrieval operations where appropriate, such as dbt-related column search in enriched contexts.


CLI and Watch Mode

jCodeMunch is MCP-first, but not MCP-only.

CLI

A minimal CLI is provided over the shared index store, covering operations such as:

  • list
  • index
  • outline
  • search
  • get
  • text
  • file
  • invalidate

Watch mode

The watch subcommand monitors one or more directories and performs incremental reindexing with debounce. The watch-claude subcommand builds on this for Claude Code's worktree workflow, using hook-driven events and/or git worktree list polling to discover worktrees automatically. Both use the same underlying store as the MCP server, so updates become immediately visible to MCP consumers.

Architecturally, CLI, watch, and watch-claude are all thin interfaces over the same parser, storage, indexing, and retrieval core.


Response Envelope and Telemetry

All tool responses include an _meta envelope.

Representative fields include:

  • timing_ms
  • repo
  • symbol_count
  • truncated
  • tokens_saved
  • total_tokens_saved
  • estimate_method
  • cost_avoided — estimated cost savings per model (dict with model names as keys)
  • total_cost_avoided — cumulative cost savings across session
  • powered_by — attribution string
  • tip — guidance for models toward more efficient tool usage (e.g., batch params, composite tools)

Representative shape:

{
  "result": "...",
  "_meta": {
    "timing_ms": 42,
    "repo": "owner/repo",
    "truncated": false,
    "tokens_saved": 2450,
    "total_tokens_saved": 184320,
    "estimate_method": "...",
    "cost_avoided": {"claude_opus_4_6": 0.012, "claude_sonnet_4_6": 0.007},
    "total_cost_avoided": {"claude_opus_4_6": 968.32, "claude_sonnet_4_6": 580.99},
    "powered_by": "jcodemunch-mcp by jgravelle",
    "tip": "Tip: use file_paths=[...] to query multiple files in one call."
  }
}

Not all fields are present in every response. tip fields appear only in single-item responses and are stripped from batch results to keep them clean.

Savings telemetry

Running totals are persisted in ~/.code-index/_savings.json. Recent versions improved cross-process accounting so concurrent processes do not overwrite cumulative totals.

Community meter

If enabled, tool calls can report a token-savings delta together with an anonymous install identifier to a global community counter. This reporting is intended to exclude repository names, paths, and source content.


Security Model

Security is a first-class subsystem.

File and path protections

The system includes protections for:

  • path traversal
  • symlink validation
  • secret-file exclusion
  • binary file detection
  • safe encoding reads using replacement handling

Additional hardening

Recent security-oriented additions include:

  • SSRF prevention for configurable API base URLs
  • ReDoS protection in search_text
  • symlink-safe temporary file handling
  • HTTP bearer token authentication for HTTP transport
  • absolute path redaction via JCODEMUNCH_REDACT_SOURCE_ROOT

The security model therefore extends beyond file safety to include network and response-surface hardening.


Performance and Scalability Notes

Recent versions added or documented several performance-oriented mechanisms:

  • streaming file indexing to reduce peak memory usage
  • bounded heap search for better top-k scaling
  • LRU index cache with mtime invalidation
  • sidecars for cheaper repo listing
  • incremental indexing with mtime fast-path
  • watch-based updates
  • centrality-aware ranking
  • no-change short-circuiting via Git tree SHA
  • SQLite WAL mode for concurrent read/write without locking
  • optimized file discovery using pre-resolved path strings and inlined security checks (approximately 2x speedup)
  • batch array parameters on high-call-count tools to reduce LLM round-trips and cache re-reads

Practical implications

The system is architected to stay responsive in larger repositories by:

  1. minimizing reparsing
  2. minimizing reloading
  3. minimizing retrieval payload size
  4. avoiding full result sorts when only top-k results are needed
  5. minimizing LLM round-trips through batch query support

Benchmarking and Proof

The benchmark surface has expanded to include:

  • a public benchmark corpus in benchmarks/tasks.json
  • methodology documentation
  • reproducible benchmark materials
  • canonical tiktoken-measured results

This is architecturally important because the system makes concrete efficiency claims. Reproducible benchmarking is therefore part of the product’s proof surface, not just a marketing artifact.


Failure Modes and Tradeoffs

No retrieval system is without tradeoffs.

1. The system only helps when the client uses it

If the agent continues to open whole files instead of using jCodeMunch tools, the retrieval architecture is present but the efficiency benefits do not materialize.

2. Structural retrieval is not identical to deep semantic reasoning

The system is strong at locating, packaging, and relating code. It is not intended to replace all forms of semantic reasoning. Summaries and ranking improve retrieval quality, but the backbone remains structure-aware access.

3. Index freshness matters

Because the system serves from a local index and cached raw files, stale indexes can mislead unless verification or reindexing is used.

4. Language support is broad but not uniform

Different languages and ecosystem providers expose different kinds of metadata and symbols. The architecture is extensible, but capability depth varies by language and provider maturity.

5. Broader capability surfaces increase maintenance demands

As the tool surface expands, protocol clarity, result consistency, and documentation accuracy become more important and more difficult to maintain.


Appendix: Symbol IDs

Symbol IDs follow this stable format:

{file_path}::{qualified_name}#{kind}

Examples:

src/main.py::UserService#class
src/main.py::UserService.login#method
src/utils.py::authenticate#function
config.py::MAX_RETRIES#constant

IDs remain stable across re-indexing as long as file path, qualified name, and kind remain unchanged. Duplicate overloads can receive numeric suffixes such as ~1, ~2.


Appendix: Key Dependencies

Representative architectural dependencies include:

Package Role
mcp MCP server framework
httpx Async HTTP and GitHub access
tree-sitter-language-pack Precompiled language grammars
pathspec Gitignore pattern matching
anthropic Claude-based summarization
google-generativeai Gemini-based summarization
pyyaml dbt schema and provider parsing
watchfiles optional watch mode
sqlite3 (stdlib) WAL-mode index storage backend

These dependencies support the core architectural layers: protocol transport, repository acquisition, parsing, storage, summarization, file watching, and safe concurrent access.