Version note: This document describes the current high-level architecture of jCodeMunch-MCP. Exact tool names, optional integrations, and ranking details may evolve over time. For user-facing setup and workflows, see
USER_GUIDE.md. For protocol semantics, seeSPEC.md.
- System Overview
- Design Goals
- Directory Structure
- High-Level Data Flow
- Core Architectural Concepts
- Repository Identity Model
- Parsing and Symbol Extraction
- Context Provider Framework
- Summarization Pipeline
- Storage Model
- Indexing Strategies
- Retrieval Model
- Search and Ranking
- Import Graph and Relationship Features
- Tool Surface
- CLI and Watch Mode
- Response Envelope and Telemetry
- Security Model
- Performance and Scalability Notes
- Benchmarking and Proof
- Failure Modes and Tradeoffs
- Appendix: Symbol IDs
- Appendix: Key Dependencies
jCodeMunch-MCP is a local-first structured code retrieval system for AI agents.
Its purpose is to let an MCP-compatible client explore a repository by symbol, outline, structure, and tightly scoped context instead of repeatedly opening large files and scanning them linearly. It indexes source code once, extracts symbols with tree-sitter, stores stable symbol metadata plus byte offsets into cached raw files, and serves precise retrieval and search operations through an MCP server.
The central architectural principle is:
Retrieval precision is more efficient than brute-force context expansion.
The system is therefore designed around symbol IDs, file outlines, bounded searches, context bundles, and direct byte-offset retrieval rather than repeated full-file reads.
jCodeMunch is built around the following priorities:
- Minimize tokens sent to the model by retrieving only relevant code.
- Preserve source fidelity by caching original raw files and retrieving exact spans by byte offset.
- Stay fast on repeat access through local storage, incremental indexing, sidecars, and caching.
- Remain explainable through stable IDs, structured metadata, and explicit tool boundaries.
- Support real engineering workflows such as onboarding, code navigation, impact analysis, class hierarchy inspection, and repository exploration.
- Favor deterministic structure-aware retrieval over opaque end-to-end semantic systems when exact code access is required.
The repository includes core documentation, benchmark material, a lightweight CLI, and the MCP server package. A representative architecture-facing layout is shown below:
jcodemunch-mcp/
├── pyproject.toml
├── uv.lock
├── README.md
├── QUICKSTART.md
├── USER_GUIDE.md
├── ARCHITECTURE.md
├── SPEC.md
├── SECURITY.md
├── LANGUAGE_SUPPORT.md
├── CONTEXT_PROVIDERS.md
├── TOKEN_SAVINGS.md
├── TROUBLESHOOTING.md
├── benchmarks/
│ ├── tasks.json
│ ├── README.md
│ ├── METHODOLOGY.md
│ ├── results.md
│ └── run_benchmarks.py
├── cli/
│ ├── cli.py
│ └── README.md
├── src/jcodemunch_mcp/
│ ├── server.py
│ ├── security.py
│ ├── hook_event.py
│ ├── parser/
│ ├── storage/
│ │ ├── index_store.py
│ │ └── sqlite_store.py
│ ├── summarizer/
│ ├── tools/
│ └── ...
├── tests/
└── .github/workflows/
This layout is schematic rather than exhaustive. Its purpose is to reflect the primary architectural domains: server, parser, storage, summarization, tools, CLI, watch mode, documentation, and benchmark harnesses.
GitHub repo or local folder
│
▼
Discovery / file enumeration
│
▼
Security filtering
(path traversal, symlinks, secrets, binary/unsafe handling, size / path controls)
│
▼
Language detection + tree-sitter parsing
│
▼
Symbol extraction
(functions, classes, methods, constants, types, language-specific constructs)
│
▼
Post-processing
(overload disambiguation, content hashing, import/reference metadata)
│
▼
Context enrichment
(provider-derived metadata such as dbt or Git history)
│
▼
Summarization
(docstring → AI batch → signature fallback)
│
▼
Persistence
(index JSON + sidecars + cached raw files + savings telemetry)
│
▼
MCP / CLI / watch consumers
(search, retrieval, outlines, bundles, hierarchy, blast radius, diff, suggestions)
The main performance strategy is to pay the parsing and indexing cost once, then make subsequent searches and retrievals cheap and precise. Cached raw files plus symbol offsets make this possible.
Most operations are designed to answer questions such as:
- what symbols exist here?
- where is the implementation?
- what is related to this symbol?
- what code would this change affect?
- what is the surrounding context bundle?
This approach avoids loading entire files unless necessary.
Indexes and raw-file caches live under ~/.code-index/ by default, optionally overridden via CODE_INDEX_PATH. This keeps repeat access fast and private.
The system relies on stable symbol IDs based on file path, qualified name, and kind, allowing later retrieval and cross-call workflows without fuzzy lookup.
Operations return metadata describing timing, repository identity, truncation, token-savings estimates, and related execution details. Recent versions also label savings as estimates and include the estimation method where applicable.
jCodeMunch indexes both:
- GitHub repositories via
index_repo - local folders via
index_folder
For GitHub repositories, the user-facing identity is typically owner/repo. For local folders, the system uses stable internal repository IDs while still exposing friendly names where possible through repo-listing operations.
For GitHub indexing, the system can store the Git tree SHA so unchanged incremental reindex runs can return early without re-downloading repository contents unnecessarily.
The parser uses a registry-oriented design where each supported language contributes extraction behavior through a LanguageSpec-style definition.
Representative shape:
@dataclass
class LanguageSpec:
ts_language: str
symbol_node_types: dict[str, str]
name_fields: dict[str, str]
param_fields: dict[str, str]
return_type_fields: dict[str, str]
docstring_strategy: str
decorator_node_type: str | None
container_node_types: list[str]
constant_patterns: list[str]
type_patterns: list[str]The generic extractor walks language-specific ASTs and emits normalized symbol records such as functions, classes, methods, constants, and types, plus language-specific constructs where useful.
Important post-extraction passes include:
- Overload disambiguation via numeric suffixes such as
~1,~2 - Content hashing using SHA-256 for change detection and diff-oriented features
These post-processing steps support stable IDs, change detection, and later comparison across snapshots.
A significant extension point in the system is the context provider framework. Providers enrich indexes with ecosystem-specific metadata during local folder indexing.
Providers are auto-detected, load metadata from project files or repository state, and inject descriptions, tags, or related properties into symbol summaries, file summaries, and search keywords.
- dbt provider: detects
dbt_project.yml, parses docs blocks and schema files, and enriches models and symbols with descriptions, tags, and column metadata - Git blame provider: auto-activates when a
.gitdirectory is present and enriches files with author and modification metadata derived from repository history
Context providers sit between raw structural parsing and final summarization and persistence. They improve retrieval quality by supplying domain and repository metadata that tree-sitter alone cannot infer.
This design allows jCodeMunch to remain structure-aware while supporting ecosystem-specific enrichment without tightly coupling every domain into the parser itself.
The summarization layer generates concise symbol and file descriptions used for retrieval ergonomics and search relevance.
The high-level fallback chain is:
docstring → AI batch → signature fallback
The system supports multiple summary backends, including:
- Anthropic-based summaries
- Gemini-based summaries
- local OpenAI-compatible servers via
OPENAI_API_BASEandOPENAI_MODEL
Summaries assist navigation and ranking, but they are not the architectural source of truth. The retrieval backbone remains symbol extraction, byte offsets, and cached source. Summaries enhance retrieval quality rather than replace deterministic access to code.
Indexes are stored under ~/.code-index/ by default and can be redirected with CODE_INDEX_PATH.
The primary storage backend is SQLite in WAL (Write-Ahead Logging) mode. Each indexed repository produces:
{repo_slug}.db— SQLite database containing metadata, symbols, files, imports, and content blobs{repo_slug}.db-wal— WAL file for concurrent read/write access{repo_slug}.db-shm— shared-memory file for WAL coordination{repo_slug}.meta— lightweight metadata sidecar so repo listing does not require opening the database{repo_slug}.checksum— SHA-256 checksum sidecar for integrity verification{repo_slug}/...— cached raw source files for exact retrieval
The SQLite schema includes tables for meta, symbols, files, imports, raw_cache, and content_blob with appropriate indexes. Legacy JSON indexes ({repo_slug}.json) are auto-migrated to SQLite on first load.
WAL mode enables concurrent readers alongside a single writer, which is important for watch-mode scenarios where incremental reindexing runs while MCP tool calls serve retrieval queries. On graceful shutdown, the server checkpoints and compacts WAL files.
- LRU index cache with mtime invalidation
- metadata sidecars so repo listing does not require loading full indexes
- schema validation on index load
- SHA-256 checksum sidecars for integrity verification
Each symbol stores byte offsets into the cached raw file, enabling exact retrieval via direct byte seeking rather than reparsing or rescanning the file.
A full index pass performs discovery, security filtering, parsing, enrichment, summarization, and persistence across the entire repository or folder.
Incremental indexing is based on stored file hashes and related repository metadata. Only changed files are reprocessed.
Enhancements to incremental indexing include:
- GitHub tree SHA short-circuiting for no-change reindexes
- mtime fast-path with fallback to hash comparison on mtime mismatch
- watch-triggered incremental reindexing
- SQLite WAL mode for concurrent read/write during incremental updates
File discovery uses pre-resolved path strings and inlined security checks to minimize per-file overhead. Gitignore matching uses string prefix comparison instead of Path operations. These optimizations reduced full-index discovery time by approximately 2x on large repositories.
The watch subcommand continuously monitors one or more directories using watchfiles and triggers incremental reindexing on change. This is useful for large repositories and multi-worktree development flows where manual reindexing would be inefficient.
The watch-claude subcommand extends watch mode for Claude Code's worktree workflow. Claude Code creates git worktrees when opening parallel sessions, and removes them when sessions end. Rather than requiring users to manually pass paths, watch-claude discovers worktrees automatically via two complementary mechanisms:
-
Hook-driven discovery — Claude Code's
WorktreeCreateandWorktreeRemovehooks calljcodemunch-mcp hook-event create|remove, which appends events to a JSONL manifest at~/.claude/jcodemunch-worktrees.jsonl. Thewatch-claudeprocess watches this file withwatchfilesand reacts to new lines instantly — no polling. -
Git-based discovery — When
--repospaths are specified,watch-claudeperiodically runsgit worktree list --porcelainon each repo and filters for Claude-created worktrees (branches matchingclaude/*orworktree-*). This works with any worktree layout without requiring hooks.
Both modes can run concurrently, sharing a single active task map to avoid double-watching. On worktree removal, the watcher task is cancelled and invalidate_cache cleans up the index. Crashed watcher tasks are automatically restarted by the repos poller.
The retrieval layer is centered on bounded, structured access to code.
The most direct retrieval path is:
search_symbols- select a
symbol_id - call
get_symbol_source
When symbol names are unknown, file and repository outlines support navigation without loading full source:
get_repo_outlineget_file_treeget_file_outline
get_context_bundle packages a symbol together with closely related material such as imports, neighboring items, or related symbols. Later versions extend this to support multi-symbol bundles, deduplicated imports, optional callers, and Markdown-formatted output.
get_symbol_source supports verification so a client can determine whether retrieved content still matches the indexed version. This helps mitigate stale-index problems in fast-changing repositories.
Symbol search began as a weighted scoring system using signals such as:
- exact name match
- substring match
- word overlap
- signature terms
- summary terms
- docstring and keyword matches
This remains a useful conceptual baseline.
The current ranking model incorporates additional mechanisms, including:
- bounded heap search for efficient top-k result handling
- BM25-based symbol ranking
- centrality-aware ranking using a log-scaled bonus for symbols in frequently imported files as a tiebreaker
Search is best understood as a hybrid structured ranking pipeline that combines lexical and summary relevance with structure-derived signals such as file centrality.
search_text remains the fallback for non-symbol content such as strings, comments, TODOs, or other literal text that does not map naturally to a symbol record.
suggest_queries is designed to help users and agents enter unfamiliar repositories by surfacing useful keywords, frequently imported files, distributions, and ready-to-run query patterns.
A major architectural expansion in recent versions is support for lightweight relationship and impact analysis.
find_importers answers "what files import this file?" by resolving import specifiers against indexed source files. find_references answers "where is this identifier used?" by matching named imports and specifier stems. Both support batch array parameters (file_paths and identifiers respectively) for querying multiple targets in a single call with shared index loading.
This operation traverses the file-level dependency graph up to 3 hops in either direction (imports, importers, or both), providing a structural overview of file relationships.
A composite tool that combines import-level reference checking (same logic as find_references) with content-level substring search (same approach as search_text) in one call. Returns an is_referenced boolean for quick dead-code detection. Skips defining files in content search to avoid false positives. Supports both singular and batch modes.
This operation uses multiple heuristics, including:
- same-file co-location
- shared importers
- name-token overlap
This operation traverses inheritance chains upward and downward, including external bases not present in the index when they can be identified structurally.
This operation traverses the reverse import graph and inspects importers to estimate impacted files, separating confirmed from potential effects where possible.
This operation compares snapshots by (name, kind) and uses content_hash to report added, removed, and changed symbols.
These features mean the system is no longer limited to static symbol lookup. It increasingly supports lightweight code intelligence over the indexed repository while preserving the core retrieval-first design.
The tool surface is best described by capability domain rather than by a fixed count.
index_repoindex_folderindex_filelist_reposresolve_repoinvalidate_cache
get_repo_outlineget_file_treeget_file_outline— supports batch viafile_pathsarray parametersuggest_queries
get_file_contentget_symbol_sourceget_context_bundle
search_symbolssearch_textsearch_columnscheck_references— composite tool combining import + content search for dead-code detection
find_importers— supports batch viafile_pathsarray parameterfind_references— supports batch viaidentifiersarray parameterget_dependency_graphget_related_symbolsget_class_hierarchyget_blast_radiusget_symbol_diff
Several tools accept array parameters alongside their singular equivalents, enabling multiple queries in a single call. This reduces LLM round-trips and cache re-reads. Singular parameters continue to return the original flat response shape for backward compatibility. Batch mode returns a grouped results array. Response-level tip fields in _meta guide models toward batch usage without requiring external configuration.
The system also supports provider-aware or ecosystem-aware retrieval operations where appropriate, such as dbt-related column search in enriched contexts.
jCodeMunch is MCP-first, but not MCP-only.
A minimal CLI is provided over the shared index store, covering operations such as:
listindexoutlinesearchgettextfileinvalidate
The watch subcommand monitors one or more directories and performs incremental reindexing with debounce. The watch-claude subcommand builds on this for Claude Code's worktree workflow, using hook-driven events and/or git worktree list polling to discover worktrees automatically. Both use the same underlying store as the MCP server, so updates become immediately visible to MCP consumers.
Architecturally, CLI, watch, and watch-claude are all thin interfaces over the same parser, storage, indexing, and retrieval core.
All tool responses include an _meta envelope.
Representative fields include:
timing_msreposymbol_counttruncatedtokens_savedtotal_tokens_savedestimate_methodcost_avoided— estimated cost savings per model (dict with model names as keys)total_cost_avoided— cumulative cost savings across sessionpowered_by— attribution stringtip— guidance for models toward more efficient tool usage (e.g., batch params, composite tools)
Representative shape:
{
"result": "...",
"_meta": {
"timing_ms": 42,
"repo": "owner/repo",
"truncated": false,
"tokens_saved": 2450,
"total_tokens_saved": 184320,
"estimate_method": "...",
"cost_avoided": {"claude_opus_4_6": 0.012, "claude_sonnet_4_6": 0.007},
"total_cost_avoided": {"claude_opus_4_6": 968.32, "claude_sonnet_4_6": 580.99},
"powered_by": "jcodemunch-mcp by jgravelle",
"tip": "Tip: use file_paths=[...] to query multiple files in one call."
}
}Not all fields are present in every response. tip fields appear only in single-item responses and are stripped from batch results to keep them clean.
Running totals are persisted in ~/.code-index/_savings.json. Recent versions improved cross-process accounting so concurrent processes do not overwrite cumulative totals.
If enabled, tool calls can report a token-savings delta together with an anonymous install identifier to a global community counter. This reporting is intended to exclude repository names, paths, and source content.
Security is a first-class subsystem.
The system includes protections for:
- path traversal
- symlink validation
- secret-file exclusion
- binary file detection
- safe encoding reads using replacement handling
Recent security-oriented additions include:
- SSRF prevention for configurable API base URLs
- ReDoS protection in
search_text - symlink-safe temporary file handling
- HTTP bearer token authentication for HTTP transport
- absolute path redaction via
JCODEMUNCH_REDACT_SOURCE_ROOT
The security model therefore extends beyond file safety to include network and response-surface hardening.
Recent versions added or documented several performance-oriented mechanisms:
- streaming file indexing to reduce peak memory usage
- bounded heap search for better top-k scaling
- LRU index cache with mtime invalidation
- sidecars for cheaper repo listing
- incremental indexing with mtime fast-path
- watch-based updates
- centrality-aware ranking
- no-change short-circuiting via Git tree SHA
- SQLite WAL mode for concurrent read/write without locking
- optimized file discovery using pre-resolved path strings and inlined security checks (approximately 2x speedup)
- batch array parameters on high-call-count tools to reduce LLM round-trips and cache re-reads
The system is architected to stay responsive in larger repositories by:
- minimizing reparsing
- minimizing reloading
- minimizing retrieval payload size
- avoiding full result sorts when only top-k results are needed
- minimizing LLM round-trips through batch query support
The benchmark surface has expanded to include:
- a public benchmark corpus in
benchmarks/tasks.json - methodology documentation
- reproducible benchmark materials
- canonical
tiktoken-measured results
This is architecturally important because the system makes concrete efficiency claims. Reproducible benchmarking is therefore part of the product’s proof surface, not just a marketing artifact.
No retrieval system is without tradeoffs.
If the agent continues to open whole files instead of using jCodeMunch tools, the retrieval architecture is present but the efficiency benefits do not materialize.
The system is strong at locating, packaging, and relating code. It is not intended to replace all forms of semantic reasoning. Summaries and ranking improve retrieval quality, but the backbone remains structure-aware access.
Because the system serves from a local index and cached raw files, stale indexes can mislead unless verification or reindexing is used.
Different languages and ecosystem providers expose different kinds of metadata and symbols. The architecture is extensible, but capability depth varies by language and provider maturity.
As the tool surface expands, protocol clarity, result consistency, and documentation accuracy become more important and more difficult to maintain.
Symbol IDs follow this stable format:
{file_path}::{qualified_name}#{kind}
Examples:
src/main.py::UserService#class
src/main.py::UserService.login#method
src/utils.py::authenticate#function
config.py::MAX_RETRIES#constant
IDs remain stable across re-indexing as long as file path, qualified name, and kind remain unchanged. Duplicate overloads can receive numeric suffixes such as ~1, ~2.
Representative architectural dependencies include:
| Package | Role |
|---|---|
mcp |
MCP server framework |
httpx |
Async HTTP and GitHub access |
tree-sitter-language-pack |
Precompiled language grammars |
pathspec |
Gitignore pattern matching |
anthropic |
Claude-based summarization |
google-generativeai |
Gemini-based summarization |
pyyaml |
dbt schema and provider parsing |
watchfiles |
optional watch mode |
sqlite3 (stdlib) |
WAL-mode index storage backend |
These dependencies support the core architectural layers: protocol transport, repository acquisition, parsing, storage, summarization, file watching, and safe concurrent access.