tags: [codegraph, pre-indexing, tool-calls, agent-efficiency, sqlite, tree-sitter, knowledge-graph] related:

019_three_token_debts.md
011_nlp_first_codebase.md
023_living_codemap.md supersedes: ~ status: current —

032 — Pre-Index Your Codebase Before the Agent Needs It

The Problem

Every time an AI agent explores an unfamiliar part of your codebase, it burns tool calls. Not one or two — dozens. A single “where is the auth middleware wired?” question triggers a chain: list directory, read file, grep for import, read another file, grep for the function, read the call site. Repeat across four files, and you’ve spent 20–30 tool calls answering a question that took you 10 seconds to answer from memory.

The agent isn’t being inefficient. It has no memory of your codebase. Every session starts cold. Every exploration is from scratch.

The fix isn’t a smarter agent. It’s giving the agent a pre-built map instead of a raw filesystem.

What CodeGraph Does

CodeGraph is a local MCP server that indexes your codebase into a SQLite knowledge graph before the agent ever opens a file. When you run codegraph init, it:

Uses git ls-files to find all tracked source files
Parses each file through tree-sitter — a deterministic AST parser, no LLM involved
Extracts nodes (functions, classes, routes, components) and edges (calls, imports, extends, implements)
Stores everything in SQLite with FTS5 full-text search
Registers OS file watchers for incremental re-indexing on change

Then it exposes 8 MCP tools to the agent:

Tool	What it answers
`codegraph_search`	Where is symbol X defined?
`codegraph_callers`	What calls this function?
`codegraph_callees`	What does this function call?
`codegraph_impact`	What breaks if I change this?
`codegraph_node`	Show me the source + signature
`codegraph_context`	Give me everything relevant to this task
`codegraph_explore`	Investigate this unfamiliar pattern
`codegraph_files`	List a directory

One codegraph_context("auth middleware") call returns: the middleware function, what it calls, what calls it, and their source — assembled as a single markdown block. The agent reads it once instead of navigating there through 20 file reads.

Zero Tokens to Produce the Intelligence

The critical point: CodeGraph produces all of this with zero LLM calls.

Every operation is static analysis:

tree-sitter parses AST deterministically from grammar rules
SQLite FTS5 ranks matches using BM25 (a text similarity algorithm, no model)
Graph traversal is BFS/DFS over stored edges
Ranking uses heuristics: co-location boosting, path relevance scoring, brevity preference

The intelligence comes from the structure of the code itself, not from any model reasoning about it. CodeGraph knows that saveOrder calls normalizeOrder because it read the AST and stored that edge — not because it understood what either function does.

This is why the benchmark numbers are credible: 94% fewer tool calls, 77% faster task completion. The agent swaps 26–52 Read/Grep calls for 1–6 MCP tool calls that return pre-assembled context. Token count drops because there’s less content flowing in, not because the model is faster.

The Pre-Index Pattern

CodeGraph is one implementation. The underlying pattern is general:

Build a queryable index of your codebase once. Let the agent query it instead of the filesystem.

The index can be anything — SQLite, a JSON file, a vector database — as long as the agent can get a structured answer without reading multiple files.

What makes this work:

The index is built offline. No token cost during the session. The session only pays for queries, and queries return compact structured data.
Queries compose. One call can return a symbol + its callers + its dependencies. That’s three separate file reads collapsed into one MCP call.
The index stays current. Git hooks or file watchers re-index only changed files — typically 1–5 files per commit, not the whole codebase.
The agent stops guessing paths. Without an index, the agent has to explore to find things. With an index, it asks directly.

Where to Apply This in Your Own Tooling

CodeGraph is tree-sitter-based — it knows structure but not intent. If you already have semantic metadata in your codebase (docstrings, annotations, contract blocks), your index can know both.

For contract-annotated codebases: If every public function has a @contract block with @does, @tags, @reuse-when, those fields are indexable with TF-IDF today. A query like dar find "normalize order payload" searches intent rather than symbol names. CodeGraph and a contract index answer different questions and are more useful combined than either alone.

For any codebase: Even a simple symbol table — function name, file, line number, docstring — is enough to cut exploration calls significantly. You don’t need a full graph to see improvement. Start with search; add graph traversal when you feel the gap.

The practical entry point: Run codegraph init -i in the most complex package you own — the one where “where is X wired?” costs you the most tool calls. Measure the difference over one week of sessions. The case for expanding to other packages writes itself.

Practical Takeaway

If your agent spends more than 5 tool calls answering “where is X defined?”, you’re paying a tax that a pre-built index eliminates. The index doesn’t need to be sophisticated — it needs to exist.

CodeGraph is the fastest path to a working structural index for any codebase in 19+ languages. For intent-level search, pair it with semantic contracts. Use the structural layer for “where and what,” the semantic layer for “why and when to reuse.”

The session that starts with a query against a pre-built index is fundamentally cheaper than the session that starts with an empty context. Build the index once; every session after that benefits.