The Basics
IR stands for Intermediate Representation. In compiler design it's a data structure that represents source code in a way that's easier for machines to analyze and optimize. CodeIR is an IR specifically designed for representing Python codebases in a compact, structured format that LLMs can reason over without reading raw source.
Indexing is pure code — AST parsing, classification, and IR generation. No LLM summarization, no API calls. A mid-sized repo like Flask (83 files) indexes in seconds. Larger codebases like Django or SQLAlchemy take a bit longer but still finish in under a minute on typical hardware.
Re-indexing is incremental: only changed files get reprocessed.
CodeIR uses Python's AST module to extract every function, async function, class, method, and module-level constant. Each gets a stable ID, boundary markers (start/end line), and import analysis.
Nested functions, decorated wrappers, and class attributes are all captured. Lambda assignments are captured as module-level constants. Each entity gets its own IR at all three levels:
Index: FN RDTKN #HTTP #CORE
Behavior: FN RDTKN C=session_get F=IR A=2 #HTTP #CORE
Source: [FN RDTKN @auth/tokens.py:47]
async def read_token(token: str, session: Session) -> TokenData:
...
No. Summarization is lossy and non-deterministic — you get a different summary each time, and you can't verify it against source. CodeIR is compilation: deterministic, lossless at each level, and invertible (you can always expand back to source).
Every field in the IR is mechanically extracted from the AST. C=
lists actual call sites, F= flags are derived from control flow
structures, A= counts real assignments. Nothing is inferred or
paraphrased.
It's both. Yes, it's a CLI tool you install and use today. But underneath, it's a representation layer — a way to encode programs for machine reasoning instead of human editing.
The real long-term implication is bigger: models may eventually reason about code primarily through IR, not source, exactly like compilers do. Source code is the human-readable format. IR is the machine-reasoning format. CodeIR is an early implementation of that idea.
Token Economics
Compression depends on repo size. Flask (1,629 entities) compresses about 8:1 — roughly 148k tokens of raw Python down to 19k in the Index. Larger codebases compress better: Tryton hits 13:1, SQLAlchemy 11:1, Django 10:1.
But the point isn't "make the repo smaller." It's that Claude can hold the entire codebase's structure and behavior in context, then retrieve source only for the specific entities it needs to edit.
In our SQLAlchemy test, Claude found a bug across a 663-file codebase using 82% fewer input tokens than the baseline — 368k vs 2.06M. Instead of reading source files to orient, it queried the IR, inspected one behavior summary, and expanded the 20 lines of source that mattered.
Large codebases don't exceed LLM context because they are too large. They exceed context because source code is a poor representation for architectural reasoning. Even with a 200k token window, raw code doesn't scale well.
Consider what the model actually sees:
def get_user_permissions(user):
if user.is_superuser:
return Permission.objects.all()
The LLM processes 25–40 tokens here. But the real architectural information is:
get_user_permissions → checks user.is_superuser → queries Permission model
Three facts. Everything else is syntax noise. In a typical Python file, 80% or more of tokens carry zero architectural signal.
Even if context windows became infinite, there's still attention dilution. Transformers distribute attention across tokens. If a prompt contains 80% boilerplate and 20% structure, the model spends most of its compute on the wrong thing. CodeIR flips that ratio to roughly 80% structure and 20% metadata.
At Index level, CodeIR fits roughly 18,000 entities per 200k context window. For reference, Django is ~41,000 entities and Tryton is ~20,000.
For codebases that exceed this, per-category bearings files break the architectural map into slices. The agent loads only the categories relevant to its current task — it never needs the whole Index at once.
Architecture & Accuracy
Behavior-level IR captures call relationships (C=), inheritance
(B=), and flags for behavioral patterns (F=EIR).
Domain classification tags (#AUTH, #CORE) group
entities by architectural role.
The callers command provides three-tier caller resolution:
⏺ codeir callers OPNSSSN
⎿ APPCNTXT.02 AppContext src/flask/ctx.py [import]
GTSSSN AppContext._get_session src/flask/ctx.py [local]
~FLSKCLNT FlaskClient src/flask/testing.py [fuzzy]
Import-level callers are resolved via import statements. Local callers are
same-file references. Fuzzy callers (marked ~) are repo-wide
name matches when resolution is ambiguous.
CodeIR resolves call relationships using three tiers of confidence:
- Local: Same-file calls with explicit references — highest confidence
- Import: Calls resolved via import statements — high confidence
- Fuzzy: Repo-wide name matches when resolution is ambiguous — marked with
~so the model knows to verify
The tiered approach means the model always knows how much to trust a relationship. Fuzzy matches are explicitly flagged rather than silently presented as certain.
CodeIR is based on static AST analysis and does not capture runtime
behaviors such as monkey-patching, dynamically generated methods,
setattr-based attribute creation, or metaclass magic that
generates methods at class creation time.
Heavy use of __getattr__, dynamic dispatch, or code generation
(e.g., exec/eval) will produce gaps in the call
graph. The fuzzy caller tier partially compensates by catching name-based
matches, but it's not a substitute for runtime analysis.
Run codeir index again. Re-indexing is incremental — only
modified files are reprocessed. Entity IDs remain stable as long as the
entity's name and location don't change, so the model's "mental map"
persists across re-indexes.
If an entity is renamed or moved, it gets a new ID. The old ID simply stops resolving. There's no migration step.
Comparisons
Most code indexers (ctags, LSP, Sourcegraph) are designed for navigation: "jump to definition," "find references." They answer where is this thing?
CodeIR is designed for architectural reasoning: "what does this entity do, what does it depend on, and what breaks if I change it?" It answers how does this system work?
The output format reflects this. Indexers produce symbol tables. CodeIR produces multi-level IR with behavioral flags, call graphs, and domain classification — a representation an LLM can reason over without reading source.
Retrieval assumes you already know what to search for. But architectural understanding comes first — you need to know the shape of the system before you can ask the right questions.
In our benchmarks, embedding-based retrieval accuracy actually decreased as models got stronger (Opus scored 33% on MPNet retrieval vs. Haiku's 71%). Stronger models are better at recognizing when retrieved fragments don't give them enough context. They hedge instead of guessing.
CodeIR's orient-first approach gives the model the architectural map before it retrieves anything. It knows what to look for because it's already seen the whole system at Index level.
We have early data on this. The original benchmarks tested Haiku, GPT 4.1, and DeepSeek alongside Sonnet and Opus. CodeIR improved accuracy across all models, though the delta was largest on stronger models (where baselines degrade most due to the model's increased ability to recognize insufficient context).
The data needs re-running with corrected tags, but the question is already in our research plan.
Design Decisions
Because it's v1. The current scheme uses vowel-stripped abbreviations
(FNLZFLSHCHNGS = finalize_flush_changes) to
maximize the number of entities that fit in a context window.
We're actively performing interactive testing with Claude to determine the minimum LLM-understandable ID length — which does not necessarily correspond to human readability. The goal is the shortest ID that a model can reliably map back to the entity it represents. Early results suggest there's room to shorten further without hurting model accuracy.
Claude is the first. We are actively working on Codex support. The underlying IR format and CLI are model-agnostic. The only Claude-specific part is the prompt templates and evaluation scripts which will be adapted for other models.
Yes, and it gets better with scale. Compression improves with repository size because larger codebases have more structural redundancy: hundreds of model classes following the same patterns, view functions with identical structures, test files with standardized layouts.
Codebases we've tested:
| Repository | Files | Entities |
|---|---|---|
| Flask | 83 | 1,629 |
| SQLAlchemy | 663 | 38,672 |
| Django | 2,894 | 41,819 |
| Tryton | 2,375 | 20,457 |
Storage is SQLite with WAL mode. Indexing is incremental. There's no practical upper bound we've hit yet.
The current implementation targets Python via the built-in ast
module. But the architecture is language-agnostic — the IR format, the
multi-level hierarchy, the entity ID scheme, and the CLI all work
independently of the parser.
Implementing support for another language means writing a new AST extractor on top of an existing parser (Tree-sitter, TypeScript AST, Roslyn for C#, etc.). The rest of the pipeline stays the same.
Speculative, but intriguing. CodeIR provides a higher signal-to-noise representation of programs: behavioral patterns, call graphs, domain classification, and inheritance hierarchies — all extracted mechanically and presented in a compact, consistent format.
A model trained on Behavior-level IR alongside source might develop stronger architectural reasoning than one trained on source alone. The IR encodes exactly the kind of structural relationships that current models struggle to extract from raw syntax. Whether this translates to better downstream performance is an open research question.
Let's talk.