CodeIR: Compiling for AI Reasoning

How CodeIR Represents Code

CodeIR compiles every function, class, and method in a codebase into three levels of progressive detail. The agent works top-down: orient broadly, reason about dependencies, expand to source only when needed.

Index — what exists and where it fits

AMT ATHNTCT.02 #AUTH #CORE

Entity type (AMT = async method), stable ID (ATHNTCT.02), domain tags (#AUTH #CORE). At this level the full codebase fits in a single context window. The agent scans thousands of these to identify which entities are relevant to a task. Token cost: ~10 tokens per entity.

Behavior — what it does and what it touches

AMT ATHNTCT.02 C=get_by_email,hash,update,verify_and_update F=AIRT A=2 #AUTH #CORE

Same entity with outgoing calls (C=), behavioral flags (F=: A=assignments/augmented assigns, I=conditionals, R=returns, T=try/except), assignment count (A=), and domain tags. The agent can reason about dependencies, complexity, and blast radius without reading source. This is the level where most architectural decisions get made — the agent knows what this function touches and how complex it is. Token cost: ~25 tokens per entity.

Source — the actual implementation, expanded on demand

async def authenticate(
    self, credentials: OAuth2PasswordRequestForm
) -> Optional[models.UP]:
    """
    Authenticate and return a user following an email and a password.

    Will automatically upgrade password hash if necessary.

    :param credentials: The user credentials.
    """
    try:
        user = await self.get_by_email(credentials.username)
    except exceptions.UserNotExists:
        # Run the hasher to mitigate timing attack
        # Inspired from Django: https://code.djangoproject.com/ticket/20760
        self.password_helper.hash(credentials.password)
        return None

    verified, updated_password_hash = self.password_helper.verify_and_update(
        credentials.password, user.hashed_password
    )
    if not verified:
        return None
    # Update password hash to a more robust one if needed
    if updated_password_hash is not None:
        await self.user_db.update(user, {"hashed_password": updated_password_hash})

    return user

This is what the model reads without CodeIR — every time, in full, whether it needs the implementation details or not. ~120 tokens for one entity. Multiply across thousands of entities and the context window fills with syntax, comments, and formatting before the model has oriented itself. With CodeIR, this level is reached only when the agent needs to verify implementation or make a change.

The workflow in practice: the agent reads Index to find what's relevant, inspects Behavior to understand how entities connect, and expands to Source only where it needs to act. Most entities in a session never go past Behavior. The ones that get expanded are the ones that actually matter.

LLMs Are Reading Code the Hard Way

When Claude or GPT tries to understand a large codebase, it does something slightly absurd: it reads files.

Entire files. Thousands of lines of syntax, indentation, and formatting, burning tokens and filling context windows. Most of the information the LLM is taking in carries little information about how the system actually works.

Humans don’t navigate codebases this way. After a few weeks on a project, developers stop thinking in files entirely. They think in architecture: which modules own which responsibilities, which entities call each other, where the system boundaries are.

They only dive into source code when something specific needs to change.

LLMs never get that representation. They get raw code.

Why Source Code Is the Wrong Format

The problem isn't just volume - it's about what kind of information fills the context window. Consider what a transformer actually processes when it sees a Python function:

def get_user_permissions(user):
    if user.is_superuser:
        return Permission.objects.all()

To the model, this becomes roughly 30 tokens: def, get, _, user, _, permissions... and so on. But the actual architectural information is:

get_user_permissions → checks user.is_superuser → queries Permission model

Three facts. Everything else is syntax noise. In a typical Python file, 80% or more of tokens carry no architectural signal. Source code is a representation optimized for human editors and machine compilers. It was never designed for machine reasoning.

The Numbers Behind the Problem

Scale of the problem across real codebases:

Repository	Entities	Raw Python (est.)	CodeIR Index	Compression
Flask	1,629	~148k	19k	8:1
Tryton	20,457	~2.8M	214k	13:1
SQLAlchemy	38,672	~5.0M	467k	11:1
Django	41,819	~4.7M	475k	10:1

Token estimates use 1 token ≈ 4 characters. Actual counts may be slightly lower.

At Index level, CodeIR fits roughly 18,000 entities in a 200k context window. The same window holds less than 2,000 entities as raw source and only if you could perfectly select which files to load, which in practice you can't.

What CodeIR Looks Like in Practice

CodeIR is essentially a semantic compiler for codebases. It is software that takes raw source code and distills every function, class, and method into a layered representation of what it does, what it touches, and how central it is, no LLM required.

Your coding agent gets to see the whole system's functionality in its context window, and can drill from a 20-token summary down to full source only when it needs to.

Instead of reading code to understand functionality the model starts with understanding and reads code to confirm details.

Setup & Workflow

Setup

pip install codeir-tools
codeir index ./your-repo

This creates a .codeir/ directory containing the index. CodeIR also generates files your agent reads automatically:

.claude/rules/codeir.md - tool usage instructions (tells the agent how to use the CodeIR commands)
.claude/bearings.md - the full architectural map of your codebase
.claude/bearings/{category}.md - per-category detail for large codebases

The workflow changes from "read files and hope" to a structured discovery process:

01 Orient

The agent reads bearings.md at session start. In 200–400 tokens, it knows every module in the codebase, what each one does, how they depend on each other. It knows where to look before it looks anywhere. For large codebases, per-category bearings files let the agent drill into only the relevant slice.

↓

02 Search

Instead of grepping through files, the agent searches the semantic index. Multi-term queries with OR logic and ranking: codeir search "auth token validate" returns the specific entities relevant to the task, not files that happen to contain those words. Use --category to filter (e.g., --category core_logic to skip tests).

↓

03 Grep

When search doesn't find what you need, codeir grep does regex search across source files, grouped by entity. Use --path to scope to a directory and -v for full IR context per match. This bridges the gap between entity-oriented search and line-oriented grep.

↓

04 Inspect

The agent retrieves entity-level IR at the detail level it needs. Index level for quick orientation ("what kind of thing is this?"), Behavior level for understanding ("what does this actually do and what does it call?").

↓

05 Trace

codeir callers finds everything that references a given entity: import-level, local, and fuzzy matching (results marked ~ are probable but not certain). codeir impact does reverse dependency analysis via BFS — showing affected entities grouped by distance with the full dependency chain.

↓

06 Scope

codeir scope returns the minimal context needed to safely modify an entity: its callers, callees, and sibling methods (same class). Use before editing to understand what you might break and what the entity depends on.

↓

07 Expand

Only when the agent needs to verify exact implementation details or make a change does it expand to raw source. This is the last step, not the first.

The net effect: Claude Code stops reading files it doesn't need. It spends its context window on understanding the system instead of re-reading syntax.

CLI Reference & IR Format

Available CLI Commands

codeir index <path>                              # Index or re-index a repository
codeir search <query> [--category <cat>]         # Multi-term OR search with ranking
codeir grep <pattern> [--path <dir>] [-i] [-C N] [-v]  # Regex search, grouped by entity
codeir show <entity_id> [--level Index|Behavior] # Show entity IR at specified level
codeir expand <entity_id>                        # Expand to full source
codeir callers <entity_id>                       # Three-tier caller resolution
codeir impact <entity_id> [--depth N]            # Reverse dependency analysis (BFS)
codeir scope <entity_id>                         # Minimal context for safe modification
codeir bearings                                  # Summary + menu with token estimates
codeir bearings [category]                       # Specific category (e.g., core_logic)
codeir bearings --full                           # Full module map

CodeIR is globally installed via symlink at /usr/local/bin/codeir. The .claude/rules/ directory is read automatically by Claude Code, so the integration requires zero manual configuration after indexing.

What bearings.md Looks Like

Each line describes one module: its ID, filename, category, entity count, dependencies, and churn. Grouped by category, ordered by entity count.

MD INVC invoice.py | cat:core_logic | entities:221 | deps:- | churn:-
MD SESS sessions.py | cat:core_logic | entities:18  | deps:auth,hooks,utils | churn:-
MD AUTH auth.py     | cat:core_logic | entities:12  | deps:sessions,models | churn:-
MD HOOK hooks.py    | cat:extension  | entities:7   | deps:sessions | churn:-
...

The entire architectural map of a codebase like Flask (83 files, 1,629 entities) fits in a few hundred tokens. Django (2,894 files, 41,819 entities) fits in a few thousand. This is the orientation layer - the agent reads it once and immediately knows the shape of the system.

For large codebases like Tryton, per-category bearings files break the map into manageable pieces. The agent loads .claude/bearings/core_logic.md when working on business logic, .claude/bearings/tests.md when writing tests - never the whole thing at once if it doesn't need to.

Entity IR at Each Level

Index - orientation (selection-level)

MT FLSH.05 #HTTP #CORE

Behavior - what it does and calls (reasoning-level)

MT FLSH.05 C=UOWTransaction,UnmappedInstanceError,CATBGNT,_begin,_commit_all_states,_expunge_states F=EILRTW A=22 #HTTP #CORE

Behavior fields: C= calls made, F= flags (R=returns, E=raises, I=conditionals, L=loops, T=try/except, W=with), A= assignment count, B= base class.

Flag/Field	Meaning	Why the LLM cares
`C=` (Calls)	Outgoing dependencies	Traces the "blast radius." If MT.A calls MT.B, the model knows a change in B might require an update in A.
`F=IR` (Flags)	Internal Logic (e.g., If, Return)	Indicates complexity. F=R is a simple pipe; F=ITLW (If, Try, Loop, With) is a high-logic function that needs careful handling.
`A=2` (Assignments)	State changes	Tells the model if the function is "pure" or if it's juggling internal state. High assignment counts signal a "heavy lifter" function.
`B=` (Bases)	Class inheritance	Immediately establishes the "Rules of the Road" (e.g., knowing a class inherits from Model or Singleton without reading the imports).

Source expansion (verification-level)

[FN RDTKN @auth/tokens.py:47]
async def read_token(token: str, session: Session) -> TokenData:
    response = session.get(f"/tokens/{token}")
    ...

The agent navigates this hierarchy the same way a developer navigates a system: broad orientation first, then targeted deep dives only where needed.

Example Workflow (Tryton codebase)

1. codeir search "flush"               → no relevant results
2. codeir grep "def flush" --path orm/ → finds entity FLSH.04 in orm/session.py
3. codeir show FLSH.04                → see Behavior IR: what it calls, flags, assignments
4. codeir callers FLSH.04             → see what depends on it
5. codeir impact FLSH.04 --depth 2    → understand blast radius before changing
6. codeir scope FLSH.04               → get callers, callees, siblings for safe modification
7. codeir expand FLSH.04              → read source only for the entity you need to modify

How It Works Under the Hood

Instead of searching files and hoping to find the right ones, CodeIR gives the agent every function, class, and method in view with the ability to drill into deeper levels of functional detail on demand.

CodeIR compiles a codebase into a hierarchical representation of functionality with three levels: Index, Behavior, and Source. This allows models to orient themselves across thousands of entities and understand how they function together before expanding only the specific code they need.

Index, Behavior, Source

Level 1 Index Architectural Map

Every entity in the codebase gets a one-line representation: entity type, stable ID, domain tag, category tag. An agent can scan thousands of entities in a few hundred tokens and identify which ones are relevant to a task.

Answers: what is this codebase made of?

Level 2 Behavior Reasoning Layer

Selected entities get expanded to Behavior: type, ID, calls made, flags (returns, raises, conditionals, loops, try/except, with-blocks), assignment count, and base class. The agent can understand what an entity does and what it depends on — without reading source.

Answers: what does this entity actually do?

Level 3 Source Verification Layer

When the agent needs to verify exact implementation or make a modification, it expands to full source with entity boundary markers.

Answers: exactly how is this implemented?

The compilation process is one-time per repository: AST parsing extracts every entity (functions, classes, methods, constants), a classifier assigns each module to a category, and a multi-pass indexer generates all IR levels with stable, deterministic entity IDs. Re-indexing is incremental — only changed files get reprocessed.

The Compilation Pipeline

The Compilation Pipeline in Detail

Step 1: AST Extraction

CodeIR walks every Python file with an AST visitor, extracting functions, async functions, classes, methods, and module-level constants. Each entity gets boundary markers (start line, end line) and import analysis (what this entity imports and from where).

Step 2: Module Classification

Each file is classified into categories: core_logic, data_model, api_endpoint, auth, config, utility, extension, test, migration, cli, docs, constants, exceptions, init, router. Classification uses structural signals — import patterns, decorator presence, naming conventions — not LLM inference.

Step 3: Stable ID Generation

Every entity gets a deterministic ID that remains stable across re-indexing as long as the entity's name and location don't change. Format: TYPE STEM.SUFFIX (e.g., FN RDTKN.03, MT WBHK.02). The ID scheme uses vowel-stripped abbreviations with numeric suffixes for disambiguation.

Type	Meaning
`FN`	Function
`CLS`	Class
`MT`	Method
`AMT`	Async method

Stable IDs act as pointers for the LLM's memory, allowing it to maintain a constant "mental map" even as the underlying code changes.

Step 4: Multi-level IR Generation

Each entity is compressed at all levels simultaneously:

Index strips everything except type, ID, and domain/category tags
Behavior adds: calls made (C=), flags (F= with single-letter codes: R returns, E raises, I conditionals, L loops, T try/except, W with-blocks), assignment count (A=), base class (B=)
Source wraps the original code with entity metadata headers ([TYPE ID @filepath:line])

Empty fields are omitted at Behavior level (no C=-, F=-, A=0). Trivial entities below a token threshold skip compression and store as source at all levels.

Step 5: Bearings Generation

Once classification and entity counting are complete, the index generates the module-level map. This contains no entity-level IR — it's a directory of modules grouped by category with dependency and churn information. For large codebases, per-category bearings files are generated in .claude/bearings/{category}.md.

Storage

SQLite with WAL mode, two-database design (entities + mappings). Created automatically in .codeir/ on first indexing. Incremental re-indexing uses change detection — only modified files trigger reprocessing.

A Note on What Was Evaluated and Removed

An intermediate level between Index and Behavior (type signatures only) was tested and cut. It was too compressed to be safe — agents would confidently reason from type information without seeing behavioral flags like exception handling, leading to plans that missed critical error conditions. The three-level stack is cleaner and each level has a clear, non-overlapping purpose.

What Can the Agent Ask?

CodeIR gives your coding agent a set of structured queries against the index. Instead of reading files and hoping to find what's relevant, it asks directly:

Agent Queries

"What's in this codebase?" → codeir bearings returns the full architectural map: every module, its category, entity count, and dependencies. The agent reads this once and knows where to look before it looks anywhere.

"Which entities match this task?" → codeir search "auth token validate" does multi-term OR search with ranking across the full index. --category core_logic filters out tests and utilities.

"Where does this pattern appear in source?" → codeir grep "session_id" --path auth/ does regex search grouped by entity, with -v for full IR context per match.

"What does this function actually do?" → codeir show ATHNTCT.02 --level Behavior returns the entity's calls, flags, assignment count, and tags — enough to reason about it without reading source.

"What calls this function?" → codeir callers ATHNTCT.02 returns three tiers: import-level, local, and fuzzy matches (marked ~ for probable but uncertain).

"What breaks if I change this?" → codeir impact ATHNTCT.02 --depth 2 traces reverse dependencies via BFS, grouped by distance, with the full dependency chain.

"What's the minimum context to safely edit this?" → codeir scope ATHNTCT.02 returns callers, callees, and sibling methods — everything the agent needs to understand blast radius before making a change.

"Show me the actual code." → codeir expand ATHNTCT.02 returns the raw source with entity boundary markers. This is the last step, not the first.

CodeIR is open source.

pip install codeir-tools

Works with Claude Code out of the box.

Wait, what does Claude think?

Since Claude is the primary "user" of CodeIR, I didn't want to just guess if it helped. I asked him. During development, Claude's feedback was so specific that it actually shaped the feature set.

For example, Claude found that while entity search was great, he still missed the "vibe" of a classic grep. Based on that, we built codeir grep, which returns IR context alongside regex matches.

Here is what Claude had to say:

✨ Claude on SQLAlchemy Bug Search:

"I found the bug in ~6 queries across a 663-file, 38k-entity codebase. Without CodeIR, I'd have been grepping through the ORM internals much longer."

Read full review →

✨ Claude on Django Refactoring Plan:

"Going from Behavior to exact source with CodeIR expand was fast. I knew what I was looking for before reading the code."

Read full review →

If you use the tool we would love to hear about Claude's experience. We encourage you to ask him and share his feedback (and yours) in our Claude Feedback thread.