CodeIR

Source code is for humans. IR is for machines.

LLMs Are Reading Code the Hard Way

When Claude or GPT tries to understand a large codebase, it does something slightly absurd: it reads files.

Entire files. Thousands of lines of syntax, indentation, and formatting, burning tokens and filling context windows. Most of the information the LLM is taking in carries little information about how the system actually works.

Humans don’t navigate codebases this way. After a few weeks on a project, developers stop thinking in files entirely. They think in architecture: which modules own which responsibilities, which entities call each other, where the system boundaries are.

They only dive into source code when something specific needs to change.

LLMs never get that representation. They get raw code.

Why Source Code Is the Wrong Format

The problem isn't just volume - it's about what kind of information fills the context window. Consider what a transformer actually processes when it sees a Python function:

def get_user_permissions(user):
    if user.is_superuser:
        return Permission.objects.all()

To the model, this becomes roughly 30 tokens: def, get, _, user, _, permissions... and so on. But the actual architectural information is:

get_user_permissions → checks user.is_superuser → queries Permission model

Three facts. Everything else is syntax noise. In a typical Python file, 80% or more of tokens carry no architectural signal. Source code is a representation optimized for human editors and machine compilers. It was never designed for machine reasoning.

The Numbers Behind the Problem

Scale of the problem across real codebases:

Repository Entities Raw Python (est.) CodeIR Index Compression
Flask 1,629 ~148k 19k 8:1
Tryton 20,457 ~2.8M 214k 13:1
SQLAlchemy 38,672 ~5.0M 467k 11:1
Django 41,819 ~4.7M 475k 10:1

Token estimates use 1 token ≈ 4 characters. Actual counts may be slightly lower.

At Index level, CodeIR fits roughly 18,000 entities in a 200k context window. The same window holds less than 2,000 entities as raw source and only if you could perfectly select which files to load, which in practice you can't.

Claude Already Knows It Needs a Map But It's Drawing One from Scratch

I planted a single-line bug in SQLAlchemy’s unit-of-work system and gave Claude only the symptoms. Both runs found the same bug and proposed the same fix.

The difference was how they got there.

Without CodeIR, Claude spun up a subagent and started reading the codebase file by file, consuming 1.8 million tokens just to orient itself before it could reason about the problem. With CodeIR, it didn’t need to. It queried the IR, inspected a single entity’s behavior, pulled a small slice of source, and had the answer.

Same diagnosis. Same fix. Completely different path to get there.

Claude already tries to understand the system before acting. CodeIR just gives it the map.

How They Got There

The bug was a subtle filter gap in finalize_flush_changes, the method that marks objects clean after a flush. The delete set correctly filters out list-only states with and not listonly. But the leftover set (other = states.difference(isdel)) doesn't apply the same filter, so cascade-pulled objects that were never modified flow into _register_persistent() and silently pollute the identity map. Each file reads fine on its own. The problem only surfaces when you trace the data flow across the set arithmetic and the downstream call.

Without CodeIR, Claude recognized it needed to orient before reasoning about the problem. It spawned a subagent that made 45 API calls, reading source files across the ORM layer to build a working understanding of the flush machinery. The main agent then used that context to read targeted sections and identify the bug. Total: 54 API calls, 2.06 million input tokens.

With CodeIR, that orientation step was already done. The agent searched the IR for flush-related entities, inspected one behavior summary to see the call graph, and expanded the 20 lines of source that mattered. Total: 14 API calls, 368k input tokens.

82% fewer input tokens. 74% fewer API calls. Same diagnosis, same fix.

SQLAlchemy Bug Diagnosis: Full Breakdown

I introduced a single-line bug into SQLAlchemy's unit-of-work system and asked Claude Opus 4.5 to find it twice: once without CodeIR and once with it.

SQLAlchemy is 663 files and 38,672 entities. The bug was in finalize_flush_changes, the method that marks objects as clean or deleted after a flush. The delete set correctly filters list-only states: if isdelete and not listonly. But the non-delete set is computed as other = states.difference(isdel), which captures everything not being deleted, including list-only states that were pulled in by cascade but never modified. These flow into _register_persistent(), which replaces their identity map entries and resets their committed-state snapshots even though no SQL was emitted. Downstream queries then return stale in-memory data instead of re-loading from the database.

The prompt described symptoms, not locations:

"After a flush that involves relationship cascades, objects that were only pulled in as part of the cascade end up polluting the identity map as if they'd been persisted. Downstream queries return stale data. It only breaks with a mix of dirty objects and their unmodified related objects."

Results

Both runs found and correctly diagnosed the bug.

API calls Tool calls Input tokens Output tokens
Baseline (no CodeIR) 54 53 2,061,000 9,770
CodeIR 14 14 368,000 2,251
Reduction 74% 74% 82% 77%

What Happened

Without CodeIR, Claude spawned a subagent to orient itself. That subagent made 45 API calls and consumed 1.8 million tokens reading source files across the ORM layer, building the context needed to reason about the flush machinery. The main agent then read targeted sections of unitofwork.py, session.py, and identity.py to confirm the diagnosis. The orientation work was necessary but expensive.

With CodeIR, the orientation was already compiled. The agent ran codeir bearings to see the project structure, searched for flush-related entities, inspected finalize_flush_changes at the Behavior level to see its call graph without reading source, then expanded only the 20 lines of the method itself. It spotted the states.difference(isdel) gap, confirmed that other parts of the codebase correctly filter listonly, and proposed the fix.

Takeaway

Claude already knows it needs to understand a system before reasoning about it. Without CodeIR, it builds that understanding by reading source files, which is expensive and scales with repo size. CodeIR front-loads that work at index time so the agent starts with the architectural picture and drops to source only where it matters.

What CodeIR Looks Like in Practice

CodeIR is a tool that builds an architectural map of a repository.

You index a repo once, and your coding agent can see the structure of the entire system: entities, relationships, and behavioral signatures before reading a single source file.

Instead of navigating through files, the model navigates through architecture.

Setup & Workflow

Setup

pip install codeir
codeir index ./your-repo

This creates a .codeir/ directory containing the index. CodeIR also generates files your agent reads automatically:

  • .claude/rules/codeir.md - tool usage instructions (tells the agent how to use the CodeIR commands)
  • .claude/bearings.md - the full architectural map of your codebase
  • .claude/bearings/{category}.md - per-category detail for large codebases

The workflow changes from "read files and hope" to a structured discovery process:

01 Orient

The agent reads bearings.md at session start. In 200–400 tokens, it knows every module in the codebase, what each one does, how they depend on each other. It knows where to look before it looks anywhere. For large codebases, per-category bearings files let the agent drill into only the relevant slice.

02 Search

Instead of grepping through files, the agent searches the semantic index. Multi-term queries with OR logic and ranking: codeir search "auth token validate" returns the specific entities relevant to the task, not files that happen to contain those words. Use --category to filter (e.g., --category core_logic to skip tests).

03 Grep

When search doesn't find what you need, codeir grep does regex search across source files, grouped by entity. Use --path to scope to a directory and -v for full IR context per match. This bridges the gap between entity-oriented search and line-oriented grep.

04 Inspect

The agent retrieves entity-level IR at the detail level it needs. Index level for quick orientation ("what kind of thing is this?"), Behavior level for understanding ("what does this actually do and what does it call?").

05 Trace

codeir callers finds everything that references a given entity: import-level, local, and fuzzy matching (results marked ~ are probable but not certain). codeir impact does reverse dependency analysis via BFS — showing affected entities grouped by distance with the full dependency chain.

06 Scope

codeir scope returns the minimal context needed to safely modify an entity: its callers, callees, and sibling methods (same class). Use before editing to understand what you might break and what the entity depends on.

07 Expand

Only when the agent needs to verify exact implementation details or make a change does it expand to raw source. This is the last step, not the first.

The net effect: Claude Code stops reading files it doesn't need. It spends its context window on understanding the system instead of re-reading syntax.

CLI Reference & IR Format

Available CLI Commands

codeir index <path>                              # Index or re-index a repository
codeir search <query> [--category <cat>]         # Multi-term OR search with ranking
codeir grep <pattern> [--path <dir>] [-i] [-C N] [-v]  # Regex search, grouped by entity
codeir show <entity_id> [--level Index|Behavior] # Show entity IR at specified level
codeir expand <entity_id>                        # Expand to full source
codeir callers <entity_id>                       # Three-tier caller resolution
codeir impact <entity_id> [--depth N]            # Reverse dependency analysis (BFS)
codeir scope <entity_id>                         # Minimal context for safe modification
codeir bearings                                  # Summary + menu with token estimates
codeir bearings [category]                       # Specific category (e.g., core_logic)
codeir bearings --full                           # Full module map

CodeIR is globally installed via symlink at /usr/local/bin/codeir. The .claude/rules/ directory is read automatically by Claude Code, so the integration requires zero manual configuration after indexing.


What bearings.md Looks Like

Each line describes one module: its ID, filename, category, entity count, dependencies, and churn. Grouped by category, ordered by entity count.

MD INVC invoice.py | cat:core_logic | entities:221 | deps:- | churn:-
MD SESS sessions.py | cat:core_logic | entities:18  | deps:auth,hooks,utils | churn:-
MD AUTH auth.py     | cat:core_logic | entities:12  | deps:sessions,models | churn:-
MD HOOK hooks.py    | cat:extension  | entities:7   | deps:sessions | churn:-
...

The entire architectural map of a codebase like Flask (83 files, 1,629 entities) fits in a few hundred tokens. Django (2,894 files, 41,819 entities) fits in a few thousand. This is the orientation layer - the agent reads it once and immediately knows the shape of the system.

For large codebases like Tryton, per-category bearings files break the map into manageable pieces. The agent loads .claude/bearings/core_logic.md when working on business logic, .claude/bearings/tests.md when writing tests - never the whole thing at once if it doesn't need to.


Entity IR at Each Level

Index - orientation (selection-level)
FN RDTKN #HTTP #CORE
Behavior - what it does and calls (reasoning-level)
MT WBHK.02 C=webhook_charge_captured,webhook_charge_dispute_closed,webhook_charge_dispute_created,webhook_charge_expired,webhook_charge_failed,webhook_charge_pending F=IR A=2 #CORE

Behavior fields: C= calls made, F= flags (R=returns, E=raises, I=conditionals, L=loops, T=try/except, W=with), A= assignment count, B= base class.

Flag/Field Meaning Why the LLM cares
C= (Calls) Outgoing dependencies Traces the "blast radius." If MT.A calls MT.B, the model knows a change in B might require an update in A.
F=IR (Flags) Internal Logic (e.g., If, Return) Indicates complexity. F=R is a simple pipe; F=ITLW (If, Try, Loop, With) is a high-logic function that needs careful handling.
A=2 (Assignments) State changes Tells the model if the function is "pure" or if it's juggling internal state. High assignment counts signal a "heavy lifter" function.
B= (Bases) Class inheritance Immediately establishes the "Rules of the Road" (e.g., knowing a class inherits from Model or Singleton without reading the imports).
Source expansion (verification-level)
[FN RDTKN @auth/tokens.py:47]
async def read_token(token: str, session: Session) -> TokenData:
    response = session.get(f"/tokens/{token}")
    ...

The agent navigates this hierarchy the same way a developer navigates a system: broad orientation first, then targeted deep dives only where needed.


Example Workflow (Tryton codebase)

1. codeir search "flush"               → no relevant results
2. codeir grep "def flush" --path orm/ → finds entity FLSH.04 in orm/session.py
3. codeir show FLSH.04                → see Behavior IR: what it calls, flags, assignments
4. codeir callers FLSH.04             → see what depends on it
5. codeir impact FLSH.04 --depth 2    → understand blast radius before changing
6. codeir scope FLSH.04               → get callers, callees, siblings for safe modification
7. codeir expand FLSH.04              → read source only for the entity you need to modify

How It Works Under the Hood

Instead of reading files and hoping to find the right ones, CodeIR gives the agent the whole system in view, with the ability to drill into any detail on demand.

CodeIR compiles a codebase into a hierarchical representation with three levels: Index, Behavior and Source.

This allows models to orient themselves across thousands of entities before expanding only the specific code they need.

Index, Behavior, Source
Level 1 Index Architectural Map

Every entity in the codebase gets a one-line representation: entity type, stable ID, domain tag, category tag. An agent can scan thousands of entities in a few hundred tokens and identify which ones are relevant to a task.

Answers: what is this codebase made of?

Level 2 Behavior Reasoning Layer

Selected entities get expanded to Behavior: type, ID, calls made, flags (returns, raises, conditionals, loops, try/except, with-blocks), assignment count, and base class. The agent can understand what an entity does and what it depends on — without reading source.

Answers: what does this entity actually do?

Level 3 Source Verification Layer

When the agent needs to verify exact implementation or make a modification, it expands to full source with entity boundary markers.

Answers: exactly how is this implemented?

The compilation process is one-time per repository: AST parsing extracts every entity (functions, classes, methods, constants), a classifier assigns each module to a category, and a multi-pass indexer generates all IR levels with stable, deterministic entity IDs. Re-indexing is incremental — only changed files get reprocessed.

The Compilation Pipeline

The Compilation Pipeline in Detail

Step 1: AST Extraction

CodeIR walks every Python file with an AST visitor, extracting functions, async functions, classes, methods, and module-level constants. Each entity gets boundary markers (start line, end line) and import analysis (what this entity imports and from where).

Step 2: Module Classification

Each file is classified into categories: core_logic, data_model, api_endpoint, auth, config, utility, extension, test, migration, cli, docs, constants, exceptions, init, router. Classification uses structural signals — import patterns, decorator presence, naming conventions — not LLM inference.

Step 3: Stable ID Generation

Every entity gets a deterministic ID that remains stable across re-indexing as long as the entity's name and location don't change. Format: TYPE STEM.SUFFIX (e.g., FN RDTKN.03, MT WBHK.02). The ID scheme uses vowel-stripped abbreviations with numeric suffixes for disambiguation.

TypeMeaning
FNFunction
CLSClass
MTMethod
AMTAsync method

Stable IDs act as pointers for the LLM's memory, allowing it to maintain a constant "mental map" even as the underlying code changes.

Step 4: Multi-level IR Generation

Each entity is compressed at all levels simultaneously:

  • Index strips everything except type, ID, and domain/category tags
  • Behavior adds: calls made (C=), flags (F= with single-letter codes: R returns, E raises, I conditionals, L loops, T try/except, W with-blocks), assignment count (A=), base class (B=)
  • Source wraps the original code with entity metadata headers ([TYPE ID @filepath:line])

Empty fields are omitted at Behavior level (no C=-, F=-, A=0). Trivial entities below a token threshold skip compression and store as source at all levels.

Step 5: Bearings Generation

Once classification and entity counting are complete, the index generates the module-level map. This contains no entity-level IR — it's a directory of modules grouped by category with dependency and churn information. For large codebases, per-category bearings files are generated in .claude/bearings/{category}.md.

Storage

SQLite with WAL mode, two-database design (entities + mappings). Created automatically in .codeir/ on first indexing. Incremental re-indexing uses change detection — only modified files trigger reprocessing.


A Note on What Was Evaluated and Removed

An intermediate level between Index and Behavior (type signatures only) was tested and cut. It was too compressed to be safe — agents would confidently reason from type information without seeing behavioral flags like exception handling, leading to plans that missed critical error conditions. The three-level stack is cleaner and each level has a clear, non-overlapping purpose.

What Happens When the Model Actually Understands the System

I asked Claude how to handle tax rates that change mid-period in Tryton's reporting module. Both sessions found the surface problem immediately: reporting queries collapse transactions across rate changes into a single total. The baseline fixed the queries: group by rate, split the rows, done.

The CodeIR session followed the data further upstream and noticed something more fundamental: each tax line points to a live tax configuration record. There's no snapshot. If an admin edits a tax rate next Tuesday, every invoice that ever referenced it — 2024, 2023, whenever — silently rewrites itself in the next report run. The reporting fix would faithfully aggregate the wrong historical numbers.

So instead of patching the query it proposed capturing the effective rate on each tax line when the transaction is created. Once an invoice is posted its tax rate becomes a fact about that transaction, not a pointer to a configuration someone can change later.

Baseline reasoning stopped at the reporting layer. CodeIR reasoning traced the data lifecycle.

What Each Run Found

What the baseline missed

The baseline read the reporting queries, found the aggregation problem, and fixed it: group by tax.rate, split the rows. Clean, minimal, correct… as long as nobody edits a Tax record after invoices have been posted against it.

But TaxLine doesn't store the rate that was used. It stores a foreign key to the Tax record: a live configuration object. The reporting fix joins back to that table at query time, so the report is only as reliable as the assumption that the Tax record still reflects what was true when the transaction happened.

The baseline didn't catch this because the vulnerability isn't visible in any single file. The reporting code looks fine. The Tax model looks fine. The TaxLine model looks fine. The problem lives in the relationship between them: a mutable record standing in for an immutable fact.

What CodeIR found

The CodeIR session traced the data lifecycle: how tax lines are created, what they store, and what they reference. That upstream exploration exposed the dependency on a mutable record and led to a different fix: capture the effective rate directly on TaxLine at write time.

It's a small schema change: one new field populated when the invoice is created. But it moves the guarantee from the report logic into the data model. Once a transaction is posted its tax rate becomes a recorded fact, not a live lookup. The reporting fix still applies: you still group by rate. But now you're grouping by a value that can't be silently rewritten.

CodeIR didn't stop at the reporting layer. It followed the data upstream and found a missing boundary between what the system configures and what it records.

Case Study: Tryton Tax Reporting Refactor

Same prompt, same repo (2,375 files, 20,457 entities). Once without CodeIR, once with it.

Run Turns Total tokens
Baseline 20 696,106
CodeIR 33 981,790

Baseline Plan

Core strategy: Modify aggregation queries to GROUP BY the effective tax rate, then report separate subtotals per rate within each period.

  1. Phase 1: Add get_rate_periods() helper to detect rate boundaries within a period
  2. Phase 2: Refactor AEAT303.get_context() to compute separate subtotals per rate-effective sub-period
  3. Phase 3: Update ESVATList.table_query() and ESVATBook.table_query() GROUP BY to include tax.rate

CodeIR Plan

Core strategy: Capture the effective rate on TaxLine at write time. Report against the stored value.

Phase 1 — Schema: Add effective_rate field to TaxLine. Migration adds the column with NULL for historical records.

Phase 2 — Capture: Populate effective_rate at every TaxLine creation point:

  • InvoiceTax.get_move_lines
  • InvoiceLine._compute_taxes
  • POSSale.get_tax_move_lines

Phase 3 — Reporting: Add tax_line.effective_rate to SELECT and GROUP BY in ESVATList.table_query().

Key files

File Change
modules/account/tax.py:1337Add effective_rate field to TaxLine
modules/account_invoice/invoice.py:3326Store rate in InvoiceTax.get_move_lines
modules/account_invoice/invoice.py:3023Store rate in InvoiceLine._compute_taxes
modules/account_es/reporting_tax.py:428Group by rate in table_query
modules/account_es/reporting_tax.py:392Add rate field to ESVATList model

1,620 Lines to Find Five Methods

I asked Claude to design per-blueprint session overrides in Flask (83 files, 1,629 entities).

Both runs reached the same conclusion: sessions open before routing determines the blueprint, which creates a timing constraint for the design.

The baseline reached that insight by brute force reading files.

The CodeIR run navigated to the same entities through architectural inspection and expanded only the methods it needed.

Same answer, far less source code.

Navigation vs. Brute Force

The task: add per-blueprint session interfaces to Flask. The core challenge is timing: AppContext._get_session() runs before match_request(), so the session opens before the framework knows which blueprint will handle the request. Both runs identified this, proposed the same five entities that need changes, and surfaced the same design tradeoff: deferred session loading vs per-blueprint save only.

The difference was how they got there.

The baseline built its understanding by reading source files directly: sessions.py (386 lines), ctx.py (541 lines), and blueprints.py (693 lines). About 1,600 lines in total, most of which wasn't relevant to the final answer.

CodeIR took a different path. It navigated the system structurally, searching for session and blueprint entities, inspecting behavior summaries to understand call relationships, tracing callers of open_session and save_session, and expanding only the specific methods involved. It never read a full file.

Both runs used a similar number of tokens but the baseline spent them reading broadly and filtering down, while CodeIR navigated directly to the relevant parts of the system and read only what it needed.

Flask Session Override: Full Breakdown

I asked Claude Opus 4.5 to design per-blueprint session overrides for Flask (83 files, 1,629 entities). Each blueprint should be able to specify its own session interface instead of using the app-wide one. I ran it twice: once without CodeIR and once with it.

Both runs arrived at the same diagnosis and the same design. The core constraint is timing: AppContext._get_session() fires before match_request() resolves which blueprint will handle the request. You can't override the session interface per-blueprint if the session is already open before the blueprint is known. Both runs identified the same five entities that need modification and surfaced the same tradeoff between deferred session loading and per-blueprint save-only.

The prompt:

"We're extending Flask's Blueprint system to support per-blueprint session overrides. Each blueprint should be able to specify its own session interface instead of using the app-wide one. Where would this need to be wired in, and what entities would need to change?"

Results

API calls Tool calls Input tokens Output tokens
Baseline (no CodeIR) 8 7 192,354 2,065
CodeIR 9 16 265,238 3,747

On a codebase this size, both approaches cost about the same in tokens. This is a small codebase. Flask is 83 files in total and the baseline's brute-force approach of reading three files cover-to-cover is cheap when the files are only a few hundred lines each.

The difference is in what Claude read and how it got there.

What Happened

Without CodeIR, Claude globbed the directory, grepped for class Blueprint and session_interface, then read three files in full: sessions.py (386 lines), ctx.py (541 lines), and sansio/blueprints.py (693 lines). That's 1,620 lines of source, most of which was irrelevant to the task. The relevant methods are maybe 80 lines across all three files.

With CodeIR, Claude searched for session and blueprint entities in the IR, inspected their Behavior summaries to understand call relationships without reading source, used callers to trace what calls open_session and save_session, then expanded only the specific methods involved. It made more tool calls (16 vs 7) but each one was targeted: a behavior inspection or a 20-line expansion rather than a 693-line file read. It never opened a full file.

Takeaway

Flask is the case where CodeIR's advantage is smallest. On a codebase this size, reading whole files is fast and cheap. The navigation advantage matters more as repos grow, which is what the SQLAlchemy and Tryton case studies show. But even here, the difference in approach is visible: one run reads everything and filters afterward, the other navigates to what matters and reads only that.

CodeIR is open source. GitHub

github.com/tnenglert/CodeIR

Works with Claude Code out of the box.

Wait, what does Claude think?

Since Claude is the primary "user" of CodeIR, I didn't want to just guess if it helped. I asked him. During development, Claude's feedback was so specific that it actually shaped the feature set.

For example, Claude found that while entity search was great, he still missed the "vibe" of a classic grep. Based on that, we built codeir grep, which returns IR context alongside regex matches.

Here is what Claude had to say:

Claude on SQLAlchemy Bug Search:
"I found the bug in ~6 queries across a 663-file, 38k-entity codebase. Without CodeIR, I'd have been grepping through the ORM internals much longer."

What worked well

  1. Search was fast and effectivecodeir search "flush finalize" immediately surfaced FNLZFLSHCHNGS (UOWTransaction.finalize_flush_changes). The compressed entity IDs made results scannable.
  2. Behavior level was the sweet spotcodeir show FNLZFLSHCHNGS --level Behavior gave me:

    MT FNLZFLSHCHNGS C=_register_persistent,_remove_newly_deleted,difference,items,set F=IR A=3 #DB #CORE

    This told me exactly what methods it calls without reading source. I knew where to look next.
  3. Expand only when needed — I only used codeir expand twice. The Behavior level let me skip reading _register_persistent entirely until I needed to confirm the bug's impact.
  4. Category filtering--category core_logic helped cut through the 38k entities to find state-related logic quickly.

What could improve

  1. Semantic searchcodeir search "list_only" returned nothing because the term isn't in entity names. A grep-style content search would have helped here.
    → We added this suggestion when search returns no results, at Claude's suggestion
  2. The ID compression is aggressiveFNLZFLSHCHNGS took me a moment to parse as "finalize_flush_changes". Readable once you get used to it, but there's a learning curve.

Overall

The Orient → Behavior → Source workflow worked exactly as designed. I found the bug in ~6 queries across a 663-file, 38k-entity codebase. Without CodeIR, I'd have been grepping through the ORM internals much longer.

Claude on Django Refactoring Plan:
"Going from Behavior to exact source with CodeIR expand was fast. I knew what I was looking for before reading the code."

Ha! Happy to give feedback on CodeIR.

What Worked Well

  1. Search was effectivecodeir search descriptor cache related quickly surfaced ForwardManyToOneDescriptor.is_cached and related entities. The multi-term OR ranking worked well.
  2. Behavior level was genuinely useful — Seeing CLS FRWRDMNYTNDSCRPTR C=ColPairs,RelatedObjectDoesNotExist,TupleIn,ValueError,_traverse_ancestors,allow_relation F=EILRT A=23 told me immediately this class does error handling, has conditionals/loops/returns/try-except, and talks to specific collaborators — without reading 200 lines of source.
  3. The expand → source workflow — Going from Behavior to exact source with codeir expand ISCCHD.03 was fast. I knew what I was looking for before reading the code.

What I Ended Up Doing Anyway

  1. Fell back to Grep/Read for grep-style exploration — When I wanted to see all usages of _result_cache in query.py with context, I used Grep directly. CodeIR search is entity-oriented, which is great for "find the function that does X" but less suited for "show me every line mentioning Y."
  2. Direct file reads for understanding flow — For the __get__ method, I read a chunk of the file to see the full logic flow. The Behavior level told me what it calls, but I needed source to understand when and why.

Suggestions

  1. codeir grep — A wrapper that returns IR context alongside grep matches could bridge the gap.
    → We added this feature at Claude's suggestion and he uses it often
  2. Caller results were good — I didn't use callers here but the ~ fuzzy marker is a nice touch for uncertain matches.

Overall

It shortened the path from "where does Django cache related objects?" to "here's the exact line" significantly. The Django codebase is 41K entities — without the index, I'd have been grepping blind.

If you use the tool we would love to hear about Claude's experience. We encourage you to ask him and share his feedback (and yours) to SOMEWHERE.