CodeIR — Case Studies

Claude Already Knows It Needs a Map But It's Drawing One from Scratch

I planted a single-line bug in SQLAlchemy’s unit-of-work system and gave Claude only the symptoms. Both runs found the same bug and proposed the same fix.

The difference was how they got there.

Without CodeIR, Claude spun up a subagent and started reading the codebase file by file, consuming 1.8 million tokens just to orient itself before it could reason about the problem. With CodeIR, it didn’t need to. It queried the IR, inspected a single entity’s behavior, pulled a small slice of source, and had the answer.

Same diagnosis. Same fix. Completely different path to get there.

Claude already tries to understand the system before acting. CodeIR just gives it the map.

How They Got There

The bug was a subtle filter gap in finalize_flush_changes, the method that marks objects clean after a flush. The delete set correctly filters out list-only states with and not listonly. But the leftover set (other = states.difference(isdel)) doesn't apply the same filter, so cascade-pulled objects that were never modified flow into _register_persistent() and silently pollute the identity map. Each file reads fine on its own. The problem only surfaces when you trace the data flow across the set arithmetic and the downstream call.

Without CodeIR, Claude recognized it needed to orient before reasoning about the problem. It spawned a subagent that made 45 API calls, reading source files across the ORM layer to build a working understanding of the flush machinery. The main agent then used that context to read targeted sections and identify the bug. Total: 54 API calls, 2.06 million input tokens.

With CodeIR, that orientation step was already done. The agent searched the IR for flush-related entities, inspected one behavior summary to see the call graph, and expanded the 20 lines of source that mattered. Total: 14 API calls, 368k input tokens.

82% fewer input tokens. 74% fewer API calls. Same diagnosis, same fix.

SQLAlchemy Bug Diagnosis: Full Breakdown

I introduced a single-line bug into SQLAlchemy's unit-of-work system and asked Claude Opus 4.5 to find it twice: once without CodeIR and once with it.

SQLAlchemy is 663 files and 38,672 entities. The bug was in finalize_flush_changes, the method that marks objects as clean or deleted after a flush. The delete set correctly filters list-only states: if isdelete and not listonly. But the non-delete set is computed as other = states.difference(isdel), which captures everything not being deleted, including list-only states that were pulled in by cascade but never modified. These flow into _register_persistent(), which replaces their identity map entries and resets their committed-state snapshots even though no SQL was emitted. Downstream queries then return stale in-memory data instead of re-loading from the database.

The prompt described symptoms, not locations:

"After a flush that involves relationship cascades, objects that were only pulled in as part of the cascade end up polluting the identity map as if they'd been persisted. Downstream queries return stale data. It only breaks with a mix of dirty objects and their unmodified related objects."

Results

Both runs found and correctly diagnosed the bug.

	API calls	Tool calls	Input tokens	Output tokens
Baseline (no CodeIR)	54	53	2,061,000	9,770
CodeIR	14	14	368,000	2,251
Reduction	74%	74%	82%	77%

What Happened

Without CodeIR, Claude spawned a subagent to orient itself. That subagent made 45 API calls and consumed 1.8 million tokens reading source files across the ORM layer, building the context needed to reason about the flush machinery. The main agent then read targeted sections of unitofwork.py, session.py, and identity.py to confirm the diagnosis. The orientation work was necessary but expensive.

With CodeIR, the orientation was already compiled. The agent ran codeir bearings to see the project structure, searched for flush-related entities, inspected finalize_flush_changes at the Behavior level to see its call graph without reading source, then expanded only the 20 lines of the method itself. It spotted the states.difference(isdel) gap, confirmed that other parts of the codebase correctly filter listonly, and proposed the fix.

Takeaway

Claude already knows it needs to understand a system before reasoning about it. Without CodeIR, it builds that understanding by reading source files, which is expensive and scales with repo size. CodeIR front-loads that work at index time so the agent starts with the architectural picture and drops to source only where it matters.

What Happens When the Model Actually Understands the System

I asked Claude how to handle tax rates that change mid-period in Tryton's reporting module. Both sessions found the surface problem immediately: reporting queries collapse transactions across rate changes into a single total. The baseline fixed the queries: group by rate, split the rows, done.

The CodeIR session followed the data further upstream and noticed something more fundamental: each tax line points to a live tax configuration record. There's no snapshot. If an admin edits a tax rate next Tuesday, every invoice that ever referenced it — 2024, 2023, whenever — silently rewrites itself in the next report run. The reporting fix would faithfully aggregate the wrong historical numbers.

So instead of patching the query it proposed capturing the effective rate on each tax line when the transaction is created. Once an invoice is posted its tax rate becomes a fact about that transaction, not a pointer to a configuration someone can change later.

Baseline reasoning stopped at the reporting layer. CodeIR reasoning traced the data lifecycle.

What Each Run Found

What the baseline missed

The baseline read the reporting queries, found the aggregation problem, and fixed it: group by tax.rate, split the rows. Clean, minimal, correct… as long as nobody edits a Tax record after invoices have been posted against it.

But TaxLine doesn't store the rate that was used. It stores a foreign key to the Tax record: a live configuration object. The reporting fix joins back to that table at query time, so the report is only as reliable as the assumption that the Tax record still reflects what was true when the transaction happened.

The baseline didn't catch this because the vulnerability isn't visible in any single file. The reporting code looks fine. The Tax model looks fine. The TaxLine model looks fine. The problem lives in the relationship between them: a mutable record standing in for an immutable fact.

What CodeIR found

The CodeIR session traced the data lifecycle: how tax lines are created, what they store, and what they reference. That upstream exploration exposed the dependency on a mutable record and led to a different fix: capture the effective rate directly on TaxLine at write time.

It's a small schema change: one new field populated when the invoice is created. But it moves the guarantee from the report logic into the data model. Once a transaction is posted its tax rate becomes a recorded fact, not a live lookup. The reporting fix still applies: you still group by rate. But now you're grouping by a value that can't be silently rewritten.

CodeIR didn't stop at the reporting layer. It followed the data upstream and found a missing boundary between what the system configures and what it records.

Case Study: Tryton Tax Reporting Refactor

Same prompt, same repo (2,375 files, 20,457 entities). Once without CodeIR, once with it.

Run	Turns	Total tokens
Baseline	20	696,106
CodeIR	33	981,790

Baseline Plan

Core strategy: Modify aggregation queries to GROUP BY the effective tax rate, then report separate subtotals per rate within each period.

Phase 1: Add get_rate_periods() helper to detect rate boundaries within a period
Phase 2: Refactor AEAT303.get_context() to compute separate subtotals per rate-effective sub-period
Phase 3: Update ESVATList.table_query() and ESVATBook.table_query() GROUP BY to include tax.rate

CodeIR Plan

Core strategy: Capture the effective rate on TaxLine at write time. Report against the stored value.

Phase 1 — Schema: Add effective_rate field to TaxLine. Migration adds the column with NULL for historical records.

Phase 2 — Capture: Populate effective_rate at every TaxLine creation point:

InvoiceTax.get_move_lines
InvoiceLine._compute_taxes
POSSale.get_tax_move_lines

Phase 3 — Reporting: Add tax_line.effective_rate to SELECT and GROUP BY in ESVATList.table_query().

Key files

File	Change
modules/account/tax.py:1337	Add `effective_rate` field to TaxLine
modules/account_invoice/invoice.py:3326	Store rate in `InvoiceTax.get_move_lines`
modules/account_invoice/invoice.py:3023	Store rate in `InvoiceLine._compute_taxes`
modules/account_es/reporting_tax.py:428	Group by rate in `table_query`
modules/account_es/reporting_tax.py:392	Add rate field to ESVATList model

1,620 Lines to Find Five Methods

I asked Claude to design per-blueprint session overrides in Flask (83 files, 1,629 entities).

Both runs reached the same conclusion: sessions open before routing determines the blueprint, which creates a timing constraint for the design.

The baseline reached that insight by brute force reading files.

The CodeIR run navigated to the same entities through architectural inspection and expanded only the methods it needed.

Same answer, far less source code.

Navigation vs. Brute Force

The task: add per-blueprint session interfaces to Flask. The core challenge is timing: AppContext._get_session() runs before match_request(), so the session opens before the framework knows which blueprint will handle the request. Both runs identified this, proposed the same five entities that need changes, and surfaced the same design tradeoff: deferred session loading vs per-blueprint save only.

The difference was how they got there.

The baseline built its understanding by reading source files directly: sessions.py (386 lines), ctx.py (541 lines), and blueprints.py (693 lines). About 1,600 lines in total, most of which wasn't relevant to the final answer.

CodeIR took a different path. It navigated the system structurally, searching for session and blueprint entities, inspecting behavior summaries to understand call relationships, tracing callers of open_session and save_session, and expanding only the specific methods involved. It never read a full file.

Both runs used a similar number of tokens but the baseline spent them reading broadly and filtering down, while CodeIR navigated directly to the relevant parts of the system and read only what it needed.

Flask Session Override: Full Breakdown

I asked Claude Opus 4.5 to design per-blueprint session overrides for Flask (83 files, 1,629 entities). Each blueprint should be able to specify its own session interface instead of using the app-wide one. I ran it twice: once without CodeIR and once with it.

Both runs arrived at the same diagnosis and the same design. The core constraint is timing: AppContext._get_session() fires before match_request() resolves which blueprint will handle the request. You can't override the session interface per-blueprint if the session is already open before the blueprint is known. Both runs identified the same five entities that need modification and surfaced the same tradeoff between deferred session loading and per-blueprint save-only.

The prompt:

"We're extending Flask's Blueprint system to support per-blueprint session overrides. Each blueprint should be able to specify its own session interface instead of using the app-wide one. Where would this need to be wired in, and what entities would need to change?"

Results

	API calls	Tool calls	Input tokens	Output tokens
Baseline (no CodeIR)	8	7	192,354	2,065
CodeIR	9	16	265,238	3,747

On a codebase this size, both approaches cost about the same in tokens. This is a small codebase. Flask is 83 files in total and the baseline's brute-force approach of reading three files cover-to-cover is cheap when the files are only a few hundred lines each.

The difference is in what Claude read and how it got there.

What Happened

Without CodeIR, Claude globbed the directory, grepped for class Blueprint and session_interface, then read three files in full: sessions.py (386 lines), ctx.py (541 lines), and sansio/blueprints.py (693 lines). That's 1,620 lines of source, most of which was irrelevant to the task. The relevant methods are maybe 80 lines across all three files.

With CodeIR, Claude searched for session and blueprint entities in the IR, inspected their Behavior summaries to understand call relationships without reading source, used callers to trace what calls open_session and save_session, then expanded only the specific methods involved. It made more tool calls (16 vs 7) but each one was targeted: a behavior inspection or a 20-line expansion rather than a 693-line file read. It never opened a full file.

Takeaway

Flask is the case where CodeIR's advantage is smallest. On a codebase this size, reading whole files is fast and cheap. The navigation advantage matters more as repos grow, which is what the SQLAlchemy and Tryton case studies show. But even here, the difference in approach is visible: one run reads everything and filters afterward, the other navigates to what matters and reads only that.