LLMs Are Reading Code the Hard Way
When Claude or GPT tries to understand a large codebase, it does something slightly absurd: it reads files.
Entire files. Thousands of lines of syntax, indentation, and formatting, burning tokens and filling context windows. Most of the information the LLM is taking in carries little information about how the system actually works.
Humans don’t navigate codebases this way. After a few weeks on a project, developers stop thinking in files entirely. They think in architecture: which modules own which responsibilities, which entities call each other, where the system boundaries are.
They only dive into source code when something specific needs to change.
LLMs never get that representation. They get raw code.
Claude Already Knows It Needs a Map But It's Drawing One from Scratch
I planted a single-line bug in SQLAlchemy’s unit-of-work system and gave Claude only the symptoms. Both runs found the same bug and proposed the same fix.
The difference was how they got there.
Without CodeIR, Claude spun up a subagent and started reading the codebase file by file, consuming 1.8 million tokens just to orient itself before it could reason about the problem. With CodeIR, it didn’t need to. It queried the IR, inspected a single entity’s behavior, pulled a small slice of source, and had the answer.
Same diagnosis. Same fix. Completely different path to get there.
Claude already tries to understand the system before acting. CodeIR just gives it the map.
What CodeIR Looks Like in Practice
CodeIR is a tool that builds an architectural map of a repository.
You index a repo once, and your coding agent can see the structure of the entire system: entities, relationships, and behavioral signatures before reading a single source file.
Instead of navigating through files, the model navigates through architecture.
How It Works Under the Hood
Instead of reading files and hoping to find the right ones, CodeIR gives the agent the whole system in view, with the ability to drill into any detail on demand.
CodeIR compiles a codebase into a hierarchical representation with three levels: Index, Behavior and Source.
This allows models to orient themselves across thousands of entities before expanding only the specific code they need.
What Happens When the Model Actually Understands the System
I asked Claude how to handle tax rates that change mid-period in Tryton's reporting module. Both sessions found the surface problem immediately: reporting queries collapse transactions across rate changes into a single total. The baseline fixed the queries: group by rate, split the rows, done.
The CodeIR session followed the data further upstream and noticed something more fundamental: each tax line points to a live tax configuration record. There's no snapshot. If an admin edits a tax rate next Tuesday, every invoice that ever referenced it — 2024, 2023, whenever — silently rewrites itself in the next report run. The reporting fix would faithfully aggregate the wrong historical numbers.
So instead of patching the query it proposed capturing the effective rate on each tax line when the transaction is created. Once an invoice is posted its tax rate becomes a fact about that transaction, not a pointer to a configuration someone can change later.
Baseline reasoning stopped at the reporting layer. CodeIR reasoning traced the data lifecycle.
1,620 Lines to Find Five Methods
I asked Claude to design per-blueprint session overrides in Flask (83 files, 1,629 entities).
Both runs reached the same conclusion: sessions open before routing determines the blueprint, which creates a timing constraint for the design.
The baseline reached that insight by brute force reading files.
The CodeIR run navigated to the same entities through architectural inspection and expanded only the methods it needed.
Same answer, far less source code.
Wait, what does Claude think?
Since Claude is the primary "user" of CodeIR, I didn't want to just guess if it helped. I asked him. During development, Claude's feedback was so specific that it actually shaped the feature set.
For example, Claude found that while entity search was great, he still missed the "vibe"
of a classic grep. Based on that, we built codeir grep, which returns IR
context alongside regex matches.
Here is what Claude had to say:
What worked well
- Search was fast and effective —
codeir search "flush finalize"immediately surfacedFNLZFLSHCHNGS(UOWTransaction.finalize_flush_changes). The compressed entity IDs made results scannable. - Behavior level was the sweet spot —
codeir show FNLZFLSHCHNGS --level Behaviorgave me:
MT FNLZFLSHCHNGS C=_register_persistent,_remove_newly_deleted,difference,items,set F=IR A=3 #DB #CORE
This told me exactly what methods it calls without reading source. I knew where to look next. - Expand only when needed — I only used
codeir expandtwice. The Behavior level let me skip reading_register_persistententirely until I needed to confirm the bug's impact. - Category filtering —
--category core_logichelped cut through the 38k entities to find state-related logic quickly.
What could improve
- Semantic search —
codeir search "list_only"returned nothing because the term isn't in entity names. A grep-style content search would have helped here.
→ We added this suggestion when search returns no results, at Claude's suggestion - The ID compression is aggressive —
FNLZFLSHCHNGStook me a moment to parse as "finalize_flush_changes". Readable once you get used to it, but there's a learning curve.
Overall
The Orient → Behavior → Source workflow worked exactly as designed. I found the bug in ~6 queries across a 663-file, 38k-entity codebase. Without CodeIR, I'd have been grepping through the ORM internals much longer.
Ha! Happy to give feedback on CodeIR.
What Worked Well
- Search was effective —
codeir search descriptor cache relatedquickly surfacedForwardManyToOneDescriptor.is_cachedand related entities. The multi-term OR ranking worked well. - Behavior level was genuinely useful — Seeing
CLS FRWRDMNYTNDSCRPTR C=ColPairs,RelatedObjectDoesNotExist,TupleIn,ValueError,_traverse_ancestors,allow_relation F=EILRT A=23told me immediately this class does error handling, has conditionals/loops/returns/try-except, and talks to specific collaborators — without reading 200 lines of source. - The expand → source workflow — Going from Behavior to exact source with
codeir expand ISCCHD.03was fast. I knew what I was looking for before reading the code.
What I Ended Up Doing Anyway
- Fell back to Grep/Read for grep-style exploration — When I wanted to see all usages of
_result_cachein query.py with context, I used Grep directly. CodeIR search is entity-oriented, which is great for "find the function that does X" but less suited for "show me every line mentioning Y." - Direct file reads for understanding flow — For the
__get__method, I read a chunk of the file to see the full logic flow. The Behavior level told me what it calls, but I needed source to understand when and why.
Suggestions
- codeir grep — A wrapper that returns IR context alongside grep matches could bridge the gap.
→ We added this feature at Claude's suggestion and he uses it often - Caller results were good — I didn't use callers here but the
~fuzzy marker is a nice touch for uncertain matches.
Overall
It shortened the path from "where does Django cache related objects?" to "here's the exact line" significantly. The Django codebase is 41K entities — without the index, I'd have been grepping blind.
If you use the tool we would love to hear about Claude's experience. We encourage you to ask him and share his feedback (and yours) to SOMEWHERE.