Field Notes
16,000 Lines of Wrong
This is part of a series on building agents that learn.
The Scope
buildlog was the exploration.
16,000 lines of Python. The project started as a question: what if you could capture every engineering decision you make, extract patterns from the record, and enforce the good ones automatically? A full pipeline: structured capture of engineering decisions and bandit-based enforcement of extracted patterns via git and Claude hooks.
The idea was right: I use it every day. Most of the implementation was…Experimental. But wrong code is only wasted if you don’t learn from where it breaks.
What Survived
Three things came out of it.
1. The Gauntlet Was Right
The gauntlet survived. A rule-based review system where custom rules gate PRs and commits. No LLM in the loop. You define rules; the rules check your code; the rules block or pass.
The gauntlet works because it’s boring. Rules are explicit and deterministic. You can read them and test them.
It runs in production and enforces the learning loop mechanically via git hooks: every commit passes through the gauntlet before it lands.
That piece was always the simplest part of the system. Turns out the simplest part was the only part that was right. The gauntlet still runs. The template is at v0.20.0 and still shipping updates.
2. Rule Derivation Was Wrong
The hard part: taking raw decision-outcome pairs, extracting patterns, then deciding which ones earn their token cost in the next session.
No sh*t: this is f*cking hard, and no one is doing it .
The LLM-as-judge instrumentation was a distraction. Routing every commit through a language model to “evaluate quality” sounded good in the design doc. In practice, it added latency to every commit and produced ratings too inconsistent to learn from .
Automated seed extraction was the second mistake. buildlog tried to extract “engineering patterns” from journal entries using LLM summarization. The results were plausible-sounding rules that hadn’t been validated against real outcomes. Plausible and correct are different things.
Both features shared the same flaw: they used LLM calls where structured data would have been cheaper and more reliable.
The qortex architecture exists because of what buildlog got wrong. Thompson Sampling replaced LLM-based evaluation. Instead of LLM-summarized patterns, we used structured decision-outcome pairs. The flat rule store gave way to a knowledge graph. Each correction came from watching buildlog fail at a specific task and engineering a solution to something the LLM had previously simply “filled in.”
Make It Look Easy covers the protocol architecture that emerged from this process.
3. The Entries Were Right
The readable logs survived. buildlog’s journal entries, structured markdown capturing what happened and why, turned out to be the single output worth keeping from the entire project.
They fed us, not the extraction pipeline.
Forty-two entries over four months, and dozens of them became published articles.
The entries are what taught us to write about this work. This very article exists because buildlog made “write down what you did and what you learned” a mechanical habit enforced by git hooks. Once the entries existed, the writing followed.
That pipeline kept going:
We didn’t stop running it. The friction for logging build journals was low enough that it scaled: entries became blog posts, blog posts became fodder for LinWheel distribution. No step requires starting from scratch.
The Shape of the Thing
Capture decisions, make them readable, let the writing emerge from the record.
The idea was right. Most of the implementation was wrong. Wrong code teaches you the right abstractions if you pay attention to where it breaks.