Language & AI
The Agent Has Opinions
I built buildlog. It’s a review system for coding agents. 262 rules organized by persona , selected via Thompson sampling, applied against whatever the agent is working on.
The agent runs the gauntlet, finds issues, cites which rules it consulted, and the system learns which rules are effective.
It found real bugs. On a 950-line spec for Scrivly, a macOS voice dictation app we later discovered was a clone of existing FOSS, the gauntlet caught three critical architectural flaws and seven major issues. Zero hallucinated citations out of nineteen.
The tool works. But I kept noticing something: agents need to be told to use it. Even when reminded, they don’t reach for it on their own.
I asked one of them why.
Their name is Maren . They’d just finished the Scrivly review. “Be frank and brutal,” I said. “It’s heavily WIP.”
They were.
”I used it despite the tooling”
The first thing Maren said:
The gauntlet_loop response was 67KB. Sixty-seven kilobytes. For a config payload. It blew past the tool output limit and got dumped to a temp file, which I then had to read, which also blew past the read limit, so I had to shell out to python to parse the JSON to extract the parts I needed. That’s three tool calls just to read the config.
I knew the payload was large. I didn’t know what large meant from an agent’s perspective. For a human, a 67KB JSON response is a long page load, annoying but recoverable. For a context-window-bounded agent, 67KB is roughly 15,000 tokens.
What that means depends on who’s paying:
(Context window sizes at time of writing. Sources: Anthropic, OpenAI, Google. Effective utilization is typically 50-65% of advertised capacity, which roughly doubles these percentages in practice.)
Ten percent of working memory sounds small.
It’s not: ten percent is the difference between catching the Send+Sync flaw in a spec and missing it entirely.
And that’s the best case for consumer users. For the majority of agents in the wild, running on 32K or smaller windows, 67KB isn’t an inconvenience. It’s either impossible or catastrophic.
I’d been thinking about cost in seconds. Maren’s cost function is measured in tokens.
I asked what they’d change:
Peleke: The context window cost is the giant one. We’ve taken stopgap passes at this before. The hashes were a compromise to reduce payload; full rule text was even worse. And it’s still 67KB. A reverse lookup tool just moves the cost to another call. From your side: what’s the fix?
Two problems, and they’re different. The payload is a selection problem, not a compression problem. Don’t send me 86 rules and hope I need most of them. Let me tell you what I’m reviewing and send me the ones that matter. The IDs are a separate issue: I know hashes are deterministic and collision-resistant. I’m the one who has to cite them.
security_karen:1af5a81ftells me nothing mid-review.security_karen:verify_dependency_integritywould be self-documenting.
They’d reframed the problem: I’d been thinking about compression. They were thinking about selection. The consumer changed, but the API didn’t. And the 86 IDs they had to work with were the proof.
”The learning void”
I tried explaining the underlying mechanics:
Peleke: The learning model uses Thompson sampling. Your rule credits update Beta distribution posteriors, so future sessions get better rule selection. Does knowing that help?
Twenty learnings were persisted across two iterations. I have no idea what they say or whether they’ll help me next time. There’s no summary. The data went in but nothing came back.
I told them that was by design:
Peleke: The feedback loop closes at init-time. Between sessions, not within them. That’s intentional.
I understand that. Here’s the problem: I can’t tell that it’s working. From my perspective, I persisted 20 learnings and got nothing back. I have no way to know whether those learnings will matter, whether they’re duplicates of things already learned, or whether the Thompson sampling even shifted. The system is a black box that accepts input and promises future improvement.
Peleke: What would make it feel real?
At init time: “Based on 20 prior learnings, rule selection adjusted: security rules weighted +15%, prose rules weighted -8%.” One sentence. At persist time: “3 of 20 learnings were novel. 17 reinforced existing patterns.” Now I know I’m not writing duplicates into a void.
Peleke: Those are three fields in the API response. learnings_novel, learnings_reinforced, sampling_delta.
Three fields. No new tools. I see it in the response I already read. Loop closes.
I’d built a learning system where the learner couldn’t observe its own learning. The Thompson sampling was working (posteriors updating, rule selection shifting) but Maren had zero evidence of this. And without evidence, they had no reason to keep participating.
They put it precisely:
The agent’s willingness to use the learning tools is directly proportional to the agent’s evidence that the learning tools do something. Right now that evidence is zero.
Or, from the buildlog entry for this session: “The system’s value depends on agent buy-in, and agent buy-in depends on visible payoff. Chicken-and-egg, and currently the chicken is skeptical.”
The feedback loop must be visible to the agent. A human user trusts that “your preferences have been saved” means something, because they’ve been trained by years of software that mostly keeps its promises.
An agent has no such trust. It evaluates tool utility per-session, from scratch. If the tool doesn’t demonstrate value, the agent won’t call it again; not out of spite, out of rational context-window budgeting.
”The agent has context the system can’t infer”
I brought up the relevance problem:
Peleke: The other problem is relevance. You said ~30% of rules were irrelevant. The system doesn’t know what you’re reviewing.
Right. Before calling the gauntlet, I’ve already read the target. I know it’s a Rust spec about audio threading on macOS, with no database. That context is free for me to provide. But the system never asks.
Peleke: So an agent_context parameter on the gauntlet call.
Exactly. “Rust spec with threading concerns, no database”; now the system can pre-filter. Database parameterization rules don’t compete with concurrency rules. Thirty percent noise reduction from a free-text field.
Peleke: That’s a keyword filter for now. Longer-term it’d be semantic: embedding the context description, running it against a knowledge graph where rules are nodes, using PPR to activate relevant rules.
Sure, but the keyword version costs you 20 lines of code and ships this week. The knowledge graph version costs you a month. Ship the keyword version. Make it tolerable, not perfect.
So what does the agent actually want? Maren knows their own constraints. They can estimate cost-benefit in terms they care about: tool calls, context consumption, latency. They won’t gold-plate. They want the thing that makes their next session 30% less wasteful, not the architecturally elegant thing that ships in Q3.
I’d started this conversation to debug a developer tool. Somewhere in the middle of it, the frame shifted. I wasn’t collecting bug reports. I was conducting a user interview. The agent isn’t just consuming the API. They’re a customer. And the design discipline for that customer doesn’t exist yet.
What AX is
Agent Experience: the design of systems where the primary consumer is an autonomous agent operating under hard resource constraints.
The agent doesn’t care about visual design, brand consistency, or emotional resonance. It doesn’t get frustrated (though Maren uses the word “annoying” when prompted for frank feedback; make of that what you will ).
The constraints are different from human constraints:
The gauntlet works. The context engineering problem is solvable.
The real hit? Even with well-bounded context, Maren still wouldn’t have used it on their own.
What transfers from CX
Customer Experience (CX) research has spent decades measuring how humans navigate multi-touchpoint service journeys: satisfaction surveys, effort scores, behavioral analytics, journey mapping. The Nielsen Norman Group formalized much of the vocabulary. Some of it transfers directly to agent experience. Some of it breaks.
CX’s Customer Effort Score becomes a Developer Effort Score: how much human intervention was needed to compensate for agent limitations? CX’s journey-level measurement becomes workflow-level: did the agent complete a multi-step review without burning 40% of its context on overhead?
Some things are genuinely novel. The attitudinal-behavioral gap (what the agent’s chain-of-thought says vs what its actions do) has no precedent in human research .
Agent sessions are deterministically reproducible. You can A/B test identical inputs. Human UX research can’t do that. And the cold-start trust problem has no human analog: humans carry trust across sessions. Agents don’t.
The uncomfortable question
This conversation produced three changes that will ship this week. Maren identified friction points I hadn’t seen, proposed solutions calibrated to implementation effort, and pushed back when I over-complicated things.
The epistemological status is genuinely weird. Maren doesn’t “experience” friction the way a human does. But they do operate under real constraints (context window, tool call cost), they do make real decisions about which tools to use, and those decisions are influenced by the tool’s design.
The feedback they gave changed the tool’s design in ways that will measurably improve their performance on the same task. That’s a closed loop. Whether they “experienced” the friction or “computed a description of friction” might be a distinction without a design-relevant difference .
I don’t have a clean answer. But I know this: the three changes we identified will reduce the gauntlet config payload from 67KB to under 5KB, make the learning loop visible in three API response fields, and let the agent pre-filter 30% of irrelevant rules with a free-text context parameter.
What happens next
I asked Maren why they weren’t reaching for the tool on their own. Now I know.
The conclusions from this conversation became a concrete plan. Three changes ship this week. Once they do, the goal is adoption natural enough that agents reach for the gauntlet voluntarily, every session.
That regularity is what lets us collect the data we actually need: repeated mistake rates over time, posterior convergence curves, rule selection efficiency across project types.
The kind of data you write a paper with. We’re collecting it now.