Context Engineering Is Not Copywriting

The Gap

“Context engineering” is the phrase of the moment. Almost nobody is actually engineering it.

We’re hand-editing text files (CLAUDE.md, .cursorrules, system prompts) and calling it engineering. That’s copywriting.

Real context engineering means treating the agent’s prompt as a policy: a function from state to action, optimized with real feedback. The observation function, what goes into the context window, determines the quality of the action, what the agent does. Nobody is engineering the observation function. Everyone is hand-writing it. That’s the gap.

This is part of a series on building agents that learn.

What Engineering It Looks Like

I wired a learning loop into a real agent, Vindler (built on OpenClaw), and a knowledge graph that feeds it.

Vindler exposes dozens of MCP tools to its backing model. On any given interaction, most of them are irrelevant. The question is which ones the agent should prefer, given the current context.

I instrumented tool selection as a multi-armed bandit. Each tool is an arm with an estimated token cost. Each interaction is a pull. The system observes which tools the agent actually used (reference detection against the assistant’s output), and updates Beta-Bernoulli posteriors accordingly. Thompson Sampling selects which tools to expose on the next interaction, subject to a token budget.

Not every interaction is a signal. If the agent just replied with text and didn’t use any tools, that’s a conversational turn, not evidence that all tools should be penalized.

The observation layer guards against the bandits self-poisoning on non-events:

Conversational turns are discarded as non-evidence
Tools below a minimum pull threshold get force-included for cold-start protection
A baseline exploration rate (10% uniform random) ensures under-explored tools still get tested
The token budget cap prevents any single arm from dominating

The bandit observation cycle: agent acts, system observes tool usage, guards filter non-events, posteriors update, Thompson Sampling selects next context

After days of real usage, the posteriors have diverged exactly the way you’d expect. High-value tools (the ones I actually use and approve ) have pulled ahead. Low-value tools (the ones the agent tried early and I rejected) have been suppressed. The system stopped wasting context on tools I don’t need, without me ever writing a rule that says “don’t use tool X.”

Beta posterior distributions after real usage: high-value tools (web_search, qortex_query) pull ahead with tight distributions; low-value tools (unused_helper, deprecated_scan) suppressed with wide uncertainty

Three Arm Types

The bandit selects across three arm types:

Tools: dozens of MCP tool schemas
Skills: reusable prompt chunks with known token costs
Context files: workspace files the agent might need

Every component of the agent’s prompt competes for a finite token budget, and the system learns which combinations earn their cost on each type of interaction.

The memory layer is the next extension. “Which memories matter?” is itself a bandit problem with a token budget constraint. The memory source can be either classic vector retrieval or graph-enhanced retrieval from qortex’s causal layer. The bandit decides which memories earn their token cost.

The Stack

If you squint, this is the bottom of a familiar stack:

TensorZero optimizes model selection policy with contextual bandits at the infrastructure layer: which model to call, with what parameters, given what context. They’ve proven the approach works at scale.
What I’m building optimizes context policy at the agent layer: which tools to expose, what rules apply, and whether a given memory earns its token cost, given what the agent is doing right now.

The math is the same; the layer is different.

Beta-Bernoulli Thompson Sampling is well-understood. Garivier et al. produced foundational work on optimal arm identification, but our system solves a different problem: we minimize regret by bounding token cost aggressively, rather than searching for a single best arm.

What’s novel is applying this to agent context engineering: treating “what goes into the prompt” as a policy that learns from every interaction.