A Measurement Thesis

I've been obsessed with some questions, lately:

  • Can agents learn to optimize their own context?
  • How faithfully do they reason in practice?
  • What does post-deployment monitoring through the agent lifecycle look like?
  • How can it be instrumented for rigorous experimentation?

The surface area of agentic computing is scaling fast, evidenced most clearly by OpenAI's adoption of OpenClaw-style agents. Yet, there exists little standardized instrumentation to measure and improve agentic behavior and decision-making.

To close the gap, I forked OpenClaw and instrumented it with a principled learning + memory backend that frames context engineering as a reinforcement learning problem in a stationary environment. The stationarity assumption is temporary and deliberate: it makes the math tractable now, and relaxing it is an explicit next step.

The result is a five-layer instrumented agent architecture, impudently called qlawbox. It's unfinished, but it runs; it's principled; and it's observable.

qlawbox

Answering questions about agent behavior requires measuring agents in realistic, messy environments. qlawbox is a five-layer architecture built to do just this, equipped with:

Each layer functions independently. When composed, they form a closed feedback loop whose behavior can be observed, internal dynamics traced, and interventions measured.

01 · KNOWLEDGE + LEARNING qortex → rules projects rules 02 · OBSERVABILITY qortex-observe ← core instruments 03 · RUNTIME vindler + bilrost → executes dispatches 04 · NERVOUS SYSTEM cadence → routes state drift 05 · INTEROCEPTION interoception affect signals ↑ feedback external signals RMR tokens convergence

Each layer emits structured events. The feedback loop returns observations to the learning layer. Everything is traceable.

01

Knowledge + Learning — qortex

qortex frames context engineering as a learning problem. System prompt components, such as tools and other injected context, are modeled as arms of a contextual bandit. Per query, the agent samples Beta posteriors to decide what belongs in context; feedback updates those posteriors via Thompson Sampling.

Beneath the bandit sits an experimental causal knowledge graph with typed edges, generated online via qortex-ingest. Feedback credit propagates backward through ancestor concepts, decaying by hop distance and edge strength.

qortex-online handles real-time indexing into Memgraph. Preliminary results suggest positive impact on retrieval performance with minimal overhead.

7 framework adapters, each passing its framework's own test suite: CrewAI, Agno, AutoGen, LangChain, LangChain.js, Mastra, Vindler (fork: OpenClaw).

02

Observability — qortex-observe

Every selection, observation, and posterior update in the learning layer emits a structured event. qortex-observe routes those events to three export paths:

  • JSONL file sinks for offline replay
  • OpenTelemetry spans for distributed tracing via Jaeger
  • Prometheus counters/gauges/histograms for real-time Grafana dashboards

MCP tracing middleware wraps each tool call with distributed trace context, so a single query can be traced from inbound message through graph traversal, vector retrieval, posterior sampling, and credit propagation.

03

Runtime — vindler + bilrost

vindler is a hardened fork of OpenClaw, rewired as the qlawbox agent runtime.

The fork replaces OpenClaw's memory layer with qortex and instruments every tool call and retrieval path through qortex-observe.

bilrost (PyPI) locks the runtime inside a Lima VM with OverlayFS for filesystem isolation and UFW NACLs for network policy. Network access is network-as-needed: tools declare their requirements and bilrost opens only those ports for the duration of the call.

The codenames are Norse. In the Eddas, Bifrost (also spelled Bilrost) is the flaming rainbow bridge situated above the world-tree Yggdrasil, binding Miðgarð to Ásgarð. The only path in, it burns anything unwelcome that dares try to cross. Vindler is Heimdallr, the guardian of heaven's gates.

04

Nervous System — cadence

Typed signal bus for ambient agent behavior.

Rather than waiting for prompts, enables agents to subscribe to event streams and act when conditions are met. cadence classifies, prioritizes, and dispatches signals to the appropriate handler.

Internal state signals from the interoception layer route alongside external events through the same bus, so the agent's response to its own dynamics is handled by the same dispatch infrastructure as its response to the outside world.

Intended to interop with interoception to trigger spontaneous "speech" in response to shifts in internal state and predictions thereof. Very early stage and highly speculative, but a fundamental part of the long-term research horizon.

Docs ↗
05

Interoception — interoception

An analog of biological homeostasis.

The agent monitors its own conservation quantities — posterior stability, convergence rate, reward variance — and flags violations as affect signals: confusion, surprise, dissonance, novelty. Each signal maps to an observable property of the learning dynamics.

This is the most speculative layer. For now, it serves as an architectural placeholder for experimenting with internal signals — pointing cadence's ambient awareness inward, so agents can generate speech based on their perception of internal state and predictions about where that state is heading.

Like cadence, this is very early stage and highly speculative, but it represents a fundamental piece of the long-term research direction: agents that respond not only to the world, but to themselves.

Docs ↗

qortex-observe

The qortex ecosystem treats post-deployment monitoring as a first-class concern.

Grafana surfaces posterior dynamics, allowing users to watch the agent learn, identify when it's learning the wrong things, and intervene with tunable reward signals. Jaeger traces latencies from message ingestion through retrieval, concept extraction, and learning subsystem calls. Memgraph Lab provides live inspection of the causal knowledge graph as it's constructed.

Together, these layers provide the kind of infrastructure that Anthropic's own research on measuring agent autonomy identifies as essential: post-deployment monitoring that reveals how agents actually behave in practice, not just what they're capable of in controlled evaluations. Pre-deployment testing cannot surface the dynamics that emerge from sustained agent-user interaction. The infrastructure to observe those dynamics must be built deliberately, and now exists.

grafana: posterior dynamics
arm convergence + reward rates

Beta posteriors updating in real time. Selection rates, observation outcomes, credit propagation depth.

jaeger: full trace
ingest → cypher → sqlite-vec → learning

End-to-end trace from inbound message through graph queries, vector retrieval, and learning subsystem.

memgraph: knowledge graph
concepts + typed causal edges

Live causal knowledge graph. Concepts, typed edges, credit propagation paths.

Research ⇌ Product Interface

Each of the following runs on qlawbox and generates signal for different aspects of the research program.

LinWheel

autonomous publishing

Paste a transcript; get seven publish-ready LinkedIn posts across distinct rhetorical angles. The publishing agent uses Playwright and HMAC request signing to post articles through an API that doesn't officially support them.

The test: can qlawbox drive a content pipeline from signal detection through scheduling and direct publishing, without human editing?

Swae OS

federated intelligence

Federated GraphQL microservice architecture. Three services behind a Hive gateway: user management, workout/habit tracking, and a hybrid RAG knowledge service. qortex plugs in behind the federation and provides structured retrieval across independent services.

The AI coaching layer is architecture-ready. When active, qortex provides the knowledge backend: structured retrieval across workout data, habit patterns, and journal entries through one federated graph.

Interlinear

adaptive learning

Language tutor with a 5-category error taxonomy (mechanical, lexical, morphological, syntactic, semantic). When you write puellam instead of puella, it identifies a case error and explains why nominative was expected. Course generation from any source text. CEFR-calibrated per grammatical concept.

The error taxonomy feeds Thompson Sampling: which correction strategies produce retention, which produce frustration. This is qlawbox as an adaptive learning backend. Med/long term, Interlinear ejects to microservices and lives behind the Swae federated gateway.

Three applications, one instrumented stack, three different research questions.

Hypotheses under investigation

The questions above frame the research direction. Below, each one gets a mechanism in the stack and something concrete to measure.

H1

Can agents learn to optimize their own context?

The bandit selects which system prompt components enter context. Over time, components that correlate with positive outcomes get promoted; those that don't get downweighted.

If the mechanism works, token consumption should decrease while task completion quality holds steady or improves.

Measuring: repeated mistake rate (RMR), total tokens per task, arm convergence speed (how many observations before posteriors stabilize).

H2

How faithfully do agents reason in practice?

The causal knowledge graph provides structured context that flat retrieval cannot. The question is whether access to typed relationships and domain rules changes agent behavior in measurable ways. This should include not just retrieval quality, but downstream reasoning and task completion.

Measuring: retrieval quality (P@5, R@5, nDCG@5) with graph-enhanced vs. flat retrieval, same tasks, same agent, same embeddings. Also tracking whether structurally relevant concepts surface that cosine similarity misses. Early adapter results.

H3

What does post-deployment monitoring look like?

qortex-observe emits structured events for every selection, observation, and posterior update. Grafana dashboards surface posterior dynamics. Jaeger traces end-to-end latency.

The question is whether this instrumentation is sufficient to detect when the agent is learning the wrong things — and whether tunable interventions (reward signals, arm resets, exploration rate adjustment) can correct course.

Measuring: time-to-detection for reward misspecification, intervention response latency, and whether operators can meaningfully steer agent learning through the observability surface. The measurement approaches here are themselves a research question.

From posteriors to geometry

Beta posteriors sit on a statistical manifold with curvature and symmetry. In physics, symmetries imply conservation laws. The question is whether that structure exists here, too. If it does, the learning layer produces quantities that should be conserved, and conservation violations become exactly the signals that interoception needs to monitor.

Sketching it through would be elegant. But showing it true would be surprising: for now this is pure, fun speculation. With luck, the agenda laid out above represents a step toward being able to show it false — to build tools that can silence the speculation, one way or the other.

Why Knowledge Graphs ↗ Causal Reasoning ↗ Geometry of Learning ↗

Where things stand

Layer Project Stage Note
01 qortex Functional Beta-Bernoulli Thompson Sampling, causal knowledge graph, 7 framework adapters
02 qortex-observe Functional JSONL sinks, OpenTelemetry/Jaeger, Prometheus/Grafana, MCP tracing middleware
03 vindler + bilrost Functional Lima VM, OverlayFS, UFW NACLs, network-as-needed tool dispatch, on PyPI
04 cadence Functional Typed signal bus, ambient dispatch, interoception interop planned
05 interoception Early Architectural placeholder; internal state signals, affect vocabulary, speculative

Everything here is work in progress. Directions, not conclusions.