Lab

A Measurement Thesis

Agents pick what goes into their own context window. When they pick well, they get better; when they pick badly, they waste your money. The work here exists to measure which is happening and study why.

We treat context engineering as a reinforcement learning problem: the agent samples from learned distributions to decide what belongs in context, gets feedback, and updates. Over time, the distributions converge on what actually helps, and the agent becomes measurably better.

The stack is under active development, but it runs, it's principled, and it's observable. Everything below is open source.

Infrastructure

qlawbox

Answering questions about agent behavior requires measuring agents in realistic, messy environments. qlawbox is a five-layer architecture built to do just this, equipped with:

A knowledge graph with adaptive learning
End-to-end observability
A hardened agent runtime
A signal bus for ambient agency
An experimental interoception layer for internal state monitoring

Each layer functions independently. When composed, they form a closed feedback loop whose behavior can be observed, internal dynamics traced, and interventions measured.

Each layer emits structured events. The feedback loop returns observations to the learning layer. Everything is traceable.

Knowledge + Learning: qortex

qortex frames context engineering as a learning problem. System prompt components, such as tools and other injected context, are modeled as arms of a contextual bandit. Per query, the agent samples Beta posteriors to decide what belongs in context; feedback updates those posteriors via Thompson Sampling.

Beneath the bandit sits an experimental causal knowledge graph with typed edges, generated online via qortex-ingest. Feedback credit propagates backward through ancestor concepts, decaying by hop distance and edge strength.

qortex-online handles real-time indexing into Memgraph. Preliminary results suggest positive impact on retrieval performance with minimal overhead.

As of recently, qortex is also a network service. qortex serve starts a Starlette ASGI server exposing 35+ REST endpoints with HMAC-SHA256 auth and replay protection. Postgres backends (pgvector for embeddings, dedicated stores for Thompson Sampling arm states and interoception factors) replace the local SQLite defaults. Any language that speaks HTTP can now consume the full knowledge and learning stack.

7 framework adapters, each built against its framework's own test suite: CrewAI, Agno, AutoGen, LangChain, LangChain.js, Mastra, Vindler (fork: OpenClaw).

Docs ↗ Framework adapters ↗ The learning loop ↗

Observability: qortex-observe

Every selection, observation, and posterior update in the learning layer emits a structured event. qortex-observe routes those events to three export paths:

JSONL file sinks for offline replay
OpenTelemetry spans for distributed tracing via Jaeger
48 Prometheus counters/gauges/histograms for real-time Grafana dashboards

MCP tracing middleware wraps each tool call with distributed trace context, so a single query can be traced from inbound message through graph traversal, vector retrieval, posterior sampling, and credit propagation.

Docs ↗

Runtime: vindler + bilrost

A note on emphasis: this layer leans on OpenClaw because it's a convenient, realistic agent to exercise and measure qortex. It's a test surface, not the thesis. The point of the stack is the knowledge-and-learning layer and its instrumentation; any agent runtime would do.

vindler is a hardened fork of OpenClaw, rewired as the qlawbox agent runtime.

The fork replaces OpenClaw's memory layer with qortex and instruments every tool call and retrieval path through qortex-observe.

bilrost (PyPI) locks the runtime inside a Lima VM with OverlayFS for filesystem isolation and UFW NACLs for network policy. Network access is network-as-needed: tools declare their requirements and bilrost opens only those ports for the duration of the call.

Vindler also grew a LinWheel extension: 17 MCP tools for LinkedIn content operations, covering content processing, drafting, post management, voice profiles, and visual generation. All tools are opt-in (declared optional: true) and only activate when explicitly allowlisted. An approval gate (post_approve) is required before anything reaches post_schedule.

The codenames are Norse. In the Eddas, Bifrost (also spelled Bilrost) is the flaming rainbow bridge situated above the world-tree Yggdrasil, binding Miðgarð to Ásgarð. The only path in, it burns anything unwelcome that dares try to cross. Vindler is Heimdallr, the guardian of heaven's gates.

vindler docs ↗ bilrost docs ↗ bilrost on PyPI ↗

Nervous System: cadence

Typed signal bus for ambient agent behavior.

Rather than waiting for prompts, it lets agents subscribe to event streams and act when conditions are met. cadence classifies, prioritizes, and dispatches signals to the appropriate handler.

Internal state signals from the interoception layer route alongside external events through the same bus, so the agent's response to its own dynamics is handled by the same dispatch infrastructure as its response to the outside world.

Intended to interop with interoception to trigger spontaneous "speech" in response to shifts in internal state and predictions thereof. Very early stage and highly speculative, but a fundamental part of the long-term research horizon.

Docs ↗

Interoception: interoception

An analog of biological homeostasis.

The agent monitors its own conservation quantities (posterior stability, convergence rate, reward variance) and flags violations as affect signals: confusion, surprise, dissonance, novelty. Each signal maps to an observable property of the learning dynamics.

This is the most speculative layer. For now, it serves as an architectural placeholder for experimenting with internal signals, pointing cadence's ambient awareness inward, so agents can generate speech based on their perception of internal state and predictions about where that state is heading.

Like cadence, this is very early stage and highly speculative, but it represents a fundamental piece of the long-term research direction: agents that respond not only to the world, but to themselves.

Docs ↗

Observability

qortex-observe

48 Prometheus metrics feed Grafana dashboards that surface posterior dynamics: watch the agent learn, catch when it's learning the wrong things, intervene with tunable reward signals. Jaeger traces latencies from message ingestion through retrieval, concept extraction, and learning subsystem calls. Memgraph Lab provides live inspection of the knowledge graph as it's constructed.

Anthropic's own research on measuring agent autonomy identifies this as essential: post-deployment monitoring that reveals how agents actually behave in practice, not just what they're capable of in controlled evaluations. Pre-deployment testing cannot surface the dynamics that emerge from sustained agent-user interaction. The infrastructure to observe those dynamics must be built deliberately.

Grafana dashboard: bandit arm convergence, posterior means separating over a 2-hour window

Beta posteriors updating in real time. Selection rates, observation outcomes, poisoning detection.

Jaeger trace: online_index.embed spanning 20.81ms, 8 spans down to memgraph.add_edge and cypher.execute

End-to-end trace from embed through graph queries, memgraph.add_edge, and cypher.execute spans.

memgraph lab
screenshot pending

Live causal knowledge graph. Concepts, typed edges, credit propagation paths.

Grafana: Knowledge Graph Growth dashboard with Edge Lifecycle, KG Crystallization, and cumulative graph growth

KG Crystallization: edges enter a buffer and graduate to the persistent graph once confidence exceeds the promotion threshold.

Jaeger trace: MCP tools/call qortex_learning_observe, 31.53ms, fastmcp provider

MCP tool call trace: fastmcp provider, LocalProvider type, full span attributes including session ID.

Products

Where the stack meets real users

Each product is where the stack is meant to earn its keep on real users, each wired to feed signal back into a different part of the research program. Some of that loop runs today; some is still being built.

LinWheel

autonomous publishing

Now a Vindler extension with 17 MCP tools spanning the full content lifecycle: analyze, reshape, refine, split, draft, bundle, manage posts, generate images and carousels, and manage voice profiles. All tools are opt-in; an approval gate is required before anything gets scheduled.

The test: can qlawbox drive a content pipeline from signal detection through scheduling and direct publishing, without human editing?

Live ↗ Case study ↗

Swae OS

federated intelligence

Federated GraphQL microservice architecture. Seven FastAPI services behind a Hive Gateway supergraph (agent, journal, habits, meals, movements, practices, users), with service-isolated Postgres, Qdrant vectors, and a LangGraph AI plane. Shipped to Cloud Run by a workflow_run CI/CD chain.

The AI plane runs a hybrid RAG pipeline over the domain services, currently served by Qdrant. Its retrieval step sits behind one clean seam, so swapping in qortex later is a contained change. That swap is designed, not yet shipped.

Alpha (offline) Case study ↗

Interlinear

adaptive learning

Language tutor with a 5-category error taxonomy (mechanical, lexical, morphological, syntactic, semantic). When you write puellam instead of puella, it identifies a case error and explains why nominative was expected. Course generation from any source text. CEFR-calibrated per grammatical concept.

The error taxonomy feeds Thompson Sampling: which correction strategies produce retention, which produce frustration. This is qlawbox as an adaptive learning backend. Med/long term, Interlinear ejects to microservices and lives behind the Swae federated gateway.

Live ↗ Case study ↗

What This Measures

Hypotheses under investigation

The questions above frame the research direction. Below, each one gets a mechanism in the stack and something concrete to measure.

Can agents learn to optimize their own context?

The bandit selects which system prompt components enter context. Over time, components that correlate with positive outcomes get promoted; those that don't get downweighted.

If the mechanism works, token consumption should decrease while task completion quality holds steady or improves.

Measuring: repeated mistake rate (RMR), total tokens per task, arm convergence speed (how many observations before posteriors stabilize).

How faithfully do agents reason in practice?

The causal knowledge graph provides structured context that flat retrieval cannot. The question is whether access to typed relationships and domain rules changes agent behavior in measurable ways. This should include not just retrieval quality, but downstream reasoning and task completion.

Measuring: retrieval quality (P@5, R@5, nDCG@5) with graph-enhanced vs. flat retrieval, same tasks, same agent, same embeddings. Also tracking whether structurally relevant concepts surface that cosine similarity misses. Background in the framework-adapter writeup.

What does post-deployment monitoring look like?

qortex-observe emits structured events for every selection, observation, and posterior update. Grafana dashboards surface posterior dynamics. Jaeger traces end-to-end latency.

The question is whether this instrumentation is sufficient to detect when the agent is learning the wrong things, and whether tunable interventions (reward signals, arm resets, exploration rate adjustment) can correct course.

Measuring: time-to-detection for reward misspecification, intervention response latency, and whether operators can meaningfully steer agent learning through the observability surface. The measurement approaches here are themselves a research question.

Status

Where things stand

Layer	Project	Stage	Note
01	qortex	Functional	Beta-Bernoulli Thompson Sampling, causal knowledge graph, 7 framework adapters, REST API, Postgres stores, HMAC auth
02	qortex-observe	Functional	JSONL sinks, OpenTelemetry/Jaeger, Prometheus/Grafana, MCP tracing middleware
03	vindler + bilrost	Functional	Lima VM, OverlayFS, UFW NACLs, network-as-needed tool dispatch, LinWheel extension (17 content tools), on PyPI
04	cadence	Functional	Typed signal bus, ambient dispatch, interoception interop planned
05	interoception	Early	Architectural placeholder; internal state signals, affect vocabulary, speculative

Everything here is work in progress. Directions, not conclusions.

← CV

Book a call Projects →