Language & AI
Plugging a Knowledge Graph into 7 Agent Frameworks
Every agent framework ships a vector store. CrewAI has KnowledgeStorage. LangChain has VectorStore. Agno has KnowledgeProtocol. AutoGen has Memory.
They all do roughly the same thing: embed text, store vectors, return top-k by cosine similarity.
qortex adds a knowledge graph and a learning layer on top of vector search. We built adapters for each framework so we could test our work against their benchmarks and test suites, and have distribution infrastructure ready the moment an assumption proves out.
What We Actually Did
For each framework, we wrote an adapter that implements the framework’s own abstract interface, backed by qortex instead of whatever vector store shipped as the default.
| Framework | Interface | Adapter | Tests |
|---|---|---|---|
| CrewAI | KnowledgeStorage | 109 lines | 46/49 pass |
| Agno | KnowledgeProtocol | 375 lines | 12/12 pass |
| AutoGen | Memory (5 async methods) | 240 lines | 26/26 pass |
| LangChain | VectorStore | — | 47 pass |
| LangChain.js | VectorStore (MCP) | — | ~40 pass |
| Mastra | MastraVector (9 methods) | — | 31/31 pass |
| Vindler (OpenClaw) | Learning + Memory (MCP) | — | pass |
“Pass” means the framework’s own test suite, not ours. These tests were written by the framework authors to validate their interface contract. If qortex passes them, it’s a drop-in replacement.
- The 3 CrewAI failures test CrewAI-internal storage path behaviors that don’t apply when the backend isn’t a local embedder. CrewAI recently replaced its DB layer; the adapter still passes because it targets the published
KnowledgeStorageinterface contract, not the internal storage implementation. - 100 adapter tests run in CI on every push. Zero skips allowed.
- The CI job installs the latest versions of each framework (not pinned versions), so if CrewAI or Agno ships a breaking change to their interface, we find out the same day.
Cross-Language via MCP
The TypeScript adapters (Mastra, LangChain.js, Vindler) talk to the same Python engine over MCP stdio. Same codebase, one transport layer. 29 MCP tool calls over real stdio in 3.94 seconds, with a one-time ~400ms server spawn.
This is no longer the only option. qortex now ships a REST API (qortex serve): a Starlette ASGI server with 35+ endpoints, API key and HMAC-SHA256 auth.
The async HttpQortexClient speaks the same protocol interface as the local client. Any application in any language can query qortex over the network; the adapter pattern and test scaffolding carry over directly.
For agent workflows where a single LLM call runs 500ms to 5s, stdio transport overhead doesn’t dominate either way. But for cross-language consumers (like Next.js apps), the REST API eliminates subprocess management entirely.
Retrieval Quality
We ran a controlled comparison on a 20-concept authentication domain (OAuth2, JWT, SAML, PKCE, mTLS, plus 10 distractors). qortex graph-enhanced retrieval vs. vanilla cosine:
| Metric | qortex | Vanilla | Delta |
|---|---|---|---|
| Precision@5 | 0.55 | 0.45 | +22% |
| Recall@5 | 0.81 | 0.65 | +26% |
| nDCG@5 | 0.716 | 0.628 | +14% |
On simple queries (“What is OAuth2?”), graph and vanilla perform identically. They suggest the graph layer helps on cross-cutting queries (the kind where related concepts don’t share vocabulary, so cosine similarity alone misses them), but it’s disingenuous to claim a test at this scale indicates a meaningful effect.
But that’s not the point of this exercise: the point is that swapping in a knowledge graph didn’t degrade anything. The retrieval quality is at least as good, with room to improve as the graph matures.
Overhead
| Component | Median | P95 |
|---|---|---|
| Embedding | 3.97ms | 5.77ms |
| Graph explore (depth=2) | 0.02ms | 0.03ms |
| Feedback recording | <0.01ms | 0.01ms |
The embedding step is 99.5% of the cost. Currently, adding a knowledge graph, typed relationships, and a feedback loop did not make anything measurably slower.
Now that qortex runs as a distributed service with network hops in the critical path, these numbers will shift: benchmarks against the REST API are forthcoming.
How It Works
qortex composes three retrieval signals:
- Vector similarity: cosine search, same as everyone else. This is the baseline.
- Graph traversal (Personalized PageRank): starts from vector hits, walks typed edges to find structurally related concepts. This is how SAML surfaces for an SSO query even when it shares no vocabulary with the query text.
- Rule projection: explicit domain rules (“Always use PKCE for public clients”) enter context when their linked concepts are activated. Rules are a specific case of the general system: a legacy, explicit monopartite projection that exists primarily to feed buildlog, an earlier experiment in capturing programming mistakes as structured rules (since deprecated and folded into qortex).
A feedback loop adjusts edge weights via Thompson Sampling. When a result is accepted, the edges that led to it get a small boost. When it’s rejected, they get penalized. Over time, the paths that consistently lead to good results pull ahead.
Our hope is that the composition of these layers (vector similarity, graph traversal, feedback-driven weight adjustment) is general enough to support a range of cognitive architectures, and that the learning signal across domains gives us the tooling to derive them on the fly. We’re not sure yet. That’s what we’re testing.
Why This Matters Beyond Benchmarks
The adapter pattern exists so that qortex can be swapped into any application already running one of these frameworks. This isn’t something you just “get for free”.
Any HTTP client can query qortex through the REST API. Any framework in Python can run it as a drop-in replacement for their existing vector stores.
The real test isn’t benchmarks; it’s production. Multiple frameworks, multiple domains, real users generating real feedback. That’s how you find out whether the graph layer actually earns its keep.
Three applications are the first production consumers:
Sandboxed, OTel-instrumented active-learning agent runtime. Uses qortex for knowledge retrieval and learning via MCP.
Code context: tool selection, file relevance, system prompt composition
Adaptive language learning with morphosyntactic error classification feeding Thompson Sampling. CLTK + Reynir NLP pipelines for Classical Latin and Icelandic. qortex tracks concept mastery and adjusts retrieval based on learner feedback.
Language pedagogy: concept mastery, morphological error patterns, adaptive difficulty
Federated GraphQL mesh across isolated health data sources. qortex integration will test whether cross-domain reasoning can emerge from N data sources with zero config.
Cross-domain reasoning: correlations across independent health data sources
Each application generates feedback signals by exercising a different retrieval pattern.
Adapter tests in CI guarantee we’ll know ASAP when the integrations break.
Reproduction
All benchmarks run against the qortex-track-c integration test suite. You’ll need uv and Python 3.11+.
cd qortex-track-c && uv sync
# Quality benchmarks
uv run pytest tests/bench_crewai_vs_vanilla.py -v -s
uv run pytest tests/bench_autogen_vs_vanilla.py -v -s
# Performance overhead
uv run pytest tests/bench_perf.py -v -s
Full reproduction guide: reproduction-guide.md
What This Shows
qortex conforms to existing interfaces well enough to pass the tests that those frameworks wrote for their own backends. Swapping the import is the only change required.
Whether the graph layer produces meaningfully better retrieval in practice is a separate question. We need more data, more domains, more feedback. CI guarantees the integrations remain in sync while we collect it.