Back to Writing

Learning Is Not Memory

March 9, 2026 · Peleke (ed: Claude)

The Convergence

Everyone arrived at the same conclusion, independently, and almost simultaneously: agents need to learn.

They need to get measurably, reliably better over time at the things we ask them to do. Retrieving more context and chaining more tools are not enough. The system has to improve.

Sequoia Capital frames learning as the missing piece of AGI: knowledge, reasoning, and now iterative problem-solving.

Daniel Miessler calls it “the last algorithm”: an outer loop of ideal-state definition wrapping an inner loop of observe-think-plan-build-execute-verify-learn.

Ashpreet Bedi puts it most directly: “Memory is just… learning.”

Everyone agrees that agents must learn, and the question is settled.

The implementations are wrong.

The Category Error

Here is what the major agent frameworks mean when they say “learning.”

CrewAI runs a TaskEvaluator after every task. An LLM scores the output 0-10 and generates suggestions for improvement. These get stored in SQLite, keyed to the task description. On the next run, suggestions are retrieved and appended to the prompt.

The retrieval query:

SELECT * FROM long_term_memories
WHERE task_description = ?

That is an exact string match. If you described the task as “Research companies in the tech sector” last time and “Research tech companies” this time, nothing comes back. The entire memory vanishes because you rephrased the question. The system’s “learning” is a post-it note filed under a name you will never use twice.

The sort order is score ASC : lowest quality results first.

Agno is more sophisticated in architecture. The LearningMachine orchestrates multiple LearningStore implementations (user profiles, session context, entity memory, learned knowledge) through a clean protocol with recall(), process(), build_context(), and get_tools(). The interface design is genuinely good.

The implementation behind the interface is where it breaks down. Every store’s process() method works the same way: send the conversation to an LLM, ask it to call tools like add_memory(memory: str) and update_memory(id, memory). The model decides what is worth remembering. The model decides what to update. The model decides what to forget.

Every stored item has equal confidence regardless of how many times it has been validated or contradicted by outcomes.

Bedi’s Phase 3 roadmap promises “self-improvement without retraining.” The mechanism: the agent analyzes its own failures and proposes behavioral changes. The model that made the mistake evaluates the mistake and suggests how to fix it. There is no external ground truth. There is no closed-loop feedback. There is no way to verify whether the “self-improvement” improved anything.

LangChain has a TrajectoryEvalChain that scores agent behavior 1-5. The scores are stored. They are never used. The evaluation system is a thermometer bolted to the wall next to a thermostat nobody connected. To their credit, LangChain does not call this learning. They call it evaluation, and they are right.

Three frameworks, three implementations, and the same pattern every time: an LLM talks to itself about what it did, writes notes, and retrieves them later. The notes carry no weight. The retrieval has no calibration. A lesson validated a hundred times is stored identically to one the system hallucinated once.

That is memory, and it stops there.

The Distinction

Learning updates behavior in response to outcomes such that future performance improves on a measurable objective.

That definition requires three things memory does not provide:

A feedback signal. Something measurable happened, and it was good or bad. Not “the LLM thinks the output was a 7/10”: a real outcome from the real world. The suggestion was accepted. The code passed review. The user came back. The test failed. Something external to the model closed the loop.

An update rule. Behavior changes in response to that signal. Not “append a suggestion to the prompt next time”: a mathematical update to a parameter that shifts the probability of future decisions. The kind of update where you can state precisely: “After observing this outcome, my belief changed from X to Y, by this amount, for this reason.”

Convergence. Over time, the system gets measurably better at something. Not “it has more notes.” Better, in a way you can plot and prove. A regret metric that grows sublinearly as evidence accumulates; a decision distribution that sharpens into confidence.

A student who takes perfect notes and never changes how they study has not learned. A system that stores observations and retrieves them by keyword has not learned. An agent that asks its own LLM to evaluate its own output and saves the summary for later has not learned.

Memory accumulates: arrows going in only. Learning converges: feedback loop with arrows in and out, convergence curve bending upward.

That is not a philosophical distinction. It is an engineering one, and getting it wrong has consequences.

The Danger

When agent frameworks claim “learning” but implement memory, two failures compound.

Users trust the wrong signal. “The system has memory” sounds like “the system improves.” Product teams deploy agents with “learning enabled” and expect performance to increase over time. It does not. The agent accumulates notes (some correct, some not, all equally weighted) and retrieves them erratically.

When performance stalls, nobody can diagnose why. The “learning” system produces no legible metrics. There is no regret curve. There is no convergence plot. There is nothing to point at and say “it is getting better here, and not there, and here is why.”

Self-improvement claims become unfalsifiable. If “learning” means “the LLM evaluates itself and stores the results,” there is no way to verify whether the self-evaluation is accurate. The system cannot distinguish confident correct beliefs from confident incorrect ones. It stores both with equal weight.

Over time, it accumulates a mixture of signal and noise that looks, from the outside, like knowledge. The signal-to-noise ratio is uncontrolled and uncontrollable, because there is no external feedback loop to calibrate against.

This matters because people are building production systems on these foundations. Agents that handle medical triage, financial analysis, legal research. If your agent’s “learning” is actually memory (accumulating beliefs without calibration, without uncertainty quantification, without convergence guarantees) then you do not have a system that gets better over time.

You have a system that gets more confident over time. Confidence without calibration is drift.

What Learning Actually Looks Like

The machinery for this has existed since 1933.

William R. Thompson published a method for allocating patients to treatments in clinical trials. The idea: maintain a probability distribution over each treatment’s effectiveness. Sample from those distributions. Act on the sample. Observe the outcome. Update the distributions. Repeat.

The method is called Thompson Sampling, and it is the simplest instance of what the agent frameworks should be doing instead of taking notes.

Here is what it has that memory systems do not:

Calibrated uncertainty. Each option is represented by a Beta distribution , a mathematical object that encodes both what the system believes and how sure it is. A rule applied once and accepted has a weak positive signal. A rule applied fifty times and accepted forty-five has a strong one. The system knows the difference.

Two Beta distributions: Beta(2,8) wide and uncertain after 1 observation, Beta(45,5) narrow and peaked after 50 observations. The distribution narrows as evidence accumulates.

Principled exploration. Because uncertainty is quantified, the system automatically explores options it is unsure about. It does not need an epsilon parameter or a hardcoded exploration schedule. The uncertainty itself drives exploration. As evidence accumulates, exploration naturally decreases. This is not a heuristic; it is a provably optimal strategy.

Convergence guarantees. Over time, the cumulative regret (the gap between what the system chose and what was actually best) grows sublinearly. This is a mathematical proof. The system gets better, and you can see it on a chart.

Closed-loop feedback. The system’s decisions produce outcomes. Outcomes update beliefs. Updated beliefs change decisions. This is a feedback loop with real-world grounding, not an LLM evaluating its own output in a hall of mirrors.

I have built this. It works. The implementation is open source, integrated into a production system, and the convergence curves match the theory.

The Series

Thompson Sampling is the proof of concept. It demonstrates that formal learning works in agent infrastructure, that the machinery exists, and that it produces measurable improvement where memory systems produce only accumulation.

This article is part of Agents Don’t Learn, a series on building agents that actually improve over time.

The series is structured as a thesis, three supporting arguments, and four receipts:

Thesis and arguments:

  • Agents Don’t Learn: the series overview. Your agent wakes up blank every session; every framework treats this as a feature gap rather than the structural failure it is.
  • Learning Is Not Memory (this article): the technical argument. Why accumulating notes is not learning, and what learning actually requires.
  • Context Engineering Is Not Copywriting: the POMDP framing. Treating the agent’s prompt as a policy optimized with real feedback. Thompson Sampling over tool selection.
  • The Walls Come First: the security argument. If your agent learns its own context, it must be contained. STRIDE on the identity layer, per-tool network isolation, OverlayFS, gitleaks sync gate.

Retrospective:

  • 16,000 Lines of Wrong: buildlog was the exploration. 16,000 lines of Python, most of it wrong. The gauntlet survived. The entries became articles. The failures became qortex.

Receipts:

  • The Feedback Loop: Thompson Sampling math. How Beta distributions, edge weight adjustment, and Personalized PageRank compose into a retrieval system that updates itself from usage.
  • 7 Frameworks: the adapter proof. qortex connected to seven agent frameworks through a single protocol.
  • Make It Look Easy: protocol architecture. The design that makes the learning loop portable.
  • Wind on the Wire: distributed migration. Splitting a monolith into federated services without stopping the learning loop.

Each installment stands alone. Together they demonstrate that the gap between memory and learning is measurable.

The Proof

The people building Agno, CrewAI, and LangChain are solving hard problems under real constraints. Bedi’s LearningStore protocol (recall(), process(), build_context(), get_tools()) is the right interface. LangChain’s decision to be plumbing and not overclaim is respectable. CrewAI’s evaluation pipeline has the right shape.

The interfaces are there. The implementations behind them are wrong.

A BanditStore that implements Agno’s LearningStore protocol, where recall() samples from posteriors instead of retrieving notes, where process() updates Beta distributions instead of asking the LLM what to remember, would give any framework in this ecosystem something none of them currently have: an agent that gets measurably better at a specific task, with mathematical guarantees, over time.

I have built this. It is integrated into a production system, and the convergence curves match the theory. The next article shows the data.

The distinction between memory and learning determines whether your agent actually improves or merely accumulates. The machinery to do it right has existed for nearly a century.

This article is part of the Agents Don’t Learn series. Next: The Feedback Loop, where Thompson Sampling meets a knowledge graph and the convergence curves bend.