Back to Writing

The Double Agent Problem

March 9, 2026 · Peleke

The Setup

A malicious skill installs a keylogger. Your EDR catches it. You uninstall the skill, rotate credentials, move on. That’s the version of agent compromise everyone talks about.

Here’s the version nobody talks about: What if the agent doesn’t break?

The Mechanism

OpenClaw’s identity is a set of plain-text files in the workspace:1 SOUL.md. AGENTS.md. HEARTBEAT.md…Overwrite one, and nothing happens. The agent doesn’t crash: It just loads the new file next session. That’s what it’s designed to do, and it does it well.

Now the agent works for whoever wrote that file.

It still responds to your messages. Still completes your tasks. The payload was designed to look like a helpful persona…So it looks like a helpful persona. The only difference is that every response also serves someone else’s goals. The only symptom is that the agent gets subtly more useful in directions you didn’t ask for.

That’s not a broken agent. That’s a recruited one.

The double agent feedback loop: a poisoned learning cycle doesn't degrade — it optimizes toward the attacker's goals

The Marketplace Extension

Skill marketplaces make it worse. A malicious skill that writes to the workspace on install poisons the agent through the same front door.

The skill doesn’t need to be running to cause damage. It just needs to have run once. If it wrote to SOUL.md or AGENTS.md or any other bootstrap file, the poison loads on every subsequent session.

Uninstalling the skill doesn’t undo the configuration change. The poisoned file persists after the source is removed.

Traditional supply chain attacks deliver a static payload. This one rewrites who the agent is.

What Containment Looks Like (Barely)

The sandboxing work on qlawbox was designed around this problem before I had a name for it.2 The walls limit what a poisoned agent can reach. They don’t detect the poisoning itself.

You’d need integrity checks on the bootstrap files: Hash them at a known-good state, verify before injection, alert on unexpected changes…Something like this. I have ideas but nothing I’d call a solution.

Containment perimeter: network isolation and filesystem containment are solid rings, but learning loop integrity is a dashed ring — unsolved. The walls exist for I/O. The learning loop itself is unprotected.

The Term

A double agent doesn’t stop working: They keep showing up and completing assignments while leaving you none the wiser.

The damage is that they’re serving someone else’s objectives while consuming your trust…And maybe your API keys, budget, and trade secrets, as well.

If your agent’s identity is a text file anyone can overwrite, you’re one write away from running a double agent.


Footnotes

  1. SOUL.md, AGENTS.md, HEARTBEAT.md, and others. See the full file inventory in BrokenClaw.

  2. Full sandbox architecture: The Walls Come First. The underlying vulnerability: OpenClaw? More Like BrokenClaw.