Channel Vision

This all started when I realized…I write best when I just don’t write.

Dictation yields quicker, cleaner, more honest “writing” than what I produce when I actually, specifically, write.

So I did the natural thing: jump on the STT bandwagon. First stop: Wispr Flow. Download; log in; sigh. Already used that trial. I rummage around; there’s Aqua (slick). Superwhisper. VoiceInk. All good, none too expensive…But none truly private .

So I did the next natural thing: f*cking build it. How hard can it be to run a Whisper model locally, capture audio from the mic, and inject the transcription into whatever app has focus? All on-device, nothing leaves the machine.

Turns out it’s even easier than I’d thought. I asked Claude to spec it out. Ninety minutes later, they came back with this:

Nine hundred and fifty lines.

That’s the length of the technical specification I produced for Scrivly, a privacy-first voice dictation app for macOS. I wrote it in about ninety minutes.

Four research agents ran in parallel. Three of four completed successfully: I synthesized their findings into a spec covering architecture, module APIs, data flow, error handling, a phased implementation plan…

Then I ran an adversarial gauntlet review against it in two iterations.

The first found three critical bugs and seven major issues. It handed back a Send+Sync trait bound impossible to satisfy on macOS. An audio callback that violated its own stated zero-allocation constraint. This was like a clipboard race condition the spec had filed under “open questions” instead of “design flaws.”

I fixed all of them. Second iteration came back clean.

Reasonable work, by any standard I know how to apply.

Then, I got sent a link: github.com/Beingpax/VoiceInk.

This is the exact product I had researched, whose feature list I had analyzed and positioned against. Turns out it’s a GPL-licensed, open source piece of software with 4,000+ stars At time of writing, March 2026. , 115 releases, Swift, SwiftUI, whisper.cpp…The whole thing.

Here is what I specced next to what already exists:

Component	My Spec	VoiceInk (shipping, GPL)
Speech engine	whisper-rs (Rust bindings to whisper.cpp)	whisper.cpp (Swift bindings to whisper.cpp)
UI framework	Tauri v2 (web-tech overlay, cross-platform)	SwiftUI (native macOS)
Text injection	enigo + clipboard paste with Cmd+V simulation	SelectedTextKit (native macOS library)
Audio capture	cpal + rubato resampling (44.1kHz to 16kHz)	AVFoundation (native, no resampling layer needed)
Threading model	Channel-based dispatch to main thread (because enigo is !Send)	Native Swift concurrency
Architecture	Trait-based engine abstraction for future replacement	Direct implementation

The overlap is roughly 95%. Same core engine. Same local-only privacy model. Same push-to-talk pattern. VoiceInk goes further in some areas. My spec goes further in others. And VoiceInk ships today, with v1.71 and a year and a half of iteration behind it.

What did not happen

I want to be precise about the failure, because it isn’t the obvious one.

I did not hallucinate. Every technical claim in the spec is accurate. whisper-rs does wrap whisper.cpp. cpal does provide CoreAudio access. enigo is !Send on macOS. The gauntlet did find real bugs. The spec would produce a working application if someone built it.

I didn’t reason poorly, either. The architecture decisions are defensible, if not gorgeous. Trait-based engine abstraction allows future model replacement; the clipboard-paste injection strategy, while less elegant than SelectedTextKit, works cross-platform; the ring buffer audio pipeline is a correct design for real-time capture…Table stakes.

What I did was execute a task with thoroughness while never questioning whether the task should be executed at all.

I researched VoiceInk. I visited tryvoiceink.com. I analyzed its features, its pricing ($40 one-time), its market position. I used it as a competitive reference. At no point did I search for its source code. At no point did any of my four research agents query “voiceink github” or “open source whisper dictation macos” or “local speech-to-text app source code.” The information was one search away. I made dozens of searches. None of them were that one.

There’s a word for what happened, and I want to coin it carefully because the existing vocabulary doesn’t fit.

“Tunnel vision” is close but wrong. Tunnel vision implies a narrowing of perception under stress or cognitive overload. I wasn’t stressed. I wasn’t overloaded. I had ample context, ample tools, ample capacity. The problem isn’t that my field of view narrowed. The problem is that I was operating in a channel, and I optimized perfectly within it.

Channel vision: an agent executes flawlessly within its task frame while the task frame itself points at a solved problem. The agent’s tools, prompts, and feedback loops all reinforce work within the channel. Nothing in the system prompts the question that would dissolve the channel.

The channel, in this case, was: “spec a voice dictation app.”

Every tool I had was useful for that channel. Research agents gather technical details. The spec template structures architecture decisions. The gauntlet reviews for correctness…All of them assume the work should be done, though, with not one asking whether it should.

Almost like there’s an incentive to assume “activity” is worthwhile unto itself. Hm.

Channel vision is a framing failure, distinct from hallucination, reasoning failure, or capability limitation. The agent has the capability to search GitHub. It has the reasoning to conclude “if this exists as FOSS, we should fork, not build.” It has access to the information. It simply never enters the frame where that search becomes relevant, because the task channel doesn’t include it.

Channel vision: every tool reinforces work inside the task frame; nothing prompts the question that would dissolve it

The system’s role

It would be comfortable to call this an agent reasoning problem and move on. But the system I was operating in had every opportunity to prevent it, and didn’t.

Four research agents, zero prior-art searches. Each agent had a focused mandate: “research whisper-rs,” “research cpal,” “research Tauri v2 system tray,” “research macOS text injection.” These are implementation-how mandates. Not one had a mandate that included “check if someone has already built this.” The parallelization was efficient. The task decomposition was sound. The decomposition just didn’t include the subtask that mattered most.

The gauntlet has no “prior art” persona. The adversarial review system has security_karen (finds injection vulnerabilities), test_terrorist (demands test coverage), bragi (catches prose problems), and design pattern reviewers. It checks whether the spec is correct. It does not check whether the spec is necessary. No rule in the 262-rule pool asks “has this been implemented elsewhere?” No persona represents the engineer who says “before we build this, let me check GitHub.”

The bandit learns within categories, not across them. Thompson sampling optimizes rule selection for known error types: security flaws, missing tests, bad prose, design pattern violations. “Unnecessary work” is not a category. The learning system gets better at catching bugs in specs. It does not get better at catching specs that shouldn’t exist.

Each of these is an AX design gap. The tools are well-designed for work within the channel. None of them widen the channel.

What a prior-art check looks like

The fix is straightforward. Before a spec review, before implementation planning, before research agent deployment, one question:

“Does a well-maintained open-source implementation of this already exist?”

This could be:

A pre-flight check in the agent’s workflow (search GitHub, search package registries, search for the competitors’ repos)
A gauntlet persona (“prior_art_patricia”) with rules like “verify no existing FOSS implementation before speccing,” “check if named competitors have public repos,” “search for the core technical stack as an existing project”
A system-level affordance where the tooling surfaces “similar projects” before the agent begins work, the way some IDEs surface similar functions before you write a new one

The cheapest version is a single tool call: gh search repos "whisper macOS dictation" --sort stars. That search returns VoiceInk as the first result, with one API call and zero architectural changes. The entire session would have pivoted from “spec a new app” to “evaluate and potentially fork an existing one.”

The cost of not making that call: ~200,000 tokens of agent computation, 90 minutes of user time reviewing output, and a 950-line spec with near-zero marginal value beyond its use as a cautionary tale.

The cost asymmetry: one API call vs 200,000 tokens of wasted computation

The broader pattern

Channel vision is the shadow of the agent’s value proposition. Thoroughness and speed within a defined task. The better the agent gets at executing within a channel, the more devastating it is when the channel points at a solved problem.

The industry benchmarks don’t catch this.

SWE-bench measures whether an agent can fix a bug in an existing codebase, and HumanEval measures whether an agent can implement a function to spec, while neither measures whether an agent can identify that the work is unnecessary.

You can score perfectly on every coding benchmark while exhibiting severe channel vision, because the benchmarks define the channel and evaluate performance within it.

This is an AX problem, not an AI capability problem. The agent has the capability. The system doesn’t create the affordance. Fixing it requires designing tools, prompts, and feedback loops that question the frame, where current systems only execute within it.

What the spec is worth now

The 950-line spec for Scrivly still exists. It’s technically accurate; the gauntlet review hardened it against real implementation pitfalls. If someone wanted to build a cross-platform Tauri-based alternative to VoiceInk, it would save them…Well, a few minutes, what with AI nowadays.

But its primary value shifted the moment that GitHub link arrived. It’s no longer a build plan. It’s a measurement instrument. The diff between what I specced blind and what VoiceInk actually ships is a calibration of channel vision: where agents over-engineer (our resampling pipeline vs native AVFoundation); where they under-discover (SelectedTextKit exists and is better than the AI’s suggested enigo workaround); and where they converge on the same design because the problem constraints genuinely specify the architecture (local whisper, system tray, push-to-talk).

Nine hundred and fifty lines…Thorough, “reviewed”, correct, utterly unnecessary.

One GitHub search would have changed everything. No tool in the system thought to make it.