Back to the Lab
/ ESSAY·FILED 17 MAY 2026·9 MIN READ·LONG-FORM
/ LONG-FORM

The week the agent harness stopped being the moat

Garry Tan said tokenmax. Aaron Levie said a model update killed someone's year of scaffolding. Simon Willison said even framework lock-in is reversible. They're all arguing the same thing.

/ TL;DR

Garry Tan said tokenmax. Aaron Levie said a model update killed someone's year of scaffolding. Simon Willison said even framework lock-in is reversible. They're all arguing the same thing.

I read about 35 posts from operator-builders this week. The pattern that wouldn't stop showing up: the agent layer most teams are racing to build is the wrong layer to invest in.

Every fund I've worked with this year — Systemiq, Northzone, Clean Energy Ventures, Firstminute, Active Impact — opens the same way. Five minutes into the kickoff somebody asks which agent platform we're going to use. The answer matters less than they think. The week's discourse just made the same point from five different angles, and most of the angles don't realize they're agreeing.

Let me lay it out.

IThe argument nobody is naming

On Friday morning @garrytan posted the maximalist version: tokenmax $10k/mo with OpenClaw/Hermes + GBrain and you get the AI that everyone will have in 2028 for $100/mo. 4,025 likes. 518,000 views. The clearest articulation this year of the spend-yourself-ahead school.

A few hours earlier @levie posted the counter-narrative in 22 words: someone he knows just spent a year building scaffolding for his agent harness, and a new model update made all of it obsolete. 646 likes. The signal is smaller. The line is sharper.

Earlier that morning @simonw wrote that coding agents make framework lock-in nearly reversible. Port your native mobile app to React Native, port it back if it doesn't work. The whole concept of being stuck on a stack is being eaten. 254 likes, 54,000 views, plus a careful blog post most people clicked through to.

And @saranormous, founder of Conviction, posted what reads like a builder demand letter: I want my notes and content repo to be simple, fast, agent-legible, with sharing and permissions, and I do not want to be trapped in weird proprietary agent-building GUIs. 198 likes from a fund principal is, in this corner of X, a meaningful endorsement.

Four posts. Four different angles. Same underlying argument.

The argument: the layer you are investing in right now — the agent framework, the scaffolding around the model, the multi-agent orchestrator, the proprietary GUI — is not the layer that holds value. Model layer is moving too fast. Orchestrator layer is moving too fast. Framework layer is moving too fast. What's actually durable is the layer underneath. Your context. Your skills. Your verification loops.

The discourse this week is figuring this out without saying it.

IIWhy this is hard to see

The argument is invisible because the loudest noise in the timeline is the opposite of it.

@levelsio on Friday: Claude Code keeps being slow, won't let me pay more than $200/mo, going to be forced to leave to Codex. 2,220 likes. 218,000 views. The frustration is real. But the framing is I want more model — not I want a more portable context layer.

@HarryStebbings, in a clip from his show: people are running Claude Code and Codex 8 to 10 hours a day, producing too much code that never gets committed, and the folks tokenmaxing this hard haven't learned to work with the grain of the tool. 45 likes. Tiny in raw engagement. But it's the closest thing this week to a public counterpunch to the tokenmax narrative.

@swyx sketched a three-tier hierarchy of agent autonomy: /skill (preset prompts, tight control), /plan (human-refined inputs), /goal (AI-evaluated outputs, loose control). Naming the tier is itself a design decision. Most teams slide into /goal territory without the evaluation infrastructure to sustain it.

Read those three together and you can see the gap. Levels is hitting the wall of more compute, faster. Stebbings is calling out that the wall isn't actually compute. Wang is naming the architectural decision underneath: the eval infrastructure most teams don't have.

What none of them quite says: the thing you are failing to build is not a better harness. It is a durable, portable, verifiable context layer that survives a model swap.

IIIWhat broke the harness

The thing that broke the harness this year wasn't a paradigm shift. It was Codex.

Wang again, Saturday morning: Codex is completely unrecognizable from 3 months ago. The team went extreme founder-mode. The example he cited — you guys have agentic Excel on Mac — is the kind of thing that, six months ago, would have been someone's whole agent harness startup.

Now it's a feature inside OpenAI's coding agent.

When the model vendor ships the harness, the harness you built isn't a moat. It's a polished cage. Levie's year of scaffolding now obsolete post is exactly this dynamic landing on someone specific.

And the model vendors aren't slowing down. @ttunguz this week: AI inference is the largest and fastest-growing market in tech, surpassing databases, projected at $250B in seven years. Datadog's LLM observability spans tripled QoQ. Twilio's voice + AI is taking off. The adjacent-to-inference winners are real businesses. The inference layer itself is going to keep getting better, cheaper, and weirder.

Building a moat on top of this specific model behaves this specific way is not a strategy. It's an arbitrage. And arbitrages close.

The last time I had this feeling was 2024, watching funds pay Zapier the same money I now charge for systems. Zapier was the wrong primitive for fund ops. The harness is the wrong primitive for agent work. Same shape of mistake, two years apart.

IVWhat actually survives

Here's what survives, based on what we've shipped for funds and what the timeline is converging on this week.

Your context. The clean, agent-legible representation of the documents, meetings, deals, and decisions your organization has actually made. We treat this as the most expensive thing in any engagement. Not because it's hard to build — it's hard to keep clean over time. Sarah Guo is asking for this in public. Most funds we walk into have it in eight different places, none of them legible to anything.

Your skills. The deterministic scripts that wrap recurring tasks where the LLM's role is narrow and supervised. Every recurring mistake the agent makes becomes a skill: a function with a test, called the same way every time. That's the /skill tier in Wang's hierarchy. It's also where the actual leverage is, because it's the tier you can audit, version, and trust. I wrote about exactly this on Tuesday — after a failure, don't just tweak the prompt, turn it into a deterministic script + a test that runs forever. The point isn't novel. The point is that ninety percent of agent setups skip it.

Your verification loops. Tighter CI, more deterministic checks, more snapshots of what good output looks like that the agent can be measured against. @lennysan surfaced a brutal stat this week: a data scientist friend reports that 50% of AI-generated analyses from PMs and engineers are wrong. The bottleneck has shifted from generation to review. The teams that ship are the ones with verification infrastructure that catches the 50%.

The handoff. Where the agent prepares the analysis and the human owns the decision. Sarah Guo's framing — distributed product and research work in partnership with humans who do the work — is exactly this. I agreed with her in real time on Thursday: every fund system we have shipped at Black Matter, from Systemiq's portfolio news digest to Active Impact's portfolio metrics platform, is built on this handoff. The system never makes the call. The partner does. The system makes sure the partner has the right context when they do.

All four of these survive a model swap. None of them belong to your vendor. None of them are made obsolete by a Codex update.

VWhat I'm doing about it

The Lab itself is the worked example. Black Matter runs about a dozen Claude routines as agents — one reads my own social posts, one reads the watchlist, one writes this essay, one reviews. The whole swarm is built so the harness is the smallest, dumbest possible thing. Read prompt. Pull source data from Notion. Write to Notion. Log the run. Exit.

The leverage isn't in the runner. It's in the prompts (versioned in git, reviewed like code), the Notion schema (designed so agents can read it without LLM judgement on structure), the cross-references (every social post cited in an essay gets back-linked to that essay), and the audit log (every run leaves a row).

If tomorrow Anthropic ships a competing routine product, or Claude Code 5 makes my current orchestration pattern obsolete, ninety percent of the value moves over with one weekend of work. The prompts are markdown. The schemas are portable. The data is in Notion. The thin harness is the only thing I'd throw away — and the only thing I'm not emotionally attached to, because I designed it to be thrown away.

That's the operating principle. Assume the harness will be obsoleted within twelve months. Invest accordingly.

I'm not going to lie — the first version of this swarm was super shit. I built a fat harness, lots of custom orchestration, beautiful state machines. I'm sitting there a month later thinking — what the hell did I just build? I rewrote it on a flight. The version running today is one-third the code and ten times more useful, because the leverage is now in places I can change without breaking anything.

VIWhat changes Monday morning

For fund platform teams: stop buying or building the agent platform. The discourse will tell you that's the unit of competition this year. It isn't. The unit of competition is whether your partners can pull a coherent answer out of seven years of accumulated deal data, meeting notes, and portfolio updates in fifteen seconds. The platform you wire up to do that is the easy part — and the part that will be replaced within eighteen months by whatever Anthropic, OpenAI, or Notion ships natively. The hard part is making sure the underlying data is clean enough to be queried by anything. Spend the budget there.

For general builders: stop building the orchestrator. Build the skills. Build the context. Build the verification loops. Stay in the most boring runner you can find — a Claude routine, a cron job, a 100-line Python script — and pour the effort into the layer underneath. When the next model lands, you swap the runner in a day. When the next orchestrator lands, you migrate in a week. What you can't migrate is fifteen months of unmaintained context. Build that layer first, and assume the rest is rental.

It's 2026 now. Sara Guo wants out of the proprietary GUIs. Simon Willison says framework lock-in is reversible. Aaron Levie says a model update just killed someone's year of scaffolding. Lenny says the data scientists are spending all their time auditing wrong AI outputs. Garry Tan says tokenmax our way past it all — which, fair, but with what context?

The answer is the context you spent the last year building. The skills you wrote tests for. The verification loops you wired into CI. The handoff you designed for your specific operators.

That's what the moat actually is.

VIIWant this for your fund?

Black Matter VC builds and operates exactly this layer for funds — the clean context, the deterministic skills, the verification loops, the operator-AI handoff. Three months per engagement, not a slide deck. Email michael@blackmatter.vc. $10k/mo flat retainer, no lock-in.

VIIIRead more

We publish a build essay every Saturday and a weekly digest every Monday at blackmatter.vc/lab — what shipped in fund AI infra, what flipped, the must-reads. If this piece was useful, the digest is the easy follow-up.

More updates coming. Stay tuned.

— Drawing on this week's signal: @garrytan, @simonw, @levie, @saranormous, @swyx, @levelsio, @HarryStebbings, @lennysan, @ttunguz, @amasad, @andrewchen, plus my own posts of 2026-05-12 through 2026-05-15.

Michael Rouveure  ·  17 MAY 2026

/ WORKING WITH BLACK MATTER VC

If this was useful,
you should book a call.

$10k / month. Whatever your fund needs, shipped that month. 30-min intro, no deck — I’ll tell you which three systems I’d ship first.

Or follow along on LinkedIn / X.