MirrorCode evaluates AI’s long-horizon coding capabilities with 22 open-source tasks
Let us take a look at a benchmark that may reshape how you think about the AI assistants already living inside your WordPress editor.

What the benchmark actually measures
Firstly, the setup is deliberately harsh. Each AI agent receives only a program's behavior — its inputs and outputs — never the source code. The agent must then build a functional replica in any language it chooses, and every reimplementation is judged through hundreds to thousands of end-to-end tests requiring exact output matching. There is no partial credit, no "close enough" rounding. This is not the same as asking an AI to draft a shortcode for you; this is asking it to engineer a whole application from scratch.
The standout result belongs to Claude Opus 4.6, which tackled gotree, an open-source bioinformatics toolkit written in Go. The original program clocks in at roughly 16,000 lines of code. Claude rebuilt it in Rust, condensing it to around 7,700 lines while passing 99.95% of 2,001 tests — a single failed test out of two thousand. METR and Epoch AI estimated that a human engineer would spend somewhere between 2 and 17 weeks on the same task, and Claude completed it within a token budget of up to 1 billion. Notably, performance scaled directly with that budget: the more room the model had to think, the better it performed.
What this means for your WordPress workflow
Let us connect those numbers to the panel toggles and block inspectors you open every day. Today's commercial AI coding assistants — the ones bolted into VS Code, JetBrains, or your favorite plugin generator — mostly autocomplete functions and suggest snippets. MirrorCode demonstrates that the underlying models are capable of something far more ambitious: autonomous engineering at the scale of entire applications. That gap between "suggest a hook" and "ship a working plugin" is narrowing, and if you build custom Gutenberg blocks or WooCommerce extensions for clients, it is worth planning for that shift now rather than later.
However, the benchmark also draws a clear boundary. Larger, more complex programs like Pkl remained unsolved under the tested limits, which is a useful reminder that WordPress core itself — with its decades of PHP quirks, database abstractions, and backward-compatibility promises — is a different beast from a standalone CLI tool. The AI can reimplement a 7,700-line Rust program; it cannot yet reimplement a system shaped by millions of lines of legacy decisions. Consequently, treat AI output as a junior teammate who is brilliant on isolated tasks and unreliable on system-wide rewrites.
What to watch — and what to budget for
There is a cost dimension worth flagging before you unleash an agent on a client site. The correlation between token budget and performance means compute expenses scale with ambition. Running a billion tokens through a frontier model is not cheap, and the economics only pencil out if the output reliably replaces a chunk of human effort. Run a pilot on a staging environment first and measure tokens spent against hours saved; the math will tell you quickly whether the workflow holds.
Two adjacent developments reinforce this picture. Open-source AI coding platform Kilo Code was named a "Cool Vendor" by Gartner in its 2026 AI Coding Agents report, with the company reporting more than 3 million developers using the platform to route work across more than 500 models through a centralized gateway. Meanwhile, The Hacker News has flagged GuardFall, a class of shell injection vulnerabilities in open-source AI coding agents that stem from decades-old patterns resurfacing in agent pipelines. Together these stories tell us the category is maturing fast, and so are its risks.
Here is what we have established: MirrorCode shows AI can reimplement substantial software from scratch when given enough token budget, but it cannot yet handle the largest, most complex systems. For your WordPress work, that translates into a practical playbook — pilot AI on self-contained plugins first, measure tokens against hours saved, keep humans in the loop for anything touching core or production data, and watch the security advisories as closely as you watch the feature releases.