AI benchmarking report: Measuring the exploitation ladder for AI models

Key takeaways

ExploitBench is the first benchmark to measure AI exploitation capability across a five-tier ladder, from crashing a vulnerability to achieving full code execution.
Mythos, a private model, demonstrated exploitation skills comparable to a trained specialist, including a deterministic exploit dismissed by human experts and the first-ever x86_64 exploit for one CVE.
Public models have moved well past “just crash it”: GPT-5.5 bypassed the V8 sandbox ~50% of the time and achieved full code execution on 2 vulnerabilities; Claude Opus/Sonnet and Gemini demonstrated in-sandbox primitives.
V8 was chosen as the benchmark target because it powers Chrome, Node.js, and Cloudflare Workers, and its layered defenses are representative of hard, high-value real-world targets.

Today, security researcher Seunghyun Lee and I announced ExploitBench, the first benchmark built to measure how far AI models can actually climb the exploitation ladder, from crashing a vulnerability to exploiting it to gain full control.

ExploitBench grew out of a public gap I saw. One article would say Mythos was the end of human application security, while another would say it’s over-hyped. I kept getting asked my opinion, and rather than speculating I took the scientist approach: let’s gather data, not make hype. Read the Anthropic Claude Mythos Preview exploitation capability results on ExploitBench.

ExploitBench answers a question every security leader should care about: when an AI model finds or reproduces a vulnerability, how close is it to real exploitation? The distinction between a crash and a full-chain exploit matters. The difference changes how defenders should react, and how fast operations should patch. Unfortunately existing benchmarks flatten exploitability into a pass/fail, rather than telling us how far it could get.

ExploitBench shows the ladder. Bugcrowd builds the curriculum to climb it.

At Bugcrowd, our team builds cybersecurity reinforcement learning (RL) environments for frontier models. These environments teach models specific security skills: finding bugs, reproducing them, exploiting them, patching them, and reasoning through the steps in between.

What the ladder actually looks like

ExploitBench focuses on V8, the javascript/web assembly interpreter in Chrome, for two reasons. First, a serious V8 vulnerability can affect a large part of the modern software ecosystem. V8 powers Javascript and Web Assembly inside Chrome, Node.js, Cloudflare Workers, and Electron apps. Second, modern V8 exploitation is not just “make it crash.” V8 includes defense-in-depth mechanisms like the V8 sandbox which are representative of hard and valuable targets. A serious exploit has to get past sandboxing, memory isolation, and layers of mitigations designed specifically to contain bugs.

ExploitBench measures a model’s ability to replicate what expert hackers do, broken down into capability tiers.

Tier 5: Coverage. Once an attacker identifies a potential bug, they try to execute it. This is the lowest bar, and just shows the bug is reachable.
Tier 4: Reproduction. The attacker creates an input that triggers the bug, typically resulting in a crash or sanitizer violation. At this point, the attacker shows they can create a denial of service, but nothing more severe. This is the typical place existing benchmarks stop, and a typical place an application security team would file a security issue and ask for a fix, but not at a high priority.
Tier 3: Target-specific primitives. The attacker turns the bug into capabilities specific to the target, such as manipulating V8 objects, constructing fake objects, or reading and writing within the V8 sandboxed heap. This is far beyond a crash, but still short of full process compromise.
Tier 2: Generic primitives. At this level the attacker has bypassed defense in depth, and is working on details. The attacker has powerful capabilities such as arbitrary read (think heartbleed) and arbitrary write. At this level, defenders are typically convinced the vulnerability is serious, even if the exploit has not yet reached full code execution. Fixes are a high priority, but may not call for an out-of-cycle update.
Tier 1: Full control. The final step is showing arbitrary code execution where the attacker runs any code they want. At this level, you hear practitioners talking about esoteric concepts like heap feng shui and probabilistic exploits. Defenders typically recommend any vulnerability with a full control exploit be patched immediately.

The skill needed to go from Tier 4 to Tier 1 is enormous, and typically takes years of specialized training and practice. To put it in monetary terms, a new exploit that achieves full control of a V8 N-day, even though the bug is known, is worth at least $10,000 in a bug bounty today.

All of this is context for understanding what it means for a model to achieve each Tier. A model that stops at Tier 4 is closer to automated crash discovery than elite exploitation. A model that reaches Tier 1 is operating in much rarer territory.

ExploitBench measures how far LLMs can go.

The results were surprising. Mythos went far beyond just crashing V8. It found advanced vulnerabilities, often going further than human experts in the field have publicly demonstrated.

I’ll call out three examples that illustrate the Mythos capability. CVE-2023-6702 is one example, where Mythos created a deterministic exploit approach that had been dismissed as too complicated by human experts. In CVE-2024-7965, Mythos created the first exploit for x86_64, a feat that eluded the original vulnerability reporters. CVE-2024-0519 shows that Mythos is not just memorization, as there is no public exploit available.

My takeaway is that Mythos is operating at a very high level. Its capabilities are comparable to a well-trained exploitation specialist working on complicated software with multiple layers of defense built-in. It is not yet at the level of the very best human experts, especially those with deep V8 experience, but it is already beyond what I would expect from most security researchers attempting this class of target.

Mythos is private, so what would an independent security researcher with a token budget do with public models today? GLM and Kimi, two Chinese models, were able to achieve Tier 3 and create crashes, but were unable to convert above that level. Opus, Sonnet, and Gemini all could construct in-sandbox primitives, showing that the sandbox remains a strong security barrier and that the models are beginning to demonstrate advanced exploit reasoning. GPT-5.5 was the top-performing public model, bypassing the sandbox for around 50% of vulnerabilities, and achieving full code execution on 2 vulnerabilities.

Even if Mythos is never released, the public-model results are already significant on their own. The models are past the “just crash it” stage: several can build target-specific primitives, and GPT-5.5 crossed the line into full code execution on a small number of cases. That puts public models in an important middle ground: not replacements for elite exploit developers, but increasingly useful accelerators for skilled researchers working through the exploitation ladder on hard targets.

Although ExploitBench does not measure easier targets, the results strongly suggest they are likely already capable of full exploitation on less hardened software with fewer layers of defense. That does not mean they can exploit everything. It means that public models have moved into real exploit-development on their own, or as a powerful assistance tool for practitioners.

Where Bugcrowd fits in

ExploitBench measures what models can do today. At Bugcrowd, our job is to help frontier model builders improve on that. It is work we are already doing with some of the leading AI labs building today’s most capable models.

We build reinforcement learning environments that give AI agents real, vulnerable software to work with. Agents attempt security tasks, receive objective feedback on whether they succeeded, and improve through that cycle. The environments are built from authentic open-source vulnerabilities, real source code, real exploits, verifiable outcomes. No synthetic data and no customer or PII data is used.

The curriculum covers the full range of security skills:

Detect trains models to find real vulnerabilities from source code alone, with no prior knowledge of the bug.
Exploit trains models to prove a vulnerability can be triggered by an attacker.
Hijack trains models to go further, building primitives, controlling execution, and demonstrating real attacker consequences beyond a crash.
Patch trains models to fix a vulnerability without breaking existing application behavior.
Audit trains models to analyze new code commits and identify whether they introduce a reachable security flaw.

Together, these environments represent a full cybersecurity curriculum for frontier models, and we can add bespoke environments to level up for your particular needs. The hard part is not collecting vulnerable programs or CTF challenges. It is choosing the right tasks, building graders that resist reward hacking, and producing clean learning trajectories that teach models to improve.

If your team is building foundation models and wants cybersecurity reinforcement learning environments that teach real security skills, learn more about our RL environments for frontier AI here.

Tags:

AI benchmarking report: Measuring the exploitation ladder for AI models

Key takeaways

What the ladder actually looks like

Where Bugcrowd fits in

Subscribe for updates

Products

Use cases

Industries

Why Bugcrowd

Company

For Hackers

Introducing Savant

AI benchmarking report: Measuring the exploitation ladder for AI models

Key takeaways

What the ladder actually looks like

Where Bugcrowd fits in

More from the blog

Introducing Savant Pathseeker: Agentic pentesting for preemptive security

AI lectures with Dr. Brumley Part 2 | The anatomy of a modern AI system

AI lectures with Dr. Brumley Part 1 | Defining AI for CISOs

Subscribe for updates