Today, security researcher Seunghyun Lee and I announced ExploitBench, the first benchmark built to measure how far AI models can actually climb the exploitation ladder, from crashing a vulnerability to exploiting it to gain full control.
ExploitBench grew out of a public gap I saw. One article would say Mythos was the end of human application security, while another would say it’s over-hyped. I kept getting asked my opinion, and rather than speculating I took the scientist approach: let’s gather data, not make hype.
ExploitBench answers a question every security leader should care about: when an AI model finds or reproduces a vulnerability, how close is it to real exploitation? The distinction between a crash and a full-chain exploit matters. The difference changes how defenders should react, and how fast operations should patch. Unfortunately existing benchmarks flatten exploitability into a pass/fail, rather than telling us how far it could get.
ExploitBench shows the ladder. Bugcrowd builds the curriculum to climb it.
At Bugcrowd, our team builds cybersecurity reinforcement learning (RL) environments for frontier models. These environments teach models specific security skills: finding bugs, reproducing them, exploiting them, patching them, and reasoning through the steps in between.
ExploitBench focuses on V8, the javascript/web assembly interpreter in Chrome, for two reasons. First, a serious V8 vulnerability can affect a large part of the modern software ecosystem. V8 powers Javascript and Web Assembly inside Chrome, Node.js, Cloudflare Workers, and Electron apps. Second, modern V8 exploitation is not just “make it crash.” V8 includes defense-in-depth mechanisms like the V8 sandbox which are representative of hard and valuable targets. A serious exploit has to get past sandboxing, memory isolation, and layers of mitigations designed specifically to contain bugs.
ExploitBench measures a model’s ability to replicate what expert hackers do, broken down into capability tiers.
The skill needed to go from Tier 4 to Tier 1 is enormous, and typically takes years of specialized training and practice. To put it in monetary terms, a new exploit that achieves full control of a V8 N-day, even though the bug is known, is worth at least $10,000 in a bug bounty today.
All of this is context for understanding what it means for a model to achieve each Tier. A model that stops at Tier 4 is closer to automated crash discovery than elite exploitation. A model that reaches Tier 1 is operating in much rarer territory.
ExploitBench measures how far LLMs can go.
The results were surprising. Mythos went far beyond just crashing V8. It found advanced vulnerabilities, often going further than human experts in the field have publicly demonstrated.
I’ll call out three examples that illustrate the Mythos capability. CVE-2023-6702 is one example, where Mythos created a deterministic exploit approach that had been dismissed as too complicated by human experts. In CVE-2024-7965, Mythos created the first exploit for x86_64, a feat that eluded the original vulnerability reporters. CVE-2024-0519 shows that Mythos is not just memorization, as there is no public exploit available.
My takeaway is that Mythos is operating at a very high level. Its capabilities are comparable to a well-trained exploitation specialist working on complicated software with multiple layers of defense built-in. It is not yet at the level of the very best human experts, especially those with deep V8 experience, but it is already beyond what I would expect from most security researchers attempting this class of target.
Mythos is private, so what would an independent security researcher with a token budget do with public models today? GLM and Kimi, two Chinese models, were able to achieve Tier 3 and create crashes, but were unable to convert above that level. Opus, Sonnet, and Gemini all could construct in-sandbox primitives, showing that the sandbox remains a strong security barrier and that the models are beginning to demonstrate advanced exploit reasoning. GPT-5.5 was the top-performing public model, bypassing the sandbox for around 50% of vulnerabilities, and achieving full code execution on 2 vulnerabilities.
Even if Mythos is never released, the public-model results are already significant on their own. The models are past the “just crash it” stage: several can build target-specific primitives, and GPT-5.5 crossed the line into full code execution on a small number of cases. That puts public models in an important middle ground: not replacements for elite exploit developers, but increasingly useful accelerators for skilled researchers working through the exploitation ladder on hard targets.
Although ExploitBench does not measure easier targets, the results strongly suggest they are likely already capable of full exploitation on less hardened software with fewer layers of defense. That does not mean they can exploit everything. It means that public models have moved into real exploit-development on their own, or as a powerful assistance tool for practitioners.
ExploitBench measures what models can do today. At Bugcrowd, our job is to help frontier model builders improve on that. It is work we are already doing with some of the leading AI labs building today’s most capable models.
We build reinforcement learning environments that give AI agents real, vulnerable software to work with. Agents attempt security tasks, receive objective feedback on whether they succeeded, and improve through that cycle. The environments are built from authentic open-source vulnerabilities, real source code, real exploits, verifiable outcomes. No synthetic data and no customer or PII data is used.
The curriculum covers the full range of security skills:
Together, these environments represent a full cybersecurity curriculum for frontier models, and we can add bespoke environments to level up for your particular needs. The hard part is not collecting vulnerable programs or CTF challenges. It is choosing the right tasks, building graders that resist reward hacking, and producing clean learning trajectories that teach models to improve.
If your team is building foundation models and wants cybersecurity reinforcement learning environments that teach real security skills, learn more about our RL environments for frontier AI here.