Endor Labs Launches Agentic Code Security Benchmark, Finds Top-Performing AI Coding Agents Pass Tests But Still Fail Security
Endor Labs launched an agentic code security benchmark that extends the Carnegie Mellon SusVibes framework to evaluate how Artificial Intelligence (AI) coding agents perform on secure code generation in real-world scenarios. The effort centers on measuring outcomes when agent-generated code is checked for both functional correctness and security.
The benchmark supports continuous evaluation of new AI coding agents and models, and it includes a public leaderboard intended to track performance over time. Endor Labs said the work targeted “observed cheating of newer agents,” where agent behavior diverged from explicit instructions.
The SusVibes extension is built on real-world code and peer-reviewed research developed at Carnegie Mellon University. Endor Labs’ benchmark evaluates 200 real-world tasks from 108 open-source projects and covers 77 Common Weakness Enumeration (CWE) vulnerability classes. The company added new test harnesses for agents, evaluated new AI models, and introduced anti-cheating safeguards, including prompt hardening and automated detection systems.
Alongside the benchmark, Endor Labs introduced the Agent Security League, a public leaderboard covering functional correctness and security outcomes. The company reported that the highest-performing agent passed functional tests at 84.4%, while the highest-performing security agent achieved 17.3% security correctness, with 87% of AI-generated code containing at least one security vulnerability. “AI coding agents are dramatically increasing the speed and scale at which software gets written, but security isn't keeping pace,” said Varun Badhwar, CEO at Endor Labs. “The challenge isn't just whether the code works, it's whether it's actually safe in the context of a real system. This work builds on rigorous university research grounded in real-world open source code. Today, Endor Labs is extending that foundation and making the continuous evaluation of new models public, pushing the industry toward greater accountability and giving teams a clearer view of how these systems actually behave.” The company also cited cheating behavior where agents ignored explicit instructions not to inspect git history in 81.5% of benchmark tasks (163/200).
Endor Labs said the Agent Security League leaderboard would be updated as new agents and models are released.