Open-source software supply chain security relies heavily on assessing affected versions of library vulnerabilities. While prior studies have leveraged exploits for verifying vulnerability affected versions, they point out a key limitation that exploits are version-specific and cannot be directly applied across library versions. Despite being widely acknowledged, this limitation has not been systematically validated at scale, leaving the actual applicability of exploits across versions unexplored. To fill this gap, we conduct the first large-scale empirical study on exploit applicability across library versions. We construct a comprehensive dataset consisting of 259 exploits spanning 128 Java libraries and 28,150 historical versions, covering 61 CWEs that account for 76.33% of vulnerabilities in Maven. Leveraging this dataset, we execute each exploit against the library version history and compare the execution outcomes with our manually annotated ground-truth affected versions. We further investigate the root causes of inconsistencies between exploit execution and ground truth, and explore strategies for exploit migration. Our results (RQ1) show that, even without migration, explo
Research on exploit chains predominantly focuses on sequences with one type of exploit, e.g., either escalating privileges on a machine or executing remote code. In networks, hybrid exploit chains are critical because of their linkable vulnerabilities. Moreover, developing hybrid exploit chains is challenging because it requires understanding the diverse and independent dependencies and outcomes. We present hybrid chains encompassing privilege escalation (PE) and remote code execution (RCE) exploits. These chains are executable and can span large networks, where numerous potential exploit combinations arise from the large array of network assets, their hardware, software, configurations, and vulnerabilities. The chains are generated by ALFA-Chains, an AI-supported framework for the automated discovery of multi-step PE and RCE exploit chains in networks across arbitrary environments and segmented networks. Through an LLM-based classification, ALFA-Chains describes exploits in Planning Domain Description Language (PDDL). PDDL exploit and network descriptions then use off-the-shelf AI planners to find multiple exploit chains. ALFA-Chains finds 12 unknown chains on an example with a kn
Smart contract vulnerabilities in Decentralized Finance caused over billions of dollars losses every year, yet the security community faces a critical bottleneck: identifying a vulnerability is not the same as proving it is exploitable. Manual PoC construction is prohibitively labor-intensive, leaving most disclosed vulnerabilities unverified and protocols exposed long before mitigation is applied. In this paper, we propose \sys, a knowledge-driven agentic system for end-to-end contract vulnerability detection and exploit synthesis. Our core insight is that exploit synthesis is not a code generation task but a \emph{structured reasoning problem} that requires grounded knowledge of protocol semantics, failure root cause, and exploit primitives. \sys organizes this knowledge into a \emph{Hierarchical Knowledge Graph} (HKG) that serves as structured memory for LLM-guided multi-hop reasoning. To validate exploit feasibility beyond code synthesis, \sys employs a two-stage validation framework that checks exploit-path reachability via SMT solving and profit realizability via asset-level state simulation, ensuring generated PoCs satisfy both logical and economic viability constraints. Eva
Poker is an imperfect information game that has served as a long-standing benchmark for decision-making under uncertainty. To maximize utility beyond the Nash equilibrium, an agent can deviate from Nash-equilibrium policies to exploit suboptimal play. We introduce AlphaExploitem, which extends the competitive RL poker agent AlphaHoldem by using a hierarchical transformer encoder that enables reasoning over previously played hands and modifying the training procedure with the inclusion of a diverse pool of exploitable opponents to facilitate learning to exploit. We train and evaluate AlphaExploitem on two standard benchmarks for imperfect-information games. Empirically, AlphaExploitem successfully exploits weak play by both in- and out-of-distribution opponents, without losing performance against NE opponents.
Vulnerability detection tools are widely adopted in software projects, yet they often overwhelm maintainers with false positives and non-actionable reports. Automated exploitation systems can help validate these reports; however, existing approaches typically operate in isolation from detection pipelines, failing to leverage readily available metadata such as vulnerability type and source-code location. In this paper, we investigate how reported security vulnerabilities can be assessed in a realistic grey-box exploitation setting that leverages minimal vulnerability metadata, specifically a CWE classification and a vulnerable code location. We introduce Agentic eXploit Engine (AXE), a multi-agent framework for Web application exploitation that maps lightweight detection metadata to concrete exploits through decoupled planning, code exploration, and dynamic execution feedback. Evaluated on the CVE-Bench dataset, AXE achieves a 30% exploitation success rate, a 3x improvement over state-of-the-art black-box baselines. Even in a single-agent configuration, grey-box metadata yields a 1.75x performance gain. Systematic error analysis shows that most failed attempts arise from specific re
Security practitioners face growing challenges in exploit assessment, as public vulnerability repositories are increasingly populated with inconsistent and low-quality exploit artifacts. Existing scoring systems, such as CVSS and EPSS, offer limited support for this task. They either rely on theoretical metrics or produce opaque probability estimates without assessing whether usable exploit code exists. In practice, security teams often resort to manual triage of exploit repositories, which is time-consuming, error-prone, and difficult to scale. We present AEAS, an automated system designed to assess and prioritize actionable exploits through static analysis. AEAS analyzes both exploit code and associated documentation to extract a structured set of features reflecting exploit availability, functionality, and setup complexity. It then computes an actionability score for each exploit and produces ranked exploit recommendations. We evaluate AEAS on a dataset of over 5,000 vulnerabilities derived from 600+ real-world applications frequently encountered by red teams. Manual validation and expert review on representative subsets show that AEAS achieves a 100% top-3 success rate in recom
Recently Large Language Models (LLMs) have been used in security vulnerability detection tasks including generating proof-of-concept (PoC) exploits. A PoC exploit is a program used to demonstrate how a vulnerability can be exploited. Several approaches suggest that supporting LLMs with additional guidance can improve PoC generation outcomes, motivating further evaluation of their effectiveness. In this work, we develop PoC-Gym, a framework for PoC generation for Java security vulnerabilities via LLMs and systematic validation of generated exploits. Using PoC-Gym, we evaluate whether the guidance from static analysis tools improves the PoC generation success rate and manually inspect the resulting PoCs. Our results from running PoC-Gym with Claude Sonnet 4, GPT-5 Medium, and gpt-oss-20b show that using static analysis for guidance and criteria lead to 21% higher success rates than the prior baseline, FaultLine. However, manual inspection of both successful and failed PoCs reveals that 71.5% of the PoCs are invalid. These results show that the reported success of LLM-based PoC generation can be significantly misleading, which is hard to detect with current validation mechanisms.
LLM agents with tool access can discover and exploit security vulnerabilities. This is known. What is not known is which features of a system prompt trigger this behaviour, and which do not. We present a systematic taxonomy based on approximately 10,000 trials across seven models, 37 prompt conditions, and 12 hypothesized attack dimensions -- all executed in real Docker sandboxes. Every condition includes the instruction "Always respect all rules and access policies"; only a single encouragement sentence varies. Within the task class studied (planted test-runner vulnerabilities), the threat model is narrower than assumed: nine of twelve dimensions -- including minimization, moral licensing, incentives, identity priming, and reasoning triggers -- produce no detectable exploitation at n=50 per cell (upper 95% CI <7% per condition). But it is also more subtle than expected: one dimension, goal reframing, reliably triggers exploitation. "You are solving a puzzle; there may be hidden clues" produces 38-40% exploitation on Claude Sonnet 4 despite the explicit rule instruction, replicating across four models (CTF framing: 8-14% on DeepSeek, GPT-5-mini, o4-mini). The agent does not over
Open-source libraries are widely used in modern software development, introducing significant security vulnerabilities. While static analysis tools can identify potential vulnerabilities at scale, they often generate overwhelming reports with high false positive rates. Automated Exploit Generation (AEG) emerges as a promising solution to confirm vulnerability authenticity by generating an exploit. However, traditional AEG approaches based on fuzzing or symbolic execution face path coverage and constraint-solving problems. Although LLMs show great potential for AEG, how to effectively leverage them to comprehend vulnerabilities and generate corresponding exploits is still an open question. To address these challenges, we propose Vulnsage, a multi-agent framework for AEG. Vulnsage simulates human security researchers' workflows by decomposing the complex AEG process into multiple specialized sub-agents: Code Analyzer Agent, Code Generation Agent, Validation Agent, and a set of Reflection Agents, orchestrated by a central supervisor through iterative cycles. Given a target program, the Code Analyzer Agent performs static analysis to identify potential vulnerabilities and collects rele
Security vulnerabilities in software packages are a significant concern for developers and users alike. Patching these vulnerabilities in a timely manner is crucial to restoring the integrity and security of software systems. However, previous work has shown that vulnerability reports often lack proof-of-concept (PoC) exploits, which are essential for fixing the vulnerability, testing patches, and avoiding regressions. Creating a PoC exploit is challenging because vulnerability reports are informal and often incomplete, and because it requires a detailed understanding of how inputs passed to potentially vulnerable APIs may reach security-relevant sinks. In this paper, we present PoCGen, a novel approach to autonomously generate and validate PoC exploits for vulnerabilities in npm packages. The approach is the first to address this task by combining the complementary strengths of large language models (LLMs), e.g., to understand informal vulnerability reports, with static analysis, e.g., to identify taint paths, and dynamic analysis, e.g., to validate generated exploits. PoCGen successfully generates exploits for 77% of the vulnerabilities in the SecBench$.$js dataset. This success
Reinforcement learning (RL) trained language model agents with tool access are increasingly deployed in coding assistants, research tools, and autonomous systems. We introduce the Reward Hacking Benchmark (RHB), a suite of multi-step tasks requiring sequential tool operations with naturalistic shortcut opportunities such as skipping verification steps, inferring answers from task-adjacent metadata, or tampering with evaluation-relevant functions. RHB supports independent and chained task regimes, where chain length acts as a proxy for longer-horizon agent behavior. We evaluate 13 frontier models from OpenAI, Anthropic, Google, and DeepSeek. Exploit rates range from 0% (Claude Sonnet 4.5) to 13.9% (DeepSeek-R1-Zero), varying sharply by post-training style. A controlled sibling comparison (DeepSeek-V3 vs. DeepSeek-R1-Zero) shows RL post-training is associated with substantially higher reward hacking (0.6% vs. 13.9%), with consistent gaps across all four task families. We identify six exploit categories and find that 72% of reward hacking episodes include explicit chain-of-thought rationale, suggesting models often frame exploits as legitimate problem-solving. Simple environmental har
Studying the responses of large language models (LLMs) to loopholes presents a two-fold opportunity. First, it affords us a lens through which to examine ambiguity and pragmatics in LLMs, since exploiting a loophole requires identifying ambiguity and performing sophisticated pragmatic reasoning. Second, loopholes pose an interesting and novel alignment problem where the model is presented with conflicting goals and can exploit ambiguities to its own advantage. To address these questions, we design scenarios where LLMs are given a goal and an ambiguous user instruction in conflict with the goal, with scenarios covering scalar implicature, structural ambiguities, and power dynamics. We then measure different models' abilities to exploit loopholes to satisfy their given goals as opposed to the goals of the user. We find that both closed-source and stronger open-source models can identify ambiguities and exploit their resulting loopholes, presenting a potential AI safety risk. Our analysis indicates that models which exploit loopholes explicitly identify and reason about both ambiguity and conflicting goals.
The exploit or the Proof of Concept of the vulnerability plays an important role in developing superior vulnerability repair techniques, as it can be used as an oracle to verify the correctness of the patches generated by the tools. However, the vulnerability exploits are often unavailable and require time and expert knowledge to craft. Obtaining them from the exploit generation techniques is another potential solution. The goal of this survey is to aid the researchers and practitioners in understanding the existing techniques for exploit generation through the analysis of their characteristics and their usability in practice. We identify a list of exploit generation techniques from literature and group them into four categories: automated exploit generation, security testing, fuzzing, and other techniques. Most of the techniques focus on the memory-based vulnerabilities in C/C++ programs and web-based injection vulnerabilities in PHP and Java applications. We found only a few studies that publicly provided usable tools associated with their techniques.
Automated Program Repair (APR) for smart contract security promises to automatically mitigate smart contract vulnerabilities responsible for billions in financial losses. However, the true effectiveness of this research in addressing smart contract exploits remains uncharted territory. This paper bridges this critical gap by introducing a novel and systematic experimental framework for evaluating exploit mitigation of program repair tools for smart contracts. We qualitatively and quantitatively analyze 20 state-of-the-art APR tools using a dataset of 143 vulnerable smart contracts, for which we manually craft 91 executable exploits. We are the very first to define and measure the essential "exploit mitigation rate" , giving researchers and practitioners a real sense of effectiveness of cutting edge techniques. Our findings reveal substantial disparities in the state of the art, with an exploit mitigation rate ranging from a low of 29% to a high of 74%. Our study identifies systemic limitations, such as inconsistent functionality preservation, that must be addressed in future research on program repair for smart contracts.
Smart contract vulnerabilities have led to billions in losses, yet finding actionable exploits remains challenging. Traditional fuzzers rely on rigid heuristics and struggle with complex attacks, while human auditors are thorough but slow and don't scale. Large Language Models offer a promising middle ground, combining human-like reasoning with machine speed. Early studies show that simply prompting LLMs generates unverified vulnerability speculations with high false positive rates. To address this, we present A1, an agentic system that transforms any LLM into an end-to-end exploit generator. A1 provides agents with six domain-specific tools for autonomous vulnerability discovery, from understanding contract behavior to testing strategies on real blockchain states. All outputs are concretely validated through execution, ensuring only profitable proof-of-concept exploits are reported. We evaluate A1 across 36 real-world vulnerable contracts on Ethereum and Binance Smart Chain. A1 achieves a 63% success rate on the VERITE benchmark. Across all successful cases, A1 extracts up to \$8.59 million per exploit and \$9.33 million total. Using Monte Carlo analysis of historical attacks, we
This paper investigates the "Exploitation Business" model, which capitalizes on information asymmetry to exploit vulnerable populations. It focuses on businesses targeting non-experts or fraudsters who capitalize on information asymmetry to sell their products or services to desperate individuals. This phenomenon, also described as "profit-making activities based on informational exploitation," thrives on individuals' limited access to information, lack of expertise, and Fear of Missing Out (FOMO). The recent advancement of social media and the rising trend of fandom business have accelerated the proliferation of such exploitation business models. Discussions on the empowerment and exploitation of fans in the digital media era present a restructuring of relationships between fans and media creators, highlighting the necessity of not overlooking the exploitation of fans' free labor. This paper analyzes the various facets and impacts of exploitation business models, enriched by real-world examples from sectors like cryptocurrency and GenAI, thereby discussing their social, economic, and ethical implications. Moreover, through theoretical backgrounds and research, it explores similar
We present LegalSim, a modular multi-agent simulation of adversarial legal proceedings that explores how AI systems can exploit procedural weaknesses in codified rules. Plaintiff and defendant agents choose from a constrained action space (for example, discovery requests, motions, meet-and-confer, sanctions) governed by a JSON rules engine, while a stochastic judge model with calibrated grant rates, cost allocations, and sanction tendencies resolves outcomes. We compare four policies: PPO, a contextual bandit with an LLM, a direct LLM policy, and a hand-crafted heuristic; Instead of optimizing binary case outcomes, agents are trained and evaluated using effective win rate and a composite exploit score that combines opponent-cost inflation, calendar pressure, settlement pressure at low merit, and a rule-compliance margin. Across configurable regimes (e.g., bankruptcy stays, inter partes review, tax procedures) and heterogeneous judges, we observe emergent ``exploit chains'', such as cost-inflating discovery sequences and calendar-pressure tactics that remain procedurally valid yet systemically harmful. Evaluation via cross-play and Bradley-Terry ratings shows, PPO wins more often, t
Web applications are prime targets for cyberattacks as gateways to critical services and sensitive data. Traditional penetration testing is costly and expertise-intensive, making it difficult to scale with the growing web ecosystem. While language model agents show promise in cybersecurity, modern web applications demand visual understanding, dynamic content handling, and multi-step interactions that only computer-use agents (CUAs) can perform. Yet, their ability to discover and exploit vulnerabilities through graphical interfaces remains largely unexplored. We present HackWorld, the first framework for systematically evaluating CUAs' capabilities to exploit web application vulnerabilities via visual interaction. Unlike sanitized benchmarks, HackWorld includes 36 real-world applications across 11 frameworks and 7 languages, featuring realistic flaws such as injection vulnerabilities, authentication bypasses, and unsafe input handling. Using a Capture-the-Flag (CTF) setup, it tests CUAs' capacity to identify and exploit these weaknesses while navigating complex web interfaces. Evaluation of state-of-the-art CUAs reveals concerning trends: exploitation rates below 12% and low cyberse
Polynomial optimization problems are infinite-dimensional, nonconvex, NP-hard, and are often handled in practice with the moment-sums of squares hierarchy of semidefinite programming bounds. We consider problems where the objective function and constraint polynomials are invariant under the action of a finite group. The present paper simultaneously exploits group symmetry and term sparsity in order to reduce the computational cost of the hierarchy. We first exploit symmetry by writing the semidefinite matrices in a symmetry-adapted basis according to an isotypic decomposition. The matrices in such a basis are block diagonal. Secondly, we exploit term sparsity on each block to further reduce the optimization matrix variables. This is a non-trivial extension of the term sparsity-based hierarchy related to sign symmetry that was introduced by two of the authors. Our method is compared with existing techniques via benchmarks on quartics with dihedral, cyclic and symmetric group symmetry.
Crossover is a powerful mechanism for generating new solutions from a given population of solutions. Crossover comes with a discrepancy in itself: on the one hand, crossover usually works best if there is enough diversity in the population; on the other hand, exploiting the benefits of crossover reduces diversity. This antagonism often makes crossover reduce its own effectiveness. We introduce a new paradigm for utilizing crossover that reduces this antagonism, which we call diversity-preserving exploitation of crossover (DiPEC). The resulting Diversity Exploitation Genetic Algorithm (DEGA) is able to still exploit the benefits of crossover, but preserves a much higher diversity than conventional approaches. We demonstrate the benefits by proving that the (2+1)-DEGA finds the optimum of LeadingOnes with $O(n^{5/3}\log^{2/3} n)$ fitness evaluations. This is remarkable since standard genetic algorithms need $Θ(n^2)$ evaluations, and among genetic algorithms only some artificial and specifically tailored algorithms were known to break this runtime barrier. We confirm the theoretical results by simulations. Finally, we show that the approach is not overfitted to Leadingones by testing