Autonomous Exploit Chaining via Reinforcement-Learning Agents
How agentic-AI systems compose multi-step attack paths from disparate findings - and what the security industry can learn from them.
Abstract
Modern enterprise stacks accumulate findings faster than humans can validate them. We present an agentic exploitation framework that composes multi-step attack chains from previously isolated SAST, SCA, DAST, and VA outputs. Trained on 4,712 anonymized engagement records and rewarded for end-to-end goal completion (rather than per-finding accuracy), the system achieves a measured chain-completion rate of 47.2% against representative web-application targets - a 3.6× improvement over a non-agentic baseline.
Why this matters
Detection has been a solved problem for the better part of a decade. Validation has not. The industry response has typically been to add more tools; ours is to add more intelligence to the layer that connects them.
Methodology
The agent receives a normalized finding graph and a goal embedding (e.g. "reach the user account database from an unauthenticated session"). At each step it can: (a) emit a candidate exploit primitive, (b) request a deeper probe of an unverified finding, or (c) declare the chain complete. A symbolic verifier rejects unsupported transitions, providing the negative reward signal needed to suppress hallucination.
Results
Across the 312 evaluation targets we observed:
- Chain length: median of 4 primitives, max of 11.
- False-positive rate on declared exploits: 1.8% (versus 22.4% for the rule-based baseline).
- Time-to-validate: median 11 minutes, p95 47 minutes.
Open questions
The same architecture that compresses red-team work into hours also compresses an attacker's. Defensive deployment alongside intrusion telemetry is, in our view, no longer optional.
