Autonomous Exploit Chaining via Reinforcement-Learning Agents

How agentic-AI systems compose multi-step attack paths from disparate findings - and what the security industry can learn from them.

Ares Cyber Engineering TeamMarch 18, 2026

Abstract

Modern enterprise stacks accumulate findings faster than humans can validate them. We present an agentic exploitation framework that composes multi-step attack chains from previously isolated SAST, SCA, DAST, and VA outputs. Trained on 4,712 anonymized engagement records and rewarded for end-to-end goal completion (rather than per-finding accuracy), the system achieves a measured chain-completion rate of 47.2% against representative web-application targets - a 3.6× improvement over a non-agentic baseline.

Why this matters

Detection has been a solved problem for the better part of a decade. Validation has not. The industry response has typically been to add more tools; ours is to add more intelligence to the layer that connects them.

Methodology

The agent receives a normalized finding graph and a goal embedding (e.g. "reach the user account database from an unauthenticated session"). At each step it can: (a) emit a candidate exploit primitive, (b) request a deeper probe of an unverified finding, or (c) declare the chain complete. A symbolic verifier rejects unsupported transitions, providing the negative reward signal needed to suppress hallucination.

Results

Across the 312 evaluation targets we observed:

Chain length: median of 4 primitives, max of 11.
False-positive rate on declared exploits: 1.8% (versus 22.4% for the rule-based baseline).
Time-to-validate: median 11 minutes, p95 47 minutes.

Open questions

The same architecture that compresses red-team work into hours also compresses an attacker's. Defensive deployment alongside intrusion telemetry is, in our view, no longer optional.