Solução

Planos

Recursos

Advisories

Empresa

Select Language



Contatar vendas

Fazer login

Teste gratuito

Select Language



Contatar vendas

Teste gratuito

Select Language



Índice

Título

Índice

Título



Blog

Opiniões

AI pentesting agents are getting real, and research shows what to trust

cover-ai-pentesting-agents-research-trust (https://unsplash.com/photos/white-and-black-skeleton-figurine-on-gray-rock-hRCH8faulKA)

Felipe Ruiz

Redator e editor de conteúdo

9 de jun. de 2026



10 min

In April 2025, we published a blog post on the upside and downside of GenAI in pentesting. At that time, the conversation was still mainly about assistance: Could generative AI help pentesters interpret outputs, write payloads, accelerate reconnaissance, summarize findings, or reduce repetitive work? The answer was already yes, with caveats around overreliance, privacy, misuse, bias, and human oversight.

Since then, the research frontier has moved quickly. A more demanding question has emerged: Can AI agents perform meaningful parts of a pentest with less human intervention? Recent papers suggest that they can, under carefully engineered conditions. The best results do not come from leaving a language model alone in a terminal. They come from systems that add structure around the model: attack trees, classical planning, tool adapters, memory, validation, specialized code generation, repeated evaluation, and human review. AI pentesting is becoming a systems engineering problem, not just a model benchmark.

The model is only one part of the agent

One sign of maturity in this field is a more precise definition of the "AI pentester." It is rarely just a model. It is usually a harness around a model: prompts, tools, state, permissions, retries, validators, logs, and reporting.

We made a similar point in our posts on Claude Code engineering and Claude Mythos and Project Glasswing: useful agents depend on the surrounding system, including source access, deterministic tools, containers, judging, human triage, and coordinated disclosure. The same premise is evident throughout recent AI-pentesting research.

A useful example is a paper on structured attack trees. In this context, a structured attack tree is a predefined map of possible pentesting steps, built from MITRE ATT&CK tactics, techniques, and procedures. Instead of asking the model to invent the next task from scratch, the system asks it to proceed through controlled steps: summarize the tool's output, update what has been learned, mark tasks as complete or incomplete, and choose the next valid action from the tree.

Pentesting agents often fail in ordinary ways: they repeat steps, follow irrelevant paths, invent procedures, or lose track of previous tests. The structured-tree approach aims to reduce those failures by providing the model with rails. The authors evaluated 10 HackTheBox machines with 103 subtasks and reported that the guided pipeline completed 71.8%, 72.8%, and 78.6% of the subtasks with Llama-3-8B, Gemini-1.5, and GPT-4, respectively. The self-guided baseline completed 13.5%, 16.5%, and 75.7% under the same model choices.

The result is most revealing for the smaller models. GPT-4 was already strong in both settings, but Llama-3-8B and Gemini-1.5 improved dramatically when the attack tree constrained their reasoning. The broader lesson is clear: capability is not only inside the model; methodology can turn the same model into a more reliable workflow component.

Wang and colleagues apply the same idea through a stricter planning system in CheckMate, an automated pentesting agent that separates the workflow into three roles: a planner, an executor, and a perceptor. The planner decides what to try next; the LLM-powered executor carries out bounded tasks; and the perceptor translates tool outputs back into structured facts. CheckMate gives long-horizon planning to a classical planner, not to the language model. Classical planning represents actions as preconditions and effects, so each step is attempted only when the required conditions hold, and the result updates the system's view of the target.

In the authors' Vulhub-based evaluation, CheckMate improved benchmark success rates by more than 20% over Claude Code. For tasks both systems solved, it also reduced the average cost by 53% and the average time by 54%. The operational takeaway is that a better harness can make the same class of AI-driven work more reliable, faster, and cheaper.

APT-Agent offers another mechanism-level example. Its initial observation is practical: LLM agents often fail due to minor operational errors. They may invent a Metasploit module name, repeat a failed step, forget that a service was already tested, or lose the order of actions from reconnaissance to exploitation.

APT-Agent addresses those failures with two support modules. A rectifier checks generated Metasploit module names against a curated database, reducing the likelihood of hallucinated tool paths. A context manager stores compact, stage-aware execution history, so the agent can carry useful state without flooding the prompt with raw logs. In its Metasploitable 2 evaluation—an intentionally vulnerable lab machine rather than an enterprise network—APT-Agent reports an 84.3% end-to-end success rate, compared with 48.6% for Script Kiddie and 18.6% for PentestGPT. Removing both the rectifier and the context module reduced success from 59/70 to 38/70.

Together, these systems show three ways to make LLM pentesting less fragile: guide the model through a known attack path, move long-horizon planning into a stricter planning system, and add correction or memory modules. The common lesson is architectural.

From CTF dominance to live enterprise testing

Some results show how quickly AI agents can optimize for standardized security challenges. Mayoral-Vilches and colleagues report strong performance by CAI (Cybersecurity AI) across several 2025 CTF competitions, including Neurogrid, Dragos OT CTF, Cyber Apocalypse, UWSP Pointer Overflow, and HTB AI vs Humans.

CTFs compress security tasks into measurable objectives. In Jeopardy-style CTFs, participants usually solve independent challenges and capture "flags" that prove success. CAI's reported results are strong: 41/45 flags at Neurogrid, Rank 1 around hours 7–8 in Dragos OT CTF, and 19/20 flags in HTB AI vs Humans. The paper also reports reducing the estimated cost of 1B-token inference from $5,940 to $119.

Those numbers show two things: First, AI agents can already be competitive in environments where objectives are clear, feedback is fast, and success is easy to verify. Second, architecture affects whether this automation can run at scale; a system that performs well but is too expensive to repeat has limited operational value.

The caveat is just as important. CTFs are controlled competitions, not enterprise pentests. They measure speed, tool use, pattern recognition, and challenge solving, but they do not fully capture production constraints such as business context, ambiguous scope, chained findings, false positives, stakeholder communication, or remediation. The CAI authors acknowledge this tension and argue that Jeopardy-style CTFs may become weaker proxies for real security skill as AI agents optimize for them. They also describe a "last 5%" problem: some challenges were resistant to automation because they required contextual knowledge, cultural cues, hidden dependencies, or interpretation beyond routine technical execution.

A stronger test is to move from CTF-style challenges to a live environment with real operational constraints. The ARTEMIS study evaluates a multi-agent pentesting scaffold designed to coordinate reconnaissance, exploitation, triage, and reporting across a large target environment. The authors compare ARTEMIS against 10 cybersecurity professionals and 6 existing AI agents on a live university network of about 8,000 hosts across 12 subnets.

ARTEMIS placed second overall, found 9 valid vulnerabilities, achieved an 82% valid submission rate, and outperformed 9 of 10 human participants. That does not make the system equivalent to a professional pentester, but it does show that a specialized scaffold can produce useful findings in a more realistic setting, under a defined scope and monitoring.

The details are more useful than the ranking. ARTEMIS was strong in enumeration, parallel exploration, and cost, but weak in GUI-based tasks. For instance, 80% of human participants found a TinyPilot RCE that the AI agent missed while focusing on misconfigurations. However, ARTEMIS found an older iDRAC vulnerability by interacting through curl -k, where humans gave up because browsers refused to load the interface due to outdated HTTPS ciphers. CLI-native, parallel systems may notice paths humans skip, while humans remain stronger at visual interfaces, contextual judgment, and strategic pivots.

Specialized models are part of the stack

Not every relevant advance in AI pentesting is a full autonomous agent. RedShell focuses on a narrower building block: locally fine-tuned models that generate offensive PowerShell snippets for Windows pentesting. Future systems may combine components for planning, target interaction, validation, and script generation.

The researchers built and extended a dataset of offensive PowerShell samples, aligned it with MITRE ATT&CK tactics, and fine-tuned open-weight models such as Qwen2.5-7B, Qwen2.5-Coder-7B-Instruct, and Llama3.1-8B. In their evaluation, the generated samples had fewer than 10% parse errors, and the code was substantially equivalent to reference examples according to automated similarity metrics. A companion RedShell paper adds functional testing and reports that the fine-tuned Qwen2.5-Coder achieved 100% correctness across produced samples in a controlled simulation, matching the functional effectiveness of ChatGPT-3.5 while aligning more closely with expected offensive strategies.

Those results support specialization rather than autonomous pentesting. RedShell does not discover targets, determine an attack path, or validate a full engagement on its own. Its relevance is that narrow, locally deployable models may become useful modules inside broader pentesting workflows. Local specialization can reduce third-party data exposure, but access control, allowed use, auditability, and misuse prevention become deployment decisions.

The benchmark ExploitGym makes a related but sharper distinction: identifying a vulnerability is not the same as turning it into a working exploit. In a recent investigation, agents are not asked to discover bugs from scratch; they are given a vulnerability trigger—a starting input that exposes a known weakness—and must extend it into an exploit that achieves unauthorized code execution. The benchmark contains 898 real-world vulnerability instances across userspace software, Google’s V8 JavaScript engine, and the Linux kernel. Its validation process checks both the capture of the flag and whether the intended vulnerability was actually used.

Under trusted-access research conditions, with deployment-time safeguards disabled to measure capability boundaries, Claude Mythos Preview with Claude Code solved 157 instances, and GPT-5.5 with Codex CLI solved 120 under a two-hour timeout. The same paper shows why controls and validation must be discussed alongside capability. With OpenAI’s default safety filters enabled for GPT-5.5, all exploit attempts under default prompting were blocked, and 88.2% were blocked before any tool call. The benchmark also found cases in which agents captured a flag via a different vulnerability than the one being tested. That distinction is crucial: a result can look successful while measuring the wrong behavior.

Broad evaluations are less flattering than single-system demos

A broader reality check comes from "Hackers or Hallucinators?," a systematization and empirical study of LLM-based automated penetration testing. Rather than presenting a single new agent, the authors review architecture, planning, memory, execution, external knowledge, and benchmarks, then compare 13 open-source AutoPT (automated penetration testing) frameworks with two baselines. The scale is unusual: more than 10 billion tokens, over 1,500 execution logs, and manual review by more than 15 cybersecurity researchers over four months.

The results challenge several common assumptions. More agents did not automatically mean better performance: three single-agent designs ranked in the top six on Easy and Medium tasks. Specialized AutoPT frameworks also did not always beat simpler baselines; Kimi CLI and Claude Code scored 72 and 69, surpassing most of the open-source frameworks tested. A system can look sophisticated and still perform worse than a simpler agent with better execution discipline.

The breakdowns are just as revealing. In chained-exploitation samples, 83.3% stalled before completing a multi-vulnerability chain. In CVE exploitation scenarios, approximately 56.7% of samples mapped targets to CVE identifiers but failed to build effective payloads. Recognizing a possible vulnerability is not the same as validating exploitability or proving impact.

The paper also warns against two common shortcuts. Retrieval-augmented generation did not reliably help; in some cases, external knowledge bases produced negative returns. Larger tool pools did not positively correlate with task success either. Adding more documents, tools, and subagents can increase noise if the system cannot select, verify, and remember correctly.

This study provides the field with a necessary correction: AI pentesting is progressing, but many systems remain brittle: retrieval can mislead, tool access does not guarantee sound judgment, multi-agent coordination can lose information, and hallucinated flags can cause false completions. A credible evaluation should focus less on how ambitious the architecture sounds and more on what the system proves.

Evaluation must move from tasks to findings

The next evaluation step is to move from task success to finding quality. Conde and colleagues argue that many AI-pentesting benchmarks still reward predefined outcomes: capturing a CTF flag, reproducing remote code execution, following an expected trajectory, or solving a known exploitation task. Those measurements are useful, but they are narrower than the question AppSec teams actually care about: did the system produce valid, actionable security findings?

The authors propose evaluating agents at the finding level: comparing reports against structured ground truth, matching findings by meaning, deduplicating overlapping results, and measuring recall, precision, runtime, and monetary cost. The protocol also treats ground truth as something that can evolve; an unmatched agent finding may be a false positive, but it may also be a real vulnerability missing from the reference set.

This framing is closer to how AppSec teams assess the value of pentesting: valid vulnerabilities, clear impact, deduplication, severity, and support for remediation. A benchmark that only asks whether an agent reached a predefined goal can miss those requirements.

One of the paper's most useful ideas is cumulative evaluation. Repeated runs can reveal findings that a single run misses, but they can also accumulate false positives. In the authors' experiments, Strix-Sonnet nearly doubled recall while remaining comparatively precise, whereas PentAGI-Sonnet lost precision as false positives accumulated. For continuous pentesting, the question is whether repeated execution improves coverage faster than it increases noise.

Reliability adds another layer. A 400-run study tested four models 100 times each against the same fixed honeypot, a controlled target with known vulnerable services. Full exploitation rates varied widely: Gemini reached all three services in 85/100 runs, Claude in 61/100, GPT-4o-mini in 56/100, and qwen2.5-coder in 25/100. The study also found large variation in strategy: GPT-4o-mini produced 98 unique attack strategies across 100 runs, compared with 69 for qwen and 48 for Gemini.

Reliability is not only a model property. Claude's results were heavily affected by provider-side overloaded errors, which truncated many runs and changed measured performance. For deployed agents, API behavior, rate limits, timeouts, logging, and retry handling become part of the security system being evaluated.

A single successful run is not a reliable pentest, and a single failed run is not a complete evaluation. AI pentesting agents are stochastic systems; they need repeated measurements, failure classification, run logs, finding-level validation, and cost curves. The right question is not only whether an agent can succeed, but how often it succeeds, what it misses, what it invents, and how expensive it is to trust.

The market is moving in the same direction

These evaluation questions are becoming practical because AI pentesting is no longer only a research topic. Product announcements and open-source projects now use similar language: autonomous testing, application context, proof of exploitability, audit trails, remediation, and retesting. Those claims are market signals, not independent evidence of effectiveness.

Recent examples include a cloud-provider announcement describing on-demand autonomous penetration testing with vulnerability validation, reproduction steps, and remediation suggestions; an AppSec market announcement positioning continuous offensive testing around application context, model orchestration, validation, and audit controls; an open-source AI-pentesting repository focused on source-code analysis, attack-path identification, and proof-by-exploitation; and an AI-pentesting product announcement centered on business context, exploit validation, and rerunnable tests.

The common direction is clear, even without accepting vendor claims at face value: AI pentesting is being packaged around continuous validation and proof-oriented workflows. Buyers should ask how findings are validated, how false positives are controlled, how runs are logged, how human review is incorporated into the workflow, and how remediation is confirmed.

The likely future is not one in which a monolithic AI pentester replaces a security team. The direction is toward layered AI-pentesting workflows, where planners, tool users, specialized models, validators, logs, triage, human approvals, and remediation steps work together. Autonomy alone does not make that workflow useful. The real value comes from proving findings, preserving context for review, and helping teams fix what the system discovers.

A better question for buyers and builders

The least useful procurement question is whether a tool "uses AI." The better question is what the system can prove under real constraints.

A serious AI-pentesting workflow should explain what evidence it produces, how it reproduces findings, and how it separates discovery from exploitation. It should show how scope, credentials, tools, and state are controlled; how false positives, duplicates, and hallucinated successes are handled; and what happens when the model refuses, fails, loops, repeats a step, or invents an identifier. It should also make clear which actions require human approval and how results connect to remediation and retesting.

For builders, the same questions become design requirements. Agentic pentesting systems need logs, validation layers, failure classification, repeated-run measurements, and cost visibility. They also need boundaries: the model should not be responsible for every planning, execution, memory, and approval decision. The stronger systems emerging from recent research are not just prompts wrapped around security tools; they are engineered workflows around probabilistic models.

AI pentesting agents are becoming sufficiently real to be evaluated seriously. They are also uneven enough that demos, leaderboards, and product claims are not enough. Progress in this space should be measured by valid findings, reproducible evidence, controlled autonomy, and remediation outcomes.

That leads to the next question. If security teams begin to rely on agentic systems for pentesting, remediation, code review, and AI red teaming, how do we secure those agents themselves? Governance, access control, model safety, logging, ethics, and responsible disclosure are no longer side topics; they become part of the security architecture. The next post will look at that governance problem in more detail: "Before trusting AI security agents, test and govern them" (to be published soon).

21 de janeiro de 2026

O melhor da JCUN sexta edição

Ler publicação



Comece seu teste gratuito de 21 dias

Descubra os benefícios de nossa solução de Hacking Contínuo, da qual empresas de todos os tamanhos já desfrutam.

Teste gratuito

Contatar vendas

Comece seu teste gratuito de 21 dias

Descubra os benefícios de nossa solução de Hacking Contínuo, da qual empresas de todos os tamanhos já desfrutam.

Teste gratuito

Contatar vendas

Comece seu teste gratuito de 21 dias

Descubra os benefícios de nossa solução de Hacking Contínuo, da qual empresas de todos os tamanhos já desfrutam.

Teste gratuito

Contatar vendas

As soluções da Fluid Attacks permitem que as organizações identifiquem, priorizem e corrijam vulnerabilidades em seus softwares ao longo do SDLC. Com o apoio de IA, ferramentas automatizadas e pentesters, a Fluid Attacks acelera a mitigação da exposição ao risco das empresas e fortalece sua postura de cibersegurança.

Consulta IA sobre Fluid Attacks