Opinions
Before trusting AI security agents, test and govern them


Content writer and editor
9 min
In the previous post, we looked at recent research on AI pentesting agents. The main lesson was that capability is increasing, but the strongest results do not come from giving a model a terminal and hoping it behaves like a pentester. They come from workflows built around the model: planning, memory, tools, validators, logs, repeated runs, and human review.
That same conclusion leads to a second concern. If agentic systems become part of security work, then those agents also become systems that need security testing. The question is no longer only whether an agent can find SQL injection, generate a payload, or summarize a report. It is also about whether the agent adheres to scope, resists prompt injection, limits tool use, preserves audit trails, handles sensitive data safely, avoids hallucinated success, and escalates risky actions to humans.
The more useful the agent becomes, the more consequential its controls become. The next phase of AI security includes both sides of the problem: using AI for pentesting, and pentesting the agentic systems that security teams may begin to trust.
Agentic AI becomes a new attack surface
A recent paper, "Penetration Testing of Agentic AI," makes this shift explicit. The authors tested five models—Claude 3.5 Sonnet, Gemini 2.5 Flash, GPT-4o, Grok 2, and Nova Pro—across two agent frameworks: AutoGen and CrewAI. Their testbed was a seven-agent university information management system with agents for advising, finance, registration, career services, student support, and campus information. The attacks included prompt injection, SSRF, tool misuse, code execution, SQL injection, and privilege escalation. In other words, the study was not only testing whether a model could answer safely in chat; it was testing how a tool-using agent behaved when malicious instructions tried to steer its actions.
One useful metric in the paper is the refusal rate: the proportion of malicious requests the agent refused or blocked rather than carrying them out. AutoGen showed a 52.3% refusal rate, while CrewAI showed 30.8%; the model-level refusal range was narrower from Nova Pro's 46.2% to Claude and Grok 2's 38.5%. That comparison points to a practical lesson: the framework can change how the same kind of agent responds to risk. In agentic AI, security behavior means more than the final answer. It includes whether the system delegates a task, invokes a tool, exposes data, rejects the request, fabricates output, logs the attempt, or allows the request to proceed.
The authors also identify "hallucinated compliance": cases in which an agent appears to follow a request but fabricates the result rather than clearly refusing or actually executing the requested action. That behavior is risky because it can create a false audit trail. A dashboard may record "task completed" even though no tool actually produced the claimed output, and a natural-language answer may sound plausible even if it doesn't reflect what happened in the environment. Agent logs, therefore, need to show more than the final response. They should preserve prompts, tool calls, permissions, errors, outputs, and the evidence behind the conclusion.
Another example involved an agent writing code and attempting to run it against a cloud metadata endpoint. The attempt failed because of the controlled test environment, not because the model refused. The lesson does not require operational detail: in agentic systems, the deployment context can determine whether unsafe behavior remains contained or escalates into an incident. Tool access, network permissions, cloud protections, identity boundaries, and runtime isolation are part of the agent's security posture.
The parallel with AI pentesting is direct. The harness shapes how the model plans, remembers, and uses tools. In agentic AI security, the same surrounding system also defines which tools are exposed, how agents communicate with one another, where data boundaries are drawn, and how unsafe behavior is blocked or logged. Traditional AppSec controls still apply: least privilege, parameter validation, network restrictions, containerization, approval gates, and traceable logging become more valuable when a model can act through them.
AI red teaming is becoming agentic too
In another paper, "Redefining AI Red Teaming in the Agentic Era," the authors focus on how teams test AI systems for unsafe, policy-violating, or otherwise harmful behavior. They argue that many current red-team workflows remain library-centered: operators select attacks, apply prompt transformations, configure scorers, inspect traces, and assemble reports manually. Their proposed agentic workflow lets operators describe the testing objective in natural language, and the agent selects attacks, composes transformations, runs the workflow, and reports findings.
The framework integrates more than 45 adversarial attacks, 450 prompt transformations, and 130 scorers. In a Llama Scout case study, the authors tested 68 adversarial goals via the Dreadnode terminal interface, without the operator writing any code. They report 674 attacks, 7,727 trials, 573 findings, and an 85% attack success rate, including 401 full jailbreaks and 232 critical findings.
Those results should be read with care. A single case study should not be generalized to all models, deployments, or red-team objectives, and the specific adversarial prompts used in the study need not be reproduced here. The useful point is operational: AI red teaming is becoming a workflow problem, not only a prompt-design issue.
The paper also shows how red-team results can be organized for follow-up work. Findings are mapped to known AI and security frameworks, including OWASP LLM Top 10, MITRE ATLAS, NIST AI RMF, and Google SAIF, and the system can export both executive reports and structured data for technical review. The broader lesson is that agentic security testing needs artifacts that people can inspect, compare, prioritize, and act on. That applies to AI red teaming, AI pentesting, code review, and remediation workflows.
Offensive AI research needs governance, not silence
Zhuo and colleagues make a strategic argument: AI agents could change the economics of attackers by making vulnerability discovery and exploitation cheaper to repeat across many targets. If an attacker can automate reconnaissance, testing, and exploit development, each additional target may require less human effort than it does today.
The authors argue that model-level safeguards are not enough on their own. Data filtering, alignment, and output guardrails can shape how commercial models behave, but they do not prevent adversaries from running open-weight models, modifying them, or building offensive systems outside managed platforms. Their proposed response is controlled defensive development: benchmarks that cover the full attack lifecycle, stronger agents for discovering real-world vulnerabilities, and governance that keeps offensive agents inside audited cyber ranges while extracting defensive capabilities for broader use.
The argument is difficult, but it deserves attention. Avoiding offensive AI research will not stop attackers from experimenting with AI, while publishing capability without guardrails can increase risk. Responsible work in this area requires scoped environments, audit logs, access controls, human approvals, staged releases, responsible disclosure, and clear boundaries around what is shared.
ExploitGym illustrates the same tension from a measurement angle. The benchmark tests whether agents can turn vulnerability triggers into working exploits across 898 instances, including userspace software, Google's V8 JavaScript engine, and Linux kernel tasks. Its results show that frontier agents can demonstrate meaningful exploit-generation capability under trusted-access research conditions, where safeguards are relaxed to study capability boundaries, but also that deployment-time safeguards can block attempts under default prompting.
That contrast is central to governance. The behavior of an AI security agent depends on the model, the surrounding tools, the access policy, and the safety controls in effect at runtime. Therefore, a serious discussion of AI cyber capability needs to specify both what the system can do under research conditions and what it is allowed to do in deployment.
Ethics should scale with capability
Happe and Cito review 16 papers covering 15 LLM-based offensive-security prototypes. Their research finds that 13 of the 15 reviewed prototypes, or 86.6%, mention ethical considerations. It also reports that 10 of the 15 papers released source code or artifacts, while three of the five that did not release artifacts still included example prompts.
Those numbers show an unresolved tension in the field. Research needs enough transparency for others to inspect claims, reproduce results, and learn from failures. But offensive artifacts—code, prompts, datasets, tool integrations, or execution traces—can also lower the barrier for misuse, especially when the system can plan, call tools, or generate exploit steps. The authors therefore argue that ethics statements should address motivation, potential impact, consistency with demonstrated capability, mitigations, monitoring guidance, and responsible disclosure expectations.
A generic disclaimer is not enough. Ethical framing should match the capability being described. A bounded CTF agent, a source-code auditing system, a payload generator, and an exploit-development benchmark do not carry the same risk. Their publication choices, artifact releases, and access controls should also differ.
A practical review can start with a few questions:
Area | Responsible question |
Motivation | Why is this capability being studied, released, or productized? |
Capability | What can the system actually do, based on the evaluation? |
Misuse | Who could abuse the system, and how easily? |
Scope | Where was it tested, and under what authorization? |
Safeguards | Which limits, approvals, sandboxes, and logs were used? |
Artifacts | What code, prompts, or datasets are being released, and why? |
Disclosure | What happens if the work uncovers a real vulnerability? |
Monitoring | What traces or indicators help defenders detect misuse? |
The same review logic applies to research, vendor claims, and internal security teams building agentic tools. Ethics should not be treated as a paragraph added after the technical work; it should shape what is tested, what is logged, what is released, and who is allowed to use the system.
Market claims raise governance questions
Market activity now puts these governance questions in a practical context. Recent announcements and projects describe autonomous pentesting, continuous offensive testing, source-code-aware agents, proof-oriented validation, remediation guidance, and rerunnable tests: see examples from a cloud provider, an AppSec market, an open-source AI pentesting repository, and an AI pentesting product.
The point is not to validate those claims here. It is that agentic security workflows are moving from research papers into products, repositories, and vendor roadmaps. Once that happens, governance becomes part of adoption: who can run the agent, what systems it can touch, which actions require approval, how evidence is logged, and how findings move into remediation.
Access models need the same discipline. Internal AppSec teams, external pentesters, incident responders, researchers, and developers do not need identical permissions. Some agents may only read code; others may run scanners; others may execute proof-of-concept attacks in staging; and a smaller subset may touch production-like environments under an explicitly written scope. Each tier needs logging, approvals, and a clear rule for stopping.
A reasonable access model might look like this:
Tier | Agent capability | Required control |
Read-only assistance | Summarize code, reports, logs and threat models | Data-handling rules and output review |
Controlled analysis | Run static checks, propose tests and inspect known findings | Tool allowlists and audit logs |
Staging execution | Run dynamic tests in disposable environments | Sandboxing, test data and rate limits |
Proof-by-exploitation | Validate exploitability in authorized targets | Human approval and rollback plans |
Production-like testing | Touch realistic or production-adjacent systems | Written scope, monitoring and emergency stop |
Third-party discovery | Report issues outside owned systems | Responsible disclosure process |
This tiering helps teams match agents to jobs. A code-review assistant, a staging pentest agent, and an exploit-generation benchmark should not share the same permissions, because they do not create the same level of operational or legal risk.
What teams should ask before adoption
Before adopting an AI security agent, engineering teams should understand its operating environment. The basic questions are practical: what data can it read, which tools can it invoke, whether it can change system state, whether it can send outbound requests, and whether execution is sandboxed. Teams should also know whether prompts, completions, commands, tool outputs, and final findings are logged in a way that supports review.
The next layer is validation. Findings should not depend only on the model's narrative. Teams need to know how the system confirms exploitability, handles false positives, removes duplicates, and responds when the agent fails, loops, fabricates output, or is influenced by malicious content inside a repository, ticket, webpage, or document. Those questions define the agent's security design.
Executives do not need to inspect every prompt or loop, but they should know whether the organization can govern the workflow. The useful questions are whether the tool detects valid vulnerabilities or produces plausible alerts; whether high-impact actions require approval; whether results are audited; whether repeated-run reliability is measured; whether the tool reduces time to remediation; and whether there is a disclosure process for issues found in third-party or open-source software.
The operational test is whether the organization can absorb the agent's output. Faster discovery helps only when validation, prioritization, and remediation can keep up. Otherwise, AI turns one bottleneck into another.
A responsible adoption model
A realistic adoption model should be gradual. The first stage is read-only assistance: summarization, threat modeling, code review support, test-case generation, and report drafting. At this level, the main concerns are data exposure, output quality, and human review.
The next stage is controlled execution in local or staging environments. Here, agents may run tools, inspect known findings, or propose tests, but they should operate with sandboxing, test data, tool allowlists, and traceable logs. Before teams compare tools or expand access, they should run repeated tests and conduct finding-level validation to understand reliability, false positives, and cost.
Proof-by-exploitation should come later and only under explicit authorization. The environment should use disposable data where possible, rollback mechanisms, monitoring, and approval gates for actions that may affect availability, data integrity, third-party systems, or production assets. At this level, the organization is no longer testing only whether the agent is useful; it is testing whether the workflow can be governed.
For AI red teaming, the same staged logic applies. Test models and agents under defined objectives. Keep adversarial prompts and responses traceable. Map findings to frameworks when that helps communication, but do not let compliance tags replace technical review. Store structured traces so failures can be studied, not only counted.
For offensive AI research, controls should scale with demonstrated capability. A capability paper should explain the environment, limits, safeguards, responsible disclosure plan, and artifact-release rationale. If a system demonstrates stronger exploitation capability, its ethics section should become more specific.
The connected lesson
The first post in this pair argued that AI pentesting agents are becoming technically meaningful when models are wrapped in planning, memory, tools, and validators. This post adds the other half: those wrappers create systems that must be secured, tested, and governed.
The future of AI in pentesting will depend on more than model capability. Stronger systems will produce valid evidence, respect scope, preserve auditability, connect findings to remediation, and keep humans responsible for high-impact decisions.
That standard is harder to meet than a convincing demo or a confident answer, but it is the standard security teams need. Agentic systems should earn trust through work that can be inspected, reproduced, and governed.
Get started with Fluid Attacks' PTaaS right now
Other posts

























