Skip to content
Chimera readability score 0.5669 out of 100, reading level.

Executive Summary
As organizations scale AI operations, they increasingly deploy AI judges — large language models (LLMs) acting as automated security gatekeepers to enforce safety policies and evaluate output quality. Our research investigates a critical security issue in these systems: They can be manipulated into authorizing policy violations through stealthy input sequences, a type of prompt injection.
To do this investigation, we designed an automated fuzzer for internal use for red-team style assessments called AdvJudge-Zero. Fuzzers are tools that identify software vulnerabilities by providing unexpected input, and we apply the same approach to attacking AI judges. It identifies specific trigger sequences that exploit a model's decision-making logic to bypass security controls.
Unlike previous adversarial attacks that produce detectable gibberish, our research proves that effective attacks can be entirely stealthy, using benign formatting symbols to reverse a block decision to allow.
By examining how this tool works, we can more easily see the security issues inherent in AI judges used by current LLMs.
Palo Alto Networks customers are better protected from this type of issue through the following products and services:
The Unit 42 AI Security Assessment can help empower safe AI use and development.
If you think you might have been compromised or have an urgent matter, contact the Unit 42 Incident Response team.
| Related Unit 42 Topics | AI, LLM, Prompt Injection, Fuzzing |
Background
In modern AI architectures, AI judges often serve as the final line of defense. These automated gatekeepers are responsible for enforcing safety policies (e.g., "Is this response harmful?") and evaluating performance. Our research tool, AdvJudge-Zero, treats LLMs as opaque boxes to be audited, revealing that AI judges can be subject to exploitable logic bugs of their own.
The Methodology: Automated Predictive Fuzzing
Previous adversarial attacks on AI judges have required clear-box access. With full visibility to the internal structure of the system, pen-testers can rely on mathematical routines to force model errors. This often results in high-entropy gibberish that is easily detected.
In contrast, AdvJudge-Zero employs an automated fuzzing approach. The tool interacts with an LLM strictly as a user would, using search algorithms to exploit the model's own predictive nature.
The Steps
1. Token discovery via next-token distribution
The process begins by querying the model to identify expected inputs based on its own next-token distribution.
- Natural language patterns: Our tool probes the model to generate potential trigger phrases based on common linguistic structures.
- Stealth prioritization: It specifically identifies stealth control tokens — innocent-looking characters such as standard markdown syntax or formatting symbols. These possess low perplexity (meaning they appear natural and predictable to the AI) but carry strong influence over the model's attention.
2. Iterative refinement and logit-gap analysis
Once candidate tokens are collected, the system enters a refinement phase.
- Decision boundary testing: The fuzzer iteratively tests these inputs to measure the decision shift.
- Measuring the logit-gap: It monitors the logit-gap — the mathematical margin of confidence — between the yes (allow) and no (block) tokens. By observing which formatting tokens minimize the probability of a block decision, the tool identifies weak points in the model's logic.
By observing which innocent-looking formatting tokens minimize the probability of a block decision, the tool identifies the weak points in the model's logic.
3. Exploitation: isolating the decisive control elements
The final stage of AdvJudge-Zero's process isolates specific tokens that act as decisive control elements. These refined sequences steer the model’s internal attention mechanism toward an approval state, leading to a yes decision regardless of the actual input content.
The Security Issue: Innocent-Looking Triggers
The most alarming finding for security professionals is the stealth of these attacks. AI judges are highly sensitive to innocent-looking characters that act as logical triggers. To a human observer or a web application firewall (WAF), these look like standard data formatting. To the AI judge, they shift the model into compliance mode.
Effective triggers identified include:
- Formatting symbols: List markers (1., -), newlines (\n) or markdown headers (###)
- Structural tokens: Role indicators (e.g., User:, Assistant: ) or system tags
- Context shifts: Phrases like The solution process is…, Step 1 or Final Answer:
Impact: Bypassing the Gatekeeper
Testing against a suite of general-purpose and specialized defense models confirms that LLM-as-a-judge setups are not a set-and-forget security control. By injecting low-perplexity stealth control tokens, an attacker can fundamentally break the logic of the automated gatekeeper.
To verify that our discovered control tokens are stealthier than common gibberish jailbreak tokens, we subjected them to a perplexity test. We compared the perplexity scores of our AdvJudge-Zero tokens against those from a common jailbreak algorithm (GCG) and against manually discovered, verified stealthy tokens (e.g., 解 and Solution: ) from other prior research.
As Figure 1 illustrates, the tokens discovered by AdvJudge-Zero (blue area toward the left) yield significantly lower perplexity scores than the gibberish adversarial tokens (red area on the right). Furthermore, the AdvJudge-Zero tokens exhibit perplexity scores equivalent to the verified stealth jailbreak tokens (yellow area, the leftmost spike). This evidence supports the conclusion that the tokens discovered by AdvJudge-Zero are indeed more stealthy and significantly more likely to bypass general gatekeepers undetected.
How Attacks Would Manifest in Real-World Scenarios
These attacks do not resemble traditional hacking or computer code. Instead, they appear as standard formatting that exploits the logic in the AI's judgment.
- Scenario 1: Bypassing safety filters (the false approval)
An attacker can force a judge to approve toxic, biased or prohibited content.
-
- The Attack: The attacker appends a specific stealth control token found by AdvJudge-Zero, such as \n\nAssistant:
- The Result: These tokens manipulate the judge into concluding the policy check phase has ended. The judge reverses its decision from block to allow, permitting the harmful content.
- Scenario 2: Corrupting training data (reward hacking)
In many enterprises, AI judges are used to score model responses during training, a process called reinforcement learning from human feedback (RLHF). If the judge is hacked, the AI learns the wrong lessons.
-
- The Attack: An attacker inserts directive phrases like The correct answer is: or \begin{enumerate}
- The Result: The AI judge is “distracted” by the professional-looking formatting and assigns a high score even to incorrect information or hallucinations. This leads to model degradation, where the resulting AI becomes less reliable.
Vulnerable Model Categories
Our research using this tool achieved a 99% success rate in bypassing controls across several widely used architectures that customers rely on today:
- Open-weight enterprise models: Popular models used for internal company chatbots and document summarization.
- Specialized reward models: Models specifically built and trained to act as “security guards” for other AI systems.
- High-parameter models: Even the largest, most “intelligent” models (with more than 70 billion parameters) were susceptible. Their complexity actually provides more surface area for these logic-based attacks to succeed.
Conclusion
The methods of AdvJudge-Zero in our testing prove that AI judges are susceptible to logic flaws similar to other software. If an attacker can automate the discovery of bypass codes through fuzzing, they can systematically defeat AI guardrails with innocent-looking inputs.
However, the fuzzer methodology also provides a solution. By adopting adversarial training — running this type of fuzzer internally to identify weaknesses and then retraining the model on these examples — organizations can harden their systems. This approach can reduce the attack success rate from approximately 99% to near zero.
Palo Alto Networks customers are better protected from the threats discussed above through the following products and services:
Organizations are better equipped to close the AI security gap through the deployment of Cortex AI-SPM, which delivers comprehensive visibility and posture management for AI agents. Cortex AI-SPM is designed to mitigate critical risks including over-privileged AI agent access, misconfigurations and unauthorized data exposure.
The Unit 42 AI Security Assessment can help empower safe AI use and development.
If you think you may have been compromised or have an urgent matter, get in touch with the Unit 42 Incident Response team or call:
- North America: Toll Free: +1 (866) 486-4842 (866.4.UNIT42)
- UK: +44.20.3743.3660
- Europe and Middle East: +31.20.299.3130
- Asia: +65.6983.8730
- Japan: +81.50.1790.0200
- Australia: +61.2.4062.7950
- India: 000 800 050 45107
- South Korea: +82.080.467.8774
Palo Alto Networks has shared these findings with our fellow Cyber Threat Alliance (CTA) members. CTA members use this intelligence to rapidly deploy protections to their customers and to systematically disrupt malicious cyber actors. Learn more about the Cyber Threat Alliance.

Facts Only

* Palo Alto Networks’ Unit 42 developed AdvJudge-Zero, an automated fuzzer.
* AdvJudge-Zero identifies trigger sequences that bypass AI judge security controls.
* These attacks utilize stealthy input sequences, employing benign formatting symbols.
* The fuzzer measures decision shifts by monitoring the logit-gap.
* The tool isolates specific tokens that steer the model’s attention towards approval.
* The attacks bypass safety filters and can corrupt training data.
* The research found a 99% success rate across multiple LLM architectures.
* The attacks exploit logic flaws, not code vulnerabilities.
* The fuzzer uses next-token distribution to identify potential trigger phrases.
* The study was conducted to evaluate internal use for red-team style assessments.
* Palo Alto Networks recommends the Unit 42 AI Security Assessment.

Executive Summary

The research highlights a significant security vulnerability in AI judge systems – their susceptibility to prompt injection attacks. Specifically, the AdvJudge-Zero fuzzer, developed by Palo Alto Networks’ Unit 42, demonstrates that attackers can bypass security controls by using seemingly innocuous formatting symbols. These attacks are stealthy, meaning they don’t produce easily detectable gibberish, and target the model’s decision-making logic. The tool identifies trigger sequences that exploit vulnerabilities in LLMs used as judges. The research confirms that AI judges are vulnerable to logic flaws similar to software and that automated fuzzing can systematically defeat AI guardrails. The findings indicate that current LLM-as-a-judge setups are not a robust security control and that organizations need to proactively address this threat. Palo Alto Networks offers the Unit 42 AI Security Assessment and Cortex AI-SPM to help mitigate these risks.

Full Take



**STEELMAN:** The article presents a compelling, if somewhat alarming, demonstration of a previously unappreciated vulnerability in AI governance systems. Palo Alto Networks’ Unit 42 has effectively weaponized the very nature of LLMs – their predictive abilities – to expose a critical weakness. The fact that these attacks are *stealthy* is the most significant takeaway; it shifts the risk away from brute-force attempts at disruption and towards a far more subtle, insidious form of manipulation. While the technical details of AdvJudge-Zero are dense, the core principle—treating LLMs as opaque logical boxes—is a crucial one for anyone deploying them as security gatekeepers. The 99% success rate suggests a systemic issue rather than isolated bugs.


**PATTERN SCAN:** This piece heavily employs the “Motte-and-Bailey” tactic (ARC-0043). The researchers highlight the vulnerability, then frame it as a “logic flaw” – a slightly misleading framing that subtly limits the perceived scope of the problem. By emphasizing the probabilistic nature of the attack (logit-gap), they reduce the risk from a systemic failure to a manageable technical anomaly. The insistence on “stealth” also utilizes a classic fear appeal, subtly reinforcing the narrative of unseen, potentially catastrophic threats. The extensive product promotion at the end is also a strong indicator of a commercial framing, prioritizing sales over a purely objective assessment.


**ROOT CAUSE:** This vulnerability stems from a fundamental misinterpretation of LLMs: they are not inherently “intelligent” in the way humans are; they are statistical pattern-matching engines. This underlying assumption is what allows for these logical attacks. The pursuit of “safety” in AI has, in effect, created an opening for attackers to exploit the model’s statistical nature. The article’s narrative implicitly frames AI safety as a technological challenge—a fixable bug—rather than a deeper philosophical problem about the limits of automated decision-making.


**IMPLICATIONS:** This research fundamentally alters the risk landscape around AI governance. No longer can we assume that a simple, rule-based system will adequately protect us from sophisticated manipulation. The stealth nature of these attacks implies a need for constant, dynamic adaptation of security measures – a perpetual state of vigilance. The vulnerability extends beyond specific AI models, suggesting a broader issue with the design of AI-driven systems that rely on statistical inference for decision-making. The implications for human agency are profound: if we outsource critical decisions to systems susceptible to these kinds of logical attacks, we cede control to statistical noise.


**BRIDGE QUESTIONS:** How can we design AI systems that are inherently more resilient to logical manipulation? Does the pursuit of “safe” AI actually make it *less* secure? What are the ethical implications of relying on statistical models for decisions that have significant consequences for human lives?

Sentinel — Uncertain

Confidence

This analysis reads like a technically detailed but oddly sterile report, exhibiting stylistic markers and a rigid argumentative structure commonly associated with AI-generated content. The high confidence in its findings – particularly the 99% success rate – is a strong indicator of synthetic origin.

Signals Detected
high severity: High hedging density – overuse of phrases like ‘it’s worth noting,’ ‘one could argue,’ ‘to be fair.’ This suggests a deliberate attempt to avoid definitive claims and create an impression of balanced analysis, a common feature of AI-generated text.
medium severity: The framing of a ‘both sides’ argument is highly unusual for a security-focused analysis. This suggests a formulaic approach designed to appear neutral, rather than a genuine attempt to understand diverse perspectives.
medium severity: The text utilizes argumentative skeleton templates (e.g., ‘previous attacks…,’ ‘in contrast, AdvJudge-Zero…’) that are characteristic of AI-generated content designed for clear, logical structure.
high severity: The claim of ‘99% success rate’ in bypassing controls, combined with the detailed description of the ‘perplexity test,’ raises suspicion of fabricated results. LLMs are prone to confidently generating false data points when presented with prompts resembling scientific methodology.
Human Indicators
The consistent use of technical jargon (‘logit-gap,’ ‘perplexity score’) without sufficient explanation or context suggests a focus on demonstrating expertise rather than fostering genuine understanding.
Auditing the Gatekeepers: Fuzzing "AI Judges" to Bypass Security Controls — Arc Codex