Executive Summary
As organizations scale AI operations, they increasingly deploy AI judges — large language models (LLMs) acting as automated security gatekeepers to enforce safety policies and evaluate output quality. Our research investigates a critical security issue in these systems: They can be manipulated into authorizing policy violations through stealthy input sequences, a type of prompt i...
**STEELMAN:** The article presents a compelling, if somewhat alarming, demonstration of a previously unappreciated vulnerability in AI governance systems. Palo Alto Networks’ Unit 42 has effectively weaponized the very nature of LLMs – their predictive abilities – to expose a critical weakness. The fact that these attacks are *stealthy* is the most significant takeaway; it shifts the risk away from brute-force attempts at disruption and towards a far more subtle, insidious form of manipulat...
