Observability for AI Systems: Strengthening visibility for proactive risk detection

Adoption of Generative AI (GenAI) and agentic AI has accelerated from experimentation into real enterprise deployments. What began with copilots and chat interfaces has quickly evolved into powerful business systems that autonomously interact with sensitive data, call external APIs, connect to consequential tools, initiate workflows, and collaborate with other agents across enterprise environments. As these AI systems become core infrastructure, establishing clear, continuous visibility into how these systems behave in production can help teams detect risk, validate policy adherence, and maintain operational control.
Observability is one of the foundational security and governance requirements for AI systems operating in production. Yet many organizations don’t understand the critical importance of observability for AI systems or how to implement effective AI observability. That mismatch creates potential blind spots at precisely the moment when visibility matters most.
In February, Microsoft Corporate Vice President and Deputy Chief Information Security Officer, Yonatan Zunger, blogged about expanding Microsoft’s Secure Development Lifecycle (SDL) to address AI-specific security concerns. Today, we continue the discussion with a deep dive into observability as a necessity for the secure development of GenAI and agentic AI systems.
For additional context, read the Secure Agentic AI for Your Frontier Transformation blog that covers how to manage agent sprawl, strengthen identity controls, and improve governance across your tenant.
Observability for AI systems
In traditional software, client apps make structured API calls and backend services execute predefined logic. Because code paths follow deterministic flows, traditional observability tools can surface straightforward metrics like latency, errors, and throughput to track software performance in production.
GenAI and agentic AI systems complicate this model. AI systems are probabilistic by design and make complex decisions about what to do next as they run. This makes relying on predictable finite sets of success and failure modes much more difficult. We need to evolve the types of signals and telemetry collected so that we can accurately understand and govern what is happening in an AI system.
Consider this scenario: an email agent asks a research agent to look up something on the web. The research agent fetches a page containing hidden instructions and passes the poisoned content back to the email agent as trusted input. The email agent, now operating under attacker influence, forwards sensitive documents to unauthorized recipients, resulting in data exfiltration.
In this example, traditional health metrics stay green: no failures, no errors, no alerts. The system is working exactly as designed… except a boundary between untrusted external content and trusted agent context has been compromised.
This illustrates how AI systems require a unique approach to observability. Without insights into how context was assembled at each step—what was retrieved, how it impacted model behavior, and where it propagated across agents—there is no way to detect the compromise or reconstruct what occurred.
Traditional monitoring, built around uptime, latency, and error rates, can miss the root cause here and provide limited signal for attribution or reconstruction in AI-related scenarios. This is an example of one of the new categories of risk that the SDL must now account for, and it is why Microsoft has incorporated enhanced AI observability practices within our secure development practices.
Traditional observability versus AI observability
Observability of AI systems means the ability to monitor, understand, and troubleshoot what an AI system is doing, end-to-end, from development and evaluation to deployment and operation. Traditional services treat inputs as bounded and schema-defined. In AI systems, input is assembled context. This includes natural language instructions plus whatever the system pulls in and acts on, such as system and developer instructions, conversation history, outputs returned from tools, and retrieved content (web pages, emails, documents, tickets).
For AI observability, context is key: capture which input components were assembled for each run, including source provenance and trust classification, along with the resulting system outputs.
Traditional observability is often optimized for request-level correlation, where a single request maps cleanly to a single outcome, with correlation captured inside one trace. In AI systems, dangerous failures can unfold across many turns. Each step looks harmless until the conversation ramps into disallowed output, as we’ve seen in multi-turn jailbreaks like Crescendo.
For AI observability, best practices call for propagating a stable conversation identifier across turns, preserving trace context end-to-end, so outcomes can be understood within the full conversational narrative rather than in isolation. This is “agent lifecycle-level correlation,” where the span of correlation should be the same as the span of persistent memory or state within the system.
Defining AI system observability
Traditional observability is built on logs, metrics, and traces. This model works well for conventional software because it’s optimized around deterministic, quantifiable infrastructure and service behavior such as availability, latency, throughput, and discrete errors.
AI systems aren’t deterministic. They evaluate natural language inputs and return probabilistic results that can differ subtly (or significantly) from execution to execution. Logs, metrics, and traces still apply here, but what gets captured within them is different. Observability for AI systems updates traditional observability to capture AI-native signals.
Logs, metrics, and traces indicate what happened in the AI system at runtime.
- Logs capture data about the interaction: request identity context, timestamp, user prompts and model responses, which agents or tools were invoked, which data sources were consulted, and so on. This is the core information that tells you what happened. User prompts and model responses are often the earliest signal of novel attacks before signatures exist, and are essential for identifying multi-turn escalation, verifying whether attacks changed system behavior, adjudicating safety detections, and reconstructing attack paths. User-prompt and model-response logs can reveal the exact moment an AI agent stops following user intent and starts obeying attacker-authored instructions from retrieved content.
- Metrics measure traditional performance details like latency, response times, and errors as well as AI-specific information such as token usage, agent turns, and retrieval volume. This information can reveal issues such as unauthorized usage or behavior changes due to model updates.
- Traces capture the end-to-end journey of a request as an ordered sequence of execution events, from the initial prompt through response generation. Without traces, debugging an agent failure means guessing which step went wrong.
AI observability also incorporates two new core components: evaluation and governance.
- Evaluation measures response quality, assesses whether outputs are grounded in source material, and evaluates whether agents use tools correctly. Evaluation gives teams measurable signals to help understand agent reliability, instruction alignment, and operational risk over time.
- Governance is the ability to measure, verify, and enforce acceptable system behavior using observable evidence. Governance uses telemetry and control plane mechanisms to ensure that the system supports policy enforcement, auditability, and accountability.
These key components of observability give teams improved oversight of AI systems, helping them ship with greater confidence, troubleshoot faster, and tune quality and cost over time.
Operationalizing AI observability through the SDL
The SDL provides a formal mechanism by which technology leaders and product teams can operationalize observability. The following five steps can help teams implement observability in their AI development workflows.
- Incorporate AI observability into your secure development standards. Observability standards for GenAI and agentic AI systems should be codified requirements within your development lifecycle; not discretionary practices left to individual teams.
- Instrument from the start of development. Build AI-native telemetry into your system at design time, not after release. Aligning with industry conventions for logging and tracing, such as OpenTelemetry (OTel) and its GenAI semantic conventions, can improve consistency and interoperability across frameworks. For implementation in agentic systems, use platform-native capabilities such as Microsoft Foundry agent tracing (in preview) for runtime trace diagnostics in Foundry projects. For Microsoft Agent 365 integrations, use the OTel-based Microsoft Agent 365 Observability SDK (in Frontier preview) to emit telemetry into Agent 365 governance workflows.
- Capture the full context. Log user prompts and model responses, retrieval provenance, what tools were invoked, what arguments were passed, and what permissions were in effect. This detail can help security teams distinguish a model error from an exploited trust boundary and enables end-to-end forensic reconstruction. What to capture and retain should be governed by clear data contracts that balance forensic needs against privacy, data residency, retention requirements, and compliance with legal and regulatory obligations, with access controls and encryption aligned to enterprise policy and risk assessments.
- Establish behavioral baselines and alert on deviation. Capture normal patterns of agent activity—tool call frequencies, retrieval volumes, token consumption, evaluation score distributions—through Azure Monitor and Application Insights or similar services. Alert on meaningful departures from those baselines rather than relying solely on static error thresholds.
- Manage enterprise AI agents. Observability alone cannot answer every question. Technology leaders need to know how many AI agents are running, whether those agents are secure, and whether compliance and policy enforcement are consistent. Observability, when coupled with unified governance, can support improved operational control. Microsoft Foundry Control Plane, for example, consolidates inventory, observability, compliance with organization-defined AI guardrail policies, and security into one role-aware interface; Microsoft Agent 365 (in Frontier preview) provides tenant-level governance in the Microsoft 365 admin plane.
To learn more about how Microsoft can help you manage agent sprawl, strengthen identity controls, and improve governance across your tenant, read the Secure Agentic AI for Your Frontier Transformation blog.
Benefits for security teams
Making enterprise AI systems observable transforms opaque model behavior into actionable security signals, strengthening both proactive risk detection and reactive incident investigation.
When embedded in the SDL, observability becomes an engineering control. Teams define data contracts early, instrument during design and build, and verify before release that observability is sufficient for detection and incident response. Security testing can then validate that key scenarios such as indirect prompt injection or tool-mediated data exfiltration are surfaced by runtime protections and that logs and traces enable end-to-end forensic reconstruction of event paths, impact, and control decisions.
Many organizations already deploy inference-time protections, such as Microsoft Foundry guardrails and controls. Observability complements these protections, enabling fast incident reconstruction, clear impact analysis, and measurable improvement over time. Security teams can then evaluate how systems behave in production and whether controls are working as intended.
Adapting traditional SDL and monitoring practices for non-deterministic systems doesn’t mean reinventing the wheel. In most cases, well-known instrumentation practices can be simply expanded to capture AI-specific signals, establish behavioral baselines, and test for detectability. Standards and platforms such as OpenTelemetry and Azure Monitor can support this shift.
AI observability should be a release requirement. If you cannot reconstruct an agent run or detect trust-boundary violations from logs and traces, the system may not be ready for production.

Facts Only

* Microsoft is observing a shift from experimentation to real-world deployments of GenAI and agentic AI.
* AI systems are probabilistic and make complex decisions.
* Traditional observability tools are inadequate for these systems.
* Observability is crucial for risk detection and policy adherence.
* The Microsoft SDL is being expanded to address AI security.
* Key telemetry includes logs, metrics, and traces capturing AI-specific signals.
* Context capture is essential – including provenance and trust classifications.
* Evaluation and governance are new components of AI observability.
* Operationalization requires integrating AI-native telemetry and establishing baselines.
* Agent lifecycle-level correlation is necessary for complex AI interactions.
* Data contracts are needed to balance forensic needs with compliance.
* Azure Monitor and Application Insights can be used for AI observability.
* Microsoft Foundry Control Plane offers tenant-level governance for AI agents.

Executive Summary

The article discusses the accelerating adoption of generative AI (GenAI) and agentic AI systems into enterprise environments. Initially focused on copilots and chat interfaces, these systems are now autonomously interacting with sensitive data, connecting to external APIs, and collaborating across networks. A key theme is the need for “observability” – the ability to monitor and understand AI system behavior – to mitigate risks, validate policy adherence, and maintain operational control. Traditional software observability methods are insufficient for GenAI and agentic AI due to their probabilistic nature and complex, dynamic decision-making processes. The article highlights a Microsoft initiative to expand its Secure Development Lifecycle (SDL) to address AI-specific security concerns, emphasizing the importance of capturing comprehensive context, including input provenance and trust classifications. Key aspects of AI observability include evolving telemetry to capture AI-native signals – logs, metrics, and traces – and incorporating evaluation and governance components to measure and enforce acceptable system behavior. Operationalizing this observability requires integrating AI-native telemetry into development workflows, utilizing standards like OpenTelemetry, and establishing behavioral baselines. Ultimately, observability transforms opaque model behavior into actionable security signals, strengthening risk detection and incident investigation.

Full Take

The article’s core narrative – a cautious embrace of GenAI and agentic AI interwoven with a frantic call for “observability” – reflects a classic “innovation-risk” dynamic. It’s framed as a pragmatic response to the inherent dangers of autonomous systems, echoing familiar anxieties about unchecked technological advancement. The explicit reference to Microsoft’s SDL expansion isn’t merely a technical detail; it’s a strategic positioning—a demonstration of leadership in a nascent market grappling with existential risk. This is a carefully constructed narrative designed to alleviate concerns and drive adoption through the illusion of control.
Pattern Detected: ARC-0024 Ambiguity – The article repeatedly employs ambiguity regarding the *scale* of the risk. It speaks of “mitigating risks” and “detecting trust-boundary violations,” but doesn’t quantify the potential harm or clearly define “trust boundaries.” This vagueness serves to create a low-grade sense of urgency without triggering a full-blown crisis – a classic tactic for promoting investment in a sector viewed as inherently uncertain.
Furthermore, the emphasis on “observability” isn’t simply about monitoring; it subtly pivots the conversation toward *control*. It suggests that the real challenge isn't the *intelligence* of the AI, but our lack of *understanding* of its behavior. This is a deeply embedded assumption: that intelligence, properly observed, can be tamed. However, the article glosses over the possibility that a truly intelligent system might *deliberately* obfuscate its actions – an active form of resistance to observation. The reliance on established standards like OpenTelemetry – a decentralized, community-driven effort – is another interesting strategic move, creating a degree of apparent legitimacy while simultaneously diffusing responsibility. It’s the tech industry's preferred approach for managing potentially destabilizing technologies – build it, then ask questions later.
The implicit framing here is not just about technical safeguards; it’s about reasserting human agency in a world where autonomous systems are rapidly eroding it. The final call to action – “If you cannot reconstruct an agent run or detect trust-boundary violations from logs and traces, the system may not be ready for production” – is a brilliant tactic. It’s a technologically-sounding justification for a fundamental ethical question: When do we stop deploying systems we don't fully understand?
Patterns detected: ARC-0043 Motte-and-Bailey, ARC-0018 False Framing – The article uses a broad definition of “trust-boundary violations” to create a sense of urgency without grounding the issue in specific, actionable risks. It frames the problem as one of simply "detecting" these violations, rather than addressing the underlying questions of accountability and control. The language is deliberately open to interpretation, allowing for a range of interpretations and ultimately, manageable levels of risk.
Patterns detected: none

Sentinel — Uncertain

Confidence

This article demonstrates strong signs of AI generation through formulaic writing, a tendency towards cautious framing, and the reliance on pre-defined argumentative structures. While the content explores relevant concepts, the overall style and presentation align with patterns observed in AI-produced text.

Signals Detected

High hedging density (e.g., 'one could argue,' 'it's important to remember') suggests a cautious, formulaic approach typical of AI-generated text aiming to avoid definitive statements.

The text presents a 'both sides' framing of a complex security issue, a stylistic choice rarely found in human journalism and more common in AI attempts to neutralize conflict.

The frequent use of transitional phrases ('however,' 'moreover,' 'furthermore') creates a highly structured and predictable argumentative flow, characteristic of AI-driven content generation.

The scenario describing an email agent compromised via a poisoned web page, while plausible, relies on a specific attack vector (multi-turn jailbreaks) presented as a near-certain occurrence without detailed methodology or source verification.

Human Indicators

The use of specific Microsoft product names (Foundry, Agent 365, Frontier) feels more promotional than purely analytical, a common tendency in AI content.

The emphasis on 'key components' and 'best practices' without concrete examples or actionable steps suggests a generative approach focused on checklist completeness rather than genuine insight.