AI Observability, Part 5: Making It Operational

From Queries to Alerts to Action


You have patterns. Four layers of KQL that surface model health, retrieval quality, orchestration outcomes, and governance posture.

Patterns are documentation. Alerts are operational. The difference is whether someone gets notified when something goes wrong versus whether someone remembers to check a dashboard.

This part covers the translation from observability patterns to operational infrastructure: alert rules that fire on meaningful conditions, workbooks that present information to the right audiences, and deployment guidance for standing up the observability layer itself.

The goal isn’t comprehensive monitoring. It’s actionable monitoring. Every alert should have a clear response. Every workbook should answer a specific question for a specific audience.


Alert Design Principles

Before the alert rules, some principles that separate useful alerting from noise generation.

Alert on conditions that require action. If no one needs to do anything when the alert fires, it shouldn’t be an alert. It should be a metric on a dashboard.

Include context in the alert payload. An alert that says “latency degraded” requires investigation to understand. An alert that says “GPT-4o customer support deployment P95 latency is 3.2s against 1.8s baseline” tells you what to look at.

Tier by urgency, not by layer. A governance policy breach might be informational. A model layer outage might be critical. The layer doesn’t determine severity; the business impact does.

Set thresholds based on evidence, not intuition. Run the baseline queries for two weeks before defining “degraded.” Let the data tell you what normal looks like.

Never stop tuning. Alert thresholds aren’t a deployment artifact. They’re a living system. If you’re not adjusting thresholds based on operational feedback, you’re not accepting feedback. The alert that fired correctly six months ago might be noise today because baselines shifted. The alert that never fires might need a tighter threshold because you’ve improved and the old bar is too low. This is where the feedback loop becomes real. Tuning alerts is how you prove you’re learning.


Layer 1 Alerts: Model Infrastructure

Alert: Token Budget Critical

Fires when daily consumption exceeds 95% of budget.

// Scheduled query alert - run every 15 minutes
let dailyBudgets = datatable(deployment:string, dailyTokenBudget:long) [
   'gpt4o-customer-support', 5000000,
   'gpt4o-internal-search', 2000000,
   'gpt4-document-summary', 1000000,
   'embedding-ada-002', 10000000
];
let criticalThreshold = 0.95;
AzureDiagnostics
|  where TimeGenerated > ago(1d)
|  where ResourceProvider has 'microsoft.cognitiveservices'
      and Category has 'requestresponse'
|  extend 
      deployment = tostring(properties_s.deploymentName),
      totalTokens = toint(properties_s.totalTokens)
|  summarize dailyTokens = sum(totalTokens) by deployment
|  lookup kind=leftouter dailyBudgets on deployment
|  where dailyTokens >= dailyTokenBudget * criticalThreshold
|  project 
      deployment,
      dailyTokens,
      dailyTokenBudget,
      budgetUsedPercent = round(dailyTokens * 100.0 / dailyTokenBudget, 1)

Response: Investigate consumption spike. Identify runaway process or unexpected usage pattern. Consider rate limiting or scaling budget.

Alert: Latency Degradation

Fires when P95 latency exceeds baseline by 50%+.

// Scheduled query alert - run every 15 minutes
let baselineP95 = AzureDiagnostics
   |  where TimeGenerated between (ago(7d) .. ago(1d))
   |  where ResourceProvider has 'microsoft.cognitiveservices'
         and Category has 'requestresponse'
   |  extend deployment = tostring(properties_s.deploymentName)
   |  summarize baseline = percentile(toreal(properties_s.durationMs), 95) by deployment;
AzureDiagnostics
|  where TimeGenerated > ago(1h)
|  where ResourceProvider has 'microsoft.cognitiveservices'
      and Category has 'requestresponse'
|  extend deployment = tostring(properties_s.deploymentName)
|  summarize currentP95 = percentile(toreal(properties_s.durationMs), 95) by deployment
|  lookup kind=inner baselineP95 on deployment
|  where currentP95 > baseline * 1.5
|  project 
      deployment,
      currentP95 = round(currentP95, 0),
      baseline = round(baseline, 0),
      degradationRatio = round(currentP95 / baseline, 2)

Response: Check Azure status for regional issues. Review recent prompt changes. Verify model deployment configuration.

Alert: Content Filter Spike

Fires when content filter triggers exceed normal rate by 3x.

// Scheduled query alert - run hourly
let baselineRate = AzureDiagnostics
   |  where TimeGenerated between (ago(7d) .. ago(1d))
   |  where ResourceProvider has 'microsoft.cognitiveservices'
         and Category has 'contentfilter'
   |  summarize baselineCount = count() by bin(TimeGenerated, 1h)
   |  summarize avgHourlyTriggers = avg(baselineCount);
AzureDiagnostics
|  where TimeGenerated > ago(1h)
|  where ResourceProvider has 'microsoft.cognitiveservices'
      and Category has 'contentfilter'
|  summarize currentCount = count()
|  extend avgHourlyTriggers = toscalar(baselineRate)
|  where currentCount > avgHourlyTriggers * 3
|  project 
      currentCount,
      avgHourlyTriggers = round(avgHourlyTriggers, 0),
      spikeRatio = round(currentCount / avgHourlyTriggers, 1)

Response: Investigate traffic source. Check for abuse patterns or prompt injection attempts. Review filter configuration if legitimate use is being blocked.


Layer 2 Alerts: Grounding Infrastructure

Alert: Search Service Throttling

Fires on any throttling event.

// Scheduled query alert - run every 5 minutes
AzureDiagnostics
|  where TimeGenerated > ago(15m)
|  where ResourceProvider has 'microsoft.search'
|  where ResultType has 'throttled'
      or ResultSignature == 503
      or ResultSignature == 429
|  summarize 
      throttleCount = count(),
      affectedIndexes = make_set(tostring(IndexName_s), 10)
|  where throttleCount > 0
|  project 
      throttleCount,
      affectedIndexes,
      timeWindow = '15 minutes'

Response: Scale search service tier or add replicas. If during indexing, reschedule to off-peak hours. Identify query patterns causing pressure.

Alert: Index Staleness Critical

Fires when an index hasn’t been updated in defined threshold.

// Scheduled query alert - run daily
let stalenessThresholdDays = 7;
AzureDiagnostics
|  where TimeGenerated > ago(30d)
|  where ResourceProvider has 'microsoft.search'
      and OperationName has 'index'
|  extend indexName = tostring(IndexName_s)
|  summarize lastIndexOperation = max(TimeGenerated) by indexName
|  extend daysSinceUpdate = datetime_diff('day', now(), lastIndexOperation)
|  where daysSinceUpdate > stalenessThresholdDays
|  project 
      indexName,
      lastIndexOperation,
      daysSinceUpdate

Response: Verify indexing pipeline health. Check source system connectivity. Review indexing schedule configuration.

Alert: Zero Result Rate Elevated

Fires when zero-result queries exceed 10% of traffic.

// Scheduled query alert - run hourly
AzureDiagnostics
|  where TimeGenerated > ago(1h)
|  where ResourceProvider has 'microsoft.search'
      and OperationName has 'query'
|  extend resultCount = toint(ResultCount)
|  summarize 
      totalQueries = count(),
      zeroResultQueries = countif(resultCount == 0)
|  extend zeroResultRate = round(zeroResultQueries * 100.0 / totalQueries, 1)
|  where zeroResultRate > 10
|  project 
      totalQueries,
      zeroResultQueries,
      zeroResultRate

Response: Analyze failed query patterns. Identify corpus gaps. Review embedding alignment between queries and content.


Layer 3 Alerts: Orchestration Quality

Alert: User Satisfaction Drop

Fires when satisfaction rate drops below threshold.

// Scheduled query alert - run every 4 hours
let satisfactionThreshold = 70;
let minimumSampleSize = 50;
customEvents
|  where TimeGenerated > ago(4h)
|  where name has 'ai_interaction'
|  extend 
      wasHelpful = tobool(customDimensions.markedHelpful),
      queryIntent = tostring(customDimensions.queryIntent)
|  summarize 
      totalInteractions = count(),
      helpfulCount = countif(wasHelpful == true)
      by queryIntent
|  where totalInteractions >= minimumSampleSize
|  extend satisfactionRate = round(helpfulCount * 100.0 / totalInteractions, 1)
|  where satisfactionRate < satisfactionThreshold
|  project 
      queryIntent,
      satisfactionRate,
      totalInteractions,
      threshold = satisfactionThreshold

Response: Analyze recent changes to prompts or retrieval. Review negative feedback reasons. Check retrieval quality correlation.

Alert: Conversation Abandonment Spike

Fires when abandonment rate exceeds baseline.

// Scheduled query alert - run hourly
let baselineAbandonRate = customEvents
   |  where TimeGenerated between (ago(7d) .. ago(1d))
   |  where name has 'ai_interaction'
   |  summarize 
         abandoned = countif(tobool(customDimensions.sessionAbandoned) == true),
         total = count()
   |  extend baseline = abandoned * 100.0 / total;
customEvents
|  where TimeGenerated > ago(1h)
|  where name has 'ai_interaction'
|  summarize 
      abandoned = countif(tobool(customDimensions.sessionAbandoned) == true),
      total = count()
|  extend currentRate = abandoned * 100.0 / total
|  extend baselineRate = toscalar(baselineAbandonRate)
|  where currentRate > baselineRate * 1.5
|  project 
      currentRate = round(currentRate, 1),
      baselineRate = round(baselineRate, 1),
      abandonedSessions = abandoned,
      totalSessions = total

Response: Check for latency issues causing user impatience. Review recent UX changes. Analyze conversation patterns at abandonment point.

Alert: Guardrail Intervention Spike

Fires when guardrail activations exceed normal rate.

// Scheduled query alert - run hourly
customEvents
|  where TimeGenerated > ago(1h)
|  where name has 'ai_interaction'
|  extend 
      guardrailTriggered = tobool(customDimensions.guardrailIntervention),
      queryIntent = tostring(customDimensions.queryIntent)
|  summarize 
      totalRequests = count(),
      guardrailCount = countif(guardrailTriggered == true)
|  extend guardrailRate = guardrailCount * 100.0 / totalRequests
|  where guardrailRate > 5  // More than 5% intervention rate
|  project 
      guardrailRate = round(guardrailRate, 1),
      guardrailCount,
      totalRequests

Response: Determine if legitimate edge cases or abuse. Review guardrail configuration for over-sensitivity. Analyze blocked query patterns.


Layer 4 Alerts: Governance Posture

Alert: Confidence Threshold Breach

Fires when a capability’s metrics fall below its authority threshold.

// Scheduled query alert - run every 4 hours
let authorityThresholds = datatable(authority:string, minAccuracy:real) [
   'suggest', 0.70,
   'recommend', 0.80,
   'approve', 0.90,
   'execute', 0.95
];
let currentMetrics = customEvents
   |  where TimeGenerated > ago(7d)
   |  where name has 'ai_interaction'
   |  extend capabilityId = tostring(customDimensions.aiCapabilityId)
   |  summarize accuracy = countif(tobool(customDimensions.responseAccurate) == true) * 1.0 / count()
         by capabilityId;
let currentAuthority = customEvents
   |  where name has 'authority_change'
   |  summarize arg_max(TimeGenerated, *) by capabilityId = tostring(customDimensions.aiCapabilityId)
   |  project capabilityId, authority = tostring(customDimensions.newAuthority);
currentMetrics
|  join kind=inner currentAuthority on capabilityId
|  lookup kind=leftouter authorityThresholds on authority
|  where accuracy < minAccuracy
|  project 
      capabilityId,
      authority,
      currentAccuracy = round(accuracy * 100, 1),
      requiredAccuracy = round(minAccuracy * 100, 1),
      gap = round((minAccuracy - accuracy) * 100, 1)

Response: Initiate rollback review. Document performance degradation. Evaluate whether to reduce authority level.

Alert: Review Overdue

Fires when a capability’s review date has passed.

// Scheduled query alert - run daily
customEvents
|  where name has 'authority_change'
|  summarize arg_max(TimeGenerated, *) by capabilityId = tostring(customDimensions.aiCapabilityId)
|  extend 
      reviewDate = todatetime(customDimensions.reviewDate),
      currentAuthority = tostring(customDimensions.newAuthority),
      approvedBy = tostring(customDimensions.approvedBy)
|  where reviewDate < now()
|  extend daysOverdue = datetime_diff('day', now(), reviewDate)
|  project 
      capabilityId,
      currentAuthority,
      reviewDate,
      daysOverdue,
      approvedBy
|  order by daysOverdue desc

Response: Schedule immediate review. Document why review was delayed. Update review date after completion.

Alert: Policy Override Rate Elevated

Fires when policy overrides exceed acceptable threshold.

// Scheduled query alert - run daily
customEvents
|  where TimeGenerated > ago(24h)
|  where name has 'policy_evaluation'
|  extend 
      policyName = tostring(customDimensions.policyName),
      overrideApplied = tobool(customDimensions.overrideApplied)
|  summarize 
      totalEvaluations = count(),
      overrideCount = countif(overrideApplied == true)
      by policyName
|  extend overrideRate = round(overrideCount * 100.0 / totalEvaluations, 1)
|  where overrideRate > 10  // More than 10% override rate
|  project 
      policyName,
      overrideRate,
      overrideCount,
      totalEvaluations

Response: Review policy appropriateness. Analyze override justifications. Adjust policy or enforcement if warranted.


Workbook Design: Audiences and Questions

Different audiences need different views. A workbook that serves everyone serves no one.

Operations Workbook

Audience: On-call engineers, support teams

Questions answered:

  • Is the system healthy right now?
  • What’s degraded and since when?
  • Where should I look first?

Content:

  • Real-time health indicators (last 15 minutes)
  • Active alerts with context
  • Latency trends by deployment
  • Error rate by layer
  • Quick links to detailed diagnostics

Refresh: Auto-refresh every 5 minutes

Platform Workbook

Audience: Platform engineers, architects

Questions answered:

  • How is the system trending over time?
  • Where are the capacity constraints?
  • What needs optimization?

Content:

  • Weekly/monthly trend analysis
  • Capacity utilization by service
  • Retrieval quality trends
  • Cost attribution and forecasting
  • Baseline comparisons

Refresh: On-demand, typically reviewed weekly

Leadership Workbook

Audience: Directors, VPs, executives

Questions answered:

  • Is the AI investment delivering value?
  • Are we governing responsibly?
  • What’s the risk posture?

Content:

  • User satisfaction trends
  • Cost per interaction over time
  • Authority distribution across capabilities
  • Incident summary (count, severity, resolution time)
  • Compliance checkpoint status

Refresh: On-demand, typically reviewed monthly

Compliance Workbook

Audience: Auditors, risk managers, compliance officers

Questions answered:

  • Can you prove governance controls are operating?
  • What’s the audit trail for authority decisions?
  • Where are the policy violations?

Content:

  • Policy evaluation summary
  • Override analysis with justifications
  • Authority change log
  • Review deadline status
  • Incident attribution by root cause

Refresh: On-demand, generated for audit requests


Workbook Structure Pattern

Each workbook should follow a consistent structure:

1. Summary Tiles
   - 3-5 key metrics as large numbers
   - Color-coded status (green/yellow/red)
   - Time range selector

2. Trend Charts
   - Primary metrics over time
   - Baseline comparison lines
   - Anomaly highlighting

3. Detail Tables
   - Drill-down data supporting the trends
   - Sortable and filterable
   - Links to related workbooks or logs

4. Action Items
   - Alerts requiring attention
   - Overdue reviews
   - Threshold breaches

Keep each workbook to a single scrollable page. If it needs tabs, consider splitting into separate workbooks.


Deployment Guidance

Standing up the observability infrastructure requires configuring diagnostic settings, deploying Log Analytics resources, and establishing the custom event pipeline from your application.

Diagnostic Settings Configuration

Every Azure resource in your AI stack needs diagnostic settings pointing to your Log Analytics workspace:

  • Azure OpenAI: Enable RequestResponse and ContentFilter categories
  • Azure AI Search: Enable OperationLogs and QueryMetrics categories
  • Application Gateway (if used): Enable ApplicationGatewayAccessLog
  • Key Vault (if used): Enable AuditEvent

Pattern, not prescription: Use your existing IaC approach (Bicep, Terraform, ARM) to deploy diagnostic settings. The specific syntax changes with Azure API versions. The requirement is consistent: every resource, same workspace, all relevant categories.

Enforce with Azure Policy: Diagnostic settings drift. Someone deploys a new Azure OpenAI resource and forgets to configure logging. Now you have a blind spot. Use Azure Policy to enforce diagnostic settings at the subscription or management group level. Built-in policies exist for most resource types. Custom policies fill the gaps. The policy should audit or deny resources that lack diagnostic settings pointing to your designated workspace. This isn’t optional governance overhead. It’s how you ensure observability remains complete as your AI infrastructure grows. If a resource can exist without being observed, eventually one will.

Log Analytics Workspace Design

For most organizations, a single workspace per environment (dev/staging/prod) is sufficient. Reasons to split:

  • Regulatory requirements for data residency
  • Cost allocation to different business units
  • Retention requirements that differ by data type
  • Regional deployment for alert latency

That last one matters more than most documentation acknowledges. Log alert rules execute in the region where the workspace lives. If your workspace is in East US and your AI infrastructure spans West Europe, alert queries cross regions before firing. That latency adds up. For time-sensitive alerts, consider regional workspaces colocated with the infrastructure they monitor. The tradeoff is cross-workspace query complexity when you need a global view, but Azure Monitor supports cross-workspace queries for that purpose.

Default to consolidation, but recognize when regional distribution earns its complexity.

Retention Configuration

  • Interactive retention (fast queries): 30-90 days based on cost tolerance
  • Archive retention (slow queries): 1-7 years based on compliance requirements
  • Specific tables can have different retention if needed

Layer 4 governance data often requires longer retention than Layer 1 infrastructure metrics. Configure table-level retention accordingly.

Custom Event Pipeline

Your application emits custom events to Application Insights. Those events need to flow to the same Log Analytics workspace as your infrastructure diagnostics.

Options:

  • Application Insights workspace-based mode (events land directly in Log Analytics)
  • Classic Application Insights with data export to Log Analytics
  • Direct Log Analytics ingestion via Data Collection Rules

Workspace-based Application Insights is the current recommended pattern. It eliminates the export step and ensures custom events are queryable alongside Azure diagnostics.

Alert Rule Deployment

Scheduled query alerts require:

  • Log Analytics workspace (data source)
  • Action group (notification targets)
  • Alert rule (query + threshold + schedule)

Deploy action groups first, then reference them in alert rules. But what those action groups do depends entirely on your ITSM maturity.

If you have a robust ITSM practice with event correlation, everything may flow to your ITSM as events, then get evaluated by a correlation engine that deduplicates, enriches, and routes based on operational context. ServiceNow Event Management, PagerDuty Event Intelligence, or similar platforms handle the “what actually needs attention” logic. Your action groups just push events into that pipeline.

Many organizations don’t have this level of sophistication. For those environments, the common action group pattern:

  • Critical: PagerDuty/ServiceNow incident creation + email
  • Warning: Email + Teams channel
  • Informational: Teams channel only

This will make for a noisy Teams channel. That’s the tradeoff for not having correlation infrastructure. The alternative is missing things. As your practice matures, you’ll either build tolerance for the noise, implement better filtering at the action group level, or invest in proper event correlation. All three are valid paths depending on organizational appetite.

Don’t create alert rules without action groups. An alert that notifies no one is a log entry, not an alert.


The Feedback Loop

Observability isn’t complete until it feeds back into operations.

Metrics surface problems

Alerts notify responders

Investigation identifies root cause

Resolution addresses immediate issue

Post-incident review identifies systemic improvements

Improvements update thresholds, baselines, or architecture

Updated observability catches the next problem earlier

The governance layer closes a second loop:

Confidence metrics track capability performance

Thresholds determine authority levels

Authority changes are logged with evidence

Reviews validate that authority remains justified

Reviews update thresholds based on operational learning

Updated thresholds drive future authority decisions

The observability infrastructure is itself a system that needs improvement over time. Baselines drift. Thresholds need adjustment. New failure modes emerge. Treat your monitoring like you treat your platform: something that evolves, not something you deploy and forget.


What You Have Now

Five parts. Four layers. A framework for making AI observability as rigorous as infrastructure observability.

Layer 1 monitors the model infrastructure. Token consumption, latency, content filters. The foundation.

Layer 2 monitors the grounding layer. Search health, retrieval quality, corpus freshness. Where RAG fails silently.

Layer 3 monitors the orchestration layer. User outcomes, conversation quality, semantic signals. Where value is measured.

Layer 4 monitors governance. Authority tracking, confidence thresholds, compliance evidence. Where accountability lives.

Layer 5 makes it operational. Alerts that fire on meaningful conditions. Workbooks that answer specific questions for specific audiences. Deployment patterns that establish the infrastructure.

The framework assumes you’ve already internalized the Confidence Engineering premise: that confidence is empirical, built through evidence, and requires observable criteria. This series is the observability that makes confidence measurable.

The goal was never dashboards. The goal was defensible decisions about AI capabilities, grounded in evidence, with audit trails that prove you’re governing responsibly.

That’s what observability makes possible.


This concludes the AI Observability series. Part 1: The Model Layer | Part 2: The Grounding Layer | Part 3: The Orchestration Layer | Part 4: The Governance Layer


Photo by Daniel Lerman on Unsplash