AI Observability, Part 3: The Orchestration Layer
AI Observability, Part 3: The Orchestration Layer
Monitoring Meaning
This is where observability gets hard.
Layers 1 and 2 monitor infrastructure. Azure gives you diagnostic logs. You query them. The patterns are familiar if you’ve done any cloud observability work. Services run, requests complete, latency is measurable, errors have codes.
Layer 3 monitors meaning. Did the response help? Was the retrieval relevant? Did the user accomplish their goal?
Azure can’t answer these questions because Azure doesn’t know what “success” looks like for your application. You have to define it. Then you have to instrument it. Then you have to analyze it.
The gap between “the system worked” and “the system produced value” is where most AI observability stops. It’s also where most AI value leaks away.
Part 1 covered the model layer: infrastructure metrics for Azure OpenAI. Part 2 covered the grounding layer: search service health and retrieval quality signals.
Both layers can be green while users get garbage. The model responded quickly. Retrieval returned chunks. Content filters passed. Every metric looks healthy. The response was still wrong, unhelpful, or misleading.
This part covers the instrumentation your application must emit to make semantic quality observable. None of this comes from Azure diagnostics. All of it comes from your code.
The Instrumentation Contract
Before writing queries, you need telemetry to query. Your orchestration code must emit custom events that capture what Azure can’t see.
Minimum viable instrumentation per AI interaction:
Request Context:
- conversation_id: Links multi-turn interactions
- turn_number: Position in conversation
- query_intent: Your classification of what the user asked
- user_segment: Cohort for analysis (internal/external, role, etc.)
Retrieval Metrics (from Layer 2):
- chunks_retrieved: Count of chunks returned
- top_similarity_score: Best match score
- retrieval_latency_ms: Time spent in search
Generation Metrics:
- model_deployment: Which model served this request
- prompt_tokens: Input token count
- completion_tokens: Output token count
- generation_latency_ms: Time spent in model call
Quality Signals:
- content_filter_triggered: Did safety filters fire?
- guardrail_intervention: Did your custom guardrails intervene?
- fallback_activated: Did the system fall back to a safe response?
Outcome Signals (when available):
- user_feedback: Explicit thumbs up/down or rating
- user_action: What the user did next (retry, abandon, proceed)
The conversation_id is critical. Without it, you can’t track degradation across turns, connect feedback to specific interactions, or analyze conversation-level patterns.
This is the contract between your application and your observability layer. Skip it and Layer 3 doesn’t exist.
What You’re Measuring
With custom instrumentation in place, you can query Application Insights for patterns Azure diagnostics will never reveal.
A note on schema: These queries assume you’re emitting custom events to Application Insights with the property names shown. Your implementation will differ. The patterns matter more than the exact field names.
Pattern 1: End-to-End Latency Decomposition
Total response time is a number. Latency broken down by pipeline stage is actionable.
// Purpose: Break down total response time by pipeline stage
// Use case: Identify bottlenecks, optimize the slowest component first
// Returns: Latency percentiles by stage with relative contribution
customEvents
| where TimeGenerated > ago(24h)
| where name has 'ai_interaction'
| extend
retrievalMs = toreal(customDimensions.retrievalLatencyMs),
generationMs = toreal(customDimensions.generationLatencyMs),
preprocessMs = toreal(customDimensions.preprocessLatencyMs),
postprocessMs = toreal(customDimensions.postprocessLatencyMs),
totalMs = toreal(customDimensions.totalLatencyMs),
deployment = tostring(customDimensions.modelDeployment)
| summarize
['Retrieval P50'] = percentile(retrievalMs, 50),
['Retrieval P95'] = percentile(retrievalMs, 95),
['Generation P50'] = percentile(generationMs, 50),
['Generation P95'] = percentile(generationMs, 95),
['Total P50'] = percentile(totalMs, 50),
['Total P95'] = percentile(totalMs, 95),
['Request Count'] = count()
by deployment, bin(TimeGenerated, 1h)
| extend
['Retrieval Share'] = round(['Retrieval P50'] * 100.0 / ['Total P50'], 1),
['Generation Share'] = round(['Generation P50'] * 100.0 / ['Total P50'], 1)
| order by TimeGenerated desc
If retrieval dominates latency, optimize your search tier, add caching, or reduce chunk count. If generation dominates, consider smaller models, prompt compression, or streaming responses.
The ratio shifts over time. A prompt change that adds context improves quality but increases generation time. A caching layer reduces retrieval latency but might serve stale results. Understanding where time goes lets you make informed tradeoffs.
Pattern 2: Retrieval-to-Quality Correlation
High similarity scores should predict good outcomes. If they don’t, your embedding model and corpus are misaligned.
// Purpose: Correlate retrieval metrics with response quality signals
// Use case: Determine similarity score thresholds that predict good outcomes
// Returns: Quality metrics bucketed by retrieval score ranges
customEvents
| where TimeGenerated > ago(7d)
| where name has 'ai_interaction'
| extend
topScore = toreal(customDimensions.topSimilarityScore),
chunksReturned = toint(customDimensions.chunksRetrieved),
userRating = toint(customDimensions.userFeedbackScore),
wasHelpful = tobool(customDimensions.markedHelpful),
hadFollowup = tobool(customDimensions.userAskedFollowup),
queryIntent = tostring(customDimensions.queryIntent)
| extend scoreBucket = case(
topScore >= 0.9, '0.9+ Excellent',
topScore >= 0.8, '0.8-0.9 Good',
topScore >= 0.7, '0.7-0.8 Marginal',
topScore >= 0.6, '0.6-0.7 Poor',
'Below 0.6 Failing'
)
| summarize
['Avg User Rating'] = round(avg(userRating), 2),
['Helpful Rate'] = round(countif(wasHelpful == true) * 100.0 / count(), 1),
['Followup Rate'] = round(countif(hadFollowup == true) * 100.0 / count(), 1),
['Sample Size'] = count()
by scoreBucket, queryIntent
| order by scoreBucket asc
The buckets reveal where your quality cliff lives. If “0.7-0.8 Marginal” still produces 80% helpful rates, your threshold is appropriate. If “0.8-0.9 Good” produces 50% helpful rates, something is broken in how retrieval connects to generation.
The follow-up rate is an underrated signal. Users who ask clarifying questions are telling you the first response was incomplete. High follow-up rates on specific intents indicate systematic gaps.
Pattern 3: Conversation Degradation Tracking
Multi-turn conversations degrade. Context windows fill with history. The model starts losing coherence. Users get frustrated.
// Purpose: Detect quality degradation across multi-turn conversations
// Use case: Identify context window exhaustion, topic drift, user frustration
// Returns: Quality and latency trends by turn number
customEvents
| where TimeGenerated > ago(7d)
| where name has 'ai_interaction'
| extend
conversationId = tostring(customDimensions.conversationId),
turnNumber = toint(customDimensions.turnNumber),
generationMs = toreal(customDimensions.generationLatencyMs),
promptTokens = toint(customDimensions.promptTokens),
wasHelpful = tobool(customDimensions.markedHelpful),
userAbandoned = tobool(customDimensions.sessionAbandoned)
| where turnNumber <= 20 // Cap for meaningful analysis
| summarize
['Avg Latency'] = round(avg(generationMs), 0),
['Avg Prompt Tokens'] = round(avg(promptTokens), 0),
['Helpful Rate'] = round(countif(wasHelpful == true) * 100.0 / count(), 1),
['Abandon Rate'] = round(countif(userAbandoned == true) * 100.0 / count(), 1),
['Conversation Count'] = dcount(conversationId)
by turnNumber
| order by turnNumber asc
Prompt tokens climbing linearly means your context management is accumulating history without summarization. You’re paying for tokens that add noise, not value.
Helpful rate dropping after turn 5 suggests context window pollution. The model is drowning in conversation history and losing focus on the current question.
Abandon rate spiking at specific turns reveals where users give up. If turn 3 has 40% abandonment, something about how you handle the third exchange is broken.
Pattern 4: Guardrail and Fallback Analysis
Your guardrails should fire rarely. When they fire frequently, either users are testing boundaries or your guardrails are too aggressive.
// Purpose: Monitor safety interventions and fallback behavior
// Use case: Tune guardrails, identify edge cases, detect abuse patterns
// Returns: Intervention rates by type and query intent
customEvents
| where TimeGenerated > ago(7d)
| where name has 'ai_interaction'
| extend
contentFilterTriggered = tobool(customDimensions.contentFilterTriggered),
guardrailIntervention = tobool(customDimensions.guardrailIntervention),
fallbackActivated = tobool(customDimensions.fallbackActivated),
interventionReason = tostring(customDimensions.interventionReason),
queryIntent = tostring(customDimensions.queryIntent),
deployment = tostring(customDimensions.modelDeployment)
| summarize
['Content Filter Rate'] = round(countif(contentFilterTriggered == true) * 100.0 / count(), 2),
['Guardrail Rate'] = round(countif(guardrailIntervention == true) * 100.0 / count(), 2),
['Fallback Rate'] = round(countif(fallbackActivated == true) * 100.0 / count(), 2),
['Total Interventions'] = countif(contentFilterTriggered == true
or guardrailIntervention == true
or fallbackActivated == true),
['Request Count'] = count()
by queryIntent, deployment, bin(TimeGenerated, 1d)
| extend ['Intervention Rate'] = round(['Total Interventions'] * 100.0 / ['Request Count'], 2)
| where ['Request Count'] > 50 // Minimum sample size
| order by ['Intervention Rate'] desc
High intervention rates on legitimate intents mean your guardrails are too aggressive. Users asking reasonable questions are hitting walls.
Low intervention rates on sensitive intents mean your guardrails are too permissive. Content that should be caught is getting through.
The intent-level breakdown tells you where calibration is needed. A customer support bot and an internal code assistant need different guardrail profiles.
Pattern 5: Agent Tool Execution Analysis
If you’re running agents that invoke tools, tool reliability becomes a quality factor. An unreliable tool degrades the entire agent’s effectiveness.
// Purpose: Analyze agent tool usage patterns and success rates
// Use case: Identify unreliable tools, optimize tool selection, detect loops
// Returns: Tool performance metrics with failure analysis
customEvents
| where TimeGenerated > ago(7d)
| where name has 'agent_tool_call'
| extend
conversationId = tostring(customDimensions.conversationId),
toolName = tostring(customDimensions.toolName),
toolSuccess = tobool(customDimensions.toolSuccess),
executionMs = toreal(customDimensions.executionMs),
retryCount = toint(customDimensions.retryCount),
errorCategory = tostring(customDimensions.errorCategory),
stepNumber = toint(customDimensions.reasoningStep)
| summarize
['Success Rate'] = round(countif(toolSuccess == true) * 100.0 / count(), 1),
['Avg Latency'] = round(avg(executionMs), 0),
['P95 Latency'] = round(percentile(executionMs, 95), 0),
['Avg Retries'] = round(avg(retryCount), 2),
['Call Count'] = count(),
['Error Types'] = make_set(errorCategory, 5)
by toolName
| order by ['Success Rate'] asc
Tools with sub-90% success rates need investigation. Either the tool itself is flaky, or the agent is invoking it incorrectly.
High retry counts indicate transient failures. The tool eventually works, but at the cost of latency and token consumption for retry logic.
The error type distribution tells you whether failures are recoverable (timeouts, rate limits) or systematic (bad inputs, missing permissions). Systematic failures need code fixes. Transient failures might just need better retry policies.
Pattern 6: User Feedback Loop Closure
Explicit feedback is the ground truth for everything else. When users tell you a response was helpful or unhelpful, that’s the signal all other metrics approximate.
// Purpose: Connect explicit user feedback to system behavior
// Use case: Ground truth for quality metrics, model improvement signals
// Returns: Feedback distribution with actionable context
customEvents
| where TimeGenerated > ago(30d)
| where name has 'user_feedback'
| extend
conversationId = tostring(customDimensions.conversationId),
turnNumber = toint(customDimensions.turnNumber),
feedbackType = tostring(customDimensions.feedbackType),
feedbackValue = tostring(customDimensions.feedbackValue),
feedbackReason = tostring(customDimensions.feedbackReason),
queryIntent = tostring(customDimensions.queryIntent),
topRetrievalScore = toreal(customDimensions.topSimilarityScore)
| summarize
['Positive'] = countif(feedbackType has 'up' or toint(feedbackValue) >= 4),
['Negative'] = countif(feedbackType has 'down' or toint(feedbackValue) <= 2),
['Neutral'] = countif(toint(feedbackValue) == 3),
['Total Feedback'] = count(),
['Avg Retrieval Score'] = round(avg(topRetrievalScore), 3),
['Common Complaints'] = make_set(feedbackReason, 10)
by queryIntent, bin(TimeGenerated, 1w)
| extend
['Satisfaction Rate'] = round(['Positive'] * 100.0 / ['Total Feedback'], 1),
['Dissatisfaction Rate'] = round(['Negative'] * 100.0 / ['Total Feedback'], 1)
| order by ['Dissatisfaction Rate'] desc
Negative feedback with high retrieval scores means the model failed despite good grounding. The chunks were relevant, but the synthesis was wrong. That’s a prompt engineering or model selection problem.
Negative feedback with low retrieval scores means your corpus has gaps. The model couldn’t give a good answer because it didn’t have the information. That’s a content problem.
The “Common Complaints” set tells you what users actually say when they’re unhappy. That qualitative signal is worth more than any metric.
The Feedback Problem
Most users don’t leave feedback. Industry benchmarks suggest 1-5% feedback rates on optional mechanisms. Your explicit feedback data is:
- Skewed toward strong opinions (very happy or very frustrated)
- Insufficient sample size for granular analysis
- Biased toward users who understand the feedback mechanism
Implicit signals fill the gap:
- Retry behavior: User immediately rephrased the question
- Session abandonment: User left without completing their task
- Copy/paste actions: User found value worth extracting
- Follow-up patterns: Clarifying questions suggest incomplete answers
- Time-on-response: Very short or very long reading times
These require additional instrumentation but provide signal at scale. A user who copies the response found it useful. A user who immediately asks again did not.
What This Layer Can’t Tell You
You now have latency decomposition, retrieval-quality correlation, conversation degradation tracking, guardrail analysis, tool reliability metrics, and feedback loops. Your orchestration layer is observable.
You still don’t know whether the response was factually correct.
Semantic quality metrics tell you the user was satisfied, not that they should have been. A confidently wrong answer that sounds authoritative can score well on every metric until someone acts on it and discovers the error.
The system worked. The user was happy. The answer was wrong. That failure mode is invisible to automated observability.
That’s what Layer 4 addresses: governance, audit trails, and the organizational infrastructure that catches what metrics miss.
What’s Next?
Coming Next: Part 4: The Governance Layer (Published January 25, 2026)
Technical observability tells you what happened. Governance observability tells you whether it was acceptable and proves you’re governing responsibly with audit trails and compliance evidence.
Photo by Daniel Lerman on Unsplash