Azure Workbooks: Custom Dashboards That Don't Suck
Azure Workbooks: Custom Dashboards That Don’t Suck
Building Dashboards That People Actually Use
Most dashboards are digital wallpaper. Here’s how to build ones that drive action.
KQL Warning: The queries in this article are patterns and examples. You’ll need to adapt them to your environment, table schemas, and operational needs. Some are production-tested, others are illustrative. Focus on the dashboard design patterns, not copying queries verbatim.
You’ve seen them. Dashboards with 47 charts, rainbow color schemes, and metrics that nobody understands. They look impressive in demos but useless in production.
Azure Workbooks can be different. When built right, they become the operational command center your team actually relies on. But most people build them wrong.
This isn’t about making pretty charts. It’s about designing information that drives decisions.
The Dashboard Hierarchy That Works
Executive Dashboard (30-second view)
- Overall health score - one number that matters
- Critical alerts - what needs immediate attention
- Trend indicators - are things getting better or worse?
Operations Dashboard (5-minute view)
- Actionable alerts - what can be fixed right now
- Resource status - capacity, performance, availability
- Recent changes - what might have caused issues
Technical Dashboard (deep-dive view)
- Detailed diagnostics - root cause analysis
- Historical trends - pattern identification
- Correlation analysis - system relationships
Building Your First Useful Workbook
Start With Questions, Not Charts
Before opening Workbooks, ask:
- What decisions does this dashboard need to support?
- What actions should viewers take after seeing it?
- How much time do they have to process the information?
// Workbook parameter for time range selection
{
"type": 1,
"content": {
"json": "## Infrastructure Health Dashboard\n\n*Select time range for analysis*"
}
},
{
"type": 9,
"content": {
"version": "KqlParameterItem/1.0",
"parameters": [
{
"id": "timeRange",
"version": "KqlParameterItem/1.0",
"name": "TimeRange",
"type": 4,
"value": {
"durationMs": 3600000
},
"typeSettings": {
"selectableValues": [
{
"durationMs": 3600000,
"createdTime": "2024-01-01T00:00:00.000Z",
"isInitialTime": false,
"grain": 1,
"useDashboardTimeRange": false
},
{
"durationMs": 14400000
},
{
"durationMs": 43200000
},
{
"durationMs": 86400000
}
]
}
}
]
}
}
The Health Score Component
Every operations dashboard needs one number that summarizes everything:
// Overall infrastructure health calculation
let timeRange = {TimeRange};
let systemHealth =
Heartbeat
| where TimeGenerated > ago(timeRange)
| summarize LastHeartbeat = max(TimeGenerated) by Computer
| extend IsHealthy = LastHeartbeat > ago(5m)
| summarize
TotalSystems = count(),
HealthySystems = countif(IsHealthy)
| extend HealthPercentage = (HealthySystems * 100) / TotalSystems;
let errorRate =
Event
| where TimeGenerated > ago(timeRange)
| where EventLevelName == "Error"
| summarize ErrorCount = count()
| extend ErrorImpact = case(
ErrorCount > 100, 20,
ErrorCount > 50, 10,
ErrorCount > 10, 5,
0
);
let performanceIssues =
Perf
| where TimeGenerated > ago(timeRange)
| where CounterName == "% Processor Time"
| where CounterValue > 80
| summarize HighCpuCount = count()
| extend PerfImpact = case(
HighCpuCount > 1000, 15,
HighCpuCount > 500, 10,
HighCpuCount > 100, 5,
0
);
systemHealth
| extend ErrorImpact = toscalar(errorRate | project ErrorImpact)
| extend PerfImpact = toscalar(performanceIssues | project PerfImpact)
| extend OverallHealth = HealthPercentage - ErrorImpact - PerfImpact
| project
HealthScore = round(OverallHealth, 1),
SystemsOnline = strcat(HealthySystems, "/", TotalSystems),
Status = case(
OverallHealth >= 95, "Excellent",
OverallHealth >= 85, "Good",
OverallHealth >= 70, "Warning",
"Critical"
)
Critical Alerts Section
Show only what needs immediate action:
// Critical alerts requiring immediate attention
// TimeGenerated filter first, then narrow to critical event types
let criticalEvents =
Event
| where TimeGenerated > ago({TimeRange})
| where EventLevelName has 'error'
| where EventID in (1001, 1074, 6008, 41) // System critical events
| summarize
Count = count(),
LastOccurrence = max(TimeGenerated),
AffectedSystems = dcount(Computer),
Sample = any(RenderedDescription)
by EventID
| extend Priority = case(
EventID == 6008, 'P1 - System Crash',
EventID == 1074, 'P1 - Unexpected Shutdown',
EventID == 41, 'P2 - Power Loss',
'P3 - System Error'
);
let resourceAlerts =
Perf
| where TimeGenerated > ago({TimeRange})
| where CounterName has 'processor time' or CounterName has 'available mbytes'
| extend Threshold = case(
CounterName has 'processor time', 90.0,
CounterName has 'available mbytes', 512.0,
0.0
)
| where (CounterName has 'processor time' and CounterValue > Threshold) or
(CounterName has 'available mbytes' and CounterValue < Threshold)
| summarize
Count = count(),
LastOccurrence = max(TimeGenerated),
AffectedSystems = dcount(Computer),
WorstValue = case(
CounterName has 'processor time', max(CounterValue),
min(CounterValue)
)
by CounterName
| extend Priority = case(
CounterName has 'available mbytes', 'P1 - Memory Critical',
'P2 - CPU Critical'
);
union criticalEvents, resourceAlerts
| project Priority, Count, AffectedSystems, LastOccurrence, Details = coalesce(Sample, strcat(CounterName, ': ', WorstValue))
| order by Priority, Count desc
Advanced Workbook Patterns
Interactive Drill-Down
Create dashboards that let users explore deeper:
// Grid component with drill-down capability
{
"type": 3,
"content": {
"version": "KqlItem/1.0",
"query": "Heartbeat\n| where TimeGenerated > ago({TimeRange})\n| summarize LastHeartbeat = max(TimeGenerated) by Computer, OSType\n| extend Status = case(\n LastHeartbeat > ago(5m), \"Online\",\n LastHeartbeat > ago(15m), \"Warning\", \n \"Offline\"\n)\n| order by Status desc, Computer asc",
"size": 0,
"title": "System Status Overview",
"exportFieldName": "Computer",
"exportParameterName": "SelectedComputer",
"queryType": 0,
"gridSettings": {
"formatters": [
{
"columnMatch": "Status",
"formatter": 18,
"formatOptions": {
"thresholdsOptions": "icons",
"thresholdsGrid": [
{
"operator": "==",
"thresholdValue": "Online",
"representation": "success",
"text": "✅ Online"
},
{
"operator": "==",
"thresholdValue": "Warning",
"representation": "warning",
"text": "⚠️ Warning"
},
{
"operator": "Default",
"representation": "error",
"text": "❌ Offline"
}
]
}
}
]
}
}
}
Conditional Visibility
Show different content based on context:
// Conditional text based on health score
{
"type": 1,
"content": {
"json": "## 🚨 Critical Issues Detected\n\nImmediate attention required for the following systems:",
"style": "error"
},
"conditionalVisibility": {
"parameterName": "HealthScore",
"comparison": "isLessThan",
"value": "70"
}
}
Time-Series Visualization
Show trends that matter:
// Resource utilization trends with normalized values for comparison
// TimeGenerated filter first, then narrow to selected systems
Perf
| where TimeGenerated > ago({TimeRange})
| where CounterName has 'processor time' or CounterName has 'available mbytes' or CounterName has 'disk transfers'
| where Computer == '{SelectedComputer}' or '{SelectedComputer}' == 'all'
| extend NormalizedValue = case(
CounterName has 'processor time', CounterValue,
CounterName has 'available mbytes', CounterValue / 1024, // Convert to GB
CounterName has 'disk transfers', CounterValue / 100, // Scale for visibility
CounterValue
)
| summarize avg(NormalizedValue) by CounterName, bin(TimeGenerated, {TimeRange:grain})
| render timechart
with (
title='Resource Utilization Trends',
xtitle='Time',
ytitle='Utilization %'
)
Dashboard Anti-Patterns to Avoid
1. The “Christmas Tree” Dashboard
// Don't do this - too many colors, no meaning
Event
| summarize count() by EventLevelName
| render piechart
with (
title='All Events by Level' // Useless - includes informational
)
// Do this - focus on actionable information
Event
| where TimeGenerated > ago({TimeRange}) // Time filter first
| where EventLevelName has 'error' or EventLevelName has 'warning'
| summarize count() by EventLevelName, bin(TimeGenerated, 1h)
| render columnchart
with (
title='Error and Warning Trends',
series=EventLevelName
)
2. The “Vanity Metric” Trap
// Don't do this - impressive but meaningless
Perf
| summarize
TotalDataPoints = count(),
AverageValue = avg(CounterValue),
MaxValue = max(CounterValue)
by CounterName
// Do this - actionable insights with baseline comparison
Perf
| where TimeGenerated > ago({TimeRange}) // Time filter first
| where CounterName has 'processor time'
| summarize
CurrentAvg = avgif(CounterValue, TimeGenerated > ago(15m)),
BaselineAvg = avgif(CounterValue, TimeGenerated < ago(1h)),
SystemsOverThreshold = dcountif(Computer, CounterValue > 80)
by bin(TimeGenerated, 15m)
| extend PerformanceTrend = case(
CurrentAvg > BaselineAvg * 1.2, 'Degrading',
CurrentAvg < BaselineAvg * 0.8, 'Improving',
'Stable'
)
3. The “Information Overload” Problem
// Don't do this - 20 charts on one page
{
"type": 12,
"content": {
"version": "NotebookGroup/1.0",
"groupType": "editable",
"items": [
// 20 different visualizations...
]
}
}
// Do this - progressive disclosure
{
"type": 11,
"content": {
"version": "LinkItem/1.0",
"style": "tabs",
"links": [
{
"cellValue": "selectedTab",
"linkTarget": "parameter",
"linkLabel": "Overview",
"subTarget": "overview"
},
{
"cellValue": "selectedTab",
"linkTarget": "parameter",
"linkLabel": "Performance",
"subTarget": "performance"
},
{
"cellValue": "selectedTab",
"linkTarget": "parameter",
"linkLabel": "Errors",
"subTarget": "errors"
}
]
}
}
Building Workbooks That Scale
Template Structure
Create reusable components:
// Parameter template for server selection
{
"type": 9,
"content": {
"version": "KqlParameterItem/1.0",
"parameters": [
{
"id": "servers",
"version": "KqlParameterItem/1.0",
"name": "Servers",
"type": 2,
"multiSelect": true,
"quote": "'",
"delimiter": ",",
"query": "Heartbeat\n| where TimeGenerated > ago(1h)\n| distinct Computer\n| order by Computer asc",
"value": ["All"],
"typeSettings": {
"additionalResourceOptions": ["value::all"],
"selectAllValue": "All"
}
}
]
}
}
Performance Optimization
// Efficient queries for large datasets
// Pre-filter early, aggregate appropriately, limit data points
let selectedServers = dynamic([{Servers}]);
Perf
| where TimeGenerated > ago({TimeRange}) // Time filter first
| where Computer in (selectedServers) or 'all' in (selectedServers)
| where CounterName has 'processor time' or CounterName has 'available mbytes'
| summarize avg(CounterValue) by Computer, CounterName, bin(TimeGenerated, {TimeRange}/50) // Limit data points
| extend MetricType = case(
CounterName has 'processor time', 'CPU',
'Memory'
)
Responsive Design
Make dashboards work on different screen sizes:
// Responsive grid layout
{
"type": 12,
"content": {
"version": "NotebookGroup/1.0",
"groupType": "editable",
"items": [
{
"type": 3,
"content": {
"gridSettings": {
"columns": [
{
"id": "Computer",
"width": "200px"
},
{
"id": "Status",
"width": "100px"
},
{
"id": "LastSeen",
"width": "150px"
}
]
}
},
"customWidth": "50"
}
]
}
}
Dashboard Governance
Version Control Your Workbooks
// Include metadata in your workbooks
{
"type": 1,
"content": {
"json": "<!-- \nWorkbook: Infrastructure Health Dashboard\nVersion: 2.1.0\nLast Updated: 2026-01-15\nOwner: Infrastructure Team\nReview Date: 2026-04-15\n-->\n\n# Infrastructure Health Dashboard v2.1"
}
}
Access Control Strategy
- Public dashboards - Overall health, no sensitive data
- Team dashboards - Operational details, limited access
- Admin dashboards - Full diagnostic capability, restricted access
Update Procedures
- Test in development - Use separate Log Analytics workspace
- Validate with stakeholders - Get feedback before deployment
- Document changes - Maintain change log
- Monitor usage - Remove unused dashboards
The Workbook Checklist
Before publishing any dashboard, verify:
- Clear purpose - What decisions does this support?
- Actionable information - Can viewers act on what they see?
- Appropriate detail level - Right information for the audience?
- Performance optimized - Loads in under 10 seconds?
- Mobile friendly - Works on different screen sizes?
- Access controlled - Right people have right access?
- Documented - Purpose and usage instructions clear?
Common Workbook Scenarios
Incident Response Dashboard
// Real-time incident tracking with severity classification
// TimeGenerated filter first for performance
let incidentTimeframe = 4h;
Event
| where TimeGenerated > ago(incidentTimeframe)
| where EventLevelName has 'error'
| summarize
ErrorCount = count(),
FirstError = min(TimeGenerated),
LastError = max(TimeGenerated),
AffectedSystems = dcount(Computer),
ErrorSample = any(RenderedDescription)
by EventID
| extend
Duration = LastError - FirstError,
ErrorRate = ErrorCount / (incidentTimeframe / 1h),
Severity = case(
ErrorCount > 100 and AffectedSystems > 10, 'P1',
ErrorCount > 50 and AffectedSystems > 5, 'P2',
ErrorCount > 10, 'P3',
'P4'
)
| where Severity in ('P1', 'P2') // Focus on high-severity issues
| order by Severity, ErrorCount desc
Capacity Planning Dashboard
// Growth trend analysis with predictive capacity alerts
// TimeGenerated filter first, then calculate trends
let lookback = 30d;
let forecastDays = 90;
Perf
| where TimeGenerated > ago(lookback)
| where CounterName has 'processor time' or CounterName has 'available mbytes' or CounterName has 'free space'
| summarize
CurrentValue = avg(CounterValue),
TrendSlope = (last(CounterValue) - first(CounterValue)) / (lookback / 1d)
by Computer, CounterName, bin(TimeGenerated, 1d)
| extend
ForecastValue = CurrentValue + (TrendSlope * forecastDays),
CapacityAlert = case(
CounterName has 'processor time' and ForecastValue > 80, 'CPU capacity concern',
CounterName has 'available mbytes' and ForecastValue < 1024, 'Memory capacity concern',
CounterName has 'free space' and ForecastValue < 10, 'Disk capacity concern',
'OK'
)
| where CapacityAlert != 'OK'
Making Dashboards Stick
The best dashboard is the one your team actually uses. To ensure adoption:
- Involve users in design - Build with them, not for them
- Start simple - Add complexity gradually
- Measure usage - Track which components get attention
- Iterate based on feedback - Dashboards are never “done”
- Train your team - Show them how to interpret and act on the data
Beyond Pretty Charts
Remember: dashboards are tools, not art projects. Every element should serve a purpose. Every chart should drive a decision. Every alert should trigger an action.
The goal isn’t to impress stakeholders with colorful visualizations. It’s to make your team more effective at keeping systems running.
Build dashboards that solve problems, not ones that create them.
Ready to level up your monitoring game? Check out KQL for Infrastructure Teams to master the queries that power these dashboards.
Photo by Saung Digital on Unsplash