KQL for Infrastructure Teams - Beyond Basic Queries
KQL for Infrastructure Teams - Beyond Basic Queries
Writing Queries That Actually Solve Problems
KQL syntax is critical, but this guide assumes you already have that foundation. This is about what to do with it: how to think like infrastructure.
KQL Warning: Learn the patterns, not my KQL. They are examples. You will need to take the patterns and modify each query to match your environment and goals. Some queries I’ve sanity checked, some I have not. But it is the patterns you should walk away with.
Every monitoring implementation I’ve touched shares the same problem: teams build dashboards that look complete but aren’t, answering questions nobody asked while leaving gaps they won’t discover until something breaks and the dashboard can’t explain why.
KQL isn’t the problem. The problem is that most infrastructure teams approach monitoring like developers approach logging. Capture everything, aggregate nothing meaningful, then wonder why their “comprehensive” observability platform can’t tell them why the application is slow.
The queries are syntactically correct. The dashboards are visually impressive. And the incident bridges are still full of fatigued people staring at graphs that don’t explain anything, burning goodwill and attention on problems that better tooling should have prevented or at least clarified.
This isn’t another syntax tutorial. It’s the operational thinking that took me years to learn the hard way.
The Infrastructure Mindset
Infrastructure teams think differently than developers or analysts. We live in a world where:
Capacity isn’t a number. It’s a trajectory. I don’t care that you’re at 60% CPU right now. I care whether you’ll be at 95% next Tuesday when marketing launches their campaign and nobody told you.
Individual errors are noise. Patterns are signal. One timeout means nothing. The same timeout happening every 47 minutes on the same three servers? That’s a DNS cache expiration problem someone’s going to discover at the worst possible moment.
Baselines can’t be static. Your “normal” on Monday morning looks nothing like your “normal” during month-end processing. Hardcoded thresholds are how you train your team to ignore alerts.
Nothing exists in isolation. When the database is slow, the question isn’t “why is the database slow?” The question is “what changed upstream that’s hammering the database in a pattern it wasn’t designed for?”
Your KQL queries should reflect this thinking. If yours don’t yet, that’s fine. Here’s how to get there.
Essential Patterns for Infrastructure
1. Capacity Planning Queries
Here’s what I see in almost every environment:
// What everyone writes
Perf
| where TimeGenerated > ago(1h)
| where CounterName has 'processor time'
| summarize avg(CounterValue) by Computer
This tells you nothing useful. It’s a snapshot. A single frame from a movie you need to understand.
// What actually helps you plan
Perf
| where TimeGenerated > ago(30d)
| where CounterName has 'processor time'
| summarize
Current = avg(CounterValue),
Trend = percentile(CounterValue, 95),
Growth = (avg(CounterValue) - avg(iff(TimeGenerated < ago(7d), CounterValue, real(null)))) / avg(iff(TimeGenerated < ago(7d), CounterValue, real(null))) * 100
by Computer, bin(TimeGenerated, 1d)
| where Growth > 5 // Flag servers with >5% weekly growth
That Growth calculation is the difference between “we need more capacity sometime” and “we need more capacity before the 15th or we’re going to have a bad day.”
I learned this lesson watching a cache mount location on a web server gradually fill with improperly configured core dumps. Slow growth, barely noticeable week to week. Then the disk filled, the web server crashed, and a retail site went down during peak hours. A simple trend query would have caught it weeks earlier.
2. Failure Pattern Analysis
Stop treating every error as a unique snowflake:
Event
| where TimeGenerated > ago(7d)
| where EventLevelName has 'error'
| extend ErrorPattern = extract(@'(\w+Exception|\w+Error|\w+Failure)', 1, RenderedDescription)
| summarize
ErrorCount = count(),
AffectedSystems = dcount(Computer),
FirstSeen = min(TimeGenerated),
LastSeen = max(TimeGenerated),
SampleMessage = any(RenderedDescription)
by ErrorPattern
| where ErrorCount > 10
| order by ErrorCount desc
The dcount(Computer) is the critical piece. One server throwing 500 errors is probably a bad deployment. Fifteen servers throwing the same error is a shared dependency failing. I’ve watched teams spend hours troubleshooting individual servers when a five-minute pattern analysis would have pointed them at a misconfigured load balancer affecting everything downstream.
3. Performance Baseline Establishment
Static thresholds are a form of institutional lying. You’re telling your monitoring system that “normal” is a fixed state when you know it isn’t.
let baseline_period = 14d;
let analysis_period = 1d;
Perf
| where TimeGenerated > ago(baseline_period)
| where CounterName in ('processor time', 'available mbytes', 'disk transfers/sec')
| extend IsBaseline = TimeGenerated < ago(analysis_period)
| summarize
BaselineAvg = avgif(CounterValue, IsBaseline),
BaselineStdDev = stdevif(CounterValue, IsBaseline),
CurrentAvg = avgif(CounterValue, not(IsBaseline)),
CurrentMax = maxif(CounterValue, not(IsBaseline))
by Computer, CounterName
| extend
DeviationFromBaseline = (CurrentAvg - BaselineAvg) / BaselineStdDev,
AlertLevel = case(
abs(DeviationFromBaseline) > 3, 'Critical',
abs(DeviationFromBaseline) > 2, 'Warning',
'Normal'
)
| where AlertLevel != 'Normal'
The standard deviation approach means your alerts actually mean something. Two standard deviations from normal on a stable system is worth investigating. Two standard deviations on a system that’s always volatile might be Tuesday.
Advanced Infrastructure Scenarios
Cross-System Impact Analysis
This is where most monitoring falls apart. Systems don’t fail in isolation, but we monitor them like they do.
let failure_window = 5m;
let primary_failures =
Event
| where TimeGenerated > ago(1h)
| where EventLevelName has 'error'
| project TimeGenerated, Computer, EventID, RenderedDescription;
primary_failures
| join kind=inner (
primary_failures
| extend JoinTime = TimeGenerated
) on $left.TimeGenerated between (JoinTime - failure_window .. JoinTime + failure_window)
| where Computer != Computer1
| summarize
CorrelatedFailures = count(),
Systems = make_set(strcat(Computer, '->', Computer1))
by EventID, EventID1
| where CorrelatedFailures > 3
The five-minute window is a starting point. Adjust based on your architecture. Tightly coupled systems might correlate within seconds. Loosely coupled systems with queue-based communication might take minutes for failures to cascade.
I’ve used variants of this query to identify shared storage problems that were manifesting as application timeouts, network failures that looked like database performance issues, and authentication service degradation that was being blamed on individual applications.
Predictive Capacity Alerts
Alert before the problem, not during it:
let prediction_days = 30;
Perf
| where TimeGenerated > ago(7d)
| where CounterName has 'free megabytes'
| summarize
CurrentFree = avg(CounterValue),
DailyChange = (first(CounterValue) - last(CounterValue)) / 7
by Computer, InstanceName
| extend
DaysUntilFull = iff(DailyChange > 0, CurrentFree / DailyChange, real(null)),
AlertLevel = case(
DaysUntilFull < 7, 'Critical',
DaysUntilFull < 14, 'Warning',
DaysUntilFull < 30, 'Watch',
'OK'
)
| where AlertLevel != 'OK'
| project Computer, Drive=InstanceName, DaysUntilFull, AlertLevel, CurrentFreeGB=CurrentFree/1024
“DaysUntilFull” is the metric that should be on your executive dashboard. Not current utilization. Not percentage free. The number of days until someone’s going to have a bad Friday afternoon.
Common Infrastructure Anti-Patterns
1. The “Everything Alert”
I’ve walked into environments with 400+ active alerts. At that point, you just walk away.
// This creates noise, not signal
Event
| where EventLevelName == "Error"
| project TimeGenerated, Computer, RenderedDescription
// This creates something you might actually act on
Event
| where TimeGenerated > ago(1h)
| where EventLevelName has 'error'
| summarize ErrorCount = count() by Computer, EventID
| where ErrorCount > 5
| join kind=inner (
Event
| where EventLevelName has 'error'
| summarize by EventID, RenderedDescription
) on EventID
| project Computer, ErrorCount, Description = RenderedDescription
The filter for ErrorCount > 5 isn’t arbitrary. It’s asking: “Is this happening often enough that someone should stop what they’re doing and look at it?” One error isn’t actionable. Five identical errors in an hour might be.
2. The “Point-in-Time Trap”
Snapshots lie. They lie by omission.
// This snapshot tells you almost nothing useful
Perf
| where TimeGenerated > ago(5m)
| where CounterName has 'processor time'
| summarize avg(CounterValue) by Computer
// This tells you what's happening AND which direction it's going
Perf
| where TimeGenerated > ago(1h)
| where CounterName has 'processor time'
| summarize
Current = avgif(CounterValue, TimeGenerated > ago(5m)),
Previous = avgif(CounterValue, TimeGenerated between (ago(1h) .. ago(55m))),
Trend = avgif(CounterValue, TimeGenerated > ago(5m)) - avgif(CounterValue, TimeGenerated between (ago(1h) .. ago(55m)))
by Computer
| extend TrendDirection = iff(Trend > 5, 'Rising', iff(Trend < -5, 'Falling', 'Stable'))
A server at 70% CPU that was at 40% an hour ago is a completely different situation than a server at 70% CPU that’s been at 70% CPU for a week. The first one needs investigation. The second one is just how that server lives.
Query Optimization for Infrastructure Scale
When you’re querying across hundreds of systems and weeks of data, efficiency matters:
// Efficient - narrows dataset before filtering
Perf
| where TimeGenerated > now(-30d)
and Computer has 'webserver01'
and CounterName has 'processor time'
// Inefficient - requires scanning all rows across all datasets
Perf
| where Computer == 'WebServer01'
| where CounterName == 'Processor Time'
| where TimeGenerated > ago(30d)
This seems minor until you’re paying for query compute across thousands of subscriptions and multiple tenants.
TimeGenerated first, always.
Building Your KQL Toolkit
The functions that matter for infrastructure work:
- bin() - Time-based aggregation that turns noise into trends
- percentile() - Understanding distribution, because averages hide problems
- make_set() - Collecting related items to see the full picture
- prev() - Comparing to previous values for change detection
- series_decompose() - Separating trend from noise in time series
Start with three queries in your environment:
- Capacity trend analysis - Know what’s growing before it’s full
- Failure correlation - Find systemic issues hiding as individual failures
- Performance baseline - Establish what “normal” actually looks like
Then build from there. The goal isn’t query mastery. The goal is asking better questions of your data.
Your monitoring is only as good as the questions you ask. Make sure you’re asking the right ones.
Want to dive deeper into advanced monitoring patterns? Check out the Beyond Azure Monitor series for enterprise-scale monitoring strategies.
Photo by Markus Spiske on Unsplash