Your Best On-Call Engineer Never Sleeps: A Look at the AWS DevOps Agent
Every Outage has a 3am Moment
You know the one. Your phone buzzes. You squint at a PagerDuty alert that makes zero sense. You open your laptop in bed, hair going in four directions, and you start SSH-ing into boxes while your partner asks if everything is okay… It’s not. You don’t know what broke, you don’t know why, and the Slack channel is already filling up with people who also don’t know.
I’ve been that person more times than I’d like to admit. Having a historical background in networking and having been an engineer and an architect, what I wouldn’t have given to have a little helper grabbing all the context and data needed to allow me to review and make an appropriate informed decision on the action to resolve.
That’s the problem, right? It’s rarely a lack of data. It’s that the data lives in fifteen different tabs across four different tools, and the person stitching it all together is running on cold coffee and adrenaline.
So when AWS launched the DevOps Frontier Agent, I paid attention. Not because I love shiny new services – I’ve been burned by “AI-powered” tooling that was really just a glorified dashboard with a chatbot stapled on. I paid attention because the pitch was different: an autonomous agent that actually investigates incidents the way a senior SRE would. It reads your metrics, your logs, your deployment history, your code changes. It maps your infrastructure topology. And it starts working the moment an alert fires – not when someone finally wakes up and checks Slack.
I’ve spent some time putting the DevOps Agent through its paces, and honestly? It’s changed how I think about on-call. This post walks through the use cases where it actually delivers – the scenarios where having an AI agent that understands your architecture isn’t a gimmick, it’s the thing standing between you and another six-hour outage.
Let’s get into it.
Foundational Concepts
Before we dive into the use cases, let me quickly walk you through the four building blocks you’ll keep hearing about. I promise I’ll keep it short because this isn’t a whitepaper.
Agent Spaces: Your Blast Radius Boundary
Think of an Agent Space as a sandbox. It’s a logical container that tells the DevOps Agent exactly what it’s allowed to touch – which AWS accounts, which third-party tools, which users can interact with it. Everything inside is fair game. Everything outside doesn’t exist as far as the agent is concerned.
This matters more than you’d think. When I first set it up, my instinct was to throw everything into one big Agent Space. Production, staging, dev – all of it. Don’t do that. The isolation is the feature. Each Agent Space keeps its investigation data, incident history, recommendations, and even chat conversations completely separate. Nothing bleeds across.
The play is to align your Agent Spaces with how your organisation actually works. One per team. One per environment. One per service boundary. Whatever keeps the agent focused on the right blast radius. You can create a dedicated AWS account as the primary for each space and connect other accounts as secondaries. It feels like overkill up front, but the first time the agent investigates a production incident without accidentally pulling staging noise into the picture, you get the idea.
Topology: The Agent’s Mental Model of Your Infrastructure
This is the part that genuinely surprised me. The DevOps Agent doesn’t just respond to alerts blindly. It builds a map of your infrastructure – every resource, every relationship, every dependency chain – and uses that map to reason about what’s going wrong and where.
It does this automatically. It scans your CloudFormation stacks (CDK too – anything that deploys via CloudFormation under the hood). It picks up tagged resources through AWS Resource Explorer. It pulls relationship data from configuration, observability tools like CloudWatch Application Signals or Dynatrace, and even your CI/CD pipelines. The result is an interactive topology graph you can explore in the web app, with multiple views: high-level account and region boundaries, CloudFormation stack containers, individual components, or the full kitchen-sink view with every discovered resource.
Here’s the thing that actually matters: during an investigation, the agent uses this topology to understand blast radius. It knows that your API Gateway talks to a Lambda function that writes to a DynamoDB table that triggers an SQS queue. So when the DynamoDB alarm fires, it doesn’t just stare at the table metrics – it traces upstream and downstream to figure out what kicked things off and what else might be affected. No one had to draw that architecture diagram. The agent figured it out.
And it’s not limited to what’s in the topology, either. If investigation leads somewhere the graph doesn’t cover, the agent can still reach out to AWS service APIs or connected observability tools to chase the thread. The topology is a starting point, not a ceiling.
Skills: Teaching the Agent Your Playbook
Out of the box, the DevOps Agent is a solid generalist. But your infrastructure isn’t generic. You’ve got that one RDS cluster with the weird connection pooling config. You’ve got that deployment pipeline that does a canary rollout with custom health checks. You’ve got the runbook that lives in someone’s head – the one person that everyone pings at 3 AM.
Skills let you encode all of that. A Skill is just a directory with a SKILL.md file – Markdown instructions that tell the agent how to investigate specific scenarios. Step-by-step procedures. Decision trees. What to check first, what to check next, what a good outcome looks like. You can bundle in reference docs, architecture diagrams, and data files alongside it.
The agent is smart about loading them, too. Each Skill has a name and a description in its frontmatter, and the agent uses that description to decide whether the skill is relevant to whatever it’s currently working on. You can even target skills to specific agent types – triage, root cause analysis, mitigation, evaluation – so the agent doesn’t burn context loading your RDS runbook when it’s investigating a networking issue.
This is where it gets personal. Your team’s hard-won operational knowledge – the stuff that lives in Confluence pages no one reads, or worse, in Slack threads from six months ago, this finally gets packaged into something that runs automatically at 3 AM when no one’s awake. That, to me, is really valuable for ensuring real personalisation within environment.
The Dual-Console Model: Admin vs. Operator
Last thing. There are two surfaces you’ll work with, and they serve different people.
The AWS Management Console is where your admins live. Creating Agent Spaces, configuring integrations, setting up IAM roles and access controls – all the setup work happens here. You’ll spend time here once, then revisit when you’re adding accounts or integrations.
The DevOps Agent web app is where your operators live. This is the day-to-day surface: launching investigations, chatting with the agent, browsing the topology, reviewing prevention recommendations. It lives outside the AWS Console, which is a deliberate choice – your on-call engineers don’t need to navigate the full AWS Console at 3 AM. They need a focused tool that shows them what’s broken and what to do about it.
Two consoles, two audiences, clean separation. It sounds simple, but it’s the kind of design decision that tells you someone on the product team has actually been on-call and in the trenches for an incident in a past life.
Lets look at some use cases that will make a significant impact on support processes.
Use Case - Autonomous Incident Investigation & Triage
Let me paint the picture.
It’s 2 AM. A Datadog alert fires on your checkout service. Three minutes later, a PagerDuty page follows. Then a ServiceNow ticket gets auto-created. Meanwhile, your CloudWatch alarms start lighting up for the downstream payment processor. To a human, this looks like four separate problems. To the DevOps Agent, it’s one.
Here’s what actually happens. The moment that first alert arrives – through a built-in integration, a webhook, or even a manual trigger – the agent kicks off an investigation automatically. No human has to triage. No one has to decide if this is worth waking someone up. The agent is already working.
The first thing it does is something most engineers skip under pressure… correlation. The agent has a triage phase that analyses every incoming incident against active investigations within a lookback window – typically around 20 minutes. It looks at component similarities, region, timing. If that PagerDuty page and that CloudWatch alarm are symptoms of the same underlying problem, the agent links them to a single investigation instead of spinning up three parallel ones. That alone saves you from the classic “five people investigating the same thing in different Slack threads” problem.
From there, the agent digs in. It pulls metrics. It reads logs. It examines traces. It looks at what code was deployed recently and through which pipeline. It checks deployment history. And it does all of this in the context of your application topology – it already knows which services talk to each other, so it can trace upstream and downstream without someone drawing a diagram on a whiteboard.
So what’s the output? A root cause analysis and a detailed mitigation plan. Not a vague “check the logs” suggestion. Specific actions: what to do, how to validate it worked, and how to roll back if it doesn’t. That plan gets pushed to your communication channels – Slack, ServiceNow, wherever you’ve configured it – so the on-call engineer who does eventually open their laptop finds a pre-analysed incident with a recommended fix, not a raw alert and a prayer.
Ok I can hear the questions now “what if the agent gets it wrong?” You can unlink incidents that were incorrectly correlated and kick off a fresh investigation. You can also create custom correlation rules by writing a Skill with your own triage logic – so if your organisation has specific rules about what should and shouldn’t be grouped, you can teach the agent.
There’s another detail I want to call out, which is, those pre-configured starting points. Sometimes you’re not reacting to an alert – you’re just suspicious. The agent gives you quick-launch options: investigate the latest triggered alarm, look into high CPU usage across your compute resources, or chase an error rate spike. You fill in a couple of fields – a description, an optional starting point like a specific alarm or log snippet, the incident timestamp – and hit go. It’s remarkably low-friction for something this powerful.
The bottom line is, by the time a human gets involved, the expensive part of the investigation is already done. The agent has correlated the signals, traced the dependencies, identified the root cause, and written up a mitigation plan. Your engineer’s job shifts from “figure out what’s broken” to “review the plan and decide whether to execute it.” That’s a fundamentally different on-call experience.
Use Case - Proactive Incident Prevention
This is the one that changed my thinking about what the agent is actually for.
Reactive incident response is table stakes. Every monitoring tool does it. Alert fires, someone investigates, you fix it, you move on. But here’s the thing that keeps burning teams: you fix the incident, but you never fix the pattern. Three months later, the same class of issue comes back wearing a slightly different hat, and you do the whole dance again.
The DevOps Agent attacks this directly. It analyses your investigation history – not just the most recent one, but the patterns across multiple incidents – and generates recommendations designed to prevent entire categories of incidents from recurring. This isn’t a generic best-practices checklist. These are targeted improvements informed by things that have actually broken in your environment.
The recommendations land in a dedicated Ops Backlog page in the web app, and they’re sorted into four categories:
- Observability - gaps in your monitoring, alerting, or logging that slowed down detection. Maybe you had a metric that should have been alarming but wasn't. Maybe your logs weren't structured enough for the agent to parse quickly. The agent will tell you exactly what to add and where.
- Infrastructure- resource configurations, capacity settings, and architectural patterns that contributed to the incident. Think autoscaling policies that were too aggressive or not aggressive enough, instance types that are underpowered for the workload, or single points of failure the agent spotted in your topology.
- Governance- deployment pipeline improvements, testing gaps, and operational controls. If the agent noticed that the incident was caused by a deployment that skipped a canary phase, or a config change that wasn't validated, it'll flag that and recommend a specific improvement.
- Code optimisation - application-level issues like poor error handling, missing retries, or code patterns that degrade under load.
By default, the agent runs these evaluations weekly. You can also trigger them on demand – which is useful right after a significant incident when you want recommendations while the context is still fresh. You can pause the schedule entirely if you prefer to control the cadence yourself.
Here’s what makes this more than just a recommendation engine: the feedback loop. For every recommendation, you have three options. Keep it – add it to your backlog for tracking. Discard it – and when you discard, you tell the agent why in plain language. “We already handle this at the CDN layer.” “This would break our blue-green deployment model.” The agent learns from that feedback and calibrates future recommendations. Or mark it as Implemented – which lets the agent measure whether similar incidents actually decrease over time.
Recommendations don’t hang around forever, either. If a recommendation sits untouched for about six weeks and no new incidents come along that it would have prevented, it gets automatically cleaned up. Your backlog stays relevant instead of becoming another graveyard of ignored action items.
One more thing – and this is the part that made me sit up straight. For recommendations that involve code or configuration changes, the agent can generate what it calls an “agent-ready specification.” This is a structured document – problem statement, solution summary, target repositories, specific file paths, implementation considerations, test requirements, and a phased rollout plan. You can hand that spec directly to a coding agent or use it as a detailed brief for a developer. The agent doesn’t just tell you what to fix. It gives you the work order.
The shift here is profound. Instead of your post-incident review producing a list of action items that slowly decay in a Jira board, the DevOps Agent continuously surfaces the improvements that would have the biggest impact – and it updates those recommendations as new incidents provide fresh evidence. It turns incident history from a sunk cost into a compounding investment in reliability.
Use Case - Escalation to AWS Support with Full Context
We’ve all lived this one. The investigation goes well. The agent (or your team) has narrowed it down. Metrics point to something that doesn’t look like your code or your infra. Maybe it’s a service-side issue. Maybe you’ve hit an undocumented limit. Maybe something is just behaving differently from what the API docs say.
So you open an AWS Support case. And now you spend 20 minutes recreating the entire investigation in a text box. What broke. When. What you’ve tried. Which metrics you looked at. Which resources are affected. You copy-paste CloudWatch URLs. You dig up timestamps. You write a mini post-mortem just to file the ticket. And then a support engineer asks a follow-up question that means you missed a detail, and you go back and dig again.
The DevOps Agent short-circuits all of that.
When you’re in the middle of an investigation and you hit the point where you need AWS Support, you click “Ask for human support” in the DevOps Agent web app. That’s it. The agent automatically packages up everything it’s found during its investigation and attaches it to the support case:
- Investigation timeline - the full chronological record of what the agent analysed, in what order, and what it found at each step
- Resource information - every affected AWS resource, already identified
- Observability data - the relevant metrics, logs, and traces, pulled from your integrated monitoring tools
- Recent changes - code deployments, infrastructure changes, config updates that happened in the incident window
- Remediation attempts - what the agent already recommended and whether anything was tried
- Impact assessment - how widespread the problem is and what's at risk
The AWS Support engineer opens the case and already has the full picture. No re-explaining. No back-and-forth asking for account IDs or resource ARNs. No “can you send me the CloudWatch graph for the last hour?” They can start working the problem immediately.
And here’s the part that makes this genuinely useful in practice: you can chat with AWS Support directly inside the DevOps Agent web app. A separate chat window opens alongside your investigation timeline, so you can see the agent’s automated analysis and the support engineer’s guidance side by side. You’re not bouncing between the Support console and your investigation tool. Everything lives in one place.
A few things worth knowing about support plans. If you’re on Business Support+, Enterprise Support, or Unified Operations, you get the full integrated chat experience – you can talk to AWS Support engineers right inside the DevOps Agent web app alongside your investigation timeline. Basic Support customers can’t create technical support cases at all, so the escalation button won’t help you there. Either way, make sure your agent’s IAM permissions include support:CreateCase and support:DescribeCases, or the integration won’t work.
The real value here isn’t just convenience – it’s speed. The context-gathering phase of a support case is often the longest part of the resolution. By the time you’ve explained the problem well enough for someone else to understand it, you’ve burned 30 minutes to an hour. The agent compresses that to zero. The support engineer gets the same quality context your best senior SRE would have put together, except it’s generated automatically and it’s there the moment the case is created.
It’s a small thing, but it’s the kind of small thing that turns a two-hour support interaction into a twenty-minute one.
Security & Governance
I’ll be honest, the first question I had when someone said “let an AI agent investigate your production infrastructure” was, what exactly can it access, and who’s watching it?
Turns out AWS thought about this quite a bit and given Security is a Day Zero problem at AWS this should come as no surprise. Here’s the short version of what you need to know.
The Agent Space Is Your Security Boundary
Everything starts with the Agent Space. It’s not just an organisational container – it’s a security perimeter. Each Agent Space has its own IAM roles, its own account access, its own integration config, and its own data. Investigation history, recommendations, chat conversations – all of it is isolated per space. Nothing leaks between spaces.
You configure three IAM roles: a primary account role for the account where the Agent Space lives, secondary account roles for any additional AWS accounts you connect, and a web app role that controls who can see investigation data. The recommendation is straightforward: least privilege, read-only. The agent doesn’t need write access to investigate, so don’t give it write access.
The Agent Mostly Can’t Break Things
This is the detail that brought my blood pressure down. The agent’s toolset is intentionally constrained – it cannot mutate your resources. The only write actions it can take are opening tickets and support cases. It can’t modify security groups, restart instances, push deployments. It reads, analyses, and recommends. You decide what to act on.
That said, there’s a caveat worth calling out. If you bring your own MCP servers into the mix, those custom tools operate outside the agent’s native guardrails. It’s on you to make sure your custom MCP servers are read-only and that the users of any external tools they connect to are trusted. Test them in a non-production Agent Space first and ensure that you audit them regularly.
Prompt Injection Protection
Any AI agent that consumes external data – logs, resource tags, ticket descriptions – is a potential prompt injection target. AWS layers several protections here:
- Limited write capabilities mean that even if a malicious instruction sneaks through, the agent can't modify your infrastructure
- Account boundary enforcement keeps the agent inside the scope of its configured IAM roles - it can't be tricked into reaching outside its Agent Space
- ASL-3 model protections include classifiers that detect and block prompt injection attempts before they affect agent behaviour
- Immutable audit trail - the agent journal logs every reasoning step and action, and those entries can't be modified after the fact, not even by the agent itself. If something suspicious happens, you'll see it
So what is the biggest risk factor? Authorised users who can modify data sources the agent consults – logs, tags, ticket fields. You need to ensure that you apply least privilege to those systems too, not just to the agent’s IAM roles.
Everything Gets Logged
The agent maintains a detailed journal for every investigation and every prevention evaluation. Every reasoning step, every action taken, every tool called – it’s all recorded and immutable. On top of that, all DevOps Agent API calls flow through CloudTrail, so you get the full picture: who made the request, from where, and when.
Between the agent journal and CloudTrail, you’ve got two independent audit trails. One shows what the agent thought and did. The other shows who triggered it and how. That’s enough to satisfy most compliance and incident review requirements.
Data Residency and Encryption
Your data is stored in the region where your Agent Space lives. Inference processing stays within your geography – EU requests process in the EU, US in the US, Australia in Australia, Japan in Japan. All data is encrypted at rest with AWS-managed keys, and everything in transit is encrypted across Amazon’s private network.
One thing to be aware of: the agent does not filter PII from its investigation summaries. If your logs contain personally identifiable information, the agent will include it in its findings. This means that you need to ensure you redact PII before it hits your observability stack, not after.
The Shared Responsibility Model
Same principle as everywhere else on AWS, but worth making explicit:
AWS is responsible for securing the agent infrastructure, protecting data the agent retrieves, and securing the native toolset.
You’re responsible for scoping IAM roles properly, managing who can access each Agent Space, ensuring your MCP servers behave, making sure connected data sources contain trusted data, and redacting PII from your logs.
If you’ve operated on AWS before, none of this will surprise you. But it’s worth reading through the full security documentation before you go to production – especially the sections on network connectivity and private connections if your observability tools are hosted inside a VPC.
In Summary
So let’s zoom out.
The DevOps Agent isn’t trying to replace your SREs. It’s trying to give them their nights back. It’s the difference between waking up to a raw alert and waking up to a fully investigated incident with a mitigation plan sitting in Slack. That distinction sounds small until you’ve lived both versions.
We covered three use cases, and they map to the full lifecycle of an incident – not just the fire, but what comes before and after.
Autonomous investigation means the agent is already working when your on-call engineer opens their laptop. It’s correlated the alerts, traced the dependencies through your topology, identified a root cause, and written up specific actions to fix it. The expensive part of the investigation – the part that used to take hours of console-hopping and log-reading at 3 AM – is done before a human touches it.
Proactive prevention means you stop fixing the same class of incident over and over. The agent analyses your investigation history, spots the patterns, and delivers targeted recommendations across observability, infrastructure, governance, and code. You keep what’s useful, discard what isn’t, and the agent learns from both. The real kicker is the agent-ready specs – structured work orders you can hand directly to a coding agent or a developer, so recommendations don’t die in a backlog.
AWS Support escalation means context never gets lost in the handoff. The agent packages up its entire investigation – timeline, findings, metrics, deployment history, impact assessment – and attaches it to the support case automatically. The support engineer starts working the problem immediately instead of asking you to re-explain what you’ve already figured out.
And underneath all of it, the security model is surprisingly sensible. The agent can’t modify your infrastructure. Every action gets journaled. Agent Spaces keep blast radius scoped. Your data stays in your geography. It’s built for the kind of scrutiny production systems deserve.
Here’s what I’d actually recommend if you’re considering this. Don’t try to boil the ocean. Stand up a single Agent Space pointing at one production workload – ideally something that’s given you trouble recently. Let the agent investigate a few real incidents. Review its root cause analyses. See if the mitigation plans make sense. Then check the Ops Backlog after a week or two and see what it recommends for prevention.
You’ll know pretty quickly whether it’s useful for your environment.