Aptible reposted this
Head of Engineering @ Aptible | helping SRE teams knock down knowledge silos, increase automation, and improve MTTR
Every one of the 58 engineering leaders I’ve talked to lately have been frustrated over the mess that is incident response 🥲 Quite a few of them have tried to build their own internal AI agent in an attempt to automate some of the more frustrating, manual tasks. And hey, we get it. We’re doing the exact same thing! Here’s why: Incident response is *really* hard to get right, especially if you don’t have a ton of resources at your disposal. While we don’t think that AI is the solution to most problems, there are several aspects of the whole incident response process that could (and should) be automated by AI. Just to name a few... 😩 On-call incident commanders have to bother their most senior engineers when they can’t find the information they need 😪 You spend hours combing through docs, logs, repos, and Notion pages when you can’t immediately determine the root cause — then you usually end up just messaging your more experienced engineers anyway ✍️ You’re having to manually update or create new runbooks every time an unrecognized incident comes up 👩💻 You locate and run the same scripts over and over again (this task alone can take ~10 minutes each time… or longer) Here's where an AI Agent would come in: 😌 An AI agent could provide immediate retrieval of the information you need from disparate sources to help you move faster 💪 It enables constant improvement by providing incident summaries and suggesting runbook updates and new processes based off of patterns in your responses 📈 It can give your leadership team a dashboard to drill down into what incidents are occurring frequently, how long they’re taking to resolve, and how you’re improving over time ✅ It eliminates (or at least minimizes) tedious, manual tasks by securely and quickly executing predefined, custom scripts So having recognized the potential for AI Agent to help with our own incident response processes at Aptible, we went ahead and built one! Our relatively small SRE team has seen significant improvements with our AI agent. Not only do we have faster resolution times and fewer outages, but the overall stress and experience of incident response has improved. This isn't just on a task-by-task basis but holistically across our entire incident response system. So, if you’re also thinking about building your own AI agent, why bother when one is already being built? Interested in checking it out? Comment below or reach out directly to schedule a call. You can also find the link to more info in the comments 😊
Are you taking feedback from the humans in the loop and reinforcing for good answers from the AI? It seems like it'll go off the rails in someone's environment if it's only pretrained and has no ongoing correction.
Sounds cool! I trust "It eliminates (or at least minimizes) tedious, manual tasks by securely and quickly executing predefined, custom scripts " would only happen with human consent prior to or at the time of the incident?
I'm really hating how LinkedIn is shoving ads into every conceivable surface now.
Excellent summary, Ashley. At Waylay, we found the AI agent approach to work best when access is provided to the rules and workflows that generated the incidents, next to other data sources like troubleshooting manuals or topology information. By using the AI agent to backtrace the logical steps that triggered the incident and correlate them with other data sources, automated RCA can be implemented to further streamline incident response.
I strongly believe having a solid framework in place does it really well. Add a Raci matrix to it, run scheduled simulations and you really should be able to do well. Tools are just tools. Adding a new one, be it AI, non AI doesn't really help without an internal framework and a validated process.
This is a fantastic read! It's clear that incident response is a critical pain point for many engineering teams, and Aptible's AI agent solution sounds like a game-changer. I particularly appreciate the focus on automating repetitive tasks and providing actionable insights for continuous improvement. I'm eager to learn more about your solution! Ashley Mathew
Love this, Ashley. This is exactly the thinking we need now. Identify the low-hanging fruit where LLM based systems can deliver real ROI right now. Shock and awe of talking with ChatGPT about Plato in 2022 is over. Time to roll up our sleeves :-)
Some lovely ideas there Ashley Mathew. I particularly like the clear and early recognition that AI is there to assist our teams in the process and enable people to be more productive. I generally consider my teams to be very good at incident management but there are definitely some opportunities to upflit there! Thanks for sharing.
Co-founder, ✨Waypoint AI: automate escalations from intake to resolution.
2moWow, Ashley. Couldn’t have said it better myself. Obviously, we think there are reasons to buy Waypoint AI vs. building internally, but it’s amazing how you’ve honed in on exactly why critical bugs need supercharging.