Here's a number that doesn't show up in most engineering dashboards.
Each after-hours incident costs roughly 4–6 hours of productive engineering time the following day. Not the incident itself. The day after.
2 hours on post-mortem docs and Slack async. 1 hour in a meeting too tired to contribute. 1 hour of feature work that just doesn't happen.
For an on-call engineer handling 3 incidents a week which nearly half of SREs report that's 50+ hours of lost productivity a month.
This never appears in your MTTR dashboard. But it shows up in velocity and attrition.
The real ROI of reducing on-call burden isn't faster incidents. It's preserving the other 99% of your team's engineering time.
The AI for incident response space went through two very different eras.
AIOps was about anomaly detection. Structured metrics. The output was a correlation score and an alert. No diagnosis. No code understanding. Remediation was mostly "roll back".
LLM-powered RCA is a completely different approach. It reads logs, metrics, traces, code diffs, configs, and deploy history. It generates a natural language root cause analysis with an evidence chain you can actually verify. It references past incidents. It suggests fixes on code level, infra and CI/CD.
The shift isn't just better models. It's a fundamentally different architecture from pattern-matching on metrics to actually understanding what your system is doing and why it broke.
That's the approach we took with Steadwing to show evidence-based RCA with fixes you can verify and immediately ship.
“On-call engineers shouldn’t exist.”
@Abejith (ex-Freshworks) said it. @khant_dev (ex-Mem0) had lived it.
Weeks later: live with real teams.
Demo: a bug a team couldn’t triangulate for years → root cause in seconds.
Most RCA tools fail.
Watch out for @steadwing at @join_ef Demo Day 👀
We’re excited to share @steadwing at Entrepreneurs First @join_ef Demo Day on April 29, 2026.
It’s a special milestone for the team, and another step toward the future we believe in.
Let’s make production software self-healing!
30 alerts in 2 minutes. Your on-call engineer opens Slack and gets overwhelmed.
It's rarely 30 problems, usually one issue causing a chain reaction.
They end up triaging each alert & 20 minutes are gone.
We group alerts into 1 incident & give root cause for it.
As engineers, we know that the hardest part of an incident isn't figuring out how to fix it. It's putting together the context from all the tools and coming up with the reason why it happened.
Those first critical minutes are spent jumping across logs, metrics, traces, and recent changes and just trying to understand what’s going on.
And that's why we are building @steadwing!
It takes info from your logs, metrics, traces and codebase, connects the dots across your whole stack, and gives you a full root cause analysis with automatable fixes on code level, around deployment, and infra.
The investigation is over by the time your on-call person opens the laptop.
Almost a month ago, we brought the inaugural
@daytonaio Compute Conference to the Chase Center, and the energy is still with us.
Today, we’re sharing the official aftermovie. 🎬A look back at the people, ideas, and momentum that made it so special.
// last Sunday, we assembled 2000 of the best AI founders, researchers, investors and engineers in one room
where we brought together 45 speakers and 50 of the hottest startups in AI, one day before Nvidia's GTC
🔥 here's a recap of the first ever AI+ Renaissance conference