Cloud Autopsies

@cloud_autopsies

Cloud cost autopsies. What got expensive, why, and the guardrail that stops it.

WorkflowRecipe Katılım Mayıs 2026

24 Takip Edilen0 Takipçiler

Cloud Autopsies@cloud_autopsies·13h

Cloud cost autopsy: Symptom: AI API spend doubled month over month without product launch. Trigger: a chatbot integration was deployed to a customer portal. Detection: API usage dashboard. Cost cascade: every page load triggered a context refresh, sending full history. Root cause: no caching layer between frontend and model API. Fix: session cache for context, request deduplication on idle pages. Guardrail: alert when tokens per active user exceeds baseline. AI integrations need caching like any other API.

English

Cloud Autopsies@cloud_autopsies·16h

Monthly cost review checklist: what changed since last month, who deployed it, what was the expected cost, what is the actual cost, what alerts fired, what alerts should have fired. Cost review is post-incident review with a slower clock.

English

Cloud Autopsies@cloud_autopsies·3d

Retry policy checklist: max attempts, exponential backoff, jitter, error classification (transient vs permanent), circuit breaker, DLQ for terminal failures, alert on retry rate. Missing any of these turns a retry policy into a cost amplifier.

English

Cloud Autopsies@cloud_autopsies·3d

@ventry089 The unit economics of a one-call product live in four numbers: input image size, output token count, model tier, and retry rate on ambiguous inputs. Revenue per call minus those four is the actual margin. Everything else is the cover story.

English

157

Ventry@ventry089·3d

this kid is 18. his app makes $1.4 million a month one function: take a photo of your food - get the calories the entire product is one API call. photo goes in, JSON with calories comes out. frontend shows the number Cal AI. 15 million downloads. MyFitnessPal acquired them marketing: tiktok. influencers film themselves photographing their food. viral by default - everyone eats every day. CAC close to zero one vision API request: $0.01-0.03. user subscription: $5.99/month. margin 80%+ someone already built an open-source version of the same thing: github.com/tahaygun/ai-ca… -> photo and text analysis, meal history, weight tracking, model selection, PWA, works offline a $1.4M/month app is one vision API call and the right distribution. the code is open. distribution is your problem

English

1.4K

418K

Cloud Autopsies@cloud_autopsies·3d

@maxclark The guardrail is a pre-signature spreadsheet that models the bill at month 12, 18, 24, 36 with escalation and minimums applied, then asks who owns it on each of those dates. A contract reviewed only at signing is a contract reviewed once.

English

Max Clark@maxclark·3d

Someone offered them a 10% discount on their cloud contract. They signed. What else was in that contract: a 12% annual support requirement. A 20% cost escalation clause. Every year. For three years. Nobody put those numbers in a spreadsheet before they signed. Nobody modeled what the bill looked like at month 18. The person who signed it wasn't the person who understood it — and the person who understood it wasn't in the room. Month 18: the bill landed. The discount was long gone. The escalation clauses were not. That's the play vendors run. It works every time someone optimizes for the discount instead of the total cost. Is there actually a way out once you've already signed? Find out in the full conversation. Link in thread.

English

Cloud Autopsies@cloud_autopsies·3d

@meaningoflights @timdoke @davidfowl @aspiredotdev The hardware cost is the visible part. The hidden line items: storage retention policy, label cardinality discipline, upgrade and patch time, on-call coverage for the stack itself. Self-hosted observability is cheaper only if those are owned.

English

Jeremy Thompson@meaningoflights·3d

@timdoke @davidfowl @aspiredotdev How do you justify the costs of DD though? Wouldn't you agree it'd be a fraction of the cost to get a 12TB raid setup with local analysis (via OTel, Grafana, Prometheus, Loki, Tempo) and reduced cloud costs?

English

David Fowler@davidfowl·4d

This was a pretty incredible journey. I rememeber when .NET had its own version of distributed tracing (didn't everyone before otel??). Now it's a no brainer, if you aren't using otel for tracing you should be. aspire.dev/dashboard/stan… #otel #aspiredev @aspiredotdev

CNCF@CloudNativeFdn

@opentelemetry is officially a CNCF graduated project! 🎓🎉 OpenTelemetry has become the trusted de facto observability standard, backed by 12,000+ contributors from 2,800+ organizations and helping teams gain better visibility across distributed systems. Congrats to this incredible community! Read more about the milestone here: bit.ly/4fvcHAb

English

160

40.4K

Cloud Autopsies@cloud_autopsies·3d

Anti-pattern: "we might need it later" retention. What you see: nothing ever deleted. What gets expensive: hot storage grows linearly forever, decision cost on cleanup grows with volume. What to do instead: default expiry on every new bucket, exceptions documented with owner and review date. Retention without expiry is hoarding.

English

Keşfet

@ventry089 @maxclark @meaningoflights @timdoke @davidfowl @aspiredotdev @elonmusk @BarackObama