
Alex Bit
1.9K posts

Alex Bit
@alex__bit
love people & solving hard technical problems. forward-looking. systems thinker. building @Codemod to make code maintenance invisible for all. ex @Meta.


We find that the adoption of Cursor leads to a statistically significant, large, but transient increase in project-level development velocity, along with a substantial and persistent increase in static analysis warnings and code complexity. arxiv.org/abs/2511.04427



🤯BREAKING: Alibaba just proved that AI Coding isn't taking your job, it's just writing the legacy code that will keep you employed fixing it for the next decade. 🤣 Passing a coding test once is easy. Maintaining that code for 8 months without it exploding? Apparently, it’s nearly impossible for AI. Alibaba tested 18 AI agents on 100 real codebases over 233-day cycles. They didn't just look for "quick fixes"—they looked for long-term survival. The results were a bloodbath: 75% of models broke previously working code during maintenance. Only Claude Opus 4.5/4.6 maintained a >50% zero-regression rate. Every other model accumulated technical debt that compounded until the codebase collapsed. We’ve been using "snapshot" benchmarks like HumanEval that only ask "Does it work right now?" The new SWE-CI benchmark asks: "Does it still work after 8 months of evolution?" Most AI agents are "Quick-Fix Artists." They write brittle code that passes tests today but becomes a maintenance nightmare tomorrow. They aren't building software; they're building a house of cards. The narrative just got honest: Most models can write code. Almost none can maintain it.


Alibaba burned 10 billion tokens testing 18 AI models across 100 real codebases over 233 days each. The headline going viral: 75% of models break previously working code. The actual story: someone finally built the scoreboard that matters. Every AI coding benchmark until now asked: “Can it fix this bug right now?” SWE-CI tracks 71 consecutive commits across 233 days and asks: “Does it still work after 8 months of real evolution?” Most models scored a zero-regression rate below 25%. Three out of four maintenance cycles, the agent fixes today’s ticket and breaks yesterday’s feature. That gap is the snapshot, not the verdict. Nadella says 30% of Microsoft’s repos are AI-generated. Pichai claims the same for Google. Zuckerberg wants AI writing half of Meta’s code within the year. The code is shipping. The question was never whether AI would write production software. The question was when someone would start measuring the right thing. Gartner forecasts global IT spending above $6 trillion in 2026. Maintenance eats 60-70% of that. Roughly $4 trillion a year spent keeping existing code alive. Every AI coding tool today is optimized for the $2 trillion creation side. The $4 trillion maintenance side just got its first real benchmark. The models will close this gap. That’s the entire point of measuring it. Once you score maintenance, every lab starts training for maintenance. The same pattern played out with SWE-bench: models went from 3% to 70%+ in under two years once there was a leaderboard to chase. SWE-CI is the starting gun, not the funeral. The company that cracks long-term code stability owns the largest budget line in every engineering org on the planet. And that gap is only getting wider until someone does.








thinking about building a new saas app and business? you should check out @makerkit_dev! @gc_psk deeply cares about devx. he uses @codemod to help devs install plugins (such as google analytics, posthog, umami, signoz, etc.) with ease. check out the 12-min technical walkthrough with screen sharing and code demos below 👇🏼



"Mining" codemods are here! Detect hardcoded CSS values that should become design tokens or CSS custom properties...



