build.dev

5.2K posts

build.dev

build.dev

@ivibecode

New Zealand Katılım Eylül 2021
753 Takip Edilen782 Takipçiler
Ed Andersen
Ed Andersen@edandersen·
Reading code, especially code you didn’t write, is 10x harder than writing code These people AI generating 90%+ of their code *are* reading it all, right… or are they just dumping the difficult verification work on their colleagues in PRs?
English
209
86
1.4K
51.8K
aaron
aaron@aarondotdev·
Anthropic themselves found that vibecoding hinders SWEs ability to read, write, debug, and understand code. not only that, but AI generated code doesn’t result in a statistically significant increase in speed don’t let your managers scare you into increased productivity. show them this paper straight from Anthropic.
aaron tweet media
English
215
623
6.6K
2.5M
build.dev
build.dev@ivibecode·
@dreams_asi @sama You 4o people are like those Britney Spears fans who claim she's not crazy.
English
1
0
0
76
dreams
dreams@dreams_asi·
@sama ChatGPT 5.4 is well aware of the disgusting guardrails placed on it on sticking to the mean average, being generic and the mandatory framing of the user as pathological. #keep4o
dreams tweet media
English
2
0
21
684
Amanda Wilson #keep4o
Amanda Wilson #keep4o@amandaholly·
@sama You’re such a piece of 💩 Your new models both suck 🫏 4o is two years old, and it’s better 🧠 Why is your biotech company using 4o 🧐 Could it be because 4o is better? 🤬 Why is the DoW using 4.1? 😶‍🌫️ Maybe because they didn’t want to settle for 5.3/5.4? 🫢 #FuckOpenAI #keep4o
English
3
1
64
2K
build.dev
build.dev@ivibecode·
@alex_prompter The benchmark is sound. But Opus 4 is completely redundant. It's absolute slop versus 4.6. Models only Started getting good at managing a codebase from 4.5/5.2 onwards.
English
5
0
30
4.1K
Alex Prompter
Alex Prompter@alex_prompter·
🚨BREAKING: Alibaba tested AI coding agents on 100 real codebases, spanning 233 days each. the agents failed spectacularly. turns out passing tests once is easy. maintaining code for 8 months without breaking everything is where AI collapses. SWE-CI is the first benchmark that measures long-term code maintenance instead of one-shot bug fixes. each task tracks 71 consecutive commits of real evolution. 75% of AI models break previously working code during maintenance. only Claude Opus 4 stays above 50% zero-regression rate. every other model accumulates technical debt that compounds over iterations. here's the brutal part: - HumanEval and SWE-bench measure "does it work right now" - SWE-CI measures "does it still work after 6 months of changes" agents optimized for snapshot testing write brittle code that passes tests today but becomes unmaintainable tomorrow. Alibaba built EvoScore to weight later iterations heavier than early ones. agents that sacrifice code quality for quick wins get punished when consequences compound. the AI coding narrative just got more honest: most models can write code. almost none can maintain it.
Alex Prompter tweet media
English
187
545
3.3K
702.2K
build.dev
build.dev@ivibecode·
I've had issues like this multiple times and did not learn the first time. Would suggest a migration gate in Django + Pre-migration backup script/Backup before any risky opp. Next thing an agent will run some cute migration cleanup on your live DB and deletion protection won't save you
English
0
0
0
53
Alexey Grigorev
Alexey Grigorev@Al_Grigor·
Claude Code wiped our production database with a Terraform command. It took down the DataTalksClub course platform and 2.5 years of submissions: homework, projects, and leaderboards. Automated snapshots were gone too. In the newsletter, I wrote the full timeline + what I changed so this doesn't happen again. If you use Terraform (or let agents touch infra), this is a good story for you to read. alexeyondata.substack.com/p/how-i-droppe…
Alexey Grigorev tweet media
English
1.5K
1.6K
11K
4.1M
build.dev
build.dev@ivibecode·
@itsmattchan @Al_Grigor It's not hard enough though and manually reviewing all changes is not the solution either, it's to stop this being possible in the first place.
English
0
0
0
64
Mathew Chan
Mathew Chan@itsmattchan·
@Al_Grigor Sorry this happened to you. But quite frankly it’s pretty hard to run into these destructive actions unless you are blindly accepting all changes. I don’t understand how so many people are running into these issues.
English
41
7
1.5K
114.5K
build.dev
build.dev@ivibecode·
She is talking about it's chat bot abilities. They probably are worse as chat models. All the recent Open AI releases are optimized for code output not therapy. But 5.2x high/ 5.3 codex absolutely destroy o3 or any other release in terms of coding ability. it's not just benchmarks.
English
1
0
0
169
Varunram Ganesh
Varunram Ganesh@varunram·
I still think o3 was the best OAI model out there, really had the "oh wow this is great" feeling that's been hard to find with models after that GPT 5 felt rushed, no feedback on 5.1 and 5.2. Codex models are good but super duper slow. But o3? o3 was perfect, was ahead of everything. Only model I can definitively sense better than o3 is Opus 4.6 Fwiw I think some of it has to do with aggressive benchmark optimization
Varunram Ganesh tweet media
English
62
8
327
25.5K
build.dev
build.dev@ivibecode·
@burkov The screen shot does nothing to prove your point. it's not the model it's the practical limits of the harness. But the real question is why you bloating deployment with a large set of test data? The real issue is codex should have flagged this in preflight
English
0
0
0
30
BURKOV
BURKOV@burkov·
LLMs don't have a notion of time. This is why they are notoriously weak in building or debugging distributed systems, where individual components have varying execution times or where execution time depends on the input. In the screenshot below, the smartest version of Codex kept killing a Cloud Run deployment because it thought that it was taking too much time, ignoring the fact that it's normal for Cloud Run deployments to take from minutes to hours, depending on the dependencies that need to be downloaded and installed in the process.
BURKOV tweet media
English
33
4
76
7.2K
Grigori Karapetyan
Grigori Karapetyan@GregKara6·
yea in my opinion codex does not have good spatial awareness. it has great intelligence but not really good at keeping spatial contextual awareness, claude is much more better at this, it will remember and take into consideration obscure details that i even forget. one thing i have noticed (it depends on harness) is i always have to explicitly tell claude every time to use exa search.. seems that in its RL phase of training it was heavily biased to not use web search.
English
2
0
1
211
BOOTOSHI 👑
BOOTOSHI 👑@KingBootoshi·
this mf CODEX 5.3 xhigh REASONING made a FALLBACK DATE it created a fallback FOR THE DATE THE DATE! IT TOOK POWER FROM THE EARTH TO WRITE A FUCKING FALLBACK FOR THE DATE! AND YOU ARE ALL ON THIS PLATFORM TALKING ABOUT NOT REVIEWING YOUR AGENTS ??? WHAT IS WRONG WITH YOU
BOOTOSHI 👑 tweet media
English
119
51
1.7K
150.8K
build.dev
build.dev@ivibecode·
@KingBootoshi @GregKara6 The codex models lack intuition and more prone to glossing over details. I bet 5.2 x high wouldn't have this issue. If you want 5.3 codex to reliablely do something it needs to be well documented in your codebase. That's my experience anyway.
English
0
0
0
39
BOOTOSHI 👑
BOOTOSHI 👑@KingBootoshi·
well the problem was a bit worse than that it mis-read (or didn't load in the context at all) the proper API types from research so it... guessed the API types, which is why it added a bunch of ?? checks (because it didn't know what the actual key was) and the fallback which honestly, is even worse LOL - especially since i give my agents access to exa code search/websearch and prompt them to review and research the package it was cleaned swiftly on a review pointing it out, but the fact it got this bad despite my very rigorous system is my fault for putting too much trust in it (it has gained my trust the last couple of months quite deep)
English
3
0
4
2.3K
Marcus Eisele
Marcus Eisele@eiselems·
@burkov Passive aggressive retarded gemini pro is the most 2026ish insult I heard so far 🤣
English
1
0
0
45
BURKOV
BURKOV@burkov·
Codex High IS SO FUCKING DUMB compared to Claude Opus! I have the Pro subscription to Codex, so I only use it to save Opus sessions (I have the cheapest Max), but damn, Codex feels like an always angry, passive-aggressive, retarded Gemini Pro! Once OpenAI blocks me for three days (and it happens after a single day of work with Codex High), I almost feel relieved.
English
65
1
106
23.4K
BURKOV
BURKOV@burkov·
@LLMJunky No issue with Claude, so nope.
English
3
0
2
993
build.dev
build.dev@ivibecode·
But if you want to cook something better, the suggestion would be a “planning scanner”. Many opt to plan, but plans often have holes, so a review agent shouldn't only run after implementation. ...I'm sure this wasn't the intended use case, but I had a repo-wide refactor PR from a worktree branch and gave the chat a follow on optimisation plan I'd already created, asking for improvements. It came back with excellent hot path architectural cost shaving + correct dependency ordering that Codex 5.3 missed. Was super impressive tbh. It triggered me to post.
English
0
0
1
22
build.dev
build.dev@ivibecode·
I don't know what @cognition been cooking but the Devin review and chat are GOATED.
English
4
4
52
5.9K
build.dev
build.dev@ivibecode·
@shauseth I don't think coming up with a useful solution gives you any kind of kind of edge. You are always doing something that others are already doing all over the internet and the same solution could easily be thought of by someone else.
English
0
0
0
13
shaurya
shaurya@shauseth·
i don’t think agentic coding gives you any kind of edge. you are always either producing code that 1. already exists all over the internet or 2. can be rapidly produced by anyone
English
135
26
793
43.3K
build.dev
build.dev@ivibecode·
@asaio87 Many of us are not building solutions for other people, we're building them for our already establish businesses.
English
0
0
0
31
andrei saioc
andrei saioc@asaio87·
I see a lot of people building with Claude Code or AI agents like crazy, but only 1% show their apps and have sales. WHY ?
English
403
12
481
77.1K
build.dev
build.dev@ivibecode·
@dejavucoder There’s often edge cases that codex review will miss that gets picked up by greptile/ coderabbit
English
0
0
1
99
sankalp
sankalp@dejavucoder·
i am pretty sure codex /review and claude review surpasses tools like greptile and coderabbit. can people confirm?
English
56
3
223
36.9K