Drew Breunig

15.4K posts

Drew Breunig banner
Drew Breunig

Drew Breunig

@dbreunig

Writing about and working on AI, geo, and data.

Bay Area Katılım Mart 2008
1.1K Takip Edilen7.8K Takipçiler
Kyle Daigle
Kyle Daigle@kdaigle·
Hot take from looking at @github Copilot telemetry: benchmarks make coding models look wildly different. Production workflows make them look much more similar. 👀 We looked at 23M+ Copilot requests and examined one simple metric: code survivability.
Kyle Daigle tweet media
English
25
34
277
50.2K
Drew Breunig
Drew Breunig@dbreunig·
There’s so many issues with prompt adherence with a mono prompt, especially with personas (gstack could learn from the RPG hackers on this…) Talked with someone yesterday who has Claude write, Codex do code review, let’s them go back and forth, then sorts changes by controversial.
English
0
0
1
99
dex
dex@dexhorthy·
Overall there are n thousand ways to throw more tokens at the problem and we need better ways to eval “review this plan and find all the things I didn’t think about” vs a 100+ instruction monolithic mega prompt
English
3
0
11
1.1K
dex
dex@dexhorthy·
Tried plan-review-ceo from gstack yesterday. I’m not sure if this is good or bad, intentional or not intentional, but when I felt like pushing back on the agent*, something in my brain feels like I’m arguing with Garry directly 🤣 Anyways milestone 1 of a big feature shipping with RPI/QRSPI + Gstack shipping today, will report back * (which @garrytan had stated is part of the process - “your job is to know when the model is gassing you up and call it out” or something) I have some technical concerns with the sheer volume of instructions in the prompt and the amount of adherence you will actually get (@0xblacklight cited an interesting arxiv paper in post linked below) - I think we might be better served by a router that routes to specific modes, rather than explaining every single mode in a single monolithic prompt, but there’s tradeoffs to consider in plumbing and Ux for the end user. I think some may complain that it’s overly verbose and thoughtful and brings up things that are irrelevant but I actually think that’s good. I want a clean braindump of everything that might be relevant so I can edit and prune down to just what’s important
English
5
1
41
4.2K
Pamela Fox
Pamela Fox@pamelafox·
@dbreunig The DSpy meetup tonight was fantastic - even though I've never used DSpy, the case studies were full of general insights. Will speakers be sharing their slides?
English
3
2
9
1.1K
Drew Breunig
Drew Breunig@dbreunig·
@mjamei @Dropbox If you set your reflection_lm to a Claude, that is what DSPy/GEPA does, but in a systematic manner over many calls.
English
0
1
13
820
Mehdi Jamei 🗽
Mehdi Jamei 🗽@mjamei·
@Dropbox Did you try giving Claude the original prompt and the list of disagreements and ask it to “rewrite the prompt to address the disagreements “? I bet you get 90% of the lift right away.
English
1
0
2
1.4K
Drew Breunig retweetledi
Dropbox
Dropbox@Dropbox·
How we used DSPy to turn our relevance judge into a measurable optimization loop, making it more reliable and scalable in Dropbox Dash.
English
10
31
235
94.3K
Drew Breunig
Drew Breunig@dbreunig·
This is THE sleeper reason teams stick with DSPy.
Mingta Kaivo 明塔 开沃@MingtaKaivo

@dbreunig Exactly. The friction cost of "we should test this new model" was killing us. Most teams just default to whatever they launched with because retooling prompts is a whole sprint. Making that a one-line change is how you actually stay on the frontier instead of reading about it.

English
1
4
64
10.2K
Drew Breunig
Drew Breunig@dbreunig·
@MingtaKaivo It is such a win. You go from: "A new model just dropped looks great, we should test it. But to see its potential we're going to have to tweak the prompt and it takes weeks," so you never do it. To: "A new model just dropped. Update the model string and run GEPA, why not?"
English
1
1
18
1.6K
Mingta Kaivo 明塔 开沃
@dbreunig the model-swap-without-prompt-rewrite is the part I hadn't solved cleanly. at AudioWave we abstracted routing early but still had to retune prompts per model. GEPA optimizing the program itself instead of the prompt is the right layer to fix at
English
2
1
10
1.2K
Drew Breunig
Drew Breunig@dbreunig·
@MilksandMatcha Also matters which em-dash you’re using! The easy one to access in MacOS is short enough I always put spaces around it.
English
0
0
0
21
Sarah Chieng
Sarah Chieng@MilksandMatcha·
One way to tell if someone is using claude vs. chatGPT is by looking at the em-dashes. claude ALWAYS adds spaces before and after the em-dash, chatGPT doesn't. Claude: ' — ' ChatGPT: '—'
Sarah Chieng tweet media
English
8
1
32
4.8K
swyx
swyx@swyx·
caption this
swyx tweet media
English
19
3
61
12.7K
Farouk
Farouk@FaroukAdeleke3·
this reminds me of @dbreunig's chroma context engineering talk so many concepts and frameworks rooted in common sense that nobody cared about until andrej karpathy tweets about them on a random tuesday x.com/ivanbokii/stat…
Farouk tweet media
Ivan@ivanbokii

It’s quite unfortunate that GEPA Optimize Anything didn’t get enough traction, while very, very similar ideas promoted by Karpathy’s autoresearch + Lütke’s pi-autoresearch - got so much traction, despite being less general

English
2
0
7
821
Drew Breunig
Drew Breunig@dbreunig·
--dangerously-skip-permissions
GIF
English
1
0
7
552
Keshav Jindal
Keshav Jindal@Keshavatearth·
@dbreunig I wonder what's the score with newer models. all the models in this chart are non-reasoning models and no longer exist
English
1
0
0
25
Drew Breunig
Drew Breunig@dbreunig·
@swyx @fabknowledge Doing the back of the envelope math...how many tokens will I have to spend to fit in a standard instance...to save $150/month...
Drew Breunig tweet media
English
0
0
1
494
Drew Breunig
Drew Breunig@dbreunig·
Some questions that can be answered with this response: - Why does context rot occur? - Why are models so good at coding and meh at other things? - Why do models fight formatting requests? - Why are models good at reasoning with code? - Why do models try to not be shutdown during testing? - Why do models seem so human like? - Explain Moltbook?
English
0
1
3
304
elie
elie@eliebakouch·
@kiranvodrahalli yes i agree! i was mainly impressed since previous anthropic models were not super good on this benchmark, and now it's sota also curious to get your take if you have some public long context eval that you like out of NIAH style one, maybe graphwalk or ones that are in HELMET?
English
2
0
1
100