ทวีตที่ปักหมุด
Eric Stokes
618 posts


@yminsky I've had good luck with Opus 4.6 using the workflow,
1. Make a plan
2. Review the plan yourself, revise until correct
3. Clear context and execute plan
4. Clear context and ask for review in plan mode, revise and execute the plan, do ~3 loops of this.
5. Review the code yourself
English

Awesome that they did this study. It demonstrates things that anyone using the models for serious engineering work could see, but is invisible in the public data.
Parker Whitfill@whitfill_parker
How do benchmarks map to real-world capabilities? To study this, we hired 4 maintainers of repos used in SWE-bench Verified to review agent code. Of agent PRs that passed SWE-bench’s grader, maintainers would merge ~half. This holds accounting for noise in maintainer decisions.
English

@AgustinLebron3 We are all on the lookout for
* I changed the *test* to work around an issue
243 of 243 tests pass, everything looks good
😱😱😱
English

@yminsky Except now the metric is, can you work on 8 projects at once without losing your mind 😂
English

@yminsky @mbacarella I do think languages they design for their own use is probably the end state. They already appear to get significant value out of expressive and strict type systems, and I see no reason why they wouldn't double down on that.
English

@mbacarella @eestokesOSS If anything, models will adopt weirder and more esoteric languages, maybe designed just for them, to magnify their intelligence yet more. They'll have fewer social barriers to learning new languages!
English

Not news, exactly, but an interesting observation about our rapidly changing world
Jules Jacobs@JulesJacobs5
@yminsky (3) Maybe certain libraries are not so valuable any more. A description of the treemap layout algorithm alone is enough or perhaps even better than having an implementation, because it is easier to tweak the description than have AI tweak the code.
English

@mbacarella @yminsky But to your point, I really was thinking about the next model when I asked the question. To reframe again. What do programming languages look like when the customer is an AI? I think Ron has the right idea.
English

@eestokesOSS @yminsky given how fast Claude is advancing if you said a year from now it'll ingest 50,000 lines of ad hoc assembly, project it into some kind of typed lambda calculus, rework it and then blast it back out I wouldn't be that shocked
English

@mbacarella @yminsky It just found and fixed a bug in the Graphix type checker. The literal most complex piece (don't tell the parser I said that). I helped, but it felt like a team effort and it went a lot faster than if I had done it unaugmented.
English

@yminsky More succinctly, PL overall still matters, but the set of PL projects that matter just changed almost completely.
English

I agree with everything you said. Type systems help them reason just like they help us reason, we should double down on that and end up with AI that's much more powerful.
However, a practical example. I just finished building a new programming language. It's designed to make building UIs much easier than it has been in the past. There are two issues for me specifically.
1. My language isn't in the training set, and so I have to rely on in context learning to get AI to write in it. This is ... not great, not awful.
2. Humans no longer write UI code.
Maybe your point is humans might read UI code, and so my language is still worth it. Well, maybe, but probably not in this case. No one cares what the UI code looks like as long as it works.
So yeah, the field of programming languages still matters. However a lot of projects that were promising and relevant 3 weeks ago, no longer are now.
English

@mbacarella I have a useless programming language to finish bro!
English

@eestokesOSS tbh if you've managed to lock in and not be distracted by news this well you should keep it up
English

@esrtweet Ooo, making a list of computer languages. Did you get graphix-lang.github.io/graphix/ it's very new
English

I've learned an interesting number recently, from working on loccount.
What would your guess be about the number of distinct computer languages and plain-text markup formats in the world? The range is anything for which "count lines" might be an interesting question on a Unix machine.
I've been working on extending loccount's breadth of coverage for a while, and recently I've been using AIs to find obscure languages and markup formats to add.
And...I've hit a wall. I've actually had an LLM tell me that I've covered every computer language outside of obscure academic toys, and most of those too. And when I ask it what markup formats I should add, it's reduced to pointing at various obscure bits of glue in build and orchestration systems.
So I know what that number is. There's room for some argument about it along the usual splitter/lumper lines, but none of those arguments are going to budge the number by 20% at the outside.
Make a guess. Drop it in a reply. I'm interested in what the range of peoples' estimates is.
When the hubbub dies down, I'll post the answer.
English

@AgustinLebron3 Yeah, chatbots are saturated, codex and claude code have found a useful job for these models. I'm seeing at least 1 oom speedup in my daily work. If they can expand that utility to a few more domains then the investments can just about pay off.
English

"OpenAI itself admits the problem, talking about a ‘capability gap’ between what the models can do and what people do with them, which seems to me [...] you don’t have clear product-market fit."
Turns out most people don't need a know-it-all very much.
ben-evans.com/benedictevans/…
English

@FaytuksNetwork We have a gold plated trash can ready and waiting to receive their strongly worded letter.
English


