Alex Boruch-Gruszecki

51 posts

Alex Boruch-Gruszecki

@abgruszecki

Investigating how to build the future of coding using AI and programming languages. Postdoc in Arjun Guha's group

Boston Katılım Ekim 2022

91 Takip Edilen84 Takipçiler

Alex Boruch-Gruszecki@abgruszecki·5d

One takeaway from ProgramBench is that some current LLMs tend to give up too early. I'm curious what happens if we prompt models to work until timeout, since long-running autonomous agent have interesting issues! See link mazesofmenace.ai/announcement/ /cc @jyangballin @KLieret

English

Alex Boruch-Gruszecki@abgruszecki·5d

ProgramBench is awesome! It shows there's a lot of room to improve LLMs and agents at autonomously implementing entire programs from scratch, and it's a great hill climb target programbench.com

English

Alex Boruch-Gruszecki@abgruszecki·6 May

Can one person really use agents to manage nearly half a million lines of JavaScript, and produce a maintainable codebase? One way to find out! mazesofmenace.ai

English

Alex Boruch-Gruszecki@abgruszecki·6 May

Porting NetHack seems like a great challenge for agents. The translation can be verified using pre-recorded input-output behaviors. But the feedback for the agent is sparse, the debugging chains get long, and early architectural mistakes can look good for a long time.

English

Alex Boruch-Gruszecki@abgruszecki·6 May

I'll be helping David Bau run a contest which will let its participants show us how far they can push agentic coding! The task is "simple": port NetHack, over 440k lines of C and Lua, to JavaScript. Easy to verify, but still far from easy for agents. abgru.me/project/telepo…

English

Alex Boruch-Gruszecki retweetledi

François Chollet@fchollet·25 Mar

The G in AGI stands for "general". General intelligence does not mean that you have been specifically trained for a large range of tasks. It means you can approach any NEW task and figure it out, just like humans do. If regular people can do it on their own (no guidance, no tools), why should AGI require special handholding and handcrafted instructions? If it's AGI, why would there still be a human in the loop, using their own human intelligence to guide the model on every new task?

Lisan al Gaib@scaling01

this is pretty much worst case performance no harness at all and very simplistic prompt

English

136

1.4K

187.3K

Alex Boruch-Gruszecki retweetledi

ARC Prize@arcprize·25 Mar

Announcing ARC-AGI-3 The only unsaturated agentic intelligence benchmark in the world Humans score 100%, AI <1% This human-AI gap demonstrates we do not yet have AGI Most benchmarks test what models already know, ARC-AGI-3 tests how they learn

GIF

English

247

586

4.3K

732.1K

Alex Boruch-Gruszecki@abgruszecki·21 Mar

I'm working on scaling Agnostics to larger problems. Esolangs are an exciting angle! I'd be glad to talk more, maybe at ICLR. Some people I know also may be interested. See more about Agnostics here: agnostics.abgru.me

English

Alex Boruch-Gruszecki@abgruszecki·21 Mar

Great study! LLMs fail at rare programming languages in surprising ways. It'd be interesting to study these failure modes on larger examples. Our Agnostics may help: we show how to make problems which can be solved in any PL. Happy to chat more @lossfunk! x.com/lossfunk/statu…

Lossfunk@lossfunk

🚨 Shocking: Frontier LLMs score 85-95% on standard coding benchmarks. We gave them equivalent problems in languages they couldn't have memorized. They collapsed to 0-11%. Presenting EsoLang-Bench. Accepted to the Logical Reasoning and ICBINB workshops at ICLR 2026 🧵

English

Alex Boruch-Gruszecki@abgruszecki·6 Şub

@ShriramKMurthi In the end, a human needs to verify if the codebase satisfies some real-world requirements. It’s hard to see how to escape that in the foreseeable future.

English

402

Shriram Krishnamurthi (primary: Bluesky)@ShriramKMurthi·6 Şub

"Turning those drafts into production software still requires […] $300K+ per year in compiler engineer salary." is an extremely poor take. That 300K/yr compiler engineer isn't going to want to go within a mile of this codebase. What you'e paying for is quality *all along*.

Aakash Gupta@aakashgupta

Sounds incredible until you read the fine print. The compiler generates less efficient code than GCC with all optimizations disabled. It doesn’t have its own assembler or linker. It can’t produce a 16-bit x86 code generator. And Carlini himself says it has “nearly reached the limits of Opus’s abilities.” New features and bugfixes kept breaking existing functionality. So what did $20,000 and two weeks actually buy? A compiler that passes 99% of GCC’s torture tests but can’t match the output quality of a tool that’s had 37 years of human engineering. That’s the constraint nobody’s pricing in. The real story is in the cost curve, not the capability demo. $20,000 for 100,000 lines means $0.20 per line of generated code. A senior compiler engineer costs roughly $150/hour. At maybe 50 polished lines per hour for something this complex, that’s $3/line. AI just did it at 15x cheaper, and it will only get cheaper from here. But the code isn’t equivalent. The AI version needs a human to finish the assembler, fix the linker, optimize the output, and prevent regressions. Those are the hardest 20% of the problem, and they represent 80% of the engineering value. Anthropic built the demo. Shipping the product still requires humans. This tells you exactly where we are in the autonomous software timeline. AI can now produce impressive first drafts of complex systems at trivial cost. Turning those drafts into production software still requires the judgment that costs $300K+ per year in compiler engineer salary. The gap between “compiles the Linux kernel” and “replaces GCC” is measured in decades of accumulated engineering wisdom that no model has internalized yet. The companies that understand this will use agent teams to generate the 80% and hire engineers to finish the 20%. The companies that don’t will ship $20,000 compilers that produce slower code than a free tool from 1987.

English

179

26.1K

Alex Boruch-Gruszecki@abgruszecki·14 Eki

@__protected @odersky Congratulations @odersky! Since the invitation to your lab I’ve been on a non-stop adventure of my life. Seeing your efforts was formative for me. As Jonathan said: well deserved!

English

115

Jonathan Brachthäuser@__protected·14 Eki

Congratulations @odersky for receiving the SIGPLAN Programming Languages Achievement Award! Your work is a great inspiration for me :) Well deserved!

English

106

6.5K

Alex Boruch-Gruszecki@abgruszecki·8 Eyl

@eliebakouch We trained SmolLM3 during the Agnostics project, exactly to have a capable smol non-Qwen model! abgru.me/project/agnost…

English

elie@eliebakouch·3 Ağu

If you’re a researcher working on RL, you should definitely try SmolLM3-3B and get another data point besides Qwen3-3B. 1) We didn’t have time to try RL during post training, so I think there’s still some room to build an even better version of smollm! 2) We released the intermediate checkpoints from post training, so you can use our model at different stages (base, mid training, SFT, APO, merging,) and see if it changes RL perf. 3) The model is also pretty good at long context, you can probably push it past 128k thanks to NoPE and yarn.

English

320

37.3K

Alex Boruch-Gruszecki@abgruszecki·8 Eyl

The leaderboard also shows the results of training SmolLM3 using the Agnostics framework, it's a small (3B) but very capable model. The Lua variant shows the highest relative gains from all the Lua models we trained!

English

Alex Boruch-Gruszecki@abgruszecki·8 Eyl

The leaderboard shows more results than what we included in the report. We can see that the models we trained rival Sonnet 4 on coding in R, and beat it and Qwen 3 Coder on Fortran!

English

Alex Boruch-Gruszecki@abgruszecki·8 Eyl

We're publishing the Ag-LiveCodeBench-X leaderboard! It shows the peformance of models on coding in low-resource programming languages, using a benchmark prepared during the Agnostics project. ag-livecodebench-x.github.io

English

Alex Boruch-Gruszecki@abgruszecki·13 Ağu

@brendanh0gan Congrats on your impressive results! We published a similar report recently, although we focused on developing a universal pipeline which works on any programming language. I'm curious to see how we can learn from each other! x.com/abgruszecki/st…

Alex Boruch-Gruszecki@abgruszecki

We show a way to reinforce an LLM’s ability to code in *any* programming language! We turn Qwen 3 4B and 8B into SOTA ≤16B models for low-resource programming languages, rivaling their 32B sibling. Find out more about our Agnostics project at agnostics.abgru.me , or here👇

English

232

Brendan Hogan@brendanh0gan·13 Ağu

introducing qqWen: our fully open-sourced project (code+weights+data+detailed technical report) for full-stack finetuning (pretrain+SFT+RL) a series of models (1.5b, 3b, 7b, 14b & 32b) for a niche financial programming language called Q All details below!

English

742

133.4K

Alex Boruch-Gruszecki@abgruszecki·12 Ağu

@Laz4rz This is Switzerland? It’s part of the experience I’m afraid. During my stay at EPFL something like this happened a few times each year

English

Lazarz@Laz4rz·11 Ağu

yikes, hospitality

English

485

45.6K

Alex Boruch-Gruszecki@abgruszecki·11 Ağu

@ezyang I might have something which will help :) x.com/abgruszecki/st…

Alex Boruch-Gruszecki@abgruszecki

English

Edward Z. Yang@ezyang·23 Tem

Suppose you are the maintainers of a low resource programming language, and you would like to work on directly improving the LLM coding experience on top of the language. What is your biggest leverage point?

English

2.6K

Alex Boruch-Gruszecki@abgruszecki·7 Ağu

@disconcision @samth @ArjunGuha Yes, exactly! And the top models could still be better. Prior work shows training on more programming languages can make models better at coding overall, which could be even more true now when models learn reasoning by writing code.

English

andrew blinn@disconcision·7 Ağu

@samth @ArjunGuha @abgruszecki that said it would be great to have a syntactically reliable smaller model to use for simpler subtasks

English

Arjun Guha@ArjunGuha·7 Ağu

This is new work from my group, led by @abgruszecki, as we try to push LLM capabilities on low-resource programming languages. I think we produced some of the best small models for OCaml, Fortran, and other PLs. We also have a "new", harder multi-language benchmark.

Alex Boruch-Gruszecki@abgruszecki

English

3.5K

Alex Boruch-Gruszecki@abgruszecki·7 Ağu

@ArjunGuha @samth There's a similar issue where better language abstractions could help LLMs on some tasks, but by definition there's little training data for these abstractions. Agnostics shows one way how to get around that, but much more could be done.

English

Arjun Guha@ArjunGuha·7 Ağu

@samth @abgruszecki To be clear, there are many confounds. For example, in the LLM space, Python has better abstractions than any other language, from low-level Pytorch up to high-level DSPy. I wonder if we can bring new abstractions to other PLs with LLM translations.

English

101

Keşfet

@jyangballin @KLieret @lossfunk @ShriramKMurthi @__protected @odersky @eliebakouch @brendanh0gan