Alex Boruch-Gruszecki

51 posts

Alex Boruch-Gruszecki

Alex Boruch-Gruszecki

@abgruszecki

Investigating how to build the future of coding using AI and programming languages. Postdoc in Arjun Guha's group

Boston Katılım Ekim 2022
91 Takip Edilen84 Takipçiler
Alex Boruch-Gruszecki
Alex Boruch-Gruszecki@abgruszecki·
ProgramBench is awesome! It shows there's a lot of room to improve LLMs and agents at autonomously implementing entire programs from scratch, and it's a great hill climb target programbench.com
English
1
0
1
30
Alex Boruch-Gruszecki
Alex Boruch-Gruszecki@abgruszecki·
Can one person really use agents to manage nearly half a million lines of JavaScript, and produce a maintainable codebase? One way to find out! mazesofmenace.ai
English
0
0
1
18
Alex Boruch-Gruszecki
Alex Boruch-Gruszecki@abgruszecki·
Porting NetHack seems like a great challenge for agents. The translation can be verified using pre-recorded input-output behaviors. But the feedback for the agent is sparse, the debugging chains get long, and early architectural mistakes can look good for a long time.
English
1
0
1
42
Alex Boruch-Gruszecki
Alex Boruch-Gruszecki@abgruszecki·
I'll be helping David Bau run a contest which will let its participants show us how far they can push agentic coding! The task is "simple": port NetHack, over 440k lines of C and Lua, to JavaScript. Easy to verify, but still far from easy for agents. abgru.me/project/telepo…
English
1
0
2
55
Alex Boruch-Gruszecki retweetledi
François Chollet
François Chollet@fchollet·
The G in AGI stands for "general". General intelligence does not mean that you have been specifically trained for a large range of tasks. It means you can approach any NEW task and figure it out, just like humans do. If regular people can do it on their own (no guidance, no tools), why should AGI require special handholding and handcrafted instructions? If it's AGI, why would there still be a human in the loop, using their own human intelligence to guide the model on every new task?
Lisan al Gaib@scaling01

this is pretty much worst case performance no harness at all and very simplistic prompt

English
136
93
1.4K
187.3K
Alex Boruch-Gruszecki retweetledi
ARC Prize
ARC Prize@arcprize·
Announcing ARC-AGI-3 The only unsaturated agentic intelligence benchmark in the world Humans score 100%, AI <1% This human-AI gap demonstrates we do not yet have AGI Most benchmarks test what models already know, ARC-AGI-3 tests how they learn
GIF
English
247
586
4.3K
732.1K
Alex Boruch-Gruszecki
Alex Boruch-Gruszecki@abgruszecki·
I'm working on scaling Agnostics to larger problems. Esolangs are an exciting angle! I'd be glad to talk more, maybe at ICLR. Some people I know also may be interested. See more about Agnostics here: agnostics.abgru.me
Alex Boruch-Gruszecki tweet media
English
0
0
0
35
Alex Boruch-Gruszecki
Alex Boruch-Gruszecki@abgruszecki·
Great study! LLMs fail at rare programming languages in surprising ways. It'd be interesting to study these failure modes on larger examples. Our Agnostics may help: we show how to make problems which can be solved in any PL. Happy to chat more @lossfunk! x.com/lossfunk/statu…
Lossfunk@lossfunk

🚨 Shocking: Frontier LLMs score 85-95% on standard coding benchmarks. We gave them equivalent problems in languages they couldn't have memorized. They collapsed to 0-11%. Presenting EsoLang-Bench. Accepted to the Logical Reasoning and ICBINB workshops at ICLR 2026 🧵

English
1
0
1
85
Alex Boruch-Gruszecki
Alex Boruch-Gruszecki@abgruszecki·
@ShriramKMurthi In the end, a human needs to verify if the codebase satisfies some real-world requirements. It’s hard to see how to escape that in the foreseeable future.
English
0
0
0
402
Shriram Krishnamurthi (primary: Bluesky)
"Turning those drafts into production software still requires […] $300K+ per year in compiler engineer salary." is an extremely poor take. That 300K/yr compiler engineer isn't going to want to go within a mile of this codebase. What you'e paying for is quality *all along*.
Aakash Gupta@aakashgupta

Sounds incredible until you read the fine print. The compiler generates less efficient code than GCC with all optimizations disabled. It doesn’t have its own assembler or linker. It can’t produce a 16-bit x86 code generator. And Carlini himself says it has “nearly reached the limits of Opus’s abilities.” New features and bugfixes kept breaking existing functionality. So what did $20,000 and two weeks actually buy? A compiler that passes 99% of GCC’s torture tests but can’t match the output quality of a tool that’s had 37 years of human engineering. That’s the constraint nobody’s pricing in. The real story is in the cost curve, not the capability demo. $20,000 for 100,000 lines means $0.20 per line of generated code. A senior compiler engineer costs roughly $150/hour. At maybe 50 polished lines per hour for something this complex, that’s $3/line. AI just did it at 15x cheaper, and it will only get cheaper from here. But the code isn’t equivalent. The AI version needs a human to finish the assembler, fix the linker, optimize the output, and prevent regressions. Those are the hardest 20% of the problem, and they represent 80% of the engineering value. Anthropic built the demo. Shipping the product still requires humans. This tells you exactly where we are in the autonomous software timeline. AI can now produce impressive first drafts of complex systems at trivial cost. Turning those drafts into production software still requires the judgment that costs $300K+ per year in compiler engineer salary. The gap between “compiles the Linux kernel” and “replaces GCC” is measured in decades of accumulated engineering wisdom that no model has internalized yet. The companies that understand this will use agent teams to generate the 80% and hire engineers to finish the 20%. The companies that don’t will ship $20,000 compilers that produce slower code than a free tool from 1987.

English
10
10
179
26.1K
Alex Boruch-Gruszecki
Alex Boruch-Gruszecki@abgruszecki·
@__protected @odersky Congratulations @odersky! Since the invitation to your lab I’ve been on a non-stop adventure of my life. Seeing your efforts was formative for me. As Jonathan said: well deserved!
English
0
0
1
115
Jonathan Brachthäuser
Jonathan Brachthäuser@__protected·
Congratulations @odersky for receiving the SIGPLAN Programming Languages Achievement Award! Your work is a great inspiration for me :) Well deserved!
Jonathan Brachthäuser tweet media
English
3
18
106
6.5K
elie
elie@eliebakouch·
If you’re a researcher working on RL, you should definitely try SmolLM3-3B and get another data point besides Qwen3-3B. 1) We didn’t have time to try RL during post training, so I think there’s still some room to build an even better version of smollm! 2) We released the intermediate checkpoints from post training, so you can use our model at different stages (base, mid training, SFT, APO, merging,) and see if it changes RL perf. 3) The model is also pretty good at long context, you can probably push it past 128k thanks to NoPE and yarn.
English
19
25
320
37.3K
Alex Boruch-Gruszecki
Alex Boruch-Gruszecki@abgruszecki·
The leaderboard also shows the results of training SmolLM3 using the Agnostics framework, it's a small (3B) but very capable model. The Lua variant shows the highest relative gains from all the Lua models we trained!
English
0
0
0
53
Alex Boruch-Gruszecki
Alex Boruch-Gruszecki@abgruszecki·
The leaderboard shows more results than what we included in the report. We can see that the models we trained rival Sonnet 4 on coding in R, and beat it and Qwen 3 Coder on Fortran!
Alex Boruch-Gruszecki tweet media
English
1
0
0
88
Alex Boruch-Gruszecki
Alex Boruch-Gruszecki@abgruszecki·
We're publishing the Ag-LiveCodeBench-X leaderboard! It shows the peformance of models on coding in low-resource programming languages, using a benchmark prepared during the Agnostics project. ag-livecodebench-x.github.io
English
1
0
0
78
Alex Boruch-Gruszecki
Alex Boruch-Gruszecki@abgruszecki·
@brendanh0gan Congrats on your impressive results! We published a similar report recently, although we focused on developing a universal pipeline which works on any programming language. I'm curious to see how we can learn from each other! x.com/abgruszecki/st…
Alex Boruch-Gruszecki@abgruszecki

We show a way to reinforce an LLM’s ability to code in *any* programming language! We turn Qwen 3 4B and 8B into SOTA ≤16B models for low-resource programming languages, rivaling their 32B sibling. Find out more about our Agnostics project at agnostics.abgru.me , or here👇

English
1
0
2
232
Brendan Hogan
Brendan Hogan@brendanh0gan·
introducing qqWen: our fully open-sourced project (code+weights+data+detailed technical report) for full-stack finetuning (pretrain+SFT+RL) a series of models (1.5b, 3b, 7b, 14b & 32b) for a niche financial programming language called Q All details below!
Brendan Hogan tweet mediaBrendan Hogan tweet media
English
20
92
742
133.4K
Alex Boruch-Gruszecki
Alex Boruch-Gruszecki@abgruszecki·
@Laz4rz This is Switzerland? It’s part of the experience I’m afraid. During my stay at EPFL something like this happened a few times each year
English
0
0
1
45
Lazarz
Lazarz@Laz4rz·
yikes, hospitality
Lazarz tweet media
English
49
5
485
45.6K
Edward Z. Yang
Edward Z. Yang@ezyang·
Suppose you are the maintainers of a low resource programming language, and you would like to work on directly improving the LLM coding experience on top of the language. What is your biggest leverage point?
English
8
1
9
2.6K
Alex Boruch-Gruszecki
Alex Boruch-Gruszecki@abgruszecki·
@disconcision @samth @ArjunGuha Yes, exactly! And the top models could still be better. Prior work shows training on more programming languages can make models better at coding overall, which could be even more true now when models learn reasoning by writing code.
English
0
0
1
40
Arjun Guha
Arjun Guha@ArjunGuha·
This is new work from my group, led by @abgruszecki, as we try to push LLM capabilities on low-resource programming languages. I think we produced some of the best small models for OCaml, Fortran, and other PLs. We also have a "new", harder multi-language benchmark.
Alex Boruch-Gruszecki@abgruszecki

We show a way to reinforce an LLM’s ability to code in *any* programming language! We turn Qwen 3 4B and 8B into SOTA ≤16B models for low-resource programming languages, rivaling their 32B sibling. Find out more about our Agnostics project at agnostics.abgru.me , or here👇

English
1
4
35
3.5K
Alex Boruch-Gruszecki
Alex Boruch-Gruszecki@abgruszecki·
@ArjunGuha @samth There's a similar issue where better language abstractions could help LLMs on some tasks, but by definition there's little training data for these abstractions. Agnostics shows one way how to get around that, but much more could be done.
English
0
0
0
17
Arjun Guha
Arjun Guha@ArjunGuha·
@samth @abgruszecki To be clear, there are many confounds. For example, in the LLM space, Python has better abstractions than any other language, from low-level Pytorch up to high-level DSPy. I wonder if we can bring new abstractions to other PLs with LLM translations.
English
2
0
2
101