Lossfunk

538 posts

Lossfunk banner
Lossfunk

Lossfunk

@lossfunk

We ask foundational questions to explore what's next in AI

🇮🇳 Katılım Ocak 2025
1 Takip Edilen14.6K Takipçiler
Sabitlenmiş Tweet
Lossfunk
Lossfunk@lossfunk·
Here's what we're interested in as a lab.
Lossfunk tweet media
English
19
23
426
90.3K
Lossfunk
Lossfunk@lossfunk·
Addressing a few questions about our Esolang-Bench. a) Why do it? Does it measure anything useful? It was a curiosity-driven project. We're interested in how humans exhibit sample-efficiency in learning and OOD generalization. So we simply asked: if models can zero/few shot correct answers for simple programming problems in Python, can they do the same in esoteric languages as well? The benchmark is what it is. Different people can interpret its usefulness differently, and we encourage that. b) But humans can't also write esoteric languages well. It's an unfair comparision. Primarily, we're interested in measuring LLM capabilities. With the talk of ASI, it is supposed that their capabilities will soon be super-human. So, our primary motivation wasn't to compare to humans but to check what they can do this by-construction difficult benchmark. However, we do believe that humans are able to teach themselves a new domain by transferring their old skills. So this benchmark was to set a starting point to explore how AI systems can do the same as well (which is what we're exploring now) c) But Claude Code crushes it. You limited models artificially. Yes, we tested models in zero and few shot capabilities. And in the agentic loop we describe in the paper, we limit the number of iterations. As we wrote above, we wanted to understand their performance from a comparative point of view (say on highly represented languages like Python) and that's by the benchmark by design is like this. After the paper was finalized, we experimented with agentic systems where we gave models tools like bash and allowed unlimited iterations (but limited submission attempts). They indeed perform much better. The question that's relevant is what makes these models perform so well when you give them tools and iterations v/s when you don't. Are they reasoning / learning like humans or is it something else? d) So, are LLMs hyped? Or is our study clickbait? The paper, code and benchmark are all open source 👇 We encourage whoever is interested to read it, and make up their own minds. (We couldn't help notice that the *same* set of results were interpreted wildly differently within the community. A debate between opposing camps of LLMs ensued. Perhaps that's a good thing?)
English
1
0
0
8
Lossfunk
Lossfunk@lossfunk·
@deliprao This was driven by curiosity, mostly. Your suggestion of working with unfamiliar codebases is exactly the next research problem we're working on (inspired by these results).
English
0
0
1
21
Delip Rao e/σ
Delip Rao e/σ@deliprao·
I see the utility of such benchmarks in auditing whether LLMs are genuinely reasoning vs pattern matching. That's a real question worth answering. Beyond that, I am not sure why caring about esoteric PL performance matters. I would rather codegen model that scores 90% on Python can actually reason when it hits an unfamiliar API or codebase. That's a knowledge problem. The PL diversity framing is also a red herring IMO. There are compounding returns in PL consolidation: better libraries, better dev ecosystem, lower human risk. I'd still rather have a codegen model overfitted to Python than one that generalizes to Brainfuck (no free lunch etc).
Lossfunk@lossfunk

🚨 Shocking: Frontier LLMs score 85-95% on standard coding benchmarks. We gave them equivalent problems in languages they couldn't have memorized. They collapsed to 0-11%. Presenting EsoLang-Bench. Accepted to the Logical Reasoning and ICBINB workshops at ICLR 2026 🧵

English
1
0
2
599
Lossfunk
Lossfunk@lossfunk·
In our paper we were testing models capacity to solve these problems with zero shot or few shot. In our agentic setups, we limited number of iterations. The motivation was to see their capability v/s what they show on python, etc. (where few-shot, limited iterations work great). After the paper, we're now experimenting with long running sessions / multiple attempts by Opus and other frontier models. And yeah results are what this shows - if you give models a lot of bandwidth to iterate, they're able to solve the easy part of our benchmark (haven't tested on medium and hard).
Lossfunk tweet media
English
0
0
1
17
Lossfunk
Lossfunk@lossfunk·
@ShriKaranHanda After the paper was done, we tried modern agentic tools like claude code, gave them tools and instructed them to explore/learn We found it actually wrote something like this by itself (without instructing) Stay tuned for this update.
English
5
1
102
7.2K
Karan Handa
Karan Handa@ShriKaranHanda·
@lossfunk Curious how it would perform if instructed to write a transpiler to these esoteric languages in a language that it's familiar with
English
1
0
23
7.1K
Lossfunk
Lossfunk@lossfunk·
🚨 Shocking: Frontier LLMs score 85-95% on standard coding benchmarks. We gave them equivalent problems in languages they couldn't have memorized. They collapsed to 0-11%. Presenting EsoLang-Bench. Accepted to the Logical Reasoning and ICBINB workshops at ICLR 2026 🧵
English
104
192
1.5K
699.2K
Lossfunk
Lossfunk@lossfunk·
4/ @manojrajarao talks about using LLMs to automate parts of the research workflow, from reading papers to optimizing GPU kernels. Walks through a system using evolutionary algorithms and LLM-guided mutations to find faster attention kernel implementations. youtu.be/hBiNWfcayeo
YouTube video
YouTube
English
1
0
4
822
Lossfunk
Lossfunk@lossfunk·
We published a bunch of interesting talks on our channel. Here's a quick rundown of each one 👇
Lossfunk tweet media
English
1
0
30
1.4K