Lossfunk

548 posts

Lossfunk banner
Lossfunk

Lossfunk

@lossfunk

We ask foundational questions to explore what's next in AI

🇮🇳 Katılım Ocak 2025
1 Takip Edilen14.7K Takipçiler
Sabitlenmiş Tweet
Lossfunk
Lossfunk@lossfunk·
Here's what we're interested in as a lab.
Lossfunk tweet media
English
19
23
428
90.7K
Lossfunk
Lossfunk@lossfunk·
@Michael_Druggan @paraschopra read the paper and the code, Michael. Nuance matters and the paper says exactly what we did. perhaps read the original thread too: x.com/lossfunk/statu… what exactly pissed you off? do you have such strongly held beliefs about LLMs/AI that you're okay being a jerk online?
Lossfunk@lossfunk

7/ After the paper was finalized, we ran agentic systems that mimic how humans would learn to solve problems in esoteric languages. We supplied our agents with a custom harness + tools on the same benchmark. They absolutely crushed the benchmark. Stay tuned 👀

English
1
0
0
206
Paras Chopra
Paras Chopra@paraschopra·
So Esolang-Bench went viral overnight! A lot of discussion ensued; addressing some of the common points that came up. a) Why do it? Does it measure anything useful? b) But humans can't also write esoteric languages well. It's an unfair comparison. c) But Claude Code crushes it. You limited models artificially. d) So, are LLMs hyped? Or is our study clickbait?
Lossfunk@lossfunk

Addressing a few questions about our Esolang-Bench. a) Why do it? Does it measure anything useful? It was a curiosity-driven project. We're interested in how humans exhibit sample-efficiency in learning and OOD generalization. So we simply asked: if models can zero/few shot correct answers for simple programming problems in Python, can they do the same in esoteric languages as well? The benchmark is what it is. Different people can interpret its usefulness differently, and we encourage that. b) But humans can't also write esoteric languages well. It's an unfair comparision. Primarily, we're interested in measuring LLM capabilities. With the talk of ASI, it is supposed that their capabilities will soon be super-human. So, our primary motivation wasn't to compare to humans but to check what they can do this by-construction difficult benchmark. However, we do believe that humans are able to teach themselves a new domain by transferring their old skills. So this benchmark was to set a starting point to explore how AI systems can do the same as well (which is what we're exploring now) c) But Claude Code crushes it. You limited models artificially. Yes, we tested models in zero and few shot capabilities. And in the agentic loop we describe in the paper, we limit the number of iterations. As we wrote above, we wanted to understand their performance from a comparative point of view (say on highly represented languages like Python) and that's by the benchmark by design is like this. After the paper was finalized, we experimented with agentic systems where we gave models tools like bash and allowed unlimited iterations (but limited submission attempts). They indeed perform much better. The question that's relevant is what makes these models perform so well when you give them tools and iterations v/s when you don't. Are they reasoning / learning like humans or is it something else? d) So, are LLMs hyped? Or is our study clickbait? The paper, code and benchmark are all open source 👇 We encourage whoever is interested to read it, and make up their own minds. (We couldn't help notice that the *same* set of results were interpreted wildly differently within the community. A debate between opposing camps of LLMs ensued. Perhaps that's a good thing?)

English
6
1
52
8.5K
Lossfunk
Lossfunk@lossfunk·
Lossfunk@lossfunk

Addressing a few questions about our Esolang-Bench. a) Why do it? Does it measure anything useful? It was a curiosity-driven project. We're interested in how humans exhibit sample-efficiency in learning and OOD generalization. So we simply asked: if models can zero/few shot correct answers for simple programming problems in Python, can they do the same in esoteric languages as well? The benchmark is what it is. Different people can interpret its usefulness differently, and we encourage that. b) But humans can't also write esoteric languages well. It's an unfair comparision. Primarily, we're interested in measuring LLM capabilities. With the talk of ASI, it is supposed that their capabilities will soon be super-human. So, our primary motivation wasn't to compare to humans but to check what they can do this by-construction difficult benchmark. However, we do believe that humans are able to teach themselves a new domain by transferring their old skills. So this benchmark was to set a starting point to explore how AI systems can do the same as well (which is what we're exploring now) c) But Claude Code crushes it. You limited models artificially. Yes, we tested models in zero and few shot capabilities. And in the agentic loop we describe in the paper, we limit the number of iterations. As we wrote above, we wanted to understand their performance from a comparative point of view (say on highly represented languages like Python) and that's by the benchmark by design is like this. After the paper was finalized, we experimented with agentic systems where we gave models tools like bash and allowed unlimited iterations (but limited submission attempts). They indeed perform much better. The question that's relevant is what makes these models perform so well when you give them tools and iterations v/s when you don't. Are they reasoning / learning like humans or is it something else? d) So, are LLMs hyped? Or is our study clickbait? The paper, code and benchmark are all open source 👇 We encourage whoever is interested to read it, and make up their own minds. (We couldn't help notice that the *same* set of results were interpreted wildly differently within the community. A debate between opposing camps of LLMs ensued. Perhaps that's a good thing?)

QME
0
0
0
66
Dan McAteer
Dan McAteer@daniel_mac8·
This is absolutely retarded. You give the smartest human in the world a language test in a language they never learned and they’ll score 0% too. At least the LLMs can make a reasonable guess. The f*ck outta here with this.
Lossfunk@lossfunk

🚨 Shocking: Frontier LLMs score 85-95% on standard coding benchmarks. We gave them equivalent problems in languages they couldn't have memorized. They collapsed to 0-11%. Presenting EsoLang-Bench. Accepted to the Logical Reasoning and ICBINB workshops at ICLR 2026 🧵

English
11
0
24
3.2K
Lossfunk
Lossfunk@lossfunk·
Lossfunk@lossfunk

Addressing a few questions about our Esolang-Bench. a) Why do it? Does it measure anything useful? It was a curiosity-driven project. We're interested in how humans exhibit sample-efficiency in learning and OOD generalization. So we simply asked: if models can zero/few shot correct answers for simple programming problems in Python, can they do the same in esoteric languages as well? The benchmark is what it is. Different people can interpret its usefulness differently, and we encourage that. b) But humans can't also write esoteric languages well. It's an unfair comparision. Primarily, we're interested in measuring LLM capabilities. With the talk of ASI, it is supposed that their capabilities will soon be super-human. So, our primary motivation wasn't to compare to humans but to check what they can do this by-construction difficult benchmark. However, we do believe that humans are able to teach themselves a new domain by transferring their old skills. So this benchmark was to set a starting point to explore how AI systems can do the same as well (which is what we're exploring now) c) But Claude Code crushes it. You limited models artificially. Yes, we tested models in zero and few shot capabilities. And in the agentic loop we describe in the paper, we limit the number of iterations. As we wrote above, we wanted to understand their performance from a comparative point of view (say on highly represented languages like Python) and that's by the benchmark by design is like this. After the paper was finalized, we experimented with agentic systems where we gave models tools like bash and allowed unlimited iterations (but limited submission attempts). They indeed perform much better. The question that's relevant is what makes these models perform so well when you give them tools and iterations v/s when you don't. Are they reasoning / learning like humans or is it something else? d) So, are LLMs hyped? Or is our study clickbait? The paper, code and benchmark are all open source 👇 We encourage whoever is interested to read it, and make up their own minds. (We couldn't help notice that the *same* set of results were interpreted wildly differently within the community. A debate between opposing camps of LLMs ensued. Perhaps that's a good thing?)

QME
0
0
0
47
kalomaze
kalomaze@kalomaze·
the kind of person who asks "but does this transfer generalize to Brainfuck?" is simply not being a serious person tbqh
English
8
3
45
1.5K
Lossfunk
Lossfunk@lossfunk·
Lossfunk@lossfunk

Addressing a few questions about our Esolang-Bench. a) Why do it? Does it measure anything useful? It was a curiosity-driven project. We're interested in how humans exhibit sample-efficiency in learning and OOD generalization. So we simply asked: if models can zero/few shot correct answers for simple programming problems in Python, can they do the same in esoteric languages as well? The benchmark is what it is. Different people can interpret its usefulness differently, and we encourage that. b) But humans can't also write esoteric languages well. It's an unfair comparision. Primarily, we're interested in measuring LLM capabilities. With the talk of ASI, it is supposed that their capabilities will soon be super-human. So, our primary motivation wasn't to compare to humans but to check what they can do this by-construction difficult benchmark. However, we do believe that humans are able to teach themselves a new domain by transferring their old skills. So this benchmark was to set a starting point to explore how AI systems can do the same as well (which is what we're exploring now) c) But Claude Code crushes it. You limited models artificially. Yes, we tested models in zero and few shot capabilities. And in the agentic loop we describe in the paper, we limit the number of iterations. As we wrote above, we wanted to understand their performance from a comparative point of view (say on highly represented languages like Python) and that's by the benchmark by design is like this. After the paper was finalized, we experimented with agentic systems where we gave models tools like bash and allowed unlimited iterations (but limited submission attempts). They indeed perform much better. The question that's relevant is what makes these models perform so well when you give them tools and iterations v/s when you don't. Are they reasoning / learning like humans or is it something else? d) So, are LLMs hyped? Or is our study clickbait? The paper, code and benchmark are all open source 👇 We encourage whoever is interested to read it, and make up their own minds. (We couldn't help notice that the *same* set of results were interpreted wildly differently within the community. A debate between opposing camps of LLMs ensued. Perhaps that's a good thing?)

QME
0
0
0
111
Chase Brower
Chase Brower@ChaseBrowe32432·
Opus 4.6 in webui can solve even the "extremely hard" problems btw, not sure what their precise methodology was but they must have heavily hamstrung the models.
Lossfunk@lossfunk

🚨 Shocking: Frontier LLMs score 85-95% on standard coding benchmarks. We gave them equivalent problems in languages they couldn't have memorized. They collapsed to 0-11%. Presenting EsoLang-Bench. Accepted to the Logical Reasoning and ICBINB workshops at ICLR 2026 🧵

English
8
3
76
14.1K
Lossfunk
Lossfunk@lossfunk·
Lossfunk@lossfunk

Addressing a few questions about our Esolang-Bench. a) Why do it? Does it measure anything useful? It was a curiosity-driven project. We're interested in how humans exhibit sample-efficiency in learning and OOD generalization. So we simply asked: if models can zero/few shot correct answers for simple programming problems in Python, can they do the same in esoteric languages as well? The benchmark is what it is. Different people can interpret its usefulness differently, and we encourage that. b) But humans can't also write esoteric languages well. It's an unfair comparision. Primarily, we're interested in measuring LLM capabilities. With the talk of ASI, it is supposed that their capabilities will soon be super-human. So, our primary motivation wasn't to compare to humans but to check what they can do this by-construction difficult benchmark. However, we do believe that humans are able to teach themselves a new domain by transferring their old skills. So this benchmark was to set a starting point to explore how AI systems can do the same as well (which is what we're exploring now) c) But Claude Code crushes it. You limited models artificially. Yes, we tested models in zero and few shot capabilities. And in the agentic loop we describe in the paper, we limit the number of iterations. As we wrote above, we wanted to understand their performance from a comparative point of view (say on highly represented languages like Python) and that's by the benchmark by design is like this. After the paper was finalized, we experimented with agentic systems where we gave models tools like bash and allowed unlimited iterations (but limited submission attempts). They indeed perform much better. The question that's relevant is what makes these models perform so well when you give them tools and iterations v/s when you don't. Are they reasoning / learning like humans or is it something else? d) So, are LLMs hyped? Or is our study clickbait? The paper, code and benchmark are all open source 👇 We encourage whoever is interested to read it, and make up their own minds. (We couldn't help notice that the *same* set of results were interpreted wildly differently within the community. A debate between opposing camps of LLMs ensued. Perhaps that's a good thing?)

QME
1
0
0
133
Ankit Jxa
Ankit Jxa@kingofknowwhere·
this research should have been titled "LLMs can't reason in Brainf#ck" and not "LLMs know python because they potentially memorized it". To truly evaluate this, you should perhaps design your own LISP with python-like syntaxes and add the syntax guide in the problem contex. This current approach seems futile in my humble opinion
Lossfunk@lossfunk

🚨 Shocking: Frontier LLMs score 85-95% on standard coding benchmarks. We gave them equivalent problems in languages they couldn't have memorized. They collapsed to 0-11%. Presenting EsoLang-Bench. Accepted to the Logical Reasoning and ICBINB workshops at ICLR 2026 🧵

English
8
5
91
7.8K
Lossfunk
Lossfunk@lossfunk·
@jeremyphoward @andrey_kurenkov Jeremy, hope this follow up helps x.com/lossfunk/statu…
Lossfunk@lossfunk

Addressing a few questions about our Esolang-Bench. a) Why do it? Does it measure anything useful? It was a curiosity-driven project. We're interested in how humans exhibit sample-efficiency in learning and OOD generalization. So we simply asked: if models can zero/few shot correct answers for simple programming problems in Python, can they do the same in esoteric languages as well? The benchmark is what it is. Different people can interpret its usefulness differently, and we encourage that. b) But humans can't also write esoteric languages well. It's an unfair comparision. Primarily, we're interested in measuring LLM capabilities. With the talk of ASI, it is supposed that their capabilities will soon be super-human. So, our primary motivation wasn't to compare to humans but to check what they can do this by-construction difficult benchmark. However, we do believe that humans are able to teach themselves a new domain by transferring their old skills. So this benchmark was to set a starting point to explore how AI systems can do the same as well (which is what we're exploring now) c) But Claude Code crushes it. You limited models artificially. Yes, we tested models in zero and few shot capabilities. And in the agentic loop we describe in the paper, we limit the number of iterations. As we wrote above, we wanted to understand their performance from a comparative point of view (say on highly represented languages like Python) and that's by the benchmark by design is like this. After the paper was finalized, we experimented with agentic systems where we gave models tools like bash and allowed unlimited iterations (but limited submission attempts). They indeed perform much better. The question that's relevant is what makes these models perform so well when you give them tools and iterations v/s when you don't. Are they reasoning / learning like humans or is it something else? d) So, are LLMs hyped? Or is our study clickbait? The paper, code and benchmark are all open source 👇 We encourage whoever is interested to read it, and make up their own minds. (We couldn't help notice that the *same* set of results were interpreted wildly differently within the community. A debate between opposing camps of LLMs ensued. Perhaps that's a good thing?)

English
0
0
0
46
Jeremy Howard
Jeremy Howard@jeremyphoward·
@andrey_kurenkov I think there's a fair reaction. OTOH, when using LLMs with APL, which is an extremely efficient and well-designed language, AI is hasn't been able to create any useful code at all for me so far. So their conclusions may be correct anyway...
English
7
1
116
5.2K
Andrey Kurenkov
Andrey Kurenkov@andrey_kurenkov·
This research is basically clickbait... These 'esoteric' languages (Brainfuck, Befunge-98, Whitespace, Unlambda, and Shakespear) in the benchmark are not just ones with less training data online, they are also just **much harder** and **less efficient** to do anything productive with, and failing to even discuss this is crazy. Saying that if you can solve something in python you should be able to generalize to these languages is akin to saying that you should be able to generalize from tasks in python to assembly. It's obviously not the same difficulty level to do tasks in python vs assembly. So is low scores on the benchmark due to lacking "ability to generalize computational reasoning to novel domains", or due to the increased difficulty of the task due to the language of choice? Somehow this question is not addressed in the paper not noted in the limitations, as far as I could find. For reference, here are the languages (info from wikipedia): * Brainfuck: The language only consists of 8 operators, yet with the 8 operators, <>+-[]. Here's 'hello world': >++++++++[<+++++++++>-]<.>++++[<+++++++>-]<+.+++++++..+++.>>++++++[<+++++++>-]<+ +.------------.>++++++[<+++++++++>-]<+.<.+++.------.--------.>>>++++[<++++++++>- ]<+. * Whitespace: 'only whitespace characters (space, tab and newline) have meaning – contrasting typical languages that largely ignore whitespace characters.' See first attached image for 'hello world' code. * Befunge-98: a stack-based, reflective language in which programs are arranged on a two-dimensional grid. "Arrow" instructions direct the control flow to the left, right, up or down, and loops are constructed by sending the control flow in a cycle. Hello world: >25*"!dlroW olleH":v v:,_@ > ^ * Unlambda: 'a minimal functional programming based on combinatory logic, an expression system without the lambda operator or free variables. It relies mainly on two built-in functions (s and k) and an apply operator (written `, the backquote character).' `r```````````.H.e.l.l.o. .w.o.r.l.di * Shakespear: 'A character list in the beginning of the program declares a number of stacks, naturally with names like "Romeo" and "Juliet". These characters enter into dialogue with each other in which they manipulate each other's topmost values, push and pop each other, and do I/O. The characters can also ask each other questions which behave as conditional statements. On the whole, the programming model is very similar to assembly language but much more verbose.' See second image for just part of the hello world. I don't want to be mean to the researchers, I do like the idea behind the research, but the way it's presented feels so misleading to me that I can't help but feel the entire effort is either in bad faith or very poorly thought out.
Andrey Kurenkov tweet mediaAndrey Kurenkov tweet media
Lossfunk@lossfunk

🚨 Shocking: Frontier LLMs score 85-95% on standard coding benchmarks. We gave them equivalent problems in languages they couldn't have memorized. They collapsed to 0-11%. Presenting EsoLang-Bench. Accepted to the Logical Reasoning and ICBINB workshops at ICLR 2026 🧵

English
26
14
262
31.9K
Lossfunk
Lossfunk@lossfunk·
Lossfunk@lossfunk

Addressing a few questions about our Esolang-Bench. a) Why do it? Does it measure anything useful? It was a curiosity-driven project. We're interested in how humans exhibit sample-efficiency in learning and OOD generalization. So we simply asked: if models can zero/few shot correct answers for simple programming problems in Python, can they do the same in esoteric languages as well? The benchmark is what it is. Different people can interpret its usefulness differently, and we encourage that. b) But humans can't also write esoteric languages well. It's an unfair comparision. Primarily, we're interested in measuring LLM capabilities. With the talk of ASI, it is supposed that their capabilities will soon be super-human. So, our primary motivation wasn't to compare to humans but to check what they can do this by-construction difficult benchmark. However, we do believe that humans are able to teach themselves a new domain by transferring their old skills. So this benchmark was to set a starting point to explore how AI systems can do the same as well (which is what we're exploring now) c) But Claude Code crushes it. You limited models artificially. Yes, we tested models in zero and few shot capabilities. And in the agentic loop we describe in the paper, we limit the number of iterations. As we wrote above, we wanted to understand their performance from a comparative point of view (say on highly represented languages like Python) and that's by the benchmark by design is like this. After the paper was finalized, we experimented with agentic systems where we gave models tools like bash and allowed unlimited iterations (but limited submission attempts). They indeed perform much better. The question that's relevant is what makes these models perform so well when you give them tools and iterations v/s when you don't. Are they reasoning / learning like humans or is it something else? d) So, are LLMs hyped? Or is our study clickbait? The paper, code and benchmark are all open source 👇 We encourage whoever is interested to read it, and make up their own minds. (We couldn't help notice that the *same* set of results were interpreted wildly differently within the community. A debate between opposing camps of LLMs ensued. Perhaps that's a good thing?)

English
0
0
1
208
Lossfunk
Lossfunk@lossfunk·
@pmddomingos The interpretation of our results is a bit more nuanced that that. Hope this clarification helps x.com/lossfunk/statu…
Lossfunk@lossfunk

Addressing a few questions about our Esolang-Bench. a) Why do it? Does it measure anything useful? It was a curiosity-driven project. We're interested in how humans exhibit sample-efficiency in learning and OOD generalization. So we simply asked: if models can zero/few shot correct answers for simple programming problems in Python, can they do the same in esoteric languages as well? The benchmark is what it is. Different people can interpret its usefulness differently, and we encourage that. b) But humans can't also write esoteric languages well. It's an unfair comparision. Primarily, we're interested in measuring LLM capabilities. With the talk of ASI, it is supposed that their capabilities will soon be super-human. So, our primary motivation wasn't to compare to humans but to check what they can do this by-construction difficult benchmark. However, we do believe that humans are able to teach themselves a new domain by transferring their old skills. So this benchmark was to set a starting point to explore how AI systems can do the same as well (which is what we're exploring now) c) But Claude Code crushes it. You limited models artificially. Yes, we tested models in zero and few shot capabilities. And in the agentic loop we describe in the paper, we limit the number of iterations. As we wrote above, we wanted to understand their performance from a comparative point of view (say on highly represented languages like Python) and that's by the benchmark by design is like this. After the paper was finalized, we experimented with agentic systems where we gave models tools like bash and allowed unlimited iterations (but limited submission attempts). They indeed perform much better. The question that's relevant is what makes these models perform so well when you give them tools and iterations v/s when you don't. Are they reasoning / learning like humans or is it something else? d) So, are LLMs hyped? Or is our study clickbait? The paper, code and benchmark are all open source 👇 We encourage whoever is interested to read it, and make up their own minds. (We couldn't help notice that the *same* set of results were interpreted wildly differently within the community. A debate between opposing camps of LLMs ensued. Perhaps that's a good thing?)

English
0
0
0
130
Lossfunk
Lossfunk@lossfunk·
@fchollet We agree with you in spirit that if we're approaching ASI with LLMs, models should be able to show superhuman capabilities. But we believe that the story is nuanced and hope this follow up clarification helps: x.com/lossfunk/statu…
Lossfunk@lossfunk

Addressing a few questions about our Esolang-Bench. a) Why do it? Does it measure anything useful? It was a curiosity-driven project. We're interested in how humans exhibit sample-efficiency in learning and OOD generalization. So we simply asked: if models can zero/few shot correct answers for simple programming problems in Python, can they do the same in esoteric languages as well? The benchmark is what it is. Different people can interpret its usefulness differently, and we encourage that. b) But humans can't also write esoteric languages well. It's an unfair comparision. Primarily, we're interested in measuring LLM capabilities. With the talk of ASI, it is supposed that their capabilities will soon be super-human. So, our primary motivation wasn't to compare to humans but to check what they can do this by-construction difficult benchmark. However, we do believe that humans are able to teach themselves a new domain by transferring their old skills. So this benchmark was to set a starting point to explore how AI systems can do the same as well (which is what we're exploring now) c) But Claude Code crushes it. You limited models artificially. Yes, we tested models in zero and few shot capabilities. And in the agentic loop we describe in the paper, we limit the number of iterations. As we wrote above, we wanted to understand their performance from a comparative point of view (say on highly represented languages like Python) and that's by the benchmark by design is like this. After the paper was finalized, we experimented with agentic systems where we gave models tools like bash and allowed unlimited iterations (but limited submission attempts). They indeed perform much better. The question that's relevant is what makes these models perform so well when you give them tools and iterations v/s when you don't. Are they reasoning / learning like humans or is it something else? d) So, are LLMs hyped? Or is our study clickbait? The paper, code and benchmark are all open source 👇 We encourage whoever is interested to read it, and make up their own minds. (We couldn't help notice that the *same* set of results were interpreted wildly differently within the community. A debate between opposing camps of LLMs ensued. Perhaps that's a good thing?)

English
0
1
1
115
François Chollet
François Chollet@fchollet·
This is more evidence that current frontier models remain completely reliant on content-level memorization, as opposed to higher-level generalizable knowledge (such as metalearning knowledge, problem-solving strategies...)
Lossfunk@lossfunk

🚨 Shocking: Frontier LLMs score 85-95% on standard coding benchmarks. We gave them equivalent problems in languages they couldn't have memorized. They collapsed to 0-11%. Presenting EsoLang-Bench. Accepted to the Logical Reasoning and ICBINB workshops at ICLR 2026 🧵

English
141
277
2.6K
218.6K
Lossfunk
Lossfunk@lossfunk·
Addressing a few questions about our Esolang-Bench. a) Why do it? Does it measure anything useful? It was a curiosity-driven project. We're interested in how humans exhibit sample-efficiency in learning and OOD generalization. So we simply asked: if models can zero/few shot correct answers for simple programming problems in Python, can they do the same in esoteric languages as well? The benchmark is what it is. Different people can interpret its usefulness differently, and we encourage that. b) But humans can't also write esoteric languages well. It's an unfair comparision. Primarily, we're interested in measuring LLM capabilities. With the talk of ASI, it is supposed that their capabilities will soon be super-human. So, our primary motivation wasn't to compare to humans but to check what they can do this by-construction difficult benchmark. However, we do believe that humans are able to teach themselves a new domain by transferring their old skills. So this benchmark was to set a starting point to explore how AI systems can do the same as well (which is what we're exploring now) c) But Claude Code crushes it. You limited models artificially. Yes, we tested models in zero and few shot capabilities. And in the agentic loop we describe in the paper, we limit the number of iterations. As we wrote above, we wanted to understand their performance from a comparative point of view (say on highly represented languages like Python) and that's by the benchmark by design is like this. After the paper was finalized, we experimented with agentic systems where we gave models tools like bash and allowed unlimited iterations (but limited submission attempts). They indeed perform much better. The question that's relevant is what makes these models perform so well when you give them tools and iterations v/s when you don't. Are they reasoning / learning like humans or is it something else? d) So, are LLMs hyped? Or is our study clickbait? The paper, code and benchmark are all open source 👇 We encourage whoever is interested to read it, and make up their own minds. (We couldn't help notice that the *same* set of results were interpreted wildly differently within the community. A debate between opposing camps of LLMs ensued. Perhaps that's a good thing?)
English
2
1
32
11.3K
Lossfunk
Lossfunk@lossfunk·
@deliprao This was driven by curiosity, mostly. Your suggestion of working with unfamiliar codebases is exactly the next research problem we're working on (inspired by these results).
English
0
0
9
183
Delip Rao e/σ
Delip Rao e/σ@deliprao·
I see the utility of such benchmarks in auditing whether LLMs are genuinely reasoning vs pattern matching. That's a real question worth answering. Beyond that, I am not sure why caring about esoteric PL performance matters. I would rather codegen model that scores 90% on Python can actually reason when it hits an unfamiliar API or codebase. That's a knowledge problem. The PL diversity framing is also a red herring IMO. There are compounding returns in PL consolidation: better libraries, better dev ecosystem, lower human risk. I'd still rather have a codegen model overfitted to Python than one that generalizes to Brainfuck (no free lunch etc).
Lossfunk@lossfunk

🚨 Shocking: Frontier LLMs score 85-95% on standard coding benchmarks. We gave them equivalent problems in languages they couldn't have memorized. They collapsed to 0-11%. Presenting EsoLang-Bench. Accepted to the Logical Reasoning and ICBINB workshops at ICLR 2026 🧵

English
4
1
29
3.5K
Lossfunk
Lossfunk@lossfunk·
In our paper we were testing models capacity to solve these problems with zero shot or few shot. In our agentic setups, we limited number of iterations. The motivation was to see their capability v/s what they show on python, etc. (where few-shot, limited iterations work great). After the paper, we're now experimenting with long running sessions / multiple attempts by Opus and other frontier models. And yeah results are what this shows - if you give models a lot of bandwidth to iterate, they're able to solve the easy part of our benchmark (haven't tested on medium and hard).
Lossfunk tweet media
English
0
0
1
30
Lossfunk
Lossfunk@lossfunk·
@ShriKaranHanda After the paper was done, we tried modern agentic tools like claude code, gave them tools and instructed them to explore/learn We found it actually wrote something like this by itself (without instructing) Stay tuned for this update.
English
5
2
115
9K
Karan Handa
Karan Handa@ShriKaranHanda·
@lossfunk Curious how it would perform if instructed to write a transpiler to these esoteric languages in a language that it's familiar with
English
2
0
27
8K
Lossfunk
Lossfunk@lossfunk·
🚨 Shocking: Frontier LLMs score 85-95% on standard coding benchmarks. We gave them equivalent problems in languages they couldn't have memorized. They collapsed to 0-11%. Presenting EsoLang-Bench. Accepted to the Logical Reasoning and ICBINB workshops at ICLR 2026 🧵
English
113
217
1.7K
788.9K