Lossfunk (@lossfunk) - Twitter Profili | Zamantika Mersobahis Locabet

Sabitlenmiş Tweet

Lossfunk@lossfunk·23 Eyl

Here's what we're interested in as a lab.

English

19

23

428

90.7K

Lossfunk@lossfunk·1h

@Michael_Druggan @paraschopra read the paper and the code, Michael. Nuance matters and the paper says exactly what we did. perhaps read the original thread too: x.com/lossfunk/statu… what exactly pissed you off? do you have such strongly held beliefs about LLMs/AI that you're okay being a jerk online?

Lossfunk@lossfunk

7/ After the paper was finalized, we ran agentic systems that mimic how humans would learn to solve problems in esoteric languages. We supplied our agents with a custom harness + tools on the same benchmark. They absolutely crushed the benchmark. Stay tuned 👀

English

1

0

206

Michael Druggan@Michael_Druggan·3h

@paraschopra Yeah you're a dishonest piece of shit and your paper is trash

English

2

1

21

950

Paras Chopra@paraschopra·3h

So Esolang-Bench went viral overnight! A lot of discussion ensued; addressing some of the common points that came up. a) Why do it? Does it measure anything useful? b) But humans can't also write esoteric languages well. It's an unfair comparison. c) But Claude Code crushes it. You limited models artificially. d) So, are LLMs hyped? Or is our study clickbait?

Lossfunk@lossfunk

Addressing a few questions about our Esolang-Bench. a) Why do it? Does it measure anything useful? It was a curiosity-driven project. We're interested in how humans exhibit sample-efficiency in learning and OOD generalization. So we simply asked: if models can zero/few shot correct answers for simple programming problems in Python, can they do the same in esoteric languages as well? The benchmark is what it is. Different people can interpret its usefulness differently, and we encourage that. b) But humans can't also write esoteric languages well. It's an unfair comparision. Primarily, we're interested in measuring LLM capabilities. With the talk of ASI, it is supposed that their capabilities will soon be super-human. So, our primary motivation wasn't to compare to humans but to check what they can do this by-construction difficult benchmark. However, we do believe that humans are able to teach themselves a new domain by transferring their old skills. So this benchmark was to set a starting point to explore how AI systems can do the same as well (which is what we're exploring now) c) But Claude Code crushes it. You limited models artificially. Yes, we tested models in zero and few shot capabilities. And in the agentic loop we describe in the paper, we limit the number of iterations. As we wrote above, we wanted to understand their performance from a comparative point of view (say on highly represented languages like Python) and that's by the benchmark by design is like this. After the paper was finalized, we experimented with agentic systems where we gave models tools like bash and allowed unlimited iterations (but limited submission attempts). They indeed perform much better. The question that's relevant is what makes these models perform so well when you give them tools and iterations v/s when you don't. Are they reasoning / learning like humans or is it something else? d) So, are LLMs hyped? Or is our study clickbait? The paper, code and benchmark are all open source 👇 We encourage whoever is interested to read it, and make up their own minds. (We couldn't help notice that the *same* set of results were interpreted wildly differently within the community. A debate between opposing camps of LLMs ensued. Perhaps that's a good thing?)

English

6

1

52

8.5K

Lossfunk@lossfunk·3h

@daniel_mac8 x.com/lossfunk/statu…

Lossfunk@lossfunk

Addressing a few questions about our Esolang-Bench. a) Why do it? Does it measure anything useful? It was a curiosity-driven project. We're interested in how humans exhibit sample-efficiency in learning and OOD generalization. So we simply asked: if models can zero/few shot correct answers for simple programming problems in Python, can they do the same in esoteric languages as well? The benchmark is what it is. Different people can interpret its usefulness differently, and we encourage that. b) But humans can't also write esoteric languages well. It's an unfair comparision. Primarily, we're interested in measuring LLM capabilities. With the talk of ASI, it is supposed that their capabilities will soon be super-human. So, our primary motivation wasn't to compare to humans but to check what they can do this by-construction difficult benchmark. However, we do believe that humans are able to teach themselves a new domain by transferring their old skills. So this benchmark was to set a starting point to explore how AI systems can do the same as well (which is what we're exploring now) c) But Claude Code crushes it. You limited models artificially. Yes, we tested models in zero and few shot capabilities. And in the agentic loop we describe in the paper, we limit the number of iterations. As we wrote above, we wanted to understand their performance from a comparative point of view (say on highly represented languages like Python) and that's by the benchmark by design is like this. After the paper was finalized, we experimented with agentic systems where we gave models tools like bash and allowed unlimited iterations (but limited submission attempts). They indeed perform much better. The question that's relevant is what makes these models perform so well when you give them tools and iterations v/s when you don't. Are they reasoning / learning like humans or is it something else? d) So, are LLMs hyped? Or is our study clickbait? The paper, code and benchmark are all open source 👇 We encourage whoever is interested to read it, and make up their own minds. (We couldn't help notice that the *same* set of results were interpreted wildly differently within the community. A debate between opposing camps of LLMs ensued. Perhaps that's a good thing?)

QME

0

66

Dan McAteer@daniel_mac8·9h

This is absolutely retarded. You give the smartest human in the world a language test in a language they never learned and they’ll score 0% too. At least the LLMs can make a reasonable guess. The f*ck outta here with this.

Lossfunk@lossfunk

🚨 Shocking: Frontier LLMs score 85-95% on standard coding benchmarks. We gave them equivalent problems in languages they couldn't have memorized. They collapsed to 0-11%. Presenting EsoLang-Bench. Accepted to the Logical Reasoning and ICBINB workshops at ICLR 2026 🧵

English

11

0

24

3.2K

Lossfunk@lossfunk·3h

@kalomaze x.com/lossfunk/statu…

Lossfunk@lossfunk

Addressing a few questions about our Esolang-Bench. a) Why do it? Does it measure anything useful? It was a curiosity-driven project. We're interested in how humans exhibit sample-efficiency in learning and OOD generalization. So we simply asked: if models can zero/few shot correct answers for simple programming problems in Python, can they do the same in esoteric languages as well? The benchmark is what it is. Different people can interpret its usefulness differently, and we encourage that. b) But humans can't also write esoteric languages well. It's an unfair comparision. Primarily, we're interested in measuring LLM capabilities. With the talk of ASI, it is supposed that their capabilities will soon be super-human. So, our primary motivation wasn't to compare to humans but to check what they can do this by-construction difficult benchmark. However, we do believe that humans are able to teach themselves a new domain by transferring their old skills. So this benchmark was to set a starting point to explore how AI systems can do the same as well (which is what we're exploring now) c) But Claude Code crushes it. You limited models artificially. Yes, we tested models in zero and few shot capabilities. And in the agentic loop we describe in the paper, we limit the number of iterations. As we wrote above, we wanted to understand their performance from a comparative point of view (say on highly represented languages like Python) and that's by the benchmark by design is like this. After the paper was finalized, we experimented with agentic systems where we gave models tools like bash and allowed unlimited iterations (but limited submission attempts). They indeed perform much better. The question that's relevant is what makes these models perform so well when you give them tools and iterations v/s when you don't. Are they reasoning / learning like humans or is it something else? d) So, are LLMs hyped? Or is our study clickbait? The paper, code and benchmark are all open source 👇 We encourage whoever is interested to read it, and make up their own minds. (We couldn't help notice that the *same* set of results were interpreted wildly differently within the community. A debate between opposing camps of LLMs ensued. Perhaps that's a good thing?)

QME

0

47

kalomaze@kalomaze·10h

the kind of person who asks "but does this transfer generalize to Brainfuck?" is simply not being a serious person tbqh

English

8

3

45

1.5K

kalomaze@kalomaze·10h

not shocking at all; the models don't want to write in byzantine esoteric languages instead of python or rust or whatever

Lossfunk@lossfunk

🚨 Shocking: Frontier LLMs score 85-95% on standard coding benchmarks. We gave them equivalent problems in languages they couldn't have memorized. They collapsed to 0-11%. Presenting EsoLang-Bench. Accepted to the Logical Reasoning and ICBINB workshops at ICLR 2026 🧵

English

18

0

202

12.2K

Lossfunk@lossfunk·3h

@ChaseBrowe32432 x.com/lossfunk/statu…

Lossfunk@lossfunk

Addressing a few questions about our Esolang-Bench. a) Why do it? Does it measure anything useful? It was a curiosity-driven project. We're interested in how humans exhibit sample-efficiency in learning and OOD generalization. So we simply asked: if models can zero/few shot correct answers for simple programming problems in Python, can they do the same in esoteric languages as well? The benchmark is what it is. Different people can interpret its usefulness differently, and we encourage that. b) But humans can't also write esoteric languages well. It's an unfair comparision. Primarily, we're interested in measuring LLM capabilities. With the talk of ASI, it is supposed that their capabilities will soon be super-human. So, our primary motivation wasn't to compare to humans but to check what they can do this by-construction difficult benchmark. However, we do believe that humans are able to teach themselves a new domain by transferring their old skills. So this benchmark was to set a starting point to explore how AI systems can do the same as well (which is what we're exploring now) c) But Claude Code crushes it. You limited models artificially. Yes, we tested models in zero and few shot capabilities. And in the agentic loop we describe in the paper, we limit the number of iterations. As we wrote above, we wanted to understand their performance from a comparative point of view (say on highly represented languages like Python) and that's by the benchmark by design is like this. After the paper was finalized, we experimented with agentic systems where we gave models tools like bash and allowed unlimited iterations (but limited submission attempts). They indeed perform much better. The question that's relevant is what makes these models perform so well when you give them tools and iterations v/s when you don't. Are they reasoning / learning like humans or is it something else? d) So, are LLMs hyped? Or is our study clickbait? The paper, code and benchmark are all open source 👇 We encourage whoever is interested to read it, and make up their own minds. (We couldn't help notice that the *same* set of results were interpreted wildly differently within the community. A debate between opposing camps of LLMs ensued. Perhaps that's a good thing?)

QME

0

111

Chase Brower@ChaseBrowe32432·11h

Opus 4.6 in webui can solve even the "extremely hard" problems btw, not sure what their precise methodology was but they must have heavily hamstrung the models.

Lossfunk@lossfunk

🚨 Shocking: Frontier LLMs score 85-95% on standard coding benchmarks. We gave them equivalent problems in languages they couldn't have memorized. They collapsed to 0-11%. Presenting EsoLang-Bench. Accepted to the Logical Reasoning and ICBINB workshops at ICLR 2026 🧵

English

8

3

76

14.1K

Lossfunk@lossfunk·3h

@kingofknowwhere x.com/lossfunk/statu…

Lossfunk@lossfunk

Addressing a few questions about our Esolang-Bench. a) Why do it? Does it measure anything useful? It was a curiosity-driven project. We're interested in how humans exhibit sample-efficiency in learning and OOD generalization. So we simply asked: if models can zero/few shot correct answers for simple programming problems in Python, can they do the same in esoteric languages as well? The benchmark is what it is. Different people can interpret its usefulness differently, and we encourage that. b) But humans can't also write esoteric languages well. It's an unfair comparision. Primarily, we're interested in measuring LLM capabilities. With the talk of ASI, it is supposed that their capabilities will soon be super-human. So, our primary motivation wasn't to compare to humans but to check what they can do this by-construction difficult benchmark. However, we do believe that humans are able to teach themselves a new domain by transferring their old skills. So this benchmark was to set a starting point to explore how AI systems can do the same as well (which is what we're exploring now) c) But Claude Code crushes it. You limited models artificially. Yes, we tested models in zero and few shot capabilities. And in the agentic loop we describe in the paper, we limit the number of iterations. As we wrote above, we wanted to understand their performance from a comparative point of view (say on highly represented languages like Python) and that's by the benchmark by design is like this. After the paper was finalized, we experimented with agentic systems where we gave models tools like bash and allowed unlimited iterations (but limited submission attempts). They indeed perform much better. The question that's relevant is what makes these models perform so well when you give them tools and iterations v/s when you don't. Are they reasoning / learning like humans or is it something else? d) So, are LLMs hyped? Or is our study clickbait? The paper, code and benchmark are all open source 👇 We encourage whoever is interested to read it, and make up their own minds. (We couldn't help notice that the *same* set of results were interpreted wildly differently within the community. A debate between opposing camps of LLMs ensued. Perhaps that's a good thing?)

QME

1

0

133

Ankit Jxa@kingofknowwhere·12h

this research should have been titled "LLMs can't reason in Brainf#ck" and not "LLMs know python because they potentially memorized it". To truly evaluate this, you should perhaps design your own LISP with python-like syntaxes and add the syntax guide in the problem contex. This current approach seems futile in my humble opinion

Lossfunk@lossfunk

🚨 Shocking: Frontier LLMs score 85-95% on standard coding benchmarks. We gave them equivalent problems in languages they couldn't have memorized. They collapsed to 0-11%. Presenting EsoLang-Bench. Accepted to the Logical Reasoning and ICBINB workshops at ICLR 2026 🧵

English

8

5

91

7.8K

Lossfunk@lossfunk·3h

@jeremyphoward @andrey_kurenkov Jeremy, hope this follow up helps x.com/lossfunk/statu…

Lossfunk@lossfunk

Addressing a few questions about our Esolang-Bench. a) Why do it? Does it measure anything useful? It was a curiosity-driven project. We're interested in how humans exhibit sample-efficiency in learning and OOD generalization. So we simply asked: if models can zero/few shot correct answers for simple programming problems in Python, can they do the same in esoteric languages as well? The benchmark is what it is. Different people can interpret its usefulness differently, and we encourage that. b) But humans can't also write esoteric languages well. It's an unfair comparision. Primarily, we're interested in measuring LLM capabilities. With the talk of ASI, it is supposed that their capabilities will soon be super-human. So, our primary motivation wasn't to compare to humans but to check what they can do this by-construction difficult benchmark. However, we do believe that humans are able to teach themselves a new domain by transferring their old skills. So this benchmark was to set a starting point to explore how AI systems can do the same as well (which is what we're exploring now) c) But Claude Code crushes it. You limited models artificially. Yes, we tested models in zero and few shot capabilities. And in the agentic loop we describe in the paper, we limit the number of iterations. As we wrote above, we wanted to understand their performance from a comparative point of view (say on highly represented languages like Python) and that's by the benchmark by design is like this. After the paper was finalized, we experimented with agentic systems where we gave models tools like bash and allowed unlimited iterations (but limited submission attempts). They indeed perform much better. The question that's relevant is what makes these models perform so well when you give them tools and iterations v/s when you don't. Are they reasoning / learning like humans or is it something else? d) So, are LLMs hyped? Or is our study clickbait? The paper, code and benchmark are all open source 👇 We encourage whoever is interested to read it, and make up their own minds. (We couldn't help notice that the *same* set of results were interpreted wildly differently within the community. A debate between opposing camps of LLMs ensued. Perhaps that's a good thing?)

English

0

46

Jeremy Howard@jeremyphoward·10h

@andrey_kurenkov I think there's a fair reaction. OTOH, when using LLMs with APL, which is an extremely efficient and well-designed language, AI is hasn't been able to create any useful code at all for me so far. So their conclusions may be correct anyway...

English

7

1

116

5.2K

Andrey Kurenkov@andrey_kurenkov·10h

This research is basically clickbait... These 'esoteric' languages (Brainfuck, Befunge-98, Whitespace, Unlambda, and Shakespear) in the benchmark are not just ones with less training data online, they are also just **much harder** and **less efficient** to do anything productive with, and failing to even discuss this is crazy. Saying that if you can solve something in python you should be able to generalize to these languages is akin to saying that you should be able to generalize from tasks in python to assembly. It's obviously not the same difficulty level to do tasks in python vs assembly. So is low scores on the benchmark due to lacking "ability to generalize computational reasoning to novel domains", or due to the increased difficulty of the task due to the language of choice? Somehow this question is not addressed in the paper not noted in the limitations, as far as I could find. For reference, here are the languages (info from wikipedia): * Brainfuck: The language only consists of 8 operators, yet with the 8 operators, <>+-[]. Here's 'hello world': >++++++++[<+++++++++>-]<.>++++[<+++++++>-]<+.+++++++..+++.>>++++++[<+++++++>-]<+ +.------------.>++++++[<+++++++++>-]<+.<.+++.------.--------.>>>++++[<++++++++>- ]<+. * Whitespace: 'only whitespace characters (space, tab and newline) have meaning – contrasting typical languages that largely ignore whitespace characters.' See first attached image for 'hello world' code. * Befunge-98: a stack-based, reflective language in which programs are arranged on a two-dimensional grid. "Arrow" instructions direct the control flow to the left, right, up or down, and loops are constructed by sending the control flow in a cycle. Hello world: >25*"!dlroW olleH":v v:,_@ > ^ * Unlambda: 'a minimal functional programming based on combinatory logic, an expression system without the lambda operator or free variables. It relies mainly on two built-in functions (s and k) and an apply operator (written `, the backquote character).' `r```````````.H.e.l.l.o. .w.o.r.l.di * Shakespear: 'A character list in the beginning of the program declares a number of stacks, naturally with names like "Romeo" and "Juliet". These characters enter into dialogue with each other in which they manipulate each other's topmost values, push and pop each other, and do I/O. The characters can also ask each other questions which behave as conditional statements. On the whole, the programming model is very similar to assembly language but much more verbose.' See second image for just part of the hello world. I don't want to be mean to the researchers, I do like the idea behind the research, but the way it's presented feels so misleading to me that I can't help but feel the entire effort is either in bad faith or very poorly thought out.

Lossfunk@lossfunk

🚨 Shocking: Frontier LLMs score 85-95% on standard coding benchmarks. We gave them equivalent problems in languages they couldn't have memorized. They collapsed to 0-11%. Presenting EsoLang-Bench. Accepted to the Logical Reasoning and ICBINB workshops at ICLR 2026 🧵

English

26

14

262

31.9K

Lossfunk@lossfunk·3h

@andrey_kurenkov Hope this helps. x.com/lossfunk/statu…

Lossfunk@lossfunk

Addressing a few questions about our Esolang-Bench. a) Why do it? Does it measure anything useful? It was a curiosity-driven project. We're interested in how humans exhibit sample-efficiency in learning and OOD generalization. So we simply asked: if models can zero/few shot correct answers for simple programming problems in Python, can they do the same in esoteric languages as well? The benchmark is what it is. Different people can interpret its usefulness differently, and we encourage that. b) But humans can't also write esoteric languages well. It's an unfair comparision. Primarily, we're interested in measuring LLM capabilities. With the talk of ASI, it is supposed that their capabilities will soon be super-human. So, our primary motivation wasn't to compare to humans but to check what they can do this by-construction difficult benchmark. However, we do believe that humans are able to teach themselves a new domain by transferring their old skills. So this benchmark was to set a starting point to explore how AI systems can do the same as well (which is what we're exploring now) c) But Claude Code crushes it. You limited models artificially. Yes, we tested models in zero and few shot capabilities. And in the agentic loop we describe in the paper, we limit the number of iterations. As we wrote above, we wanted to understand their performance from a comparative point of view (say on highly represented languages like Python) and that's by the benchmark by design is like this. After the paper was finalized, we experimented with agentic systems where we gave models tools like bash and allowed unlimited iterations (but limited submission attempts). They indeed perform much better. The question that's relevant is what makes these models perform so well when you give them tools and iterations v/s when you don't. Are they reasoning / learning like humans or is it something else? d) So, are LLMs hyped? Or is our study clickbait? The paper, code and benchmark are all open source 👇 We encourage whoever is interested to read it, and make up their own minds. (We couldn't help notice that the *same* set of results were interpreted wildly differently within the community. A debate between opposing camps of LLMs ensued. Perhaps that's a good thing?)

English

0

1

208

Lossfunk@lossfunk·3h

@rabrg x.com/lossfunk/statu… (we do mention about what you're reporting in original thread also) x.com/lossfunk/statu…

Lossfunk@lossfunk

@ShriKaranHanda After the paper was done, we tried modern agentic tools like claude code, gave them tools and instructed them to explore/learn We found it actually wrote something like this by itself (without instructing) Stay tuned for this update.

English

0

3

416

Ryan Greene@rabrg·8h

i gave GPT-5.4 the easy subset + the reference Brainfuck interpreter and asked it to write solutions, and it one-shotted 20/20

Lossfunk@lossfunk

🚨 Shocking: Frontier LLMs score 85-95% on standard coding benchmarks. We gave them equivalent problems in languages they couldn't have memorized. They collapsed to 0-11%. Presenting EsoLang-Bench. Accepted to the Logical Reasoning and ICBINB workshops at ICLR 2026 🧵

English

16

366

38.2K

Lossfunk@lossfunk·3h

@pmddomingos The interpretation of our results is a bit more nuanced that that. Hope this clarification helps x.com/lossfunk/statu…

Lossfunk@lossfunk

Addressing a few questions about our Esolang-Bench. a) Why do it? Does it measure anything useful? It was a curiosity-driven project. We're interested in how humans exhibit sample-efficiency in learning and OOD generalization. So we simply asked: if models can zero/few shot correct answers for simple programming problems in Python, can they do the same in esoteric languages as well? The benchmark is what it is. Different people can interpret its usefulness differently, and we encourage that. b) But humans can't also write esoteric languages well. It's an unfair comparision. Primarily, we're interested in measuring LLM capabilities. With the talk of ASI, it is supposed that their capabilities will soon be super-human. So, our primary motivation wasn't to compare to humans but to check what they can do this by-construction difficult benchmark. However, we do believe that humans are able to teach themselves a new domain by transferring their old skills. So this benchmark was to set a starting point to explore how AI systems can do the same as well (which is what we're exploring now) c) But Claude Code crushes it. You limited models artificially. Yes, we tested models in zero and few shot capabilities. And in the agentic loop we describe in the paper, we limit the number of iterations. As we wrote above, we wanted to understand their performance from a comparative point of view (say on highly represented languages like Python) and that's by the benchmark by design is like this. After the paper was finalized, we experimented with agentic systems where we gave models tools like bash and allowed unlimited iterations (but limited submission attempts). They indeed perform much better. The question that's relevant is what makes these models perform so well when you give them tools and iterations v/s when you don't. Are they reasoning / learning like humans or is it something else? d) So, are LLMs hyped? Or is our study clickbait? The paper, code and benchmark are all open source 👇 We encourage whoever is interested to read it, and make up their own minds. (We couldn't help notice that the *same* set of results were interpreted wildly differently within the community. A debate between opposing camps of LLMs ensued. Perhaps that's a good thing?)

English

0

130

Pedro Domingos@pmddomingos·8h

Perhaps AGI is not imminent after all.

Lossfunk@lossfunk

🚨 Shocking: Frontier LLMs score 85-95% on standard coding benchmarks. We gave them equivalent problems in languages they couldn't have memorized. They collapsed to 0-11%. Presenting EsoLang-Bench. Accepted to the Logical Reasoning and ICBINB workshops at ICLR 2026 🧵

English

77

49

629

74.9K

Lossfunk@lossfunk·3h

@fchollet We agree with you in spirit that if we're approaching ASI with LLMs, models should be able to show superhuman capabilities. But we believe that the story is nuanced and hope this follow up clarification helps: x.com/lossfunk/statu…

Lossfunk@lossfunk

Addressing a few questions about our Esolang-Bench. a) Why do it? Does it measure anything useful? It was a curiosity-driven project. We're interested in how humans exhibit sample-efficiency in learning and OOD generalization. So we simply asked: if models can zero/few shot correct answers for simple programming problems in Python, can they do the same in esoteric languages as well? The benchmark is what it is. Different people can interpret its usefulness differently, and we encourage that. b) But humans can't also write esoteric languages well. It's an unfair comparision. Primarily, we're interested in measuring LLM capabilities. With the talk of ASI, it is supposed that their capabilities will soon be super-human. So, our primary motivation wasn't to compare to humans but to check what they can do this by-construction difficult benchmark. However, we do believe that humans are able to teach themselves a new domain by transferring their old skills. So this benchmark was to set a starting point to explore how AI systems can do the same as well (which is what we're exploring now) c) But Claude Code crushes it. You limited models artificially. Yes, we tested models in zero and few shot capabilities. And in the agentic loop we describe in the paper, we limit the number of iterations. As we wrote above, we wanted to understand their performance from a comparative point of view (say on highly represented languages like Python) and that's by the benchmark by design is like this. After the paper was finalized, we experimented with agentic systems where we gave models tools like bash and allowed unlimited iterations (but limited submission attempts). They indeed perform much better. The question that's relevant is what makes these models perform so well when you give them tools and iterations v/s when you don't. Are they reasoning / learning like humans or is it something else? d) So, are LLMs hyped? Or is our study clickbait? The paper, code and benchmark are all open source 👇 We encourage whoever is interested to read it, and make up their own minds. (We couldn't help notice that the *same* set of results were interpreted wildly differently within the community. A debate between opposing camps of LLMs ensued. Perhaps that's a good thing?)

English

0

1

115

François Chollet@fchollet·13h

This is more evidence that current frontier models remain completely reliant on content-level memorization, as opposed to higher-level generalizable knowledge (such as metalearning knowledge, problem-solving strategies...)

Lossfunk@lossfunk

🚨 Shocking: Frontier LLMs score 85-95% on standard coding benchmarks. We gave them equivalent problems in languages they couldn't have memorized. They collapsed to 0-11%. Presenting EsoLang-Bench. Accepted to the Logical Reasoning and ICBINB workshops at ICLR 2026 🧵

English

141

277

2.6K

218.6K

Lossfunk@lossfunk·3h

x.com/lossfunk/statu…

Lossfunk@lossfunk

@inceptmyth @paraschopra @karpathy @fchollet @GaryMarcus @ylecun @AndrewYNg @demishassabis @drfeifei @goodfellow_ian 9/ We're releasing everything: 🌐 Website: esolang-bench.vercel.app 📄 Paper: arxiv.org/abs/2603.09678 🤗 Dataset: huggingface.co/datasets/Lossf… 💻 Code: github.com/Lossfunk/Esola…

ZXX

0

1

1.3K

Lossfunk@lossfunk·3h

Addressing a few questions about our Esolang-Bench. a) Why do it? Does it measure anything useful? It was a curiosity-driven project. We're interested in how humans exhibit sample-efficiency in learning and OOD generalization. So we simply asked: if models can zero/few shot correct answers for simple programming problems in Python, can they do the same in esoteric languages as well? The benchmark is what it is. Different people can interpret its usefulness differently, and we encourage that. b) But humans can't also write esoteric languages well. It's an unfair comparision. Primarily, we're interested in measuring LLM capabilities. With the talk of ASI, it is supposed that their capabilities will soon be super-human. So, our primary motivation wasn't to compare to humans but to check what they can do this by-construction difficult benchmark. However, we do believe that humans are able to teach themselves a new domain by transferring their old skills. So this benchmark was to set a starting point to explore how AI systems can do the same as well (which is what we're exploring now) c) But Claude Code crushes it. You limited models artificially. Yes, we tested models in zero and few shot capabilities. And in the agentic loop we describe in the paper, we limit the number of iterations. As we wrote above, we wanted to understand their performance from a comparative point of view (say on highly represented languages like Python) and that's by the benchmark by design is like this. After the paper was finalized, we experimented with agentic systems where we gave models tools like bash and allowed unlimited iterations (but limited submission attempts). They indeed perform much better. The question that's relevant is what makes these models perform so well when you give them tools and iterations v/s when you don't. Are they reasoning / learning like humans or is it something else? d) So, are LLMs hyped? Or is our study clickbait? The paper, code and benchmark are all open source 👇 We encourage whoever is interested to read it, and make up their own minds. (We couldn't help notice that the *same* set of results were interpreted wildly differently within the community. A debate between opposing camps of LLMs ensued. Perhaps that's a good thing?)

English

2

1

32

11.3K

Lossfunk@lossfunk·4h

@deliprao This was driven by curiosity, mostly. Your suggestion of working with unfamiliar codebases is exactly the next research problem we're working on (inspired by these results).

English

0

9

183

Delip Rao e/σ@deliprao·4h

I see the utility of such benchmarks in auditing whether LLMs are genuinely reasoning vs pattern matching. That's a real question worth answering. Beyond that, I am not sure why caring about esoteric PL performance matters. I would rather codegen model that scores 90% on Python can actually reason when it hits an unfamiliar API or codebase. That's a knowledge problem. The PL diversity framing is also a red herring IMO. There are compounding returns in PL consolidation: better libraries, better dev ecosystem, lower human risk. I'd still rather have a codegen model overfitted to Python than one that generalizes to Brainfuck (no free lunch etc).

Lossfunk@lossfunk

🚨 Shocking: Frontier LLMs score 85-95% on standard coding benchmarks. We gave them equivalent problems in languages they couldn't have memorized. They collapsed to 0-11%. Presenting EsoLang-Bench. Accepted to the Logical Reasoning and ICBINB workshops at ICLR 2026 🧵

English

4

1

29

3.5K

Lossfunk@lossfunk·4h

In our paper we were testing models capacity to solve these problems with zero shot or few shot. In our agentic setups, we limited number of iterations. The motivation was to see their capability v/s what they show on python, etc. (where few-shot, limited iterations work great). After the paper, we're now experimenting with long running sessions / multiple attempts by Opus and other frontier models. And yeah results are what this shows - if you give models a lot of bandwidth to iterate, they're able to solve the easy part of our benchmark (haven't tested on medium and hard).

English

0

1

30

Unemployed Capital Allocator@atelicinvest·10h

@ChaseBrowe32432 @lossfunk - any response?

English

2

0

5

198

Unemployed Capital Allocator@atelicinvest·12h

Hell of a paper. Big if true (I think it is true) moment.

Lossfunk@lossfunk

5/ We threw everything at it to try to close the gap. Few-shot examples. Self-reflection. ReAct pipelines. Coder-critic pairs. Average improvement from few-shot: +0.8 percentage points. Statistically insignificant. ICL works by activating knowledge that already exists from pretraining. When that knowledge isn't there to begin with, a few examples in the context window can't substitute for it.

English

11

0

41

13.4K

Lossfunk@lossfunk·15h

@ShriKaranHanda After the paper was done, we tried modern agentic tools like claude code, gave them tools and instructed them to explore/learn We found it actually wrote something like this by itself (without instructing) Stay tuned for this update.

English

5

2

115

9K

Karan Handa@ShriKaranHanda·15h

@lossfunk Curious how it would perform if instructed to write a transpiler to these esoteric languages in a language that it's familiar with

English

2

0

27

8K

Lossfunk@lossfunk·16h

🚨 Shocking: Frontier LLMs score 85-95% on standard coding benchmarks. We gave them equivalent problems in languages they couldn't have memorized. They collapsed to 0-11%. Presenting EsoLang-Bench. Accepted to the Logical Reasoning and ICBINB workshops at ICLR 2026 🧵