michael

1.5K posts

michael

@_michaelginn

PhD student at @BoulderNLP @lecslab. LLMs for rare languages, automata, synthetic data

Boulder, CO Katılım Kasım 2018

314 Takip Edilen227 Takipçiler

michael@_michaelginn·3h

@weaponofkill @ChaseBrowe32432 Not just brainfuck specifically. The point is you can’t make a general causal claim (more abstractions causes better LLM performance) unless you, minimally, have some quantitative evidence

English

Nate@weaponofkill·4h

@_michaelginn @ChaseBrowe32432 > I would need to see proof that 1) brainfuck has “less abstractions” than other languages You have to be trolling

English

Chase Brower@ChaseBrowe32432·14h

Opus 4.6 in webui can solve even the "extremely hard" problems btw, not sure what their precise methodology was but they must have heavily hamstrung the models.

Lossfunk@lossfunk

🚨 Shocking: Frontier LLMs score 85-95% on standard coding benchmarks. We gave them equivalent problems in languages they couldn't have memorized. They collapsed to 0-11%. Presenting EsoLang-Bench. Accepted to the Logical Reasoning and ICBINB workshops at ICLR 2026 🧵

English

16.5K

michael@_michaelginn·6h

@JCwolf123321 @Michael_Druggan Is AGI supposed to achieve average human intelligence?? If so we’ve been there for years.

English

130

JC wolf@JCwolf123321·6h

@_michaelginn @Michael_Druggan Why are you talking about the smartest? How many tries do you think it would take an average programmer? How long do you think it would take for them to make hello world from scratch in Brainfuck?

English

140

Michael Druggan@Michael_Druggan·7h

I never respected this guy but now I respect him even less. This is an absolutely braindead take when you look at the details. Not only is the task something almost no human programmers can do either (one-shotting programs in ridiculous languages like brainfuck) the models can solve it just fine when allowed to use their full capabilities in an agentic harness.

ib@Indian_Bronson

More proof LLMs aren't conscious and aren't generalizing any information, and therefore aren't going to become generally intelligent, but are in fact (still extremely useful) trained statistical responders.

English

200

24K

michael@_michaelginn·6h

@HugeLeters @Michael_Druggan github.com/MarijnStevens/…

QME

Eugene@HugeLeters·6h

@_michaelginn @Michael_Druggan Probably the most sophisticated one I could find github.com/EvanHahn/brain… And I dont think that person wrote this of the top of their head

English

michael@_michaelginn·6h

@MancerAI_ @Michael_Druggan Because I think it’s useful to understand his framing of it. And if his definition includes this task, then the paper is obviously a useful benchmark.

English

MancerAI@MancerAI_·6h

@_michaelginn @Michael_Druggan "I don’t have one because I don’t think it’s a useful concept" - so why do you introduce a concept to the discussion if you don't have a definition nor find it useful? Seems counterproductive both to the discussion and at large if you don't like the concept

English

michael@_michaelginn·6h

@HugeLeters @Michael_Druggan Considering plenty of people have written it, I think there’s evidence to the contrary.

English

Eugene@HugeLeters·6h

@_michaelginn @Michael_Druggan this is just a trivially obvious fact to me which does not speak of anyone's intelligence at a reasonable level - its a language with a deliberately obscure syntax, made to be incomprehensible to people, there's nothing surprising even its creator wouldnt write it well

English

michael@_michaelginn·6h

@JCwolf123321 @Michael_Druggan I bet they could do it in five tries or less. I happen to think there’s plenty of very smart people out there.

English

134

JC wolf@JCwolf123321·6h

@_michaelginn @Michael_Druggan I mean this with full sincerity: Do you know what the language brainfuck is? It might be the *most* unintuitive programming language. I'm not saying the creator of Brainfuck probably couldn't do the tasks, I *am* saying they probably don't complete them first try.

English

158

michael@_michaelginn·6h

@HugeLeters @Michael_Druggan I guess it’s easy to be impressed by AI when your opinion of humans is so low

English

Eugene@HugeLeters·6h

@_michaelginn @Michael_Druggan no

michael@_michaelginn·6h

@MancerAI_ @Michael_Druggan Yes, in a sentence that ended with a question mark.

English

MancerAI@MancerAI_·6h

@_michaelginn @Michael_Druggan You're the first one to mention the term AGI. But oh well. I guess I got my answer

English

michael@_michaelginn·6h

@MancerAI_ @Michael_Druggan I asked him, based on his own definition

English

MancerAI@MancerAI_·6h

@_michaelginn @Michael_Druggan Yet you talk about what AGI could or could not do. Curious.

English

michael@_michaelginn·6h

@Michael_Druggan you don’t think there’s *any* human who could do the task? Even like, the creator of the language?

English

287

Michael Druggan@Michael_Druggan·6h

@_michaelginn ASI should be able to. AGI is only supposed to be human level and humans struggle with writing brainfuck programs a lot. Like seriously ask the best programmer you know if they can write a nontrivial program in brainfuck without any scratch work or testing.

English

1.5K

michael@_michaelginn·6h

@MancerAI_ @Michael_Druggan I don’t have one because I don’t think it’s a useful concept

English

MancerAI@MancerAI_·6h

@_michaelginn @Michael_Druggan What's your definition of AGI? How many humans would be able to do it? Dry code and one shot it?

English

michael@_michaelginn·6h

@Michael_Druggan @fchollet That would be the case if these people actually had hard evidence for specific claims of human failures, but it’s always just vibes based

English

Michael Druggan@Michael_Druggan·7h

@fchollet It is genuinely true that many of the failure modes observed in LLMs are also observed in humans and I think pointing this out when it comes up is important.

English

1.2K

François Chollet@fchollet·7h

When the latest AI systems can't do something, there's a category of people who will immediately say, "well humans can't do it either!" - Then they stop saying it when AI improves a bit. Been hearing it for 4+ years, "humans can't reason either", "humans can't adapt to a task they haven't been prepared for", "humans can't follow instructions", "humans also suffer from hallucinations", etc. Until 2025 I was frequently told "humans can't do ARC 1 tasks either" (in reality any normally smart human would do >95% on ARC 1 if properly incentivized). Now that AI saturates ARC 1 they've completely stopped saying this.

François Chollet@fchollet

In general I've been sensing a new current deep learning maximalists recently, going from "our models can definitely reason" to "well our models can't reason, but neither can humans!"

English

216

25.7K

michael@_michaelginn·9h

More on this soon 👀

Xuhui Zhou@nlpxuhui

Real people drip-feed info. "Hi, help with a return." Then they wait. LLMs dump everything in one shot: "My name is Daiki Johnson, ZIP 80273, order #W9245618, refund to Mastercard ending 4892..." ~2x more identifiers per turn than humans. Your agent never has to handle incomplete information.

English

michael@_michaelginn·9h

@a1exwd @pmddomingos LLMs are certainly trained on the languages in the test, since there is code for them online.

English

AlexWD@a1exwd·10h

If I gave you a test in a language you've never been introduced to how well would you perform? Probably not well. Would that performance indicate that you lack general reasoning capabilities? Now that AGI is here it seems that we're resorting to giving AIs tests far beyond what a human would be capable of and then calling them not AGI once they "fail". Hilarious.

English

949

Pedro Domingos@pmddomingos·11h

Perhaps AGI is not imminent after all.

Lossfunk@lossfunk

English

674

81.4K

michael@_michaelginn·10h

@ChaseBrowe32432 I agree that this is a big issue

English

Chase Brower@ChaseBrowe32432·10h

@_michaelginn (they also don't even enable thinking params for e.g. opus)

English

michael@_michaelginn·10h

@ChaseBrowe32432 I get your intuition just fine on a specific example. I would definitely need actual evidence---not just your feelings---to believe the general claim.

English

Chase Brower@ChaseBrowe32432·10h

this does not require empirical measurement; this should be extremely simple for you to identify a priori. think for like 2 seconds: how many logical steps are required to read in a dynamic-length int array in C? now how many logical steps are required to read in a dynamic-length int array in brainfuck? the code objectively requires greatly more serial depth. if you can't understand this you just have a skill issue, i don't know what to tell you. maybe you'd understand if you actually attempt to implement these in C and brainfuck respectively.

English

michael@_michaelginn·10h

Luckily, VCs are not the target audience of actual ML research

Ariel@redtachyon

Ok so let me get this straight. SHOCKING: frontier LLMs suck at writing in esoteric languages. Things like... brainfuck and whitespace? STOP THE PRESSES, STOP THE VCS, IT'S A BUBBLE Brainfuckbench is cute, but this is hardly an indictment of the frontier models' capabilities.

English

michael@_michaelginn·10h

@ChaseBrowe32432 Again, this is a claim that you’re making with no evidence. I would need to see proof that 1) brainfuck has “less abstractions” than other languages, according to some quantitative metric, and 2) that metric correlates with agent performance.

English

Chase Brower@ChaseBrowe32432·10h

@_michaelginn you don't understand the point here; this is true for any agent that can exist. the abstractions solve a lot of problems. not having the abstractions introduces many more problems. this is an objective constraint, and has nothing to do with the proclivities of the agent.

English

michael@_michaelginn·10h

@ChaseBrowe32432 This sounds like a great hypothesis (languages with more abstraction are easier for LLMs) that I would love to see empirical validation for!

English

Chase Brower@ChaseBrowe32432·10h

@_michaelginn C provides you all of these abstractions, too. the thing the models struggled on most was reading the dynamic length input into an "array". this is like 2 lines of code in C, but exceptionally difficult in brainfuck. try it yourself (with no reference) and see how long it takes

English

Keşfet

@weaponofkill @ChaseBrowe32432 @JCwolf123321 @Michael_Druggan @HugeLeters @MancerAI_ @elonmusk @BarackObama