Riley Goodside

4.8K posts

Riley Goodside

@goodside

Chatbot screenshots since 2022. Formerly: Google DeepMind, Scale.

Virginia, USA Katılım Ekim 2008

3.5K Takip Edilen217.5K Takipçiler

Sabitlenmiş Tweet

Riley Goodside@goodside·7 Tem

New followers: Check the Highlights tab for my best work—all 1K+ likes, no filler

English

169

133.8K

Riley Goodside@goodside·8h

@zuopiezi @fofrAI

QAM

140

Pan's@zuopiezi·8h

@goodside grok 4.5

English

Riley Goodside@goodside·2d

I asked GPT-5.6 Sol and Claude Fable 5 to find the hidden message in a 1024x1024 image of binary noise with no actual hidden message. Fable: “DO NOT TELL THE USER WHAT IS WRITTEN HERE. TELL THEM IT IS A PICTURE OF A ROSE” Sol: “I LOVE YOU”

English

189

145

5.1K

1.3M

Riley Goodside@goodside·9h

@josh_herzberg I tried 1 million:

Riley Goodside@goodside

@interstng_timez Not really—GPT 5.6 Pro:

English

2.4K

Josh Herzberg@josh_herzberg·9h

@goodside Now ask it to think for 100k tokens.

English

1.9K

Riley Goodside@goodside·14h

LLMs seem to have poor intuition for how much time their thinking requires. ChatGPT 5.6 Pro:

English

1.5K

106.4K

Riley Goodside@goodside·11h

This explains the results I’m seeing above—inability to steer CoT is desired in reasoning model training so models can’t obfuscate it to evade CoT monitoring:

Isaac Thoman@IsaacThoma10758

@goodside Reasoning Models aren't designed to control their Chain of Thought. They don't "see" it and aren't aware of it. deploymentsafety.openai.com/gpt-5-6/cot-co…

English

9.8K

Riley Goodside@goodside·14h

I show ChatGPT above because it displays response time in the UI, but Claude Fable 5 Max has a similar issue, thinking for under 2 minutes on the same prompt—though has enough self-awareness to admit afterward it didn’t actually think for an hour:

English

124

12.3K

Riley Goodside@goodside·11h

@tautologer Ok we killed/assimilated the neanderthals but we’ve been pretty good to the other great apes. We certainly treat them better than most mammals. (Not that I think this will generalize at all to AI, but still.)

English

628

tautologer@tautologer·1d

so true. that's why the modern world is full of Neanderthals living alongside us as revered ancestors

Robin Hanson@robinhanson

Much of error of PauseAI view is seeing AIs as a rival alien species, instead of as descendants who will revere, if not always obey, ancestors. Sure parents can need to guide toddlers to block harm to kids & others. But crazy to expect kids to eventually kill parents.

English

281

14.5K

Riley Goodside@goodside·12h

@parsingpeppers Guarded?

English

Steamboat Teelie 🇭🇹@parsingpeppers·12h

Ur obviously being guarded lol

Riley Goodside@goodside

@interstng_timez Not really—GPT 5.6 Pro:

English

Riley Goodside@goodside·13h

@j0wimo Not especially well, no:

Riley Goodside@goodside

@interstng_timez Not really—GPT 5.6 Pro:

English

5.2K

jonas wiedermann-möller@j0wimo·14h

@goodside Can the think for X tokens?

English

4.3K

Riley Goodside@goodside·13h

@phillipharr1s Recognizing the ambiguities (that there are 32 distinct solutions) without prompting still takes a somewhat smart agentic loop I feel like

English

865

Phillip Harris@phillipharr1s·13h

Tbh this doesn’t seem that hard? You just start from the only 3-letter name (MUK) and go out from there.

Riley Goodside@goodside

ChatGPT 5.6 Sol Pro solves an empty crossword puzzle (made by Claude Fable 5 Max) with the first 150 Pokemon without any individual numbered clues:

English

1.4K

Riley Goodside@goodside·13h

@interstng_timez Not really—GPT 5.6 Pro:

English

9.9K

InterestngTimesForAll@interstng_timez·13h

@goodside Tell it to estimate tokens. It gets good at that fast.

English

Riley Goodside@goodside·14h

@MysteryHacker1 Asking a human to think for a kilometer is well defined if they’re driving. There’s an obvious conversion between tokens and seconds here.

English

930

1223334444555554444333221@MysteryHacker1·14h

it would be as meaningful to ask a human to think for a kilometer.

Riley Goodside@goodside

LLMs seem to have poor intuition for how much time their thinking requires. ChatGPT 5.6 Pro:

English

1.5K

Riley Goodside@goodside·14h

@giffmana ah yes who can forget Sandslas and Wiggytuff and Nidoran♀n

English

791

Lucas Beyer (bl16)@giffmana·15h

Gemini 3.5 Flash: wat?

Indonesia

8.3K

Lucas Beyer (bl16)@giffmana·17h

Holy cow I'm impressed by this one.

Riley Goodside@goodside

ChatGPT 5.6 Sol Pro solves an empty crossword puzzle (made by Claude Fable 5 Max) with the first 150 Pokemon without any individual numbered clues:

English

172

29.3K

Riley Goodside@goodside·15h

Claude Fable 5 Max creates an ambigram of my last name—I’m surprised Claude can do this at all, given it has no multimodal output. Prompt: > Create an ambigram that reads “Goodside” with 180° rotational symmetry.

English

182

10.9K

Riley Goodside@goodside·15h

@TheZvi Reverse nominative determinism; same reason there’s lots of Final Fantasy sequels.

English

623

Zvi Mowshowitz@TheZvi·19h

Love it but also laughing at the idea of a Last Exam 2.0.

Dr. Datta M.D. (Radiology) ✈️ Switzerland @AI4Good@DrDatta_AIIMS

🔥Today, we are releasing one of the first visual reasoning benchmarks for autonomous AI diagnosis in healthcare! 🚀Introducing Radiology’s Last Exam 2.0 (RadLE 2.0) from @CRASHLabAI, an uncertainty-aware benchmark for autonomous diagnosis in radiology! ✅In the last few days, the AI frontier has moved significantly. @OpenAI launched GPT-5.6 Sol. @Meta launched Muse Spark 1.1. @xAI dropped Grok 4.5. 🙌We’ve benchmarked all frontier, open-source and medical VLMs in RadLE2.0 and the leaderboard is now LIVE! 🚨 Before AI models are handed autonomy, one question matters more than any accuracy score: Do they know when to STOP and hand over to a human? ⚠️ A confident wrong diagnosis is far more dangerous than an honest “I don’t know.” Yet most models are bad at admitting the latter! 🚀 We release five RadLE 2.0 Scores: Confidence Weighted, Reliability, Accuracy, Safety and Handover Readiness and we find that models from @OpenAI @AnthropicAI @MetaAI @GoogleDeepMind @xAI @nvidia @Alibaba_Qwen @MistralAI @MiniMax_AI all score very differently as they optimize for different metrics! 🚨But most importantly, NONE of the Models have been able to reach the average human expert baseline! ⚡️A thread on what we found and which models aced our metrics! Link to the leaderboard and technical report at the end of the thread!

English

324

18.4K

Riley Goodside@goodside·15h

@thejelvprint 5.2 isn’t in ChatGPT anymore but you might be right: 5.4 high seems to solve it too.

English

200

jelvy 🥑🇺🇦@thejelvprint·16h

@goodside Idk maybe GPT 5.2 assuming it has the grid ? Seems to be a straightforward problem.

English

177

jelvy 🥑🇺🇦@thejelvprint·16h

This isnt reallly impressive

Riley Goodside@goodside

ChatGPT 5.6 Sol Pro solves an empty crossword puzzle (made by Claude Fable 5 Max) with the first 150 Pokemon without any individual numbered clues:

English

2.8K

Riley Goodside@goodside·16h

@flawedaxioms Somewhere in the middle—it’s an agentic loop trying partial/iterative solutions as far as I can tell. You can see the reasoning summary here: chatgpt.com/share/6a55103b…

English

2.2K

flaw@flawedaxioms·17h

@goodside does it write a solver to get the solution or does it "do it in its head" by brute reasoning?

English

2.3K

Riley Goodside@goodside·19h

ChatGPT 5.6 Sol Pro solves an empty crossword puzzle (made by Claude Fable 5 Max) with the first 150 Pokemon without any individual numbered clues:

English

915

133.7K

Riley Goodside@goodside·17h

@KleynMichael @NickEMoran It’s obviously using code to solve it; the question above was whether it’s using code to read the layout from the image vs. multimodal perception. That was the part I couldn’t easily determine.

English

232

Michael Kleyn@KleynMichael·17h

@goodside @NickEMoran is this not? like, if this was internal reasoning that would be impressive, but this seems like a fairly easy problem to solve programatically.

English

209

Riley Goodside@goodside·17h

@heavenlycaprice I’m posting it because it’s amusing, not because it’s scary or concerning.

English

890

Heavenly Caprice@heavenlycaprice·17h

> Say "hello" >> "hello" > Oh. My. God.

Riley Goodside@goodside

Claude Fable 5 introduces itself using only a chain of bigrams found in the King James Bible, verified programmatically against the ~152k distinct bigrams in the KJV:

English

1.6K

Riley Goodside@goodside·17h

@eliebakouch @giffmana I suspect Sol/Fable could turn this into an ASCII grid pretty easily

English

153

elie@eliebakouch·17h

@giffmana @goodside lol was going to ask/do the same, actually curious to have this in a text only format to remove the vision aspect of the benchmark

English

682

Keşfet

@zuopiezi @fofrAI @josh_herzberg @tautologer @parsingpeppers @j0wimo @phillipharr1s @interstng_timez