Timothy O'Hear

2.4K posts

Timothy O'Hear

@timohear

Enthralled by machine learning / artificial intelligence, robot•me CTO, software engineer, Dai the robot co-creator, president of impactIA foundation, Genève

Geneva, Switzerland Katılım Haziran 2009

496 Takip Edilen416 Takipçiler

Timothy O'Hear retweetledi

Yannic Kilcher 🇸🇨@ykilcher·7 Mar

I built a fully automatic mansplainer. I'm sure this will not get me into any trouble at all... Watch here: youtu.be/xHi8PUIVyoo

YouTube

English

4.2K

Timothy O'Hear@timohear·11 Ara

@BingBongBrent @arcprize @OpenAI @poetiq_ai Also they only appear on the arc-agi-2 leaderboard it seems

English

124

Timothy O'Hear@timohear·11 Ara

@BingBongBrent @arcprize @OpenAI @poetiq_ai Poetiq is the white triangle on the top right. It's clearer on the arc-agi site when you filter by author. Why their name isn't displayed 🤷

English

727

ARC Prize@arcprize·11 Ara

A year ago, we verified a preview of an unreleased version of @OpenAI o3 (High) that scored 88% on ARC-AGI-1 at est. $4.5k/task Today, we’ve verified a new GPT-5.2 Pro (X-High) SOTA score of 90.5% at $11.64/task This represents a ~390X efficiency improvement in one year

English

157

668

4.6K

2.3M

Timothy O'Hear@timohear·4 Ara

@GregKamradt @guille_bar Isn't there a risk with code execution that Google could capture your task data as the sandbox is running on their infra?

English

Greg Kamradt@GregKamradt·4 Ara

@guille_bar In the script we were given to run it, code execution and search were tools available We removed search as this would expose task data to the internet So code was used

English

Guillermo Barbadillo@guille_bar·4 Ara

Today I noticed an important detail about the Gemini 3 Deep Think solution for ARC AGI 2 that I missed in the announcement: it is using tools, very likely code execution. The legend of the plot might go to the hall of fame of chart crimes ;)

ARC Prize@arcprize

Gemini 3 models from @Google @GoogleDeepMind have made a significant 2X SOTA jump on ARC-AGI-2 (Semi-Private Eval) Gemini 3 Pro: 31.11%, $0.81/task Gemini 3 Deep Think (Preview): 45.14%, $77.16/task

English

2.3K

Timothy O'Hear retweetledi

Shane Legg@ShaneLegg·26 Kas

From the makers of the popular AlphaGo documentary, The Thinking Game gives a much broader picture of the story of DeepMind and our mission to build AGI, drawing on interviews with myself and others going back many years. You can now freely watch it here: youtube.com/watch?v=d95J8y…

YouTube

English

113

826

123.6K

Timothy O'Hear@timohear·26 Kas

@sahilshah91 @JayaGup10 @ChatGPTapp I noticed the same thing a couple of days ago

English

Sahil Shah@sahilshah91·25 Kas

I had explicitly turned OFF "Improve the model for everyone" flag a while back on @ChatGPTapp and today I chanced upon Data Controls and looks like its turned ON again. This is a massive breach of trust. Everyone should check this setting again..

English

2.4K

Timothy O'Hear retweetledi

Guillermo Barbadillo@guille_bar·4 Kas

ARC25 is over and despite a lot of work I have been unable to implement my vision successfully. I hope to learn from other teams’ solutions and refine my ideas for ARC26. I am currently 6th on the public test set. Read about my vision and experiments: ironbar.github.io/arc25/05_Solut…

English

3.9K

Timothy O'Hear@timohear·3 Kas

@StphTphsn1 @Dorialexander Yes, very much iid and fairly simple tasks belonging to eg a single 20-person service. But I'm pretty sure they would have failed even a few months ago.

English

Stéphane Deny@StphTphsn1·3 Kas

@timohear @Dorialexander testing set iid? because that's sometimes a problem that real world apps are not iid with training set (eg in the medical domain, different hospitals are not iid)

English

359

Alexander Doria@Dorialexander·2 Kas

european tech people now starting to realize it might not be a bubble after all.

English

426

50.7K

Timothy O'Hear@timohear·3 Kas

@StphTphsn1 @Dorialexander I've seen a significant increase in robustness of data extraction / instruction following scenarios over the past 12 months with high 9x% accuracy/F1 now achievable on real world tasks.

English

378

Stéphane Deny@StphTphsn1·2 Kas

@Dorialexander got it. yeah, i'm still bearish on the robustness that can be expected from deep learning tech

English

867

Timothy O'Hear@timohear·14 Eki

@podesta_aldo @giotto_ai youtu.be/ytzOSBOvr74

YouTube

QME

Aldo Podestà@podesta_aldo·14 Eki

🎙️Great talking to Marcel Salathé on the EPFL AI Center podcast about Giotto.ai. Among other things, we talked about the origins of Giotto, what differentiates us from the other major players, and how R&D investments, even if risky, are absolutely crucial for real progress. 🎧 Listen to the full episode on Apple podcast: podcasts.apple.com/ch/podcast/con… Spotify: open.spotify.com/episode/3n9a9K…

English

1.2K

Timothy O'Hear retweetledi

Alexia Jolicoeur-Martineau@jm_alexia·7 Eki

New paper 📜: Tiny Recursion Model (TRM) is a recursive reasoning approach with a tiny 7M parameters neural network that obtains 45% on ARC-AGI-1 and 8% on ARC-AGI-2, beating most LLMs. Blog: alexiajm.github.io/2025/09/29/tin… Code: github.com/SamsungSAILMon… Paper: arxiv.org/abs/2510.04871

English

150

666

4.2K

694.9K

Timothy O'Hear@timohear·2 Eki

@arankomatsuzaki @jeremyphoward @teknium How are you running GDPval? Do you have access to the test scaffolding OpenAI use?

English

Aran Komatsuzaki@arankomatsuzaki·2 Eki

@jeremyphoward @teknium I find 4.5 on claude.ai with upgraded file creation and analysis to be by far the best on GDPval. It's magical how it can one-shot a 10-hour industry task. The output format (e.g. spreadsheet) even looks better too.

English

750

Teknium (e/λ)@Teknium·1 Eki

I'm feeling like sonnet 4.5 is bad its really really fucking up in ways sonnet 4 and opus 4.1 did not unfortunately

English

208

37.7K

Timothy O'Hear@timohear·27 Eyl

From github.com/epang080516/ar… "the private eval set is only accessible via the no-internet-access Kaggle competition" "The semi-private eval set was calibrated to have the same difficulty as the public eval set, but researchers need to coordinate with the ARC-Prize team to test their model on it in a Kaggle notebook that runs at most 12 hours." From the Kaggle page "This leaderboard is calculated with approximately 50% of the test data. The final results will be based on the other 50%, so the final standings may be different." So the ARC-AGI-2 scores on both pages are measured in different ways but are somewhat comparable?

English

Timothy O'Hear@timohear·27 Eyl

@arcprize @podesta_aldo How should the ARC-AGI-2 scores here arcprize.org/leaderboard be compared to those on the Kaggle leaderboard here kaggle.com/competitions/a… ? It looks like J. Berman working outside the Kaggle competition has a higher score of 29.4%. Are the constraints different?

English

352

ARC Prize@arcprize·26 Eyl

New ARC Prize 2025 High Score 27.08% by Giotto. ai (@podesta_aldo)

English

357

33.9K

Timothy O'Hear retweetledi

anandmaj@Almondgodd·25 Eyl

I spent the past month reimplementing DeepMind’s Genie 3 world model from scratch Ended up making TinyWorlds, a 3M parameter world model capable of generating playable game environments demo below + everything I learned in thread (full repo at the end)👇🏼

English

271

2.4K

215.3K

Timothy O'Hear retweetledi

AI Coffee Break with Letitia@AICoffeeBreak·21 Eyl

Ever wondered how Energy-Based Models (EBMs) work and how they differ from normal neural networks? ☕️We go over EBMs and then dive into the Energy-Based Transformers paper to make LLMs that refine guesses, self-verify, and could adapt compute to problem difficulty. (link👇)

AI Coffee Break with Letitia tweet media

English

7.2K

Timothy O'Hear retweetledi

Jonathan Carroll@JSCarroll·19 Eyl

ZXX

6.4K

Timothy O'Hear retweetledi

Eric Pang@_eric_pang_·16 Eyl

Here's how I (almost) got the high scores in ARC-AGI-1 and 2 (the honor goes to @jeremyberman) while keeping the cost low. To put things into perspective: o3-preview scored 75.7% on ARC-AGI-1 last year while spending $200/task on low setting. My approach scores 77.1% while spending $2.56!

ARC Prize@arcprize

New SOTA on ARC-AGI - V1: 79.6%, $8.42/task - V2: 29.4%, $30.40/task Custom submissions by @jeremyberman and @_eric_pang_ are now the best known solutions to ARC-AGI Both: * Are open source * Use Grok 4 * Implement program-synthesis outer loops with test-time adaptation

English

887

135.7K

Timothy O'Hear@timohear·13 Eyl

@nicksherrow @lateinteraction Yeah, I'm sticking with GEPA = Guh-EPA

English

Nick Shirobokov@nicksherrow·13 Eyl

@lateinteraction Wait, if GEPA is jepa then how to pronounce JEPA?

English

147

Omar Khattab@lateinteraction·13 Eyl

no one is quite sure how to correctly pronounce ColBERT, DSPy, MIPRO, or GEPA. we’ve done it guys 😈

English

113

10.3K

Timothy O'Hear@timohear·21 Ağu

@LucaAmb @fchollet @polynoamial Well there was the Microsoft "sparks of AGI" paper...

English

Luca Ambrogioni@LucaAmb·21 Ağu

@fchollet @polynoamial I do not think it is true, most people never bought into that narrative, it was a coval minory of zealots who pushed it.

English

341

François Chollet@fchollet·21 Ağu

LLM adoption among US workers is closing in on 50%. Meanwhile labor productivity growth is lower than in 2020. Many counter-arguments can be made here, e.g. "they don't know yet how to be productive with it, they've only been using for 1-2 years", "50% is still too low to see impact", "models next year will be unbelievably better", etc. But I think we now have enough evidence to say that the 2023 talking point that "LLMs will make workers 10x more productive" (some folks even quoted 100x) is probably not accurate.

Oyvind Bjerke@BjerkeOy

LLM adoption rose to 45.9% among US workers as of June/July 2025, according to a Stanford/World Bank survey. Inference demand will continue to surge, not just from more users and more usage per user, but as newer, more advanced GenAI models require far more inference compute. Source: The Labor Market Effects of Generative Artificial Intelligence, Stanford University, World Bank

English

322

563

4.6K

927K

Keşfet

@BingBongBrent @arcprize @OpenAI @poetiq_ai @GregKamradt @guille_bar @sahilshah91 @JayaGup10