Timothy O'Hear

2.4K posts

Timothy O'Hear banner
Timothy O'Hear

Timothy O'Hear

@timohear

Enthralled by machine learning / artificial intelligence, robot•me CTO, software engineer, Dai the robot co-creator, president of impactIA foundation, Genève

Geneva, Switzerland Katılım Haziran 2009
496 Takip Edilen416 Takipçiler
Timothy O'Hear retweetledi
Yannic Kilcher 🇸🇨
Yannic Kilcher 🇸🇨@ykilcher·
I built a fully automatic mansplainer. I'm sure this will not get me into any trouble at all... Watch here: youtu.be/xHi8PUIVyoo
YouTube video
YouTube
Yannic Kilcher 🇸🇨 tweet media
English
4
2
32
4.2K
ARC Prize
ARC Prize@arcprize·
A year ago, we verified a preview of an unreleased version of @OpenAI o3 (High) that scored 88% on ARC-AGI-1 at est. $4.5k/task Today, we’ve verified a new GPT-5.2 Pro (X-High) SOTA score of 90.5% at $11.64/task This represents a ~390X efficiency improvement in one year
ARC Prize tweet media
English
157
668
4.6K
2.3M
Timothy O'Hear
Timothy O'Hear@timohear·
@GregKamradt @guille_bar Isn't there a risk with code execution that Google could capture your task data as the sandbox is running on their infra?
English
1
0
1
52
Greg Kamradt
Greg Kamradt@GregKamradt·
@guille_bar In the script we were given to run it, code execution and search were tools available We removed search as this would expose task data to the internet So code was used
English
1
0
5
90
Guillermo Barbadillo
Guillermo Barbadillo@guille_bar·
Today I noticed an important detail about the Gemini 3 Deep Think solution for ARC AGI 2 that I missed in the announcement: it is using tools, very likely code execution. The legend of the plot might go to the hall of fame of chart crimes ;)
Guillermo Barbadillo tweet media
ARC Prize@arcprize

Gemini 3 models from @Google @GoogleDeepMind have made a significant 2X SOTA jump on ARC-AGI-2 (Semi-Private Eval) Gemini 3 Pro: 31.11%, $0.81/task Gemini 3 Deep Think (Preview): 45.14%, $77.16/task

English
2
1
12
2.3K
Timothy O'Hear retweetledi
Shane Legg
Shane Legg@ShaneLegg·
From the makers of the popular AlphaGo documentary, The Thinking Game gives a much broader picture of the story of DeepMind and our mission to build AGI, drawing on interviews with myself and others going back many years. You can now freely watch it here: youtube.com/watch?v=d95J8y…
YouTube video
YouTube
English
27
113
826
123.6K
Sahil Shah
Sahil Shah@sahilshah91·
I had explicitly turned OFF "Improve the model for everyone" flag a while back on @ChatGPTapp and today I chanced upon Data Controls and looks like its turned ON again. This is a massive breach of trust. Everyone should check this setting again..
English
3
0
4
2.4K
Timothy O'Hear retweetledi
Guillermo Barbadillo
Guillermo Barbadillo@guille_bar·
ARC25 is over and despite a lot of work I have been unable to implement my vision successfully. I hope to learn from other teams’ solutions and refine my ideas for ARC26. I am currently 6th on the public test set. Read about my vision and experiments: ironbar.github.io/arc25/05_Solut…
Guillermo Barbadillo tweet media
English
1
14
68
3.9K
Timothy O'Hear
Timothy O'Hear@timohear·
@StphTphsn1 @Dorialexander Yes, very much iid and fairly simple tasks belonging to eg a single 20-person service. But I'm pretty sure they would have failed even a few months ago.
English
0
0
2
30
Stéphane Deny
Stéphane Deny@StphTphsn1·
@timohear @Dorialexander testing set iid? because that's sometimes a problem that real world apps are not iid with training set (eg in the medical domain, different hospitals are not iid)
English
1
0
1
359
Alexander Doria
Alexander Doria@Dorialexander·
european tech people now starting to realize it might not be a bubble after all.
English
26
3
426
50.7K
Timothy O'Hear
Timothy O'Hear@timohear·
@StphTphsn1 @Dorialexander I've seen a significant increase in robustness of data extraction / instruction following scenarios over the past 12 months with high 9x% accuracy/F1 now achievable on real world tasks.
English
1
0
1
378
Stéphane Deny
Stéphane Deny@StphTphsn1·
@Dorialexander got it. yeah, i'm still bearish on the robustness that can be expected from deep learning tech
English
2
0
4
867
Aldo Podestà
Aldo Podestà@podesta_aldo·
🎙️Great talking to Marcel Salathé on the EPFL AI Center podcast about Giotto.ai. Among other things, we talked about the origins of Giotto, what differentiates us from the other major players, and how R&D investments, even if risky, are absolutely crucial for real progress. 🎧 Listen to the full episode on Apple podcast: podcasts.apple.com/ch/podcast/con… Spotify: open.spotify.com/episode/3n9a9K…
Aldo Podestà tweet media
English
1
1
10
1.2K
Aran Komatsuzaki
Aran Komatsuzaki@arankomatsuzaki·
@jeremyphoward @teknium I find 4.5 on claude.ai with upgraded file creation and analysis to be by far the best on GDPval. It's magical how it can one-shot a 10-hour industry task. The output format (e.g. spreadsheet) even looks better too.
English
1
0
8
750
Teknium (e/λ)
Teknium (e/λ)@Teknium·
I'm feeling like sonnet 4.5 is bad its really really fucking up in ways sonnet 4 and opus 4.1 did not unfortunately
English
34
1
208
37.7K
Timothy O'Hear
Timothy O'Hear@timohear·
From github.com/epang080516/ar… "the private eval set is only accessible via the no-internet-access Kaggle competition" "The semi-private eval set was calibrated to have the same difficulty as the public eval set, but researchers need to coordinate with the ARC-Prize team to test their model on it in a Kaggle notebook that runs at most 12 hours." From the Kaggle page "This leaderboard is calculated with approximately 50% of the test data. The final results will be based on the other 50%, so the final standings may be different." So the ARC-AGI-2 scores on both pages are measured in different ways but are somewhat comparable?
English
0
0
0
67
ARC Prize
ARC Prize@arcprize·
New ARC Prize 2025 High Score 27.08% by Giotto. ai (@podesta_aldo)
ARC Prize tweet media
English
15
30
357
33.9K
Timothy O'Hear retweetledi
anandmaj
anandmaj@Almondgodd·
I spent the past month reimplementing DeepMind’s Genie 3 world model from scratch Ended up making TinyWorlds, a 3M parameter world model capable of generating playable game environments demo below + everything I learned in thread (full repo at the end)👇🏼
English
95
271
2.4K
215.3K
Timothy O'Hear retweetledi
AI Coffee Break with Letitia
AI Coffee Break with Letitia@AICoffeeBreak·
Ever wondered how Energy-Based Models (EBMs) work and how they differ from normal neural networks? ☕️We go over EBMs and then dive into the Energy-Based Transformers paper to make LLMs that refine guesses, self-verify, and could adapt compute to problem difficulty. (link👇)
AI Coffee Break with Letitia tweet media
English
2
8
48
7.2K
Timothy O'Hear retweetledi
Eric Pang
Eric Pang@_eric_pang_·
Here's how I (almost) got the high scores in ARC-AGI-1 and 2 (the honor goes to @jeremyberman) while keeping the cost low. To put things into perspective: o3-preview scored 75.7% on ARC-AGI-1 last year while spending $200/task on low setting. My approach scores 77.1% while spending $2.56!
ARC Prize@arcprize

New SOTA on ARC-AGI - V1: 79.6%, $8.42/task - V2: 29.4%, $30.40/task Custom submissions by @jeremyberman and @_eric_pang_ are now the best known solutions to ARC-AGI Both: * Are open source * Use Grok 4 * Implement program-synthesis outer loops with test-time adaptation

English
28
94
887
135.7K
Omar Khattab
Omar Khattab@lateinteraction·
no one is quite sure how to correctly pronounce ColBERT, DSPy, MIPRO, or GEPA. we’ve done it guys 😈
English
12
5
113
10.3K
Luca Ambrogioni
Luca Ambrogioni@LucaAmb·
@fchollet @polynoamial I do not think it is true, most people never bought into that narrative, it was a coval minory of zealots who pushed it.
English
1
0
0
341
François Chollet
François Chollet@fchollet·
LLM adoption among US workers is closing in on 50%. Meanwhile labor productivity growth is lower than in 2020. Many counter-arguments can be made here, e.g. "they don't know yet how to be productive with it, they've only been using for 1-2 years", "50% is still too low to see impact", "models next year will be unbelievably better", etc. But I think we now have enough evidence to say that the 2023 talking point that "LLMs will make workers 10x more productive" (some folks even quoted 100x) is probably not accurate.
Oyvind Bjerke@BjerkeOy

LLM adoption rose to 45.9% among US workers as of June/July 2025, according to a Stanford/World Bank survey. Inference demand will continue to surge, not just from more users and more usage per user, but as newer, more advanced GenAI models require far more inference compute. Source: The Labor Market Effects of Generative Artificial Intelligence, Stanford University, World Bank

English
322
563
4.6K
927K