Tim Franzmeyer

63 posts

Tim Franzmeyer

Tim Franzmeyer

@frtimlive

Research Scientist @googledeepmind Gemini, Reinforcement Learning, Agents

Katılım Temmuz 2022
436 Takip Edilen1.4K Takipçiler
Tim Franzmeyer
Tim Franzmeyer@frtimlive·
HALT (“High Accuracy, Less Talk”) accepted to ICLR 2026 🎉 LLMs are trained to always finish answers — even past what they truly know — causing partially wrong outputs. HALT instead finetunes models to stop when confidence drops, trading completeness for reliability 🚧 👇
Tim Franzmeyer@frtimlive

What if LLMs knew when to stop? 🚧 HALT finetuning teaches LLMs to only generate content they’re confident is correct. 🔍 Insight: Post-training must be adjusted to the model’s capabilities. ⚖️ Tunable trade-off: Higher correctness 🔒 vs. More completeness 📝 with @AIatMeta 🧵

English
0
1
12
1.5K
Tim Franzmeyer
Tim Franzmeyer@frtimlive·
On my way to @NeurIPSConf. Looking forward to catching up and to meeting new people!
English
0
0
4
388
Tim Franzmeyer retweetledi
Google DeepMind
Google DeepMind@GoogleDeepMind·
This is Gemini 3: our most intelligent model that helps you learn, build and plan anything. It comes with state-of-the-art reasoning capabilities, world-leading multimodal understanding, and enables new agentic coding experiences. 🧵
English
215
1.1K
6.5K
1.7M
Tim Franzmeyer retweetledi
Noam Brown
Noam Brown@polynoamial·
Below is a deep dive into why self play works for two-player zero-sum (2p0s) games like Go/Poker/Starcraft but is so much harder to use in "real world" domains. tl;dr: self play converges to minimax in 2p0s games, and minimax is really useful in those games. Every finite 2p0s game has a minimax equilibrium, which is essentially an unbeatable strategy in expectation (assuming the players alternate sides). In rock paper scissors, for example, minimax is 1/3rd on each action. Is minimax what we want? Not necessarily. If you're playing minimax in Rock Paper Scissors when most opponents' strategies are "always throw Rock" then you're clearly suboptimal, even though you're not losing in expectation. This especially matters in a game like poker because playing minimax means you might not make as much money off of weak players as you could if you maximally exploited them. But the guarantee of "you will not lose in expectation" is really nice to have. And in games like Chess and Go, the difference between a minimax strategy and a strategy that optimally exploits the population of opponents is negligible. For that reason, minimax is typically considered the goal for a two-player zero-sum game. Even in poker, the conventional wisdom among top pros is to play minimax (game theory optimal) and then only deviate if you spot clear weaknesses in the opponent. Sound self play, even from scratch, is guaranteed to converge to a minimax equilibrium in finite 2p0s games. That's amazing! By simply scaling memory and compute, and with no human data, we can converge to a strategy that's unbeatable in expectation. What about non-2p0s games? Sadly, pure self play, with no human data, is no longer guaranteed to converge to a useful strategy. This can be clearly seen in the Ultimatum Game. Alice must offer Bob $0-100. Bob then accepts or rejects. If Bob accepts, the money is split according to Alice's proposal. If Bob rejects, both receive $0. The equilibrium (specifically, subgame perfect equilibrium) strategy is to offer 1 penny and for Bob to accept. But in the real world, people aren't so rational. If Alice were to try that strategy with real humans she would end up with very little money. Self play becomes untethered from what we as humans find useful. A lot of folks have proposed games like "an LLM teacher proposes hard math problems, and a student LLM tries to solve them" to achieve self-play training, but this runs into similar problems as the Ultimatum game where the equilibrium is untethered from what we as humans find useful. What should the reward for the teacher be in such a game? If it's 2p0s then the teacher is rewarded if the student couldn't solve the problem, so the teacher will pose impossible problems. Okay, what if we reward it for the student having a 50% success rate? Then the teacher could just flip a coin and ask the student if it landed Heads. Or the teacher could ask the student to decrypt a message via an exhaustive key search. Reward shaping to achieve intended behavior becomes a major challenge. This isn't an issue in 2p0s games. I do believe in self play. It provides an infinite source of training, and it continuously matches an agent with an equally skilled peer. We've also seen it work in some complex non-2p0s settings like Diplomacy and Hanabi. But applying it outside of 2p0s games is a lot harder than it was for Go, Poker, Dota, and Starcraft.
Noam Brown tweet media
Noam Brown@polynoamial

Self play works so well in chess, go, and poker because those games are two-player zero-sum. That simplifies a lot of problems. The real world is messier, which is why we haven’t seen many successes from self play in LLMs yet. Btw @karpathy did great and I mostly agree with him!

English
61
171
1.6K
312.2K
Tim Franzmeyer
Tim Franzmeyer@frtimlive·
I recently joined @GoogleDeepMind in London. Excited to be part of David Silver's RL team to work on Gemini, Reinforcement Learning and Agents. It’s been amazing speaking with so many fascinating people in the first weeks and learning from them!
English
47
11
1.1K
92.2K
Tim Franzmeyer retweetledi
International Conference on 3D Vision
🎉Meet the #3DV2026 Keynote Speakers! Jitendra Malik · University of California, Berkeley Angela Dai · Technical University of Munich Christian Rupprecht · University of Oxford Alec Jacobson · University of Toronto Bring your latest work and join us for the exciting keynotes!
International Conference on 3D Vision tweet media
English
2
23
157
15.6K
Tim Franzmeyer retweetledi
Felipe Nuti
Felipe Nuti@NutiFelipe·
Many works on LLM jailbreaking have hypothesized that certain jailbreaks get around safety guardrails introduced during fine-tuning by sending prompts out of the distribution of the fine-tuning data, leading the LLM to behave more like its unaligned pre-trained checkpoint. But how can we directly measure—for an individual prompt—when fine-tuning is having less of an effect on the model’s final output? At ICML 2025 on Thu. 11AM (#E-3005), we show how to do this for any fine-tuned model, assuming access to the original pre-trained model 🧵🪡 Joint work with @frtimlive and João Henriques at @Oxford_VGG .
Felipe Nuti tweet media
English
1
3
13
3K
Tim Franzmeyer retweetledi
Minqi Jiang
Minqi Jiang@MinqiJiang·
Recently, there has been a lot of talk of LLM agents automating ML research itself. If Llama 5 can create Llama 6, then surely the singularity is just around the corner. How can we get a pulse check on whether current LLMs are capable of driving this kind of total self-improvement? Well, we know humans are pretty good at improving LLMs. In the NanoGPT speedrun challenge, created by @kellerjordan0, human researchers iteratively improved @karpathy's GPT-2 replication, slashing the training time (to the same target validation loss) from 45 minutes to under 3 minutes in just under a year (!). Surely, a necessary (but not sufficient) ability for an LLM that can automatically improve frontier techniques is the ability to *reproduce* known innovations on GPT-2, a tiny language model from over 5 years ago. 🤔 So we took several of the top models and combined them with various search scaffolds to create *LLM speedrunner agents*. We then asked these agents to reproduce each of the NanoGPT speedrun records, starting from the previous record, while providing them access to different forms of hints that revealed the exact changes needed to reach the next record. The results were surprising—not because we thought these agents would ace the benchmark, but because even the best agent failed to recover even half of the speed-up of human innovators on average in the easiest hint mode, where we show the agent the full pseudocode of the changes to the next record. We believe The Automated LLM Speedrunning Benchmark provides a simple eval for measuring the lower bound of LLM agents’ ability to reproduce scientific findings close to the frontier of ML. Beyond scientific reproducibility, this benchmark can also be run without hints, transforming into an automated *scientific innovation* benchmark. When run in "innovation mode," this benchmark effectively extends the NanoGPT speedrun to AI participants! While initial results here indicate that current agents seriously struggle to match human innovators beyond just a couple of records, benchmarks have a tendency to fall. This one is particularly exciting to watch, as new state-of-the-art here by definition implies a form of *superhuman innovation*.
Minqi Jiang tweet media
English
41
196
1.2K
567.4K
Tim Franzmeyer retweetledi
Chuanxia Zheng
Chuanxia Zheng@ChuanxiaZ·
After two amazing years with @Oxford_VGG, I will be joining @NTUsg as a Nanyang Assistant Professor in Fall 2025! I’ll be leading the Physical Vision Group (physicalvision.github.io) — and we're hiring for next year!🚀 If you're passionate about vision or AI, get in touch!
English
24
29
242
42.9K
Tim Franzmeyer retweetledi
Edward Hughes
Edward Hughes@edwardfhughes·
Hypothesis: Humans succeed not because we make fewer errors, but because we are better at correcting them early. In other words, aiming for error correction in few-shot is likely to be more realistic and effective than aiming for vanishingly low error in zero-shot.
Benjamin Todd@ben_j_todd

Why can AIs code for 1h but not 10h? A simple explanation: if there's a 10% chance of error per 10min step (say), the success rate is: 1h: 53% 4h: 8% 10h: 0.002% @tobyordoxford has tested this 'constant error rate' theory and shown it's a good fit for the data chance of success declines exponentially

English
0
2
17
1.7K
Tim Franzmeyer retweetledi
ELLIS
ELLIS@ELLISforEurope·
Meet @DebOishi, ELLIS PhD Student 🎓 at @UniofOxford 🏴󠁧󠁢󠁥󠁮󠁧󠁿 & @GoogleDeepMind. She works on computer vision, gen & responsible AI and chairs ELLIS PhD Reading Groups on CV & deep learning theory. Career high: She won a £25K grant from @SkyUK to advance ML & AI! 👏 #WomenInELLIS
English
3
6
28
5.2K
Tim Franzmeyer retweetledi
Seohong Park
Seohong Park@seohong_park·
Q-learning is not yet scalable seohong.me/blog/q-learnin… I wrote a blog post about my thoughts on scalable RL algorithms. To be clear, I'm still highly optimistic about off-policy RL and Q-learning! I just think we haven't found the right solution yet (the post discusses why).
Seohong Park tweet media
English
35
187
1.2K
168.2K
Tim Franzmeyer retweetledi
Jakob Foerster
Jakob Foerster@j_foerst·
Hallucinations are still one major challenge for deploying LLMs in real world scenarios, in particular when we SFT models on downstream datasets. Our new paper ensures models are only trained on content they already know how to generate, drastically reducing hallucinations. 🛑
Tim Franzmeyer@frtimlive

What if LLMs knew when to stop? 🚧 HALT finetuning teaches LLMs to only generate content they’re confident is correct. 🔍 Insight: Post-training must be adjusted to the model’s capabilities. ⚖️ Tunable trade-off: Higher correctness 🔒 vs. More completeness 📝 with @AIatMeta 🧵

English
0
4
26
2.5K
Tim Franzmeyer
Tim Franzmeyer@frtimlive·
🚨 One model, high correctness: With low-threshold tuning, we take Llama3-70B from: ➡️ 51% → 87% correctness ➡️ Retaining 53% of the original completeness
Tim Franzmeyer tweet media
English
1
0
2
347
Tim Franzmeyer
Tim Franzmeyer@frtimlive·
What if LLMs knew when to stop? 🚧 HALT finetuning teaches LLMs to only generate content they’re confident is correct. 🔍 Insight: Post-training must be adjusted to the model’s capabilities. ⚖️ Tunable trade-off: Higher correctness 🔒 vs. More completeness 📝 with @AIatMeta 🧵
English
1
13
65
9.5K