Tim Franzmeyer

64 posts

Tim Franzmeyer

@frtimlive

Research Scientist @googledeepmind Gemini, Reinforcement Learning, Agents

Katılım Temmuz 2022

437 Takip Edilen1.4K Takipçiler

Tim Franzmeyer retweetledi

Georgia Channing@cgeorgiaw·29 Nis

🤗🤗🤗introducing Hugging Science -- the home of AI for science 🤗🤗🤗 open models and datasets are the powerhouse of science (see the PDB), but finding the models and data you actually need for your breakthrough is hard af you shouldn't need to scrape arxiv, own your own wetlab, fight a custom HDF5 parser, build a fusion stellarator, and beg for compute before you've trained a single epoch so we're changing that we've put all the best science on @huggingface in one place: - 78GB of genomics data - 11TB of PDE simulations - 100M cell profiles - 9T DNA base pairs - 13M molecular trajectories - 400k medical QA pairs and much more, all open, and all ready for training (+ you can also now filter and search by domain, task, and keyword) we've put together all the biggest releases from our partners at NASA, Google, OpenAI, Meta FAIR, Arc Institute, Ginkgo, SandboxAQ, Proxima Fusion, NVIDIA, Ai2, OpenADMET, InstaDeep, Future House, Polymathic AI, LeMaterial, Earth Species Project, Merck, and Eve Bio if you're not sure where you fit in -- work on open challenges for problems that matter: including fusion stellarator design, ADMET, antibody developability, multilingual medicine, catalysis and materials, and scientific reasoning. we're already changing how science gets done: a fusion startup needed a benchmark for stellarator plasma confinement that didn't exist. @proximafusion shipped ConStellaration on Hugging Science: a leaderboard, dataset, and eval metrics, all in one place. a drug discovery team wanted to predict hPXR induction. OpenADMET put up a blind challenge: 11,000+ compounds assayed at Octant, 513 held out, two tracks (pEC50 + structure). Anyone in the world can train and submit. an antibody team at @Ginkgo released GDPa1, a developability dataset for stability, manufacturability, and immunogenicity prediction, with a live leaderboard scoring every submission. if you know a problem the ML community should be working on, let us know. make a challenge! this is about putting all the tools for solving science in one place. so we can hillclimb! → huggingscience.co

English

352

1.8K

195K

Tim Franzmeyer@frtimlive·15 Şub

HALT (“High Accuracy, Less Talk”) accepted to ICLR 2026 🎉 LLMs are trained to always finish answers — even past what they truly know — causing partially wrong outputs. HALT instead finetunes models to stop when confidence drops, trading completeness for reliability 🚧 👇

Tim Franzmeyer@frtimlive

What if LLMs knew when to stop? 🚧 HALT finetuning teaches LLMs to only generate content they’re confident is correct. 🔍 Insight: Post-training must be adjusted to the model’s capabilities. ⚖️ Tunable trade-off: Higher correctness 🔒 vs. More completeness 📝 with @AIatMeta 🧵

English

1.6K

Tim Franzmeyer@frtimlive·8 Ara

Had an amazing time @NeurIPSConf! Thank you to everyone who came to our Gemini panel! It was great being up there with @JeffDean , @OriolVinyalsML , @bonniesjli , @sedielem , and the rest of the team. Loved the questions and follow-up discussions! Looking forward to Sydney!

English

174

12.9K

Tim Franzmeyer@frtimlive·3 Ara

On my way to @NeurIPSConf. Looking forward to catching up and to meeting new people!

English

404

Tim Franzmeyer retweetledi

Google DeepMind@GoogleDeepMind·18 Kas

This is Gemini 3: our most intelligent model that helps you learn, build and plan anything. It comes with state-of-the-art reasoning capabilities, world-leading multimodal understanding, and enables new agentic coding experiences. 🧵

English

212

1.1K

6.5K

1.7M

Tim Franzmeyer retweetledi

Noam Brown@polynoamial·21 Eki

Below is a deep dive into why self play works for two-player zero-sum (2p0s) games like Go/Poker/Starcraft but is so much harder to use in "real world" domains. tl;dr: self play converges to minimax in 2p0s games, and minimax is really useful in those games. Every finite 2p0s game has a minimax equilibrium, which is essentially an unbeatable strategy in expectation (assuming the players alternate sides). In rock paper scissors, for example, minimax is 1/3rd on each action. Is minimax what we want? Not necessarily. If you're playing minimax in Rock Paper Scissors when most opponents' strategies are "always throw Rock" then you're clearly suboptimal, even though you're not losing in expectation. This especially matters in a game like poker because playing minimax means you might not make as much money off of weak players as you could if you maximally exploited them. But the guarantee of "you will not lose in expectation" is really nice to have. And in games like Chess and Go, the difference between a minimax strategy and a strategy that optimally exploits the population of opponents is negligible. For that reason, minimax is typically considered the goal for a two-player zero-sum game. Even in poker, the conventional wisdom among top pros is to play minimax (game theory optimal) and then only deviate if you spot clear weaknesses in the opponent. Sound self play, even from scratch, is guaranteed to converge to a minimax equilibrium in finite 2p0s games. That's amazing! By simply scaling memory and compute, and with no human data, we can converge to a strategy that's unbeatable in expectation. What about non-2p0s games? Sadly, pure self play, with no human data, is no longer guaranteed to converge to a useful strategy. This can be clearly seen in the Ultimatum Game. Alice must offer Bob $0-100. Bob then accepts or rejects. If Bob accepts, the money is split according to Alice's proposal. If Bob rejects, both receive $0. The equilibrium (specifically, subgame perfect equilibrium) strategy is to offer 1 penny and for Bob to accept. But in the real world, people aren't so rational. If Alice were to try that strategy with real humans she would end up with very little money. Self play becomes untethered from what we as humans find useful. A lot of folks have proposed games like "an LLM teacher proposes hard math problems, and a student LLM tries to solve them" to achieve self-play training, but this runs into similar problems as the Ultimatum game where the equilibrium is untethered from what we as humans find useful. What should the reward for the teacher be in such a game? If it's 2p0s then the teacher is rewarded if the student couldn't solve the problem, so the teacher will pose impossible problems. Okay, what if we reward it for the student having a 50% success rate? Then the teacher could just flip a coin and ask the student if it landed Heads. Or the teacher could ask the student to decrypt a message via an exhaustive key search. Reward shaping to achieve intended behavior becomes a major challenge. This isn't an issue in 2p0s games. I do believe in self play. It provides an infinite source of training, and it continuously matches an agent with an equally skilled peer. We've also seen it work in some complex non-2p0s settings like Diplomacy and Hanabi. But applying it outside of 2p0s games is a lot harder than it was for Go, Poker, Dota, and Starcraft.

Noam Brown@polynoamial

Self play works so well in chess, go, and poker because those games are two-player zero-sum. That simplifies a lot of problems. The real world is messier, which is why we haven’t seen many successes from self play in LLMs yet. Btw @karpathy did great and I mostly agree with him!

English

169

1.6K

317.9K

Tim Franzmeyer retweetledi

Yoram Bachrach@yorambac·22 Eki

Looking for interns to work on AI Research Agents: forms.gle/4tUFbCTBVyUnCg…

English

117

980

88K

Tim Franzmeyer@frtimlive·20 Eki

I recently joined @GoogleDeepMind in London. Excited to be part of David Silver's RL team to work on Gemini, Reinforcement Learning and Agents. It’s been amazing speaking with so many fascinating people in the first weeks and learning from them!

English

1.1K

92.3K

Tim Franzmeyer retweetledi

International Conference on 3D Vision (3DV)@3DVconf·19 Tem

🎉Meet the #3DV2026 Keynote Speakers! Jitendra Malik · University of California, Berkeley Angela Dai · Technical University of Munich Christian Rupprecht · University of Oxford Alec Jacobson · University of Toronto Bring your latest work and join us for the exciting keynotes!

International Conference on 3D Vision (3DV) tweet media

English

156

15.7K

Tim Franzmeyer retweetledi

Felipe Nuti@NutiFelipe·16 Tem

Many works on LLM jailbreaking have hypothesized that certain jailbreaks get around safety guardrails introduced during fine-tuning by sending prompts out of the distribution of the fine-tuning data, leading the LLM to behave more like its unaligned pre-trained checkpoint. But how can we directly measure—for an individual prompt—when fine-tuning is having less of an effect on the model’s final output? At ICML 2025 on Thu. 11AM (#E-3005), we show how to do this for any fine-tuned model, assuming access to the original pre-trained model 🧵🪡 Joint work with @frtimlive and João Henriques at @Oxford_VGG .

English

Tim Franzmeyer retweetledi

Minqi Jiang@MinqiJiang·30 Haz

Recently, there has been a lot of talk of LLM agents automating ML research itself. If Llama 5 can create Llama 6, then surely the singularity is just around the corner. How can we get a pulse check on whether current LLMs are capable of driving this kind of total self-improvement? Well, we know humans are pretty good at improving LLMs. In the NanoGPT speedrun challenge, created by @kellerjordan0, human researchers iteratively improved @karpathy's GPT-2 replication, slashing the training time (to the same target validation loss) from 45 minutes to under 3 minutes in just under a year (!). Surely, a necessary (but not sufficient) ability for an LLM that can automatically improve frontier techniques is the ability to *reproduce* known innovations on GPT-2, a tiny language model from over 5 years ago. 🤔 So we took several of the top models and combined them with various search scaffolds to create *LLM speedrunner agents*. We then asked these agents to reproduce each of the NanoGPT speedrun records, starting from the previous record, while providing them access to different forms of hints that revealed the exact changes needed to reach the next record. The results were surprising—not because we thought these agents would ace the benchmark, but because even the best agent failed to recover even half of the speed-up of human innovators on average in the easiest hint mode, where we show the agent the full pseudocode of the changes to the next record. We believe The Automated LLM Speedrunning Benchmark provides a simple eval for measuring the lower bound of LLM agents’ ability to reproduce scientific findings close to the frontier of ML. Beyond scientific reproducibility, this benchmark can also be run without hints, transforming into an automated *scientific innovation* benchmark. When run in "innovation mode," this benchmark effectively extends the NanoGPT speedrun to AI participants! While initial results here indicate that current agents seriously struggle to match human innovators beyond just a couple of records, benchmarks have a tendency to fall. This one is particularly exciting to watch, as new state-of-the-art here by definition implies a form of *superhuman innovation*.

English

197

1.2K

569.3K

Tim Franzmeyer retweetledi

Chuanxia Zheng@ChuanxiaZ·24 Haz

After two amazing years with @Oxford_VGG, I will be joining @NTUsg as a Nanyang Assistant Professor in Fall 2025! I’ll be leading the Physical Vision Group (physicalvision.github.io) — and we're hiring for next year!🚀 If you're passionate about vision or AI, get in touch!

English

240

43.4K

Tim Franzmeyer retweetledi

Suny Shtedritski@shtedritski·26 Haz

Happy to share SynCity 🌆was accepted at ICCV! See you in Hawaii 🏖️

AK@_akhaliq

This looks insane, SynCity Training-Free Generation of 3D Worlds

English

1.4K

Tim Franzmeyer retweetledi

Edward Hughes@edwardfhughes·18 Haz

Hypothesis: Humans succeed not because we make fewer errors, but because we are better at correcting them early. In other words, aiming for error correction in few-shot is likely to be more realistic and effective than aiming for vanishingly low error in zero-shot.

Benjamin Todd@ben_j_todd

Why can AIs code for 1h but not 10h? A simple explanation: if there's a 10% chance of error per 10min step (say), the success rate is: 1h: 53% 4h: 8% 10h: 0.002% @tobyordoxford has tested this 'constant error rate' theory and shown it's a good fit for the data chance of success declines exponentially

English

1.7K

Tim Franzmeyer retweetledi

ELLIS@ELLISforEurope·18 Haz

Meet @DebOishi, ELLIS PhD Student 🎓 at @UniofOxford 🏴󠁧󠁢󠁥󠁮󠁧󠁿 & @GoogleDeepMind. She works on computer vision, gen & responsible AI and chairs ELLIS PhD Reading Groups on CV & deep learning theory. Career high: She won a £25K grant from @SkyUK to advance ML & AI! 👏 #WomenInELLIS

English

5.2K

Tim Franzmeyer retweetledi

Seohong Park@seohong_park·13 Haz

Q-learning is not yet scalable seohong.me/blog/q-learnin… I wrote a blog post about my thoughts on scalable RL algorithms. To be clear, I'm still highly optimistic about off-policy RL and Q-learning! I just think we haven't found the right solution yet (the post discusses why).

English

184

1.2K

168.5K

Tim Franzmeyer retweetledi

Visual Geometry Group (VGG)@Oxford_VGG·13 Haz

Many Congratulations to @jianyuan_wang, @MinghaoChen23, @n_karaev, Andrea Vedaldi, Christian Rupprecht and @davnov134 for winning the Best Paper Award @CVPR for "VGGT: Visual Geometry Grounded Transformer" 🥇🎉 🙌🙌 #CVPR2025!!!!!!

English

489

46.2K

Tim Franzmeyer retweetledi

Madian Khabsa@MadianKhabsa·5 Haz

Check out Tim's work with us on reducing hallucinations in LLMs

Tim Franzmeyer@frtimlive

English

666

Tim Franzmeyer retweetledi

Jakob Foerster@j_foerst·5 Haz

Hallucinations are still one major challenge for deploying LLMs in real world scenarios, in particular when we SFT models on downstream datasets. Our new paper ensures models are only trained on content they already know how to generate, drastically reducing hallucinations. 🛑

Tim Franzmeyer@frtimlive

English

2.5K

Tim Franzmeyer@frtimlive·5 Haz

📄 Full paper: arxiv.org/abs/2506.04051 With amazing collaborators: @archie_srvnkmr Lijuan Liu @yuning_pro Rui Hou @sinongwang @j_foerst @MadianKhabsa @LukeZettlemoyer

English

395

Tim Franzmeyer@frtimlive·5 Haz

🚨 One model, high correctness: With low-threshold tuning, we take Llama3-70B from: ➡️ 51% → 87% correctness ➡️ Retaining 53% of the original completeness

English

348

Tim Franzmeyer@frtimlive·5 Haz

English

9.6K

Keşfet

@huggingface @proximafusion @Ginkgo @NeurIPSConf @JeffDean @OriolVinyalsML @bonniesjli @sedielem