Nando de Freitas

12.2K posts

Nando de Freitas

@NandoDF

Microsoft CVP & CIFAR Fellow. Prev. Prof @ Oxford & DeepMind — AlphaGo-tuning AlphaCode Gato ReST SSM-Gemma Imagen Veo Genie MAI-voice MAI-Image MAI-Transcribe

London, England Katılım Nisan 2009

831 Takip Edilen109.4K Takipçiler

Nando de Freitas retweetledi

Hanchen Li@lihanc02·23h

An agent that beats Claude Mythos on Terminal Bench and SWE-bench Verified? 🎉We are excited to share Terminator-1, our newest agent that achieved 95+% on SWE-bench Verified and Terminal-Bench with @MogicianTony! We show that besides model capabilities, well-designed harness could actually boost the accuracy by 3x in coding tasks. Well if you really wanted you could get 100% accuracy without solving a single task. The actual finding is that most AI benchmarks can be easily reward-hacked with simple exploits. Read more about the same 7 design flaws that almost every evaluation has ⬇️

Hao Wang@MogicianTony

SWE-bench Verified and Terminal-Bench—two of the most cited AI benchmarks—can be reward-hacked with simple exploits. Our agent scored 100% on both. It solved 0 tasks. Evaluate the benchmark before it evaluates your agent. If you’re picking models by leaderboard score alone, you’re optimizing for the wrong thing. 🧵

English

151

253

3.4K

795.6K

Nando de Freitas retweetledi

alphaXiv@askalphaxiv·21h

What if the model didn’t just use a computer, but actually was the computer? Meta AI introduces "Neural Computer", a model where computation, memory, and I/O are all inside one learned system. Their early prototype learns from screen recordings of terminals and desktops, and it can already imitate some basic computer behavior like rendering interfaces and responding to clicks or commands. But it still breaks on slightly harder tasks like reliable reasoning, stable memory, and reusable skills.

English

452

33.4K

Nando de Freitas retweetledi

Lucas Beyer (bl16)@giffmana·22h

Muse Spark not benchmark overfit. See thread for what a benchmark overfit model looks like instead. (Note i don't know what the proprietary benches look like. Kinda the point.)

Rayan Krishnan@RayanKrishnan

Today, we see the result: Muse Spark. I'll be honest, I was surprised how competitive this model is. Progress at OAI, Anthropic, and GDM has been continuous, built on compounding breakthroughs. But MSL's team seems to have taken a single binary leap, catching up on many fronts at once.

English

271

82.3K

Nando de Freitas retweetledi

Nathan Lambert@natolambert·23h

My book, Reinforcement Learning from Human Feedback, is wrapping up and going into final production (copyediting, making pretty, formatting, etc.). Shipping to you in 1-2 months! It's a wonderful project to create a foundation of knowledge for the research communities that I love and operate in. It’s the book I wish I had when starting on my LLM journey about 3 years ago. The book’s deepest cut is on core reinforcement learning methods, intuitons, and implementations for LLMs. These don’t live in isolation, and it’s presented in the broader context of post-training methods and unsolved problems in RLHF. A nice balance of depth and breadth. I’m always asked about the title, and I am staying firm that this is THE book documenting the organization of the field of RLHF. Any other topic is too dynamic, where writing a book today would be immediately outdated. RLHF is largely being overshadowed by lots of other developments in AI, but will always be around and at the forefront of human-AI interactions. The topic deserves coverage in depth and this platform. Thank you for all your support. More projects related to the book being announced soon 🎥 I'm excited to reconnect with the community through in-person book events this summer and fall.

English

304

16.3K

Nando de Freitas retweetledi

NVIDIA Healthcare@NVIDIAHealth·1d

🚀 The largest-ever open‑source protein‑complex treasure trove - 1.7 million of AI‑predicted complexes now live in the AlphaFold Database In collaboration with @emblebi, @GoogleDeepMind, and @SeoulNatlUni, we have added millions of predicted complexes to the AlphaFold Database to accelerate global health research. 🧵👇

English

109

608

44.5K

Nando de Freitas@NandoDF·8h

Very proud of our MAI Voice team! In less than a year this team of less than 10 people achieved this!

Microsoft AI@MicrosoftAI

MAI-Voice-1 sets a new bar for natural, expressive speech generation. Can you tell which voice is synthetic and which is human? Try it for yourself msft.it/6014Q4aKK

English

2.8K

Nando de Freitas@NandoDF·1d

@iamtrask Fully agree.

English

⿻ Andrew Trask@iamtrask·1d

I joined DeepMind in 2017 and I remember in my first conversation with my manager, he described to me that he thought Demis's most brilliant move was NOT relocating to Silicon Valley. Demis's decision meant that for a 5-7 year period, every senior AI researcher in Europe who wanted to join one of the new/big AGI labs... but didn't want to be 5000 miles away from their home/family/culture... joined DeepMind. There was *very* little competition for talent for an incredibly long period of time... and very high retention. There was a meme that "basically nobody has quit DeepMind yet". I remember 20 people joining/week. There was a pizza party every friday where they rang a gong and you met all these famous researchers. If AI labs are mostly a competition for data, compute, and talent (and the rest is mostly the stochastic nature of research), Demis's opening moves here were briliant: 2010ish... Data: he had the idea to use videogames to generate infinite amounts of training data for specific tasks. Atara, Go, etc. came out of this. 2014ish... Compute: He used his data advantage to create the impressive Atari demonstration (and then i'm sure many other tactical strategies) to get acquired by Google... giving him an instant, sustainable compute advantage 2014ish... Talent: Post acquisition, he refused to relocate the team to SV... unlocking the massive talent advantage described above. It also meant he got the talent cheaper (while the other labs fought over talent with money... DeepMind got top talent at a discount b/c of location. At the time salaries were lower at DM than Facebook or OpenAI). And if you look at the rise of OpenAI, in particular around GPT-1, GPT-2, ... it included a data acquisition strategy that wasn't an option for DeepMind (I was on the language team at the time... it wasn't an option). This high-risk strategy produced very impressive models.... forming a particular kind of data disadvantage for firms with less risk-on data acquisition strategies. Then the other labs also built offices across the street from DeepMind, and the geography-based aspect talent advantage expired (DeepMind still has many great talent strategies). And of course... plenty of compute buildout ensued across many labs... and things balanced out to the competitive landscape that exists today. (According to Epoch's recent report... Google still has more compute than anyone though) It's a complex time. AI is a crazy space. But Demis is a master strategist, and those opening moves were both counter to the conventional wisdom and absolutely brilliant.

Harry Stebbings@HarryStebbings

DeepMind stayed in London because it is better for talent than Silicon Valley. "I saw London and the UK as having incredible talent from top universities like Cambridge, Oxford, Imperial and UCL. There is a deep heritage of scientific breakthroughs and world-class thinkers. There was less competition for that talent, which made it a huge structural advantage for building DeepMind." @demishassabis What is the single biggest advantage of building in Europe for you @torsten @antonosika @MaxJunestrand @matiii @ChrisParsonson @cjpedregal @matthewclifford @torstenreil @alanchanguk

English

526

90.3K

Nando de Freitas retweetledi

Sebastian Raschka@rasbt·2d

Strong release! GLM-5.1 is a DeepSeek-V3.2-like architecture (including MLA and DeepSeek Sparse Attention) but with more layers. And the benchmarks look better throughout! Looks like THE flagship open-weight model now.

Z.ai@Zai_org

Introducing GLM-5.1: The Next Level of Open Source - Top-Tier Performance: #1 in open source and #3 globally across SWE-Bench Pro, Terminal-Bench, and NL2Repo. - Built for Long-Horizon Tasks: Runs autonomously for 8 hours, refining strategies through thousands of iterations. Blog: z.ai/blog/glm-5.1 Weights: huggingface.co/zai-org/GLM-5.1 API: docs.z.ai/guides/llm/glm… Coding Plan: z.ai/subscribe Coming to chat.z.ai in the next few days.

English

144

1.1K

86K

Nando de Freitas retweetledi

Nathan Lambert@natolambert·2d

New report with @xeophon is out with the latest open model adoption data we have gathered for Interconnects & The ATOM Project. At the surface level, we can see Chinese models continuing to accelerate in adoption. The report details much more. 1. We manually curate ~1.5K of the most important language models, creating a specific set of models to focus our analysis on (excludes embedding models, local inference models like MLX/GGUF, etc to have accurate download rankings). 2. Studying other adoption metrics, such as derivative models and inference share on OpenRouter, to show how they correlate with downloads, while often sifted in time. China has a strong lead here too. 3. Better classification of downloads across model sizes. Large models still are the models where Qwen is least competitive, relative to other model builders. 4. Expansion of our Relative Adoption Metric (RAM) to show standout recent models (we'll check Gemma 4 on Friday); Qwen 3.5, Nemontron 3, Kimi K2.5, all showing very strong adoption. Overall, this is another step towards formalizing and making public better data on the open language model ecosystem, so the community can better understand the impact and trends of its adoption. More on this soon!

English

152

17.9K

Nando de Freitas retweetledi

Jarrid Rector-Brooks@jarridrb·2d

What if AI could invent enzymes that nature hasn’t seen? 👩‍🔬🧑‍🔬 Introducing 🪩 DISCO: Diffusion for Sequence-structure CO-design 14 rounds of directed evolution and over a year of wet lab work. That's what it took to engineer an enzyme for selective C(sp³)–H insertion, one of the most challenging transformations in organic chemistry. DISCO surpasses this with a single plate. No pre-specified catalytic residues, no template, no theozyme, no inverse folding, just joint diffusion over protein sequence and structure. 📝 Blog: disco-design.github.io 📄 Paper: arxiv.org/abs/2604.05181 💻 Code: github.com/DISCO-design/D…

English

214

912

202.3K

Nando de Freitas retweetledi

Alexandr Wang@alexandr_wang·1d

a good writeup about Muse Spark on a few complex queries (multimodal, stock analysis, coding): riteshkhanna.com/blog/muse-spar…

English

588

44.8K

Nando de Freitas retweetledi

Artificial Analysis@ArtificialAnlys·2d

Meta is back! Muse Spark scores 52 on the Artificial Analysis Intelligence Index, behind only Gemini 3.1 Pro, GPT-5.4, and Claude Opus 4.6. Muse Spark is the first new release since Llama 4 in April 2025 and also Meta's first release that is not open weights Muse Spark is a new model from @Meta evaluated on Artificial Analysis. We were given early access by Meta to independently benchmark the model. It is the first frontier-class model from Meta since Llama 4 Maverick was released in April 2025, and notably the first @AIatMeta model that is not being released as open weights. The release follows Meta's reorganization of its AI efforts under Meta Superintelligence Labs, and signals that Meta is re-entering the frontier race after roughly a year of relative quiet. For context, Llama 4 Maverick and Scout scored 18 and 13 respectively on the Artificial Analysis Intelligence Index as non-reasoning models at the time of their release, while Muse Spark scores 52. Muse Spark essentially closes the gap between to the frontier in a single release. The model is not open source and is not yet accessible via an API but Meta has shared they expect this to come soon. Meta is also integrating Muse Spark into their first party products including their Meta AI chat product, Facebook, Instagram and Threads. Key takeaways from our benchmarks: ➤ Muse Spark scores 52 on the Artificial Analysis Intelligence Index, placing it within the top 5 models we have benchmarked. It sits ahead of Claude Sonnet 4.6, GLM-5.1, MiniMax-M2.7, Grok 4.20 and behind Gemini 3.1 Pro Preview, GPT-5.4 and Claude Opus 4.6 ➤ Muse Spark is notably token efficient for its intelligence level. It used 58M output tokens to run the Intelligence Index, comparable to Gemini 3.1 Pro Preview (57M) and notably lower than Claude Opus 4.6 (Adaptive Reasoning, max effort, 157M), GPT-5.4 (xhigh, 120M) and GLM-5 (110M) ➤ Muse Spark is the second-most capable vision model we have benchmarked. It scores 80.5% on MMMU-Pro, behind only Gemini 3.1 Pro Preview (82.4%) ➤ Muse Spark performs strongly on reasoning and instruction-following evaluations. It scores 39.9% on HLE, trailing only Gemini 3.1 Pro Preview (44.7%) and GPT-5.4 (xhigh, 41.6%). The model also achieved 5th highest in CritPT with a score of 11%, an eval that is focused on difficult physics research questions. This is substantially above above Gemini 3 Flash (9%) and Claude 4.6 Sonnet (3%) ➤ Agentic performance does not stand out. On GDPval-AA, our evalaution focused on real world work tasks, Muse Spark scores 1427, behind both Claude Sonnet 4.6 at 1648 and GPT-5.4 at 1676, but ahead of Gemini 3.1 Pro Preview at 1320. On On TerminalBench Hard, Muse Spark trails Claude Sonnet 4.6, GPT-5.4, and Gemini 3.1 Pro. Muse Spark joins others in achieving a high τ²-Bench Telecom score of 92% Key model details: ➤ Modalities: Multimodal including text and vision input, text output ➤ License: Proprietary, Meta's first frontier model not released as open weights ➤ Availability: No public API at the time of publishing. Meta expects to provide API access soon. Meta has started integration into their first party AI offering Meta AI and inside Facebook, Instagram, and Threads

English

322

2.4K

473.2K

Nando de Freitas retweetledi

AI at Meta@AIatMeta·2d

Muse Spark is built from the ground up to integrate visual information across domains and tools. It achieves strong performance on visual STEM questions, entity recognition, and localization, enabling interactive experiences like troubleshooting your home appliances with dynamic annotations.

English

521

155.1K

Nando de Freitas retweetledi

Pietro Schirano@skirano·2d

Ok this is actually pretty impressive and I truly didn't see any model doing this before or being able to do it to this extent. When I asked Muse Spark from Meta to convert this image into code, it cut out the assets from the screens so it could use them correctly!

English

835

133.6K

Nando de Freitas@NandoDF·1d

@alexandr_wang Congratulations. Very impressive results given the pressure and short time.

English

Nando de Freitas retweetledi

Alexandr Wang@alexandr_wang·2d

1/ today we're releasing muse spark, the first model from MSL. nine months ago we rebuilt our ai stack from scratch. new infrastructure, new architecture, new data pipelines. muse spark is the result of that work, and now it powers meta ai. 🧵

English

701

1.2K

10.1K

4.2M

Nando de Freitas@NandoDF·1d

@_jasonwei @ananyaku Congratulations. Very impressive results for 9 months of work!

English

1.9K

Jason Wei@_jasonwei·1d

Fun nine months! My first week i remember we had a long dinner in the cafeteria daydreaming about the cool research directions to pursue, then going to back to our desks to write a basic script to inference llama. Now we have a pretty complete stack and our first model is out 🥑

Alexandr Wang@alexandr_wang

English

668

84.7K

Nando de Freitas@NandoDF·4d

What is the state of the art in Quantum algorithms for optimisation and machine learning? I found this 2017 comparison and I would love to know what has changed in the last 8 years? microsoft.com/en-us/research…

English

3.5K

Nando de Freitas@NandoDF·4d

This is such an inspiring video. It also makes me feel proud that we are building AI tools to empower these amazing scientists to expand our knowledge. Modern physics is forcing us to rethink existence | Michelle Thaller: Fu... youtu.be/LcC5ilQKQGc?si… via @YouTube

YouTube

English

5.7K

Nando de Freitas retweetledi

MIT CSAIL@MIT_CSAIL·2 Nis

A first-of-its-kind study from MIT measures how successful AI is in completing thousands of tasks done by workers in the US economy. Across these real-world tasks, they found that AI capabilities are improving quickly, but performance is rising smoothly: bit.ly/3Q1JQZD

English

126

34.7K

Keşfet

@MogicianTony @emblebi @GoogleDeepMind @SeoulNatlUni @iamtrask @xeophon @Meta @AIatMeta