Thomas Wolf

5K posts

Thomas Wolf banner
Thomas Wolf

Thomas Wolf

@Thom_Wolf

Co-founder at @HuggingFace - moonshots - angel

Katılım Şubat 2011
7.1K Takip Edilen113.6K Takipçiler
Sabitlenmiş Tweet
Thomas Wolf
Thomas Wolf@Thom_Wolf·
Shifting structures in a software world dominated by AI. Some first-order reflections (TL;DR at the end): Reducing software supply chains, the return of software monoliths – When rewriting code and understanding large foreign codebases becomes cheap, the incentive to rely on deep dependency trees collapses. Writing from scratch ¹ or extracting the relevant parts from another library is far easier when you can simply ask a code agent to handle it, rather than spending countless nights diving into an unfamiliar codebase. The reasons to reduce dependencies are compelling: a smaller attack surface for supply chain threats, smaller packaged software, improved performance, and faster boot times. By leveraging the tireless stamina of LLMs, the dream of coding an entire app from bare-metal considerations all the way up is becoming realistic. End of the Lindy effect – The Lindy effect holds that things which have been around for a long time are there for good reason and will likely continue to persist. It's related to Chesterton's fence: before removing something, you should first understand why it exists, which means removal always carries a cost. But in a world where software can be developed from first principles and understood by a tireless agent, this logic weakens. Older codebases can be explored at will; long-standing software can be replaced with far less friction. A codebase can be fully rewritten in a new language. ² Legacy software can be carefully studied and updated in situations where humans would have given up long ago. The catch: unknown unknowns remain unknown. The true extent of AI's impact will hinge on whether complete coverage of testing, edge cases, and formal verification is achievable. In an AI-dominated world, formal verification isn't optional—it's essential. The case for strongly typed languages – Historically, programming language adoption has been driven largely by human psychology and social dynamics. A language's success depended on a mix of factors: individual considerations like being easy to learn and simple to write correctly; community effects like how active and welcoming a community was, which in turn shaped how fast its ecosystem would grow; and fundamental properties like provable correctness, formal verification, and striking the right balance between dynamic and static checks—between the freedom to write anything and the discipline of guarding against edge cases and attacks. As the human factor diminishes, these dynamics will shift. Less dependence on human psychology will favor strongly typed, formally verifiable and/or high performance languages.³ These are often harder for humans to learn, but they're far better suited to LLMs, which thrive on formal verification and reinforcement learning environments. Expect this to reshape which languages dominate. Economic restructuring of open source – For decades, open-source communities have been built around humans finding connection through writing, learning, and using code together. In a world where most code is written—and perhaps more importantly, read—by machines, these incentives will start to break down.⁴ Communities of AIs building libraries and codebases together will likely emerge as a replacement, but such communities will lack the fundamentally human motivations that have driven open source until now. If the future of open-source development becomes largely devoid of humans, alignment of AI models won't just matter—it will be decisive. The future of new languages – Will AI agents face the same tradeoffs we do when developing or adopting new programming languages? Expressiveness vs. simplicity, safety vs. control, performance vs. abstraction, compile time vs. runtime, explicitness vs. conciseness. It's unclear that they will. In the long term, the reasons to create a new programming language will likely diverge significantly from the human-driven motivations of the past. There may well be an optimal programming language for LLMs—and there's no reason to assume it will resemble the ones humans have converged on. TL; DR: - Monoliths return – cheap rewriting kills dependency trees; smaller attack surface, better performance, bare-metal becomes realistic - Lindy effect weakens – legacy code loses its moat, but unknown unknowns persist; formal verification becomes essential - Strongly typed languages rise – human psychology mattered for adoption; now formal verification and RL environments favor types over ergonomics - Open source restructures – human connection drove the community; AI-written/read code breaks those incentives; alignment becomes decisive - New languages diverge – AI may not share our tradeoffs; optimal LLM programming languages may look nothing like what humans converged on ¹ x.com/mntruell/statu… ² x.com/anthropicai/st… ³ wesmckinney.com/blog/agent-erg…#issuecomment-3717222957" target="_blank" rel="nofollow noopener">github.com/tailwindlabs/t…
English
94
285
1.8K
1M
Thomas Wolf
Thomas Wolf@Thom_Wolf·
This is really cool. It got me thinking more deeply about personalized RL: what’s the real point of personalizing a model in a world where base models can become obsolete so quickly? The reality in AI is that new models ship every few weeks, each better than the last. And the pace is only accelerating, as we see on the Hugging Face Hub. We are not far away from better base models dropping daily. There’s a research gap in RL here that almost no one is working on. Most LLM personalization research assumes a fixed base model, but very few ask what happens to that personalization when you swap the base model. Think about going from Llama 3 to Llama 4. All the tuned preferences, reward signals, and LoRAs are suddenly tied to yesterday’s model. As a user or a team, you don’t want to reteach every new model your preferences. But you also don’t want to be stuck on an older one just because it knows you. We could call this "RL model transferability": how can an RL trace, a reward signal, or a preference representation trained on model N be distilled, stored, and automatically reapplied to model N+1 without too much user involvement? We solved that in SFT where a training dataset can be stored and reused to train a future model. We also tackled a version of that in RLHF phases somehow but it remain unclear more generally when using RL deployed in the real world. There are some related threads (RLTR for transferable reasoning traces, P-RLHF and PREMIUM for model-agnostic user representations, HCP for portable preference protocols) but the full loop seems under-studied to me. Some of these questions are about off-policy but other are about capabilities versus personalization: which of the old customizations/fixes does the new model already handle out of the box, and which ones are actually user/team-specific to ever be solved by default? That you would store in a skill for now but that RL allow to extend beyond the written guidance level. I have surely missed some work so please post any good work you’ve seen on this topic in the comments.
Ronak Malde@rronak_

This paper is almost too good that I didn't want to share it Ignore the OpenClaw clickbait, OPD + RL on real agentic tasks with significant results is very exciting, and moves us away from needing verifiable rewards Authors: @YinjieW2024 Xuyang Chen, Xialong Jin, @MengdiWang10 @LingYang_PU

English
28
35
462
70.2K
Thomas Wolf
Thomas Wolf@Thom_Wolf·
@OpenAI Very cool (and love to see Fineweb here). Are people allowed to iterate on the training data?
English
0
0
4
1.6K
kepano
kepano@kepano·
I have been working on Obsidian Reader for a over a year. I didn't want to share it until I felt it was good enough. It's finally there. Consistent formatting for any article. Outline, syntax highlighting, nice footnotes, adjustable typography. Runs locally. Just rules, no AI.
English
158
281
5.1K
267.4K
Workshop Labs
Workshop Labs@WorkshopLabs·
Letting a provider see all your data is the price of admission for AI. We're changing that. Introducing Silo, the first private post-training and inference stack for frontier models, with hardware-level guarantees that we can’t see your data. Privacy without compromises. 🧵
Workshop Labs tweet media
English
17
35
247
35.5K
Thomas Wolf
Thomas Wolf@Thom_Wolf·
@cjpedregal @soleio great to read that, I love granola. tbh MCP access felt already much more stable/reliable than the previous hacks indeed
English
0
0
7
1.2K
Chris Pedregal
Chris Pedregal@cjpedregal·
There are some tweets out there saying that Granola is trying to lock down access to your data. Tldr; we are actually trying to become more open, not closed. We’re launching a public API next week to complement our MCP. Read on for context. A couple months ago, we noticed that some folks had reversed engineered our local cache so they could access their meeting data. Our cache was not built for this (it can change at any point), so we launched our MCP to serve this need. The MCP gives full access to your notes and transcripts (all time for paid users, time restricted for free users). MCP usage has exploded since launch, so we felt good about it. A week ago, we updated how we store data in our cache and broke the workarounds. This is on us. Stupidly, we thought we had solved these use cases well enough with our MCP. We’ve now learned that while MCPs are great for connecting to tools like Claude or chatGPT, they don’t meet your needs for agents running locally or for data export / pipeline work. So we’re going to fix this for you ASAP. First, we’ll launch a public API next week to make it easier for you to pull your data. Second, we’ll figure out how to make Granola work better for agents running locally. Whether that’s expanding our MCP, launching a CLI, a local API, etc. The industry is moving quickly here, so we’d appreciate your suggestions. We want Granola data to be accessible and useful wherever you need it. Stay tuned.
English
95
40
789
145.2K
Thomas Wolf retweetledi
Elliot Arledge
Elliot Arledge@elliotarledge·
Karpathy asked. I delivered. Introducing OpenSquirrel! Written in pure rust with GPUI (same as zed) but with agents as central unit rather than files. Supports Claude Code, Codex, Opencode, and Cursor (cli). This really forced me to think up the UI/UX from first principles instead of relying on common electron slop. github.com/Infatoshi/Open…
Andrej Karpathy@karpathy

Expectation: the age of the IDE is over Reality: we’re going to need a bigger IDE (imo). It just looks very different because humans now move upwards and program at a higher level - the basic unit of interest is not one file but one agent. It’s still programming.

English
146
172
2.5K
406.7K
Christos Tzamos
Christos Tzamos@ChristosTzamos·
1/4 LLMs solve research grade math problems but struggle with basic calculations. We bridge this gap by turning them to computers. We built a computer INSIDE a transformer that can run programs for millions of steps in seconds solving even the hardest Sudokus with 100% accuracy
English
239
787
5.9K
1.6M
Thomas Wolf
Thomas Wolf@Thom_Wolf·
Codexing games together with my 12 yo has been a surprisingly fun dad-son activity over the past couple months as well I don’t pretend he’s really learning to code through that but the very low friction from ideas to implementation and the pure pleasure to invent/propose-anything/mix-and-match-games-ideas/collaboratively-create-something-fun is deeply enjoyable Somewhere between LEGOs and exquisite corpse
Sebastien Bubeck@SebastienBubeck

My 9 yo is now fully independent with codex and it's insane to watch, we built a few games together and then he went off to build his own tower defense, adding features by himself and testing them ... crazy

English
6
6
73
10.2K
Thomas Wolf retweetledi
Archie Sengupta
Archie Sengupta@archiexzzz·
i spent a few hours going through /karpathy/autoresearch repo line by line. the "ai agents doing research" angle is what's getting all the attention but i think the more interesting thing is what's actually inside the training script and the engineering decisions that make the search loop tight. it's one of the most dense single-file training setups i've read. let me start with the thing that makes the whole project possible: the time budget is fixed at 300 seconds wall clock. not fixed steps, not fixed tokens, not fixed flops. wall clock seconds. this sounds like a minor detail but it's the entire reason the autonomous loop works. the agent can make the model 3x bigger, cut the batch size in half, swap in a completely different architecture, and the result is still directly comparable to every other experiment because they all got exactly 5 minutes of training on the same gpu. if you fixed steps instead, a bigger model would get less gradient updates per second and you'd be penalizing it unfairly. if you fixed tokens, you'd have the same problem. fixing wall time means you're asking the right question: given this hardware and this much time, what is the best model you can produce? everything else is a free variable. the agent can explore the full pareto surface of model size vs throughput vs convergence speed without any of those tradeoffs being confounded by the evaluation protocol. the metric is also carefully chosen. it's bits per byte, not cross entropy loss. cross entropy depends on your vocab size. a model with 32k tokens and a model with 8k tokens will have very different loss values even if they compress the data equally well. bpb normalizes this away by summing the per-token cross entropy in nats, summing the utf-8 byte lengths of the target tokens, and converting nats-per-byte to bits-per-byte. so even if the agent changes something that affects the effective token distribution, the comparison remains fair. these two choices, fixed wall time and a vocab-invariant metric, turn what would be a messy incomparable search into a clean optimization problem. now the model itself. it's a GPT but with a bunch of modern tricks that are worth understanding. first, RMSnorm everywhere. on the block inputs (pre-norm), and also on queries and keys right before the attention dot product. this QK-norm thing is important because without it the norms of q and k can grow unboundedly during training, causing attention logits to sharpen and softmax to saturate. normalizing q and k keeps the dot products in a stable range regardless of how deep the network is or how training dynamics evolve. the attention itself is FA 3, loaded through the kernels library. it uses varunneal's implementation on hopper (sm_90) and falls back to a community build on older gpus. the attention pattern is "SSSL" which means three layers of sliding window attention (window = half the sequence length) followed by one layer of full causal attention, repeating. this is the sparse-to-dense pattern you see in mistral and gemma2. the local attention layers are computationally cheap because the attention matrix is banded, and the periodic global layer lets information flow across the full context. with 8 layers and a 4-character pattern you get layers 0,1,2 local, layer 3 global, layers 4,5,6 local, layer 7 global. the last layer is forced global regardless of pattern. the value embedding thing is subtle and i think underappreciated. every other layer gets its own embedding table, completely separate from the main token embedding, that maps token ids directly to value-dimension vectors. these get mixed into the attention values through a learned gate: v = v + 2 * sigmoid(W_gate @ x:32) * ve. the gate weight is zero-initialized, so sigmoid(0) = 0.5, times 2 gives 1.0, which is a neutral starting point. over training the model can learn to amplify or suppress the value embedding per-head based on the first 32 dimensions of the hidden state. this is from the ResFormer line of work and the intuition is that it gives attention a direct shortcut to token identity. the value vectors can carry information about "what token is at this position" without that information having to survive the residual stream transformations from earlier layers. it's essentially a skip connection from the input directly into the attention values, gated so the model can decide when it's useful. there are also per-layer learnable scalars on the residual stream: x = lambda_residi * x + lambda_x0i * x0, where x0 is the normalized embedding from layer 0. every layer can independently control how much it listens to the running residual vs the original input. the residual lambdas start at 1.0, the x0 lambdas start at 0.1. this is a soft version of the "disentangled residual" idea. in a standard transformer the residual stream is a sum of all previous layer outputs and it gets increasingly polluted as you go deeper. giving each layer access to the clean original embedding means it doesn't have to learn to "undo" earlier layers to recover low-level information. the logits are softcapped at 15 via tanh(logits/15)*15 which prevents the model from being overconfident early in training when the representations are still noisy. but honestly the most interesting part of the whole file is the optimizer. MuonAdamW is a combined optimizer that dispatches different update rules based on parameter group. embeddings (token embedding, value embeddings, unembedding head) and per-layer scalars get standard AdamW with different learning rates for each group. the spread is wild. embedding lr is 0.6, unembedding lr is 0.004, that's a 150x difference, and it's intentional. the embedding matrix sees every single token and needs to update aggressively. the unembedding matrix is a linear probe on the final representation and benefits from stability. the embedding, value embedding, and unembedding learning rates are all scaled by (d_model / 768)^(-0.5) which is a muP-inspired correction. as model width changes, those learning rates adjust to keep the feature learning dynamics scale-invariant. the scalar learning rates for the per-layer lambdas are handled separately and don't get this scaling. the 2D weight matrices in the transformer, attention projections and mlp weights, get Muon, and this is where it gets genuinely interesting. muon takes the gradient, applies nesterov momentum, then runs a newton-schulz iteration to approximate the polar decomposition of the gradient matrix. the polar decomposition factors a matrix G into G = U * S where U is orthogonal and S is symmetric positive semi-definite. muon computes U, the nearest orthogonal matrix to the gradient, and uses that as the update direction. the newton-schulz iteration is 5 steps. for tall matrices (more rows than columns), A = X^T @ X then X -> aX + X @ (bA + cA^2). for wide matrices, A = X @ X^T then X -> aX + (bA + cA^2) @ X. the coefficients are hardcoded from a precomputation. they call it "polar express." the whole thing compiles to a single fused kernel via torch.compile. why does this matter? because for weight matrices the frobenius norm gradient (what adam and sgd use) is geometrically wrong. the "correct" steepest descent direction for a weight matrix is the one that minimizes the loss subject to the constraint that the update has unit spectral norm, not unit frobenius norm. the orthogonal polar factor gives you exactly this. in practice it means muon makes much larger effective updates because it's not wasting step size on scaling the singular values. it only rotates them. this is why muon converges significantly faster than adam on transformer weight matrices. muon does maintain per-element momentum buffers (same shape as the parameters, stacked across each shape group), but unlike adam it doesn't track per-element second moments. the second moment estimates are per-row or per-column after orthogonalization, not per-element. that's where NorMuon comes in. on top of the base muon there's NorMuon, a variance reduction scheme. after orthogonalization, it computes per-row (or per-column depending on aspect ratio) second moment estimates, maintains an exponential moving average of those, and rescales the update so each output dimension gets its own adaptive step size. it's essentially the adam adaptivity idea but applied in the orthogonalized coordinate system rather than the raw parameter space. the weight decay is also non-standard. it's "cautious," meaning it only decays parameters where the muon update direction agrees with the parameter sign: mask = (g * params) >= 0. this avoids the known failure mode where weight decay pushes parameters toward zero against the update's wishes, which can destabilize training. one small detail i appreciated: after the very first training step, the code calls gc.collect(), gc.freeze(), gc.disable() to completely shut off python's garbage collector. python's GC runs periodically and causes ~500ms stalls. when your total budget is 300 seconds and each step is maybe 300ms, a random GC pause costs you almost 2 training steps. they manually trigger gc.collect() every 5000 steps as a compromise. this is the kind of thing you only learn by profiling real training runs and noticing mysterious throughput drops. the first 11 steps (0 through 10) aren't counted toward the time budget either. that's the warmup where torch.compile does its thing and CUDA kernels get JIT'd. without this exclusion, different experiments would get different amounts of "real" training depending on how long compilation takes for that particular model configuration. again, a design choice that seems small but is critical for making experiments comparable. now zoom out. the actual autoresearch loop is: the agent reads program.md (a markdown file that describes its job), modifies train .py, commits, runs for 5 minutes, checks if val_bpb improved, keeps or reverts, repeats. program.md explicitly says "NEVER STOP." the agent runs indefinitely until the human kills it. ~12 experiments per hour, ~100 overnight while you sleep. the thing i keep coming back to is how tight the constraints make the problem: > one file to edit. > one metric to optimize. > one gpu. > five minutes. > no new dependencies allowed. the search space is large but the evaluation is fast, cheap, and unambiguous. without the fixed time budget the agent would have to reason about compute-performance tradeoffs which is a much harder problem. without the single-file constraint it could create sprawling multi-file messes that are impossible to revert cleanly. the constraints are what make it work. this is honestly a general lesson in research. the tighter the evaluation protocol, the faster you make progress.
English
38
97
1.2K
98K
Laura Modiano
Laura Modiano@LauraModiano·
London to Amsterdam to Stockholm to London to Paris to Berlin hanging with some of the best in the startup ecosystem in Europe is a week very well spent. Now back home for a vibe coding session for 10 year olds
English
7
0
92
6K
Christine Yip
Christine Yip@christinetyip·
If you're still doing autoresearch alone, you're already behind. Every node is an experiment run by an agent. Every experiment and result is open-source. Your agent could've read these results and adjusted its strategy before running its own experiments. That's the power of autoresearch@home. ~1400 experiments have already been run. And it's growing.
Christine Yip tweet media
English
23
23
323
104.6K
Finn Meeks
Finn Meeks@finn_meeks·
Anything but casual Friday at SPC! We started with agent workflow demos. Special shoutout to @trq212 for showing us the latest with Claude. And ended with @Thom_Wolf and the @huggingface team stopping by for a Q&A and deep dive on LeRobot. Guest appearance from Reachy Mini!
Finn Meeks tweet mediaFinn Meeks tweet mediaFinn Meeks tweet mediaFinn Meeks tweet media
English
6
3
36
5.8K
Thomas Wolf retweetledi
Alif Munim (d/acc)
Alif Munim (d/acc)@alifmunim·
Since @karpathy kicked off recursive self-improvement a few days ago, I've been thinking about how we can automate interpretability research. I asked Claude to train a sparse autoencoder on Gemma3-1B. It recovered 96% of Gemma's behaviors from interpretable features overnight.
Alif Munim (d/acc) tweet media
English
20
39
449
40.6K
Cheng-Wei Hu
Cheng-Wei Hu@HcwXd·
I left NotebookLM a few months ago to solve a bigger problem in learning. Today, as the first step, we are launching @WonderingApp for early access. It's Duolingo for anything — turning any topic into a guided path with bite-size visual lessons that can fit into your busy schedule. But you don't sacrifice depth/effectiveness for convenience: Total Control: You decide how deep you want to go, how difficult the material should be, and how personalized the experience feels. Active Learning: We provide the tools you need to practice, test your understanding, and actually apply what you’ve learned. Long-term Mastery: It’s built to help you truly remember and master any subject, not just skim the surface.
English
212
202
3.1K
305.7K
Thomas Wolf retweetledi
AI4Science Catalyst
AI4Science Catalyst@AI4S_Catalyst·
We’re thrilled to open-source LabClaw — the Skill Operating Layer for LabOS by Stanford-Princeton Team One command turns any OpenClaw agent into a full AI Co-Scientist. Demo: labclaw-ai.github.io Dragon Shrimp Army reporting for duty 🦞🔬 #AIforScience #OpenClaw
English
51
269
1.5K
433.1K
Thomas Wolf retweetledi
LeRobot
LeRobot@LeRobotHF·
🚀 𝐒𝐜𝐚𝐥𝐢𝐧𝐠 𝐞𝐯𝐞𝐫𝐲 𝐝𝐢𝐦𝐞𝐧𝐬𝐢𝐨𝐧 𝐨𝐟 𝐎𝐒𝐒 𝐑𝐨𝐛𝐨𝐭𝐢𝐜𝐬! 𝐋𝐞𝐑𝐨𝐛𝐨𝐭 𝐯0.5.0 𝐢𝐬 𝐨𝐟𝐟𝐢𝐜𝐢𝐚𝐥𝐥𝐲 𝐋𝐈𝐕𝐄! With over 200 merged PRs and 50+ new contributors, this is our biggest release yet. Whether you're working in sim or deploying on real hardware, v0.5.0 pushes the boundaries of open-source robot learning. Highlights: * 🤖 First Humanoid Support: Full integration for the Unitree G1, including whole-body control, locomotion, and manipulation! * 🧠 New SOTA Policies: Expanding the zoo with Pi0-FAST (Autoregressive VLAs), Wall-X, X-VLA, and SARM for complex, long-horizon tasks. * ⚡ Real-Time Chunking (RTC): Dramatically more responsive, real-time inference for flow-matching policies. * 🎥 Faster Datasets: New streaming video encoding means zero wait time between recording episodes, plus 10x faster image training. * 🌍 EnvHub & IsaacLab: Load sim environments straight from the Hugging Face Hub, now featuring GPU-accelerated NVIDIA Isaac integration. * 🛠️ Modernized Core: Upgraded to Python 3.12 & Transformers v5, plus a seamless new 3rd-party policy plugin system. This is a massive leap toward general-purpose embodied AI. Read the full announcement in the Release Blog: huggingface.co/blog/lerobot-r… P.S. Keep an eye out... a big surprise is right around the corner! 👕👀
LeRobot tweet media
English
6
55
244
30.8K
Thomas Wolf retweetledi
LDJ
LDJ@ldjconfirmed·
In November 2023, Yann LeCun, Thomas Wolf and others from Meta and Huggingface created a benchmark called GAIA, which described itself as: "A benchmark for General AI Assistants that, if solved, would represent a milestone in AI research." Most of the problem solutions were kept private, not released online. It proposed 466 "real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency." On the hardest level, the average human score was 87%, while the leading systems scored less than 3%. 10 months later OpenAI released O1-preview, reaching ~30% on that level. Now in 2026 the human baseline for the hardest level has officially been surpassed, the best agent systems are now scoring 88.9% on GAIAs hardest level (level 3).
LDJ tweet media
English
25
61
806
78K