Tim Dettmers

3.8K posts

Tim Dettmers banner
Tim Dettmers

Tim Dettmers

@Tim_Dettmers

Creator of bitsandbytes. Professor @CarnegieMellon and Research Scientist @allen_ai . I blog about deep learning and PhD life at https://t.co/Y78KDJJFE7.

Pittsburgh, PA Katılım Ekim 2012
899 Takip Edilen44.9K Takipçiler
Sabitlenmiş Tweet
Tim Dettmers
Tim Dettmers@Tim_Dettmers·
After 7 months on the job market, I am happy to announce: - I joined @allen_ai - Professor at @CarnegieMellon from Fall 2025 - New bitsandbytes maintainer @Titus_vK My main focus will be to strengthen open-source for real-world problems and bring the best AI to laptops 🧵
English
155
86
2.4K
255K
Tim Dettmers
Tim Dettmers@Tim_Dettmers·
If these sizes are true, that is pretty devastating for closed-sourced labs. Training very large-scale models is difficult due to unexplained, sudden, rogue loss spikes. But once you manage that, it is mostly spending more compute on your model.
Bojie Li@bojie_li

Closed labs hide model sizes. They can't hide what their models know, and what a model knows is an indicator on how big it is. Reasoning compresses. Factual knowledge doesn't. So you can size a frontier model from black-box API calls alone, and across releases you can literally watch a single fact arrive in the parameters over time. For three years, my friends Jiyan He and Zihan Zheng have been asking frontier LLMs the same question: "what do you know about USTC Hackergame?", a CTF contest. May 2024: GPT-4o invented fake titles. Feb 2025: Claude 3.7 Sonnet listed 19 verified 2023 challenges. By April 2026, frontier models recall specific challenges across consecutive years. After DeepSeek-V4 dropped, I instructed my agent to spend four days autonomously turning that habit into Incompressible Knowledge Probes (IKP) — 1,400 questions, 7 tiers of obscurity, 188 models, 27 vendors. Three findings: 1/ You can approximately size any black-box LLM from factual accuracy alone. Penalized accuracy is log-linear in log(params), R² = 0.917 on 89 open-weight models from 135M to 1.6T params. Project closed APIs onto the curve → GPT-5.5 ~9T, Claude Opus 4.7 ~4T, GPT-5.4 ~2.2T, Claude Sonnet 4.6 ~1.7T, Gemini 2.5 Pro ~1.2T (90% CI: 0.3-3x size). 2/ Citation count and h-index don't predict whether a frontier model recognizes a researcher. Two researchers with similar citation profiles get very different responses. Models memorize impact — work that shaped a field, not many incremental papers. 3/ Factual capacity doesn't compress over time. Across 96 open-weight models across 3 years, the IKP time coefficient is statistically zero, rejecting the Densing-Law prediction of +0.0117/month at p<10⁻¹⁵. Reasoning benchmarks saturate; factual capacity keeps scaling with parameters. Website: 01.me/research/ikp/ Paper: arxiv.org/pdf/2604.24827

English
18
14
296
75.9K
Tim Dettmers retweetledi
Reiner Pope
Reiner Pope@reinerpope·
Intelligence per picojoule, with @itsclivetime and @dylan522p (0:00) Intro (1:22) What is codesign? (2:49) Codesign example: Swish vs ReLU (4:22) Are DeepSeek papers codesign? (6:45) Predicting where ML research will go (8:06) Should researchers hate your chips? (9:34) Can you codesign too much? (13:23) Picking the right grain size for specialization (16:22) How much hardware flexibility for The Age of Research? (20:05) Did reasoning and RL disrupt hardware roadmaps? (23:09) Cerebras/Groq: unexpected wins on reasoning and RL (25:34) Disaggregating MLP and attention (29:06) The right metrics for quantization and codesign papers
English
11
56
601
140.3K
Tim Dettmers
Tim Dettmers@Tim_Dettmers·
So cool to see that open-source, with open experimentation (and with the help of someone posting blog posts about their personal research), can yield a very robust method for MoE balancing. This method seems more elegant than all other methods I have seen. Open source is Awesome!
Percy Liang@percyliang

Marin is using quantile balancing from @Jianlin_S (who developed RoPE, which was also a good idea) to train our current 1e23 FLOPs MoE. The idea is elegant: assigning tokens to experts by solving a linear program. No hyperparameters to tune. Yields stable training.

English
3
9
81
18.9K
Tim Dettmers
Tim Dettmers@Tim_Dettmers·
Something is about to drop 🔥
Tim Dettmers tweet media
English
0
3
76
6.4K
Tim Dettmers retweetledi
Qwen
Qwen@Alibaba_Qwen·
⚡ Meet Qwen3.6-35B-A3B:Now Open-Source!🚀🚀 A sparse MoE model, 35B total params, 3B active. Apache 2.0 license. 🔥 Agentic coding on par with models 10x its active size 📷 Strong multimodal perception and reasoning ability 🧠 Multimodal thinking + non-thinking modes Efficient. Powerful. Versatile. Try it now👇 Blog:qwen.ai/blog?id=qwen3.… Qwen Studio:chat.qwen.ai HuggingFace:huggingface.co/Qwen/Qwen3.6-3… ModelScope:modelscope.cn/models/Qwen/Qw… API(‘Qwen3.6-Flash’ on Model Studio):Coming soon~ Stay tuned
Qwen tweet media
English
449
1.7K
11.6K
2.7M
Tim Dettmers retweetledi
Zhihao Jia
Zhihao Jia@JiaZhihao·
The #MLSys2026 program is out, and it is awesome! 📄 107 research papers + 28 industry papers spanning the full AI systems stack 🏆 Three exciting contests: AWS Trainium programming, Google graph scheduling, and NVIDIA AI kernel generation 🎤 Keynotes from an outstanding lineup: Amin Vahdat (Google) on infra; @LukeZettlemoyer (UW & Meta) on models; @kozyraki (Stanford & NVIDIA) on architecture; Lidong Zhou (Microsoft) on systems; and @marksaroufim (GPUMode) on GPUs and kernels. Join us in Bellevue, WA in a month! Early registration ends April 19 — don’t miss it: mlsys.org.
Zhihao Jia tweet media
English
1
18
104
31.4K
Tim Dettmers retweetledi
Liang Chen
Liang Chen@liangchen5518·
GLM 5.1 from @Zai_org ranks as the top open model on the newly released Monthly-SWEBench by @UniPat_AI—second only to Claude-Opus-4.6. Congrats to the team! 🚀Explore the benchmark: unipat.ai/benchmarks/Mon…
Liang Chen tweet media
English
1
19
135
15.3K
Tim Dettmers retweetledi
Graham Neubig
Graham Neubig@gneubig·
Everyone's talking about Anthropic's new model discovering new security vulnerabilities. What people aren't talking about is the millions of KNOWN vulnerabilities remaining unfixed due to lack time, interest, etc. e.g. OpenClaw has 67 CVEs right now, including 4 critical ones.
Graham Neubig tweet media
English
9
27
145
11.6K
Tim Dettmers
Tim Dettmers@Tim_Dettmers·
I was going crazy because I could not replicate TurboQuant. Turns out the community also had issues. The community quickly made adjustments to "make it work", but what they did not realize is that they reimplemented (most of) HIGGS in the process (full HIGGS would be even better)
English
14
72
847
97.5K
Tim Dettmers retweetledi
Tim Dettmers
Tim Dettmers@Tim_Dettmers·
@YouJiacheng There is more discussion below that comment. It is all a bit complicated and a bit evidence back and forth. But there seems to be pretty big evidence that QJL does not help. It certainly does not help in my own benchmarks
English
1
0
1
145
Tim Dettmers retweetledi
Nicholas Boffi
Nicholas Boffi@nmboffi·
🤯 big update to our flow map language models paper! we believe this is the future of non-autoregressive text generation. read about it in the blog: one-step-lm.github.io/blog/ full details in the paper: arxiv.org/abs/2602.16813 we introduce a new class of continuous flow-based language models and distill them into their corresponding flow map for one-step text generation. we beat all discrete diffusion baselines at ~8x speed! v2 gives a complete theory of the flow map over discrete data, with three equivalent ways to learn it (semigroup, lagrangian, eulerian). it turns out you can train these with cross-entropy objectives that look very similar to standard discrete diffusion — but without the factorization error that kills discrete methods at few steps. beyond improving results across the board, we showcase properties that are unique to continuous flows. in particular, inference-time steering and guidance become straightforward. autoguidance brings generative perplexity down to 51.6 on LM1B, while discrete baselines completely collapse at the same guidance scale. we also show reward-guided generation for steering topic, sentiment, grammaticality, and safety at inference time — and it works even at 1-2 steps with our flow map model. simple, well-understood techniques from continuous flows just work incredibly well in practice for language. we’re extremely excited about the future of this class of models. stay tuned for results on scaling, reasoning, and reinforcement learning-based fine-tuning. 🚀
English
13
90
472
72.5K
Tim Dettmers retweetledi
Sam Bowman
Sam Bowman@sleepinyourhat·
Mythos Preview seems to be the best-aligned model out there on basically every measure we have. But it also likely poses more misalignment risk than any model we’ve used: Its new capabilities significantly increase the risk from any bad behavior. 🧵
Sam Bowman tweet media
English
54
190
1.4K
978.4K
Tim Dettmers retweetledi
Z.ai
Z.ai@Zai_org·
Introducing GLM-5.1: The Next Level of Open Source - Top-Tier Performance: #1 in open source and #3 globally across SWE-Bench Pro, Terminal-Bench, and NL2Repo. - Built for Long-Horizon Tasks: Runs autonomously for 8 hours, refining strategies through thousands of iterations. Blog: z.ai/blog/glm-5.1 Weights: huggingface.co/zai-org/GLM-5.1 API: docs.z.ai/guides/llm/glm… Coding Plan: z.ai/subscribe Coming to chat.z.ai in the next few days.
Z.ai tweet media
English
550
1.3K
10.9K
4.3M
Tim Dettmers
Tim Dettmers@Tim_Dettmers·
@turbo_xo_ @yacineMTB For weights, too. HIGGS is flexible. Difficult to improve for data-free quantization methods.
English
0
0
12
2.6K
Tim Dettmers
Tim Dettmers@Tim_Dettmers·
We in the quantization community could quickly see this and were flabbergastered by the response to TurboQuant. Whenever I saw TurboQuant on my timeline, I found it hurtful, because the work of other academics who worked so hard was discounted.
English
9
12
234
19.2K