Tijmen Blankevoort

499 posts

Tijmen Blankevoort banner
Tijmen Blankevoort

Tijmen Blankevoort

@TiRune

Amsterdam, The Netherlands Katılım Mayıs 2009
210 Takip Edilen674 Takipçiler
Tijmen Blankevoort retweetledi
Bryan Catanzaro
Bryan Catanzaro@ctnzr·
Today, @NVIDIA is launching the open Nemotron 3 model family, starting with Nano (30B-3A), which pushes the frontier of accuracy and inference efficiency with a novel hybrid SSM Mixture of Experts architecture. Super and Ultra are coming in the next few months.
Bryan Catanzaro tweet media
English
41
224
1.2K
503K
Chris Barber
Chris Barber@chrisbarber·
I made an unofficial NeurIPS 2025 hiring list: @rronak_, @QuantumArjun, @michaelelabd, stealth, I’m a small investor: RL post-training from live product usage. Research Engineers. @jonsidd, Turing: data for frontier models. Research Engineers, SWEs. @schwarzjn_, ICL & Thomson Reuters: LLMs for law. Research Engineers, SWEs, PhD students. @panda_liyin, AdaL: copilot for ML engineering. MLEs, SWEs. @sarwal_varuni, TriFetch: data and post-training for medical AI. @bidhan, Bagel Labs: decentralized training for diffusion models. MLEs, ML Scientists. @meggmcnulty, Cosmic Labs: AI-native OS for embedded engineering. MLEs, SWEs, systems engineers. @samuelekpe, GrupaAI: operating system for AI agents. SWEs. @jaradcannon, Humanoid: industrial humanoids. SWEs and applied researchers. @saurabh_here1, Cantina: AI native social media. Research interns for video gen. @RicardoMonti9, DatologyAI: frontier data curation (filtering, mixing, synthetic) for LLMs. Research Scientists, MLEs, SWEs. @NimaGard, Path Robotics: physical AI to automate manufacturing tasks (e.g. welding). MLEs for robot learning. @DrJimFan, Nvidia robotics team. Research Engineers, SWEs. @katherine1ee, OpenAI pretraining safety team. Research Engineers. @BorisMPower, OpenAI applied AI research team. Research Engineers. @j_asminewang, OpenAI alignment team. Research Engineers, Research Scientists. @zijianwang30, MSL data research team. Research Engineers, Research Scientists. @RuiqiGao, Google DeepMind video gen team. Research Engineers, Research Scientists. @joshim5, Chai Discovery: molecule prediction for drug discovery. Research Engineers, SWEs. @crisbodnar, Project Prometheus: AI for manufacturing and logistics. Research Engineers. @vdbergrianne, Microsoft Research Amsterdam materials science team. Research Engineers. @kamath_sutra, Smallest: AI for call centers. SWEs. @idavidrein, METR: frontier model evaluation. Research Engineer. @jimmysmith1919, Liquid AI: on-device models. MLEs, Research Engineers. @alxndrdavies, AI Security Institute: red-teaming. Research Scientists/Engineers. @stuhlmueller, Elicit: AI for scientific research and good reasoning. MLEs, SWEs. @gavincrooks, @FarisSbahi, Normal Computing: physics-based ASICs. Research Engineers, SWEs. @myra_deng, Goodfire AI: interpretability research. Research Engineers, Research Scientists, MLEs. @_lychrel, @SergeiIakhnin, @ja_kirkpatrick, @sbos, Isomorphic Labs: AI-first drug discovery. Research Engineers, Research Scientists, MLEs. @kdqg1, @bneyshabur, Anthropic AI Scientist team. Research Engineers with infra experience. @sarahookr, Adaption: continuous learning. Research Engineers. @francedot, Cua, I’m a small investor: infra for computer-use agents. SWEs, Research Engineers. @iScienceLuvr, Sophont: multimodal models for healthcare. Research Engineers/Research Scientists. @aditshah00, Until Labs: organ preservation. MLEs. @RitvikKapila & @gauri__gupta, NeoSigma: evals and post-training for real world agents. SWEs. @abeirami, stealth: reliability & statistical evaluation. Research Engineers & SWEs. @adityachinchure, Ideogram: image generation. Research Engineers. @AndrewLBeam, @kenneth0stanley, Lila Sciences: autonomous labs, verifiability for science. Research Engineers, MLEs. @brianwilt, Waymo: ML infra for motion planning team. Senior SWEs. @thisismadani, Profluent Bio: protein generation for drug development. MLEs.
English
19
41
429
62.2K
Romi Lifshitz
Romi Lifshitz@RomiLifshitz·
@TiRune Would love to chat! (but your DMs are closed!)
English
1
0
0
94
Tijmen Blankevoort
Tijmen Blankevoort@TiRune·
Looking for cracked full-time Deep Learning researchers on Efficiency, Quantization and Sparsity. Join our world-class applied deep learning research team at Nvidia. Team creates the Nemotron models, we influence the hardware with our research. Shoot me a message! Am at Neurips!
English
2
1
10
1.7K
Xiangming Gu
Xiangming Gu@gu_xiangming·
Congratulations to @Alibaba_Qwen for winning the NeurIPS 2025 Best Paper Award. Great to hear that attention sink attracts a lot of attention. I think why gated attention eliminates attention sink: the gate mechanism implements "no-op" (do not update token representations), exempting the necessity to develop attention sink to achieve. Please also check our two papers about when attention sink emerges in LLMs(openreview.net/forum?id=78Nn4…) and why LLMs need attention sink(arxiv.org/abs/2504.02732). In my first paper, I showed some attention variants that are attention-sink-free, like sigmoid attention and some linear attention.
Xiangming Gu tweet media
Qwen@Alibaba_Qwen

🏆 We are incredibly honored to announce that our paper, "Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free" has received the NeurIPS 2025 Best Paper Award! A huge congratulations to our dedicated research team for pushing the boundaries of AI. Read more: blog.neurips.cc/2025/11/26/ann…

English
4
62
521
147.3K
Qwen
Qwen@Alibaba_Qwen·
🏆 We are incredibly honored to announce that our paper, "Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free" has received the NeurIPS 2025 Best Paper Award! A huge congratulations to our dedicated research team for pushing the boundaries of AI. Read more: blog.neurips.cc/2025/11/26/ann…
Qwen tweet media
English
91
384
2.9K
482.8K
Lisan al Gaib
Lisan al Gaib@scaling01·
@david_sepulvado very high OpenAI released their open source models 3 months ago in FP4 and other open source models come natively with QAT in FP4, like Kimi-K2 Thinking Google pioneered a lot of these techniques.
English
1
0
14
4.2K
Lisan al Gaib
Lisan al Gaib@scaling01·
Gemini 3 Pro has around ~7.5T params (vibe-mathing with explanation) > the naive fit with with an R^2 of 0.8816 yields a mean estimation of 2.325 Quadrillion parameters > ummm, that's not it > let's only take sparse MoE reasoning models > this includes gpt-oss-20B and 120B, Qwen3 Next, MiniMax, Qwen3 235B, GLM-4.6, DeepSeek-V3.1 Terminus, DeepSeek V3.2, DeepSeek R1 0528 and Kimi K2 Thinking > R^2 of 0.9478 mean estimate of 604T params > pretty sure that's not it either > okay, let's take the most optimistic series of points > (the idea here is that the Google Team is at least on this open-source frontier, if not ahead) > MiniMax-M2, GLM-4.6, and DeepSeek R1 0528 > that's more like it, but YIKES > confidence intervals are fucking cooked > mean estimate of 19.6T with the lower 95% bound at 1.7T > I will take 1.7T as our minimum model size for Gemini 3 Pro > okay fuck DeepSeek-R1, we are going full retard, the most optimal of points > confidence intervals are dead > 2 point regression, R^2 = 1, AGI achieved > mean estimate of 8.2T params > TPUv7 rack has 64 TPUs @ 192GB/TPU = 12288 > I assume they wouldn't want multi-rack inference because of latency, complexity or whatever > they are likely serving in FP4 which limits the maximum model to 24.576T params > inference max shows that a GB200 NVL72 which is very similar to TPUv7 rack setup can serve 512 or even 1024 users at above 50 tokens/s > KV size only scales with layers and latent dim and data format, for DeepSeek V3 with MLA this would be 4.48TB for 256 concurrent users at 1 million context and FP4 (they probably have something better than this. since I overestimate memory usage I go with the lower batch size of 256 instead of 512) > so 4.48TB for context and 1TB of overhead > ~5.5TB of our precious memory gone > ~6.788TB memory left > max model size at FP4 -> ~12.576T params My prior vibe-estimate before doing all of this: 5-10T Mean estimate based on open-source MoE reasoning models: 8.2T Lower Bound: 1.7T Upper Bound: 12.576T Midpoint between upper and lower bound: 7.138T New estimate: Gemini 3 Pro has around ~7.5T params (big uncertainty here due to data format, batch-size and memory requirements)
Lisan al Gaib tweet mediaLisan al Gaib tweet mediaLisan al Gaib tweet mediaLisan al Gaib tweet media
English
55
48
660
200.7K
Andrej Karpathy
Andrej Karpathy@karpathy·
@eigenrobot World of Warcraft Classic grinding mobs, simple questing is mine. Repetitive skill rotation with just enough variety to keep fun/engaging but easy. A lot of *wrong* answers in the replies here, games that nowhere near mindless enough eg Factorio.
English
27
7
597
66.8K
eigenrobot
eigenrobot@eigenrobot·
any good video games for zoning out and listening to podcasts
English
1.3K
27
2.4K
445.1K
pushkar
pushkar@thepushkarp·
this was a good read, esp the comparison of attention to graphs i didn’t understand all of it though. looking for more reads around attention sinks. what should i be looking at?
pushkar tweet media
English
9
75
784
44.1K
Tim Dettmers
Tim Dettmers@Tim_Dettmers·
I do not get why Dan's group does not get more attention: best quantization methods, best quantization kernels, and they even put everything into open-source libraries. Meanwhile, we see slop papers/software explode. If frontier labs ask me who's students to hire, I go like 👇
Dan Alistarh@DAlistarh

We're releasing the DASLab GGUF Quantization Toolkit! 🚀 First open-source toolkit bringing GPTQ + EvoPress to @ggerganov's GGUF format, enabling heterogeneous quantization based on importance. Result: Better models at the same file size. [1/5]

English
8
25
358
49.6K
Tijmen Blankevoort
Tijmen Blankevoort@TiRune·
@awnihannun @thinkymachines Why would smaller formats give a larger effect on non-associativity? Wouldn’t it be the other way around as scales tend to be in the same order of magnitude? I tested effects of e.g. order of operations on fp8 before and it was minuscule.
English
0
0
0
210
Awni Hannun
Awni Hannun@awnihannun·
Here's a one-line code summary in MLX of the @thinkymachines blog post on non-determinism in LLM inference. I'd guess the difference is larger the lower the precision, as you get larger affects from non-associativity of FP math. Interestingly, that implies that training at low precision (think NVFP4) might make generation much more sensitive to batch size.
Awni Hannun tweet media
English
6
5
280
27.8K
Tijmen Blankevoort
Tijmen Blankevoort@TiRune·
@richpinky3 @awnihannun @thinkymachines There’s no reproducibility across hardware in general. This was discussed in fp8 standards meets - but for portability you’d have to define how each operation in deep learning is defined to the bit. Not feasible, and networks are generally robust to noise, so likely unnecessary.
English
0
0
0
25
pinkY
pinkY@richpinky3·
@awnihannun @thinkymachines Non-determinism at FP4/FP8 feels like an underexplored alignment risk too — if outputs drift with batch size/precision, how do we trust reproducibility across hardware? Curious if anyone is benchmarking sensitivity vs. stability tradeoffs systematically yet?
English
1
0
11
1.3K
Tijmen Blankevoort
Tijmen Blankevoort@TiRune·
@Tim_Dettmers I’d be up for contributing to this. I was already getting a bit annoyed the wheel kept on being reinvented.
English
0
0
2
184
Tim Dettmers
Tim Dettmers@Tim_Dettmers·
I should really write a blog post about how attention sinks relate to outliers and information processing in transformers. Almost all data is out there in papers, and if you pull things together it is easier to understand what is going on
tensorqt@tensorqt

attention sinks may be a bias in causal transformers. as some of you know, i've been writing a long blogpost on attention and its properties as a message-passing operation on graphs. while doing so, i figured i might have found an explanation for which attention sinks may be an *intrinsic bias of causal transformers learning dynamics* , rather than a desirable learnable feature. this prompted me to slice up my long blogpost into a series of chapters, of which this is the first. many thanks to @zmkzmkz , @Niccolg92, @thelokasiffers and @fabmilo who, among others (acknowleged in the post), gave me precious feedback on this first blogpost of mine. Please let me know what you think if you end up reading it, it's definitely a very early hypothesis that i'm more than willing to challenge. A link to the post is in the first reply

English
24
53
869
75.6K
tensorqt
tensorqt@tensorqt·
attention sinks may be a bias in causal transformers. as some of you know, i've been writing a long blogpost on attention and its properties as a message-passing operation on graphs. while doing so, i figured i might have found an explanation for which attention sinks may be an *intrinsic bias of causal transformers learning dynamics* , rather than a desirable learnable feature. this prompted me to slice up my long blogpost into a series of chapters, of which this is the first. many thanks to @zmkzmkz , @Niccolg92, @thelokasiffers and @fabmilo who, among others (acknowleged in the post), gave me precious feedback on this first blogpost of mine. Please let me know what you think if you end up reading it, it's definitely a very early hypothesis that i'm more than willing to challenge. A link to the post is in the first reply
tensorqt tweet media
English
40
114
1.1K
201.6K
Tijmen Blankevoort
Tijmen Blankevoort@TiRune·
@YouJiacheng Can at this point only guess, but: the power-of-two exponent formats are good for hardware, but not great for accuracy. You want some mantissa bits. 7 bit exponents might have be done for ease of implementation in hardware? Use signed format, but put the sign bit to 0.
English
0
0
0
74
You Jiacheng
You Jiacheng@YouJiacheng·
why NVFP4 uses E4M3 scale instead of UE4M4?
English
4
0
14
4.7K
Tijmen Blankevoort
Tijmen Blankevoort@TiRune·
@YouJiacheng You could, but for both training and inference, if you’re setting the min-max range anyway, don’t you want more precision than dynamic range?
English
0
0
1
170
Tijmen Blankevoort
Tijmen Blankevoort@TiRune·
@JohnCLangford @manan_tomar You can do this in quantization - add ‘quantization-like’ noise to regularize for quantization. Though, it always works less well than quantization-aware training, which allows weights to converge and settle to specific discrete values. Best is still to train in ‘discrete’.
English
1
0
0
63
John Langford
John Langford@JohnCLangford·
@manan_tomar In a learning context where discretization is inherently tricky with gradient descent, is there an alternative viable approach where you keep things continuous but inject noise regularly so the system learns to purge errors?
English
1
0
1
310
Frank Karsten
Frank Karsten@karsten_frank·
Edge of Eternities Limited has been stellar: I launched two 7-1 Arena Direct finishes with these Sealed builds! 🚀
Frank Karsten tweet mediaFrank Karsten tweet media
English
2
1
31
9.6K