
Tijmen Blankevoort
499 posts

Tijmen Blankevoort
@TiRune









🏆 We are incredibly honored to announce that our paper, "Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free" has received the NeurIPS 2025 Best Paper Award! A huge congratulations to our dedicated research team for pushing the boundaries of AI. Read more: blog.neurips.cc/2025/11/26/ann…










Several of my team members + myself are impacted by this layoff today. Welcome to connect :)



We're releasing the DASLab GGUF Quantization Toolkit! 🚀 First open-source toolkit bringing GPTQ + EvoPress to @ggerganov's GGUF format, enabling heterogeneous quantization based on importance. Result: Better models at the same file size. [1/5]







attention sinks may be a bias in causal transformers. as some of you know, i've been writing a long blogpost on attention and its properties as a message-passing operation on graphs. while doing so, i figured i might have found an explanation for which attention sinks may be an *intrinsic bias of causal transformers learning dynamics* , rather than a desirable learnable feature. this prompted me to slice up my long blogpost into a series of chapters, of which this is the first. many thanks to @zmkzmkz , @Niccolg92, @thelokasiffers and @fabmilo who, among others (acknowleged in the post), gave me precious feedback on this first blogpost of mine. Please let me know what you think if you end up reading it, it's definitely a very early hypothesis that i'm more than willing to challenge. A link to the post is in the first reply







Not sure if this is the case now but some miss that UE8M0 (Unsigned, Exponent 8, Mantissa 0) used for V3.1 is a microscaling data format. They're NOT using mantissa-free *weights*. It's just for large dynamic range, cheaply applied scale factors.









