Chien Nguyen

8 posts

Chien Nguyen

Chien Nguyen

@chiennv2000

Ph.D. Student @uoregon | Prev: @GoogleDeepMind and @Adobe.

Katılım Nisan 2020
296 Takip Edilen48 Takipçiler
Sabitlenmiş Tweet
Chien Nguyen
Chien Nguyen@chiennv2000·
We introduce Orthrus, a dual-architecture that unifies AR-level fidelity with parallel diffusion-style decoding, addressing the memory-bandwidth bottleneck in autoregressive generation. Paper: arxiv.org/abs/2605.12825 Code: github.com/chiennv2000 Thread🧵
English
1
0
2
67
Chien Nguyen
Chien Nguyen@chiennv2000·
(4/4) Comparison with diffusion adaptation methods. Recent diffusion LLMs enable parallel decoding but often degrade quality and reasoning. For instance, Fast-dLLM-v2 shows a -11.1% accuracy drop over its AR baseline (Qwen2.5-7B) due to conditional drift, which often cancels out speed gains. Orthrus removes this trade-off by decoupling parallel generation from sequential constraints while preserving exact AR fidelity. It is strictly lossless and achieves ~6× speedup over Qwen3-8B, without sacrificing generation quality or reasoning ability.
Chien Nguyen tweet media
English
0
0
0
19
Chien Nguyen
Chien Nguyen@chiennv2000·
Compared to speculative decoding methods such as EAGLE-3 and DFlash, Orthrus avoids the need for an external drafter model and separate KV cache, eliminating both redundancy and time-to-first-token overhead. Because both views share a single KV cache, the system introduces only O(1) memory overhead while scaling efficiently to long contexts. Empirically, Orthrus achieves up to 7.8× speedup, is strictly lossless with respect to the base AR model, and is around 2x faster than DFlash at 40K context length.
Chien Nguyen tweet media
English
1
0
0
29
Chien Nguyen
Chien Nguyen@chiennv2000·
We introduce Orthrus, a dual-architecture that unifies AR-level fidelity with parallel diffusion-style decoding, addressing the memory-bandwidth bottleneck in autoregressive generation. Paper: arxiv.org/abs/2605.12825 Code: github.com/chiennv2000 Thread🧵
English
1
0
2
67
Chien Nguyen retweetledi
Horace He
Horace He@cHHillee·
For too long, users have lived under the software lottery tyranny of fused attention implementations. No longer. Introducing FlexAttention, a new PyTorch API allowing for many attention variants to enjoy fused kernels in a few lines of PyTorch. pytorch.org/blog/flexatten… 1/10
Horace He tweet media
English
25
270
1.5K
287.7K
Chien Nguyen retweetledi
Hieu Pham
Hieu Pham@hyhieu226·
research.colfax-intl.com/tutorial-hoppe… A tutorial to help your kernels run faster on the H100s. The H100 SXM GPU has the memory bandwidth of 3.35 TB/s (read: very fast), but writing CUDA kernels that can actually utilize this bandwidth is no easy business. H100 GPUs have a feature called TMA (Tensor Memory Accelerator). It is quite essential to use TMA if we want to utilize these GPUs' full bandwidth. But using TMA is not easy either. TMA has a lot of nuts and bolts. At a quick glance, it has a totally different way to invoke, compared to vanilla GPU memory copy operations. At a deeper dive, it has many nuance that programmers need to get right to achieve good speedups. Debugging it is painful if we don't understand how it works. You can find many of these nuts and bolts and nuance in our newest tutorial on TMA! We hope it's helpful. This is a collaboration with friends at @colfaxintl, whom I am really, really fortunate to have found.
English
3
64
380
34.4K
Chien Nguyen retweetledi
Alexandr Wang
Alexandr Wang@alexandr_wang·
the most valuable skill in the world is systems engineering: the ability to debug, understand, and improve a complex system with limited/poor measurement THIS is what makes great scientists, engineers, PMs, operators, doctors & investors not truly taught in school outside STEM
English
103
400
3.1K
442.3K
Chien Nguyen retweetledi
Tri Dao
Tri Dao@tri_dao·
I'll be at #ICML2023 hanging out at a few poster sessions, and helping organize a workshop on efficient systems for foundation models (ES-FoMo). Pls reach out if you want to chat about ML & systems. es-fomo.com
Tri Dao tweet media
English
1
11
128
27.8K