Chien Nguyen

8 posts

Chien Nguyen

@chiennv2000

Ph.D. Student @uoregon | Prev: @GoogleDeepMind and @Adobe.

Katılım Nisan 2020

296 Takip Edilen48 Takipçiler

Sabitlenmiş Tweet

Chien Nguyen@chiennv2000·8h

We introduce Orthrus, a dual-architecture that unifies AR-level fidelity with parallel diffusion-style decoding, addressing the memory-bandwidth bottleneck in autoregressive generation. Paper: arxiv.org/abs/2605.12825 Code: github.com/chiennv2000 Thread🧵

English

Chien Nguyen@chiennv2000·8h

(4/4) Comparison with diffusion adaptation methods. Recent diffusion LLMs enable parallel decoding but often degrade quality and reasoning. For instance, Fast-dLLM-v2 shows a -11.1% accuracy drop over its AR baseline (Qwen2.5-7B) due to conditional drift, which often cancels out speed gains. Orthrus removes this trade-off by decoupling parallel generation from sequential constraints while preserving exact AR fidelity. It is strictly lossless and achieves ~6× speedup over Qwen3-8B, without sacrificing generation quality or reasoning ability.

English

Chien Nguyen@chiennv2000·8h

Compared to speculative decoding methods such as EAGLE-3 and DFlash, Orthrus avoids the need for an external drafter model and separate KV cache, eliminating both redundancy and time-to-first-token overhead. Because both views share a single KV cache, the system introduces only O(1) memory overhead while scaling efficiently to long contexts. Empirically, Orthrus achieves up to 7.8× speedup, is strictly lossless with respect to the base AR model, and is around 2x faster than DFlash at 40K context length.

English

Chien Nguyen@chiennv2000·8h

English

Chien Nguyen retweetledi

Horace He@cHHillee·7 Ağu

For too long, users have lived under the software lottery tyranny of fused attention implementations. No longer. Introducing FlexAttention, a new PyTorch API allowing for many attention variants to enjoy fused kernels in a few lines of PyTorch. pytorch.org/blog/flexatten… 1/10

English

270

1.5K

287.7K

Chien Nguyen retweetledi

Hieu Pham@hyhieu226·24 Haz

research.colfax-intl.com/tutorial-hoppe… A tutorial to help your kernels run faster on the H100s. The H100 SXM GPU has the memory bandwidth of 3.35 TB/s (read: very fast), but writing CUDA kernels that can actually utilize this bandwidth is no easy business. H100 GPUs have a feature called TMA (Tensor Memory Accelerator). It is quite essential to use TMA if we want to utilize these GPUs' full bandwidth. But using TMA is not easy either. TMA has a lot of nuts and bolts. At a quick glance, it has a totally different way to invoke, compared to vanilla GPU memory copy operations. At a deeper dive, it has many nuance that programmers need to get right to achieve good speedups. Debugging it is painful if we don't understand how it works. You can find many of these nuts and bolts and nuance in our newest tutorial on TMA! We hope it's helpful. This is a collaboration with friends at @colfaxintl, whom I am really, really fortunate to have found.

English

380

34.4K

Chien Nguyen retweetledi

Alexandr Wang@alexandr_wang·20 Ağu

the most valuable skill in the world is systems engineering: the ability to debug, understand, and improve a complex system with limited/poor measurement THIS is what makes great scientists, engineers, PMs, operators, doctors & investors not truly taught in school outside STEM

English

103

400

3.1K

442.3K

Chien Nguyen retweetledi

Tri Dao@tri_dao·23 Tem

I'll be at #ICML2023 hanging out at a few poster sessions, and helping organize a workshop on efficient systems for foundation models (ES-FoMo). Pls reach out if you want to chat about ML & systems. es-fomo.com

English

128

27.8K

Keşfet

@colfaxintl @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates @NASA @nikifrancismediavine