Jingyu Liu

30 posts

Jingyu Liu

@Jingyu227

CS PhD @Uchicago. RS Intern @Nvidia, AI Resident @AIatMeta, MLE @ByteDanceTalk.

Chicago, IL Katılım Aralık 2019

101 Takip Edilen139 Takipçiler

Jingyu Liu retweetledi

Zhihan Yang@zhihanyang_·17 Şub

📢 Excited to share our new paper: Scaling Beyond Masked Diffusion Language Models 🤔 Is masked diffusion really the future of non-AR language modeling? 📈 We ran the first scaling law study across 3 discrete diffusion families: masked, uniform-state (Duo), and interpolating (Eso-LMs)! 🤯 Surprisingly: Uniform-state diffusion outperforms masked diffusion on several downstream tasks including GSM8K. 🤔 As expected: Uniform-state diffusion has worse perplexity than masked diffusion. How to explain this? Dive in👇[🧵1/9] Paper: arxiv.org/abs/2602.15014 Blog: s-sahoo.com/scaling-dllms/ Code: github.com/s-sahoo/scalin… Work done in collaboration with: @ssahoo_ @jm_lemercier @jdeschena @Jingyu227 @jwthickstun Ante Jukic

English

225

18.4K

Jingyu Liu retweetledi

Discrete Diffusion Reading Group@diffusion_llms·15 Oca

📢Jan 19 (Mon): TiDAR: Think in Diffusion, Talk in Autoregression Diffusion LMs enable fast parallel generation, while autoregressive (AR) models typically deliver higher quality thanks to their causal structure. A central challenge is whether these advantages can be unified to achieve ✅ High throughput ✅ Higher GPU utilization ✅ AR-level quality This Monday, Jingyu Liu (@Jingyu227) will discuss TiDAR, a hybrid decoding approach that combines diffusion-style parallel drafting with autoregressive verification for high quality and high throughput. The project was co-led by Jingyu Liu (@Jingyu227) and Xin Dong (@SimonXinDong). Collaborators: Zhifan Ye (PhD Student @ GaTech), Rishabh Mehta (@__principia__), @YongganFu, Vartika Singh (@vartuattheghat), @jankautz, @ce_zhang and @PavloMolchanov Paper link: arxiv.org/abs/2511.08923

Discrete Diffusion Reading Group tweet media

English

20.2K

Jingyu Liu retweetledi

Akshay 🚀@akshay_pachaar·26 Kas

NVIDIA just dropped a paper that might solve the biggest trade-off in LLMs. Speed vs. Quality. Autoregressive models (like GPT) are smart but slow - they generate one token at a time, leaving most of your GPU sitting idle. Diffusion models are fast but often produce incoherent outputs. TiDAR gets you both in a single forward pass. Here's the genius part: Modern GPUs can process way more tokens than we actually use. TiDAR exploits these "free slots" by: 1. Drafting multiple tokens at once using diffusion (the "thinking" phase) 2. Verifying them using autoregression (the "talking" phase) Both happen simultaneously using smart attention masks - bidirectional for drafting, causal for verification. The results: ↳ 4.71x faster at 1.5B parameters with zero quality loss ↳ Nearly 6x faster at 8B parameters ↳ First architecture to outperform speculative decoding (EAGLE-3) ↳ Works with standard KV caching, unlike pure diffusion models The training trick is clever too - instead of randomly masking tokens, they mask everything. This gives stronger learning signals and enables efficient single-step drafting. If you're building real-time AI agents where latency kills the experience, this architecture is worth paying attention to. Link to the paper in the next tweet.

English

156

931

79K

Jingyu Liu@Jingyu227·19 Kas

Incredible experience working on this project at NV. One of the best experience ever working with @SimonXinDong and the rest of the team! Also huge thanks to my advisor @ce_zhang. Stay tuned for the exciting next step for TiDAR!

X. Dong@SimonXinDong

We reveal TiDAR, a new paradigm that’s about to shake up how LLMs run. The wild part? You can get 5x-8x decoding speedup, by hitting peak compute-to-memory ratio even with bsz = 1 . Never waste any compute on your GPU because you pay for it. It’s a Diffusion + Autoregressive hybrid at the sequence level, unlocking a huge leap in self-speculative decoding and multi-token prediction. Demo’s live — go take a look 👇

English

300

Jingyu Liu retweetledi

AK@_akhaliq·13 Kas

Nvidia presents TiDAR Think in Diffusion, Talk in Autoregression

English

176

1.7K

101.8K

Jingyu Liu@Jingyu227·6 Eki

@ssahoo_ @GMartius @volokuleshov This is huge! Congrats!!!

English

100

Subham Sahoo@ssahoo_·6 Eki

🎓 Officially a doctor now 😊!!! As a first-gen college kid, this moment means the world to me. Grateful beyond words to all my mentors who’ve guided me along the way — from @GMartius who first introduced me to research back in 2017, to @volokuleshov who sparked my love for generative modeling, and finally to @jwthickstun and @Jimantha for their incredible mentorship through the final stretch of my PhD. ❤️

English

1.7K

100.9K

Jingyu Liu retweetledi

Beidi Chen@BeidiChen·10 Tem

I was asked many times lately what repo to use by students who’re working on test-time scaling with slight modified attention or generation workflow (customized reward model /search). HF is a bit too time consuming esp with tons of token generation and Sglang/vllm is a bit hard to hack when doing earlier explorations. So we recommend a lite-weight repo we wrote for our earlier work!

Infini-AI-Lab@InfiniAILab

🧵 Glad to introduce LiteSys the inference framework we used in📄 Kinetics: Rethinking Test-Time Scaling Laws (arxiv.org/abs/2506.05333) to evaluate test-time scaling (32K+ generated tokens) at scale. If you are: ✅ Looking for an inference framework that's easy to extend. 🐢 Frustrated by how slow Hugging Face Transformers are. 🎯 Struggling to align performance on evaluation benchmarks. Then LiteSys is built for you. 🔗 GitHub: github.com/Infini-AI-Lab/…

English

220

29.1K

Jingyu Liu@Jingyu227·16 Haz

Any PhD student sincerely caring about LLM efficiency should consider reading this insanely helpful and nicely written series: jax-ml.github.io/scaling-book

English

204

Jingyu Liu@Jingyu227·30 May

@zhang_muru @togethercompute @GoogleDeepMind Congrats on ur paper at together and new opportunities at deepmind!

English

502

Muru Zhang@zhang_muru·30 May

After a year of internship with amazing folks at @togethercompute, I will be interning at @GoogleDeepMind this summer working on language model architecture! Hit me up and I will get you a boba at the bayview rooftop of my Emeryville apartment 😉

English

267

19.1K

Jingyu Liu@Jingyu227·28 May

For more details, check our preprint: arxiv.org/abs/2505.20438 Welcome to try our model at (reproducible experiments and model checkpoint available): github.com/Jingyu6/hambur…

English

Jingyu Liu@Jingyu227·28 May

We benchmark HAMburger on decoding TPS. Notably, HAMburger makes small models fast without needing extra steps to do alignments or finding smaller draft models with the same tokenizer.

English

112

Jingyu Liu@Jingyu227·28 May

Ever get bored seeing LLMs output one token per step? Check out HAMburger (advised by @ce_zhang), which smashes multiple tokens into a virtual token with up to 2x decoding TPS boost + reduced KV FLOPs and storage while maintaining quality! github.com/Jingyu6/hambur…

English

1.1K

Jingyu Liu@Jingyu227·28 May

HAMburger stacks a standard LLM with a compositional embedder and a micro-step decoder which dynamically adjusts the number of tokens to output per forward. It functions as a self-speculative decoding framework without rejection and needs a single forward for "verification".

English

Jingyu Liu retweetledi

Muru Zhang@zhang_muru·4 Şub

Running your model on multiple GPUs but often found the speed not satisfiable? We introduce Ladder-residual, a parallelism-aware architecture modification that makes 70B Llama with tensor parallelism ~30% faster! Work done at @togethercompute. Co-1st author with @MayankMish98 and mentored by @ben_athi, @tri_dao arxiv.org/pdf/2501.06589 🧵[1/7]

English

324

77K

Jingyu Liu@Jingyu227·2 May

> What about implementation? Efficiency claims can only be made with highly optimized code, and therefore, we build SpecPrefill on vLLM. Free feel to try our stuff both with standalone models and servers here: github.com/Jingyu6/specul… Stay tuned for the camera-ready paper!

English

174

Jingyu Liu@Jingyu227·2 May

> How does it work with long context? We made sure it works well on all compressible long context tasks including LongBench and RULER up to 128K.

English

225

Jingyu Liu@Jingyu227·2 May

> How does it compare with Sparse Attention for prefill acceleration? SpecPrefill not only skips attention but also MLP for a carefully chosen subset of tokens, ideally suitable for large batch size! We benchmarked latency on different configuration of prefill settings.

English

394

Jingyu Liu@Jingyu227·2 May

Speculative Prefill got accepted by #ICML2025! Thanks @ce_zhang @BeidiChen!! Significantly improves LLM prefill throughput and latency with up to 405B models without training! We achieved 2-7x higher QPS on vLLM servers while preserving quality. github.com/Jingyu6/specul…

English

8.6K

Keşfet

@ssahoo_ @jm_lemercier @jdeschena @jwthickstun @SimonXinDong @__principia__ @YongganFu @vartuattheghat