Jingyu Liu

30 posts

Jingyu Liu banner
Jingyu Liu

Jingyu Liu

@Jingyu227

CS PhD @Uchicago. RS Intern @Nvidia, AI Resident @AIatMeta, MLE @ByteDanceTalk.

Chicago, IL Katılım Aralık 2019
101 Takip Edilen139 Takipçiler
Jingyu Liu retweetledi
Zhihan Yang
Zhihan Yang@zhihanyang_·
📢 Excited to share our new paper: Scaling Beyond Masked Diffusion Language Models 🤔 Is masked diffusion really the future of non-AR language modeling? 📈 We ran the first scaling law study across 3 discrete diffusion families: masked, uniform-state (Duo), and interpolating (Eso-LMs)! 🤯 Surprisingly: Uniform-state diffusion outperforms masked diffusion on several downstream tasks including GSM8K. 🤔 As expected: Uniform-state diffusion has worse perplexity than masked diffusion. How to explain this? Dive in👇[🧵1/9] Paper: arxiv.org/abs/2602.15014 Blog: s-sahoo.com/scaling-dllms/ Code: github.com/s-sahoo/scalin… Work done in collaboration with: @ssahoo_ @jm_lemercier @jdeschena @Jingyu227 @jwthickstun Ante Jukic
Zhihan Yang tweet media
English
12
37
225
18.4K
Jingyu Liu retweetledi
Discrete Diffusion Reading Group
Discrete Diffusion Reading Group@diffusion_llms·
📢Jan 19 (Mon): TiDAR: Think in Diffusion, Talk in Autoregression Diffusion LMs enable fast parallel generation, while autoregressive (AR) models typically deliver higher quality thanks to their causal structure. A central challenge is whether these advantages can be unified to achieve ✅ High throughput ✅ Higher GPU utilization ✅ AR-level quality This Monday, Jingyu Liu (@Jingyu227) will discuss TiDAR, a hybrid decoding approach that combines diffusion-style parallel drafting with autoregressive verification for high quality and high throughput. The project was co-led by Jingyu Liu (@Jingyu227) and Xin Dong (@SimonXinDong). Collaborators: Zhifan Ye (PhD Student @ GaTech), Rishabh Mehta (@__principia__), @YongganFu, Vartika Singh (@vartuattheghat), @jankautz, @ce_zhang and @PavloMolchanov Paper link: arxiv.org/abs/2511.08923
Discrete Diffusion Reading Group tweet media
English
2
9
55
20.2K
Jingyu Liu retweetledi
Akshay 🚀
Akshay 🚀@akshay_pachaar·
NVIDIA just dropped a paper that might solve the biggest trade-off in LLMs. Speed vs. Quality. Autoregressive models (like GPT) are smart but slow - they generate one token at a time, leaving most of your GPU sitting idle. Diffusion models are fast but often produce incoherent outputs. TiDAR gets you both in a single forward pass. Here's the genius part: Modern GPUs can process way more tokens than we actually use. TiDAR exploits these "free slots" by: 1. Drafting multiple tokens at once using diffusion (the "thinking" phase) 2. Verifying them using autoregression (the "talking" phase) Both happen simultaneously using smart attention masks - bidirectional for drafting, causal for verification. The results: ↳ 4.71x faster at 1.5B parameters with zero quality loss ↳ Nearly 6x faster at 8B parameters ↳ First architecture to outperform speculative decoding (EAGLE-3) ↳ Works with standard KV caching, unlike pure diffusion models The training trick is clever too - instead of randomly masking tokens, they mask everything. This gives stronger learning signals and enables efficient single-step drafting. If you're building real-time AI agents where latency kills the experience, this architecture is worth paying attention to. Link to the paper in the next tweet.
Akshay 🚀 tweet media
English
41
156
931
79K
Jingyu Liu retweetledi
AK
AK@_akhaliq·
Nvidia presents TiDAR Think in Diffusion, Talk in Autoregression
AK tweet media
English
24
176
1.7K
101.8K
Subham Sahoo
Subham Sahoo@ssahoo_·
🎓 Officially a doctor now 😊!!! As a first-gen college kid, this moment means the world to me. Grateful beyond words to all my mentors who’ve guided me along the way — from @GMartius who first introduced me to research back in 2017, to @volokuleshov who sparked my love for generative modeling, and finally to @jwthickstun and @Jimantha for their incredible mentorship through the final stretch of my PhD. ❤️
Subham Sahoo tweet media
English
81
55
1.7K
100.9K
Jingyu Liu retweetledi
Beidi Chen
Beidi Chen@BeidiChen·
I was asked many times lately what repo to use by students who’re working on test-time scaling with slight modified attention or generation workflow (customized reward model /search). HF is a bit too time consuming esp with tons of token generation and Sglang/vllm is a bit hard to hack when doing earlier explorations. So we recommend a lite-weight repo we wrote for our earlier work!
Infini-AI-Lab@InfiniAILab

🧵 Glad to introduce LiteSys the inference framework we used in📄 Kinetics: Rethinking Test-Time Scaling Laws (arxiv.org/abs/2506.05333) to evaluate test-time scaling (32K+ generated tokens) at scale. If you are: ✅ Looking for an inference framework that's easy to extend. 🐢 Frustrated by how slow Hugging Face Transformers are. 🎯 Struggling to align performance on evaluation benchmarks. Then LiteSys is built for you. 🔗 GitHub: github.com/Infini-AI-Lab/…

English
2
23
220
29.1K
Jingyu Liu
Jingyu Liu@Jingyu227·
Any PhD student sincerely caring about LLM efficiency should consider reading this insanely helpful and nicely written series: jax-ml.github.io/scaling-book
English
0
0
4
204
Muru Zhang
Muru Zhang@zhang_muru·
After a year of internship with amazing folks at @togethercompute, I will be interning at @GoogleDeepMind this summer working on language model architecture! Hit me up and I will get you a boba at the bayview rooftop of my Emeryville apartment 😉
Muru Zhang tweet mediaMuru Zhang tweet media
English
7
3
267
19.1K
Jingyu Liu
Jingyu Liu@Jingyu227·
We benchmark HAMburger on decoding TPS. Notably, HAMburger makes small models fast without needing extra steps to do alignments or finding smaller draft models with the same tokenizer.
Jingyu Liu tweet media
English
1
0
0
112
Jingyu Liu
Jingyu Liu@Jingyu227·
Ever get bored seeing LLMs output one token per step? Check out HAMburger (advised by @ce_zhang), which smashes multiple tokens into a virtual token with up to 2x decoding TPS boost + reduced KV FLOPs and storage while maintaining quality! github.com/Jingyu6/hambur…
English
1
2
8
1.1K
Jingyu Liu
Jingyu Liu@Jingyu227·
HAMburger stacks a standard LLM with a compositional embedder and a micro-step decoder which dynamically adjusts the number of tokens to output per forward. It functions as a self-speculative decoding framework without rejection and needs a single forward for "verification".
Jingyu Liu tweet media
English
0
0
0
70
Jingyu Liu retweetledi
Muru Zhang
Muru Zhang@zhang_muru·
Running your model on multiple GPUs but often found the speed not satisfiable? We introduce Ladder-residual, a parallelism-aware architecture modification that makes 70B Llama with tensor parallelism ~30% faster! Work done at @togethercompute. Co-1st author with @MayankMish98 and mentored by @ben_athi, @tri_dao arxiv.org/pdf/2501.06589 🧵[1/7]
Muru Zhang tweet media
English
5
60
324
77K
Jingyu Liu
Jingyu Liu@Jingyu227·
> What about implementation? Efficiency claims can only be made with highly optimized code, and therefore, we build SpecPrefill on vLLM. Free feel to try our stuff both with standalone models and servers here: github.com/Jingyu6/specul… Stay tuned for the camera-ready paper!
English
0
0
0
174
Jingyu Liu
Jingyu Liu@Jingyu227·
> How does it work with long context? We made sure it works well on all compressible long context tasks including LongBench and RULER up to 128K.
Jingyu Liu tweet mediaJingyu Liu tweet media
English
1
0
0
225
Jingyu Liu
Jingyu Liu@Jingyu227·
> How does it compare with Sparse Attention for prefill acceleration? SpecPrefill not only skips attention but also MLP for a carefully chosen subset of tokens, ideally suitable for large batch size! We benchmarked latency on different configuration of prefill settings.
Jingyu Liu tweet media
English
1
0
0
394