Harry Dong

34 posts

Harry Dong

Harry Dong

@Real_HDong

PhD Student @ CMU | Prev @ Meta, Apple, AFRL, AWS, UC Berkeley | Research in ML Inference

Pittsburgh Katılım Mart 2024
422 Takip Edilen191 Takipçiler
Sabitlenmiş Tweet
Harry Dong
Harry Dong@Real_HDong·
1/🧵 🎉Introducing Bridge🌉, our parallel LLM inference scaling method that shares info between all responses to an input prompt throughout the generation process! Bridge greatly improves the quality of individual responses and the entire response set! 📜arxiv.org/pdf/2510.01143
Harry Dong tweet media
English
1
4
24
4.6K
Harry Dong retweetledi
Infini-AI-Lab
Infini-AI-Lab@InfiniAILab·
Video generation models are improving fast—real-time autoregressive models now deliver high quality at low latency, and they’re quickly being adopted for world models and robotics applications. So what’s the problem? They’re still too slow on consumer hardware. 🚀 What if we told you that we can get true real-time 16 FPS video generation on a single RTX 5090? (1.5-12x over FA 2/3/4 on 5090, H100, B200) Today we release MonarchRT 🦋, an efficient video attention that parameterizes attention maps as (tiled) Monarch matrices and delivers real E2E gains. 📄 Paper: arxiv.org/abs/2602.12271 🌐 Website: infini-ai-lab.github.io/MonarchRT 🔗 GitHub: github.com/Infini-AI-Lab/… 🧵1/n
English
4
27
132
32.9K
Harry Dong retweetledi
Infini-AI-Lab
Infini-AI-Lab@InfiniAILab·
RL is notoriously unstable under actor–policy mismatch 😥 — a common reality caused by kernel differences, MoE randomness, FP8 rollouts, or asynchronous pipelines. But here’s a crazy thought 🤔 👉 What if you could RL-train a large model using rollouts generated only by a weaker, faster, and completely different model? Sounds doomed from the start? 💩 We are releasing Jackpot 🎰.💡 enabling training Qwen3-8B-Base using only Qwen3-1.7B-Base generated rollouts ✨ Jackpot is surprisingly powerful: • Enables cheap, fast rollouts to train stronger models • Dramatically changes the cost–performance tradeoff of RL training We release Jackpot 🎰 in the following format: 🌔Paper: arxiv.org/abs/2602.06107 🌕Code: github.com/Infini-AI-Lab/… 🌖Blog: infini-ai-lab.github.io/jpt_website/ [1/n]
Infini-AI-Lab tweet media
English
6
22
124
23.5K
Harry Dong
Harry Dong@Real_HDong·
Very neat work led by @RJ_Sadhukhan to make LLMs more efficient, sparse, and interpretable!
Infini-AI-Lab@InfiniAILab

Lookup memories are having a moment 😄 The whale 🐋 #deepseek dropped engram… and we dropped up-projections from our FFNs…perfect timing 😅 🥳 Introducing STEM: Scaling Transformers with Embedding Modules 🌱 A scalable way to boost parametric memory with extra perks: ✅ Stable training even at extreme sparsity ✅ Better quality for fewer training FLOPs (knowledge + reasoning + long-context gains) ✅ Efficient inference: ~33% FFN params removed + CPU offload & async prefetch ✅ More interpretable → seamless knowledge editing 🔧🧠 Looking forward to DeepSeek v4… feels like we’ve only scratched the surface of embedding-lookup scaling 👀 📄Paper: arxiv.org/abs/2601.10639 🌐 Website: infini-ai-lab.github.io/STEM 🔗 GitHub: github.com/Infini-AI-Lab/…

English
0
0
5
170
Harry Dong
Harry Dong@Real_HDong·
At NeurIPS all week! Swing by the Efficient Reasoning workshop at 10:45-11:00 on Saturday to hear my oral presentation about our work on interdependent sampling for parallel generation!
Harry Dong@Real_HDong

1/🧵 🎉Introducing Bridge🌉, our parallel LLM inference scaling method that shares info between all responses to an input prompt throughout the generation process! Bridge greatly improves the quality of individual responses and the entire response set! 📜arxiv.org/pdf/2510.01143

English
0
1
6
306
Harry Dong retweetledi
Harry Dong retweetledi
Rohan Choudhury
Rohan Choudhury@rchoudhury997·
Excited to release our new preprint - we introduce Adaptive Patch Transformers (APT), a method to speed up vision transformers by using multiple different patch sizes within the same image!
English
10
28
232
29.7K
Harry Dong retweetledi
Infini-AI-Lab
Infini-AI-Lab@InfiniAILab·
🤔Can we train RL on LLMs with extremely stale data? 🚀Our latest study says YES! Stale data can be as informative as on-policy data, unlocking more scalable, efficient asynchronous RL for LLMs. We introduce M2PO, an off-policy RL algorithm that keeps training stable and performant even when using data stale by 256 model updates. 🔗 Notion Blog: m2po.notion.site/rl-stale-m2po 📄 Paper: arxiv.org/abs/2510.01161 💻 GitHub: github.com/Infini-AI-Lab/… 🧵 1/4
Infini-AI-Lab tweet media
English
3
39
233
62.6K
Harry Dong
Harry Dong@Real_HDong·
8/🧵 ✨Key takeaway: By treating LLM features for parallel scaling as a single tensor unit instead of independent slices, each response can give/take info to/from other responses to improve individual response AND response set quality while maintaining total parallelism.
English
1
0
0
135
Harry Dong
Harry Dong@Real_HDong·
1/🧵 🎉Introducing Bridge🌉, our parallel LLM inference scaling method that shares info between all responses to an input prompt throughout the generation process! Bridge greatly improves the quality of individual responses and the entire response set! 📜arxiv.org/pdf/2510.01143
Harry Dong tweet media
English
1
4
24
4.6K
Harry Dong retweetledi
Infini-AI-Lab
Infini-AI-Lab@InfiniAILab·
🔥 We introduce Multiverse, a new generative modeling framework for adaptive and lossless parallel generation. 🚀 Multiverse is the first open-source non-AR model to achieve AIME24 and AIME25 scores of 54% and 46% 🌐 Website: multiverse4fm.github.io 🧵 1/n
GIF
English
6
76
221
120.5K
Harry Dong retweetledi
Infini-AI-Lab
Infini-AI-Lab@InfiniAILab·
🚀 RAG vs. Long-Context LLMs: The Real Battle ⚔️ 🤯Turns out, simple-to-build RAG can match million-dollar long-context LLMs (LC LLMs) on most existing benchmarks. 🤡So, do we even need long-context models? YES. Because today’s benchmarks are flawed: ⛳ Too Simple – Over-reliant on retrieval & QA. ⛳ Detectable Noise – RAG can filter out filler text easily. ⛳ Too Few – High-quality data requires huge human effort. 🔭 With LC LLMs hitting the ceiling, we need a benchmark that justifies their insane training costs. 🔥 Introducing 🐭🐷 GSM-Infinite – our synthetic long-context reasoning benchmark built to push LLMs to their real limits. 💎 Infinitely scalable in reasoning complexity & quantity 💎 Precision control over reasoning complexity 💎 Fully customizable RAG-proof context lengths 🚀 [1/n] 📄Paper: arxiv.org/abs/2502.05252 🖥️Code: github.com/Infini-AI-Lab/… 🤗Huggingface datasets: huggingface.co/collections/In… 🏃Leaderboard: huggingface.co/spaces/InfiniA…
Infini-AI-Lab tweet media
English
6
37
188
98.4K