Jiaming Tang

32 posts

Jiaming Tang

Jiaming Tang

@jmtang42

Ph.D. student @MIT. I am interested in MLSys & Algo.

Cambridge, MA Katılım Ocak 2023
365 Takip Edilen537 Takipçiler
Jiaming Tang retweetledi
Zhijian Liu
Zhijian Liu@zhijianliu_·
ParoQuant just got a big upgrade 🚀 ✅ Supports the new Qwen3.5 models ⚡ Now runs on MLX (fast local inference on Apple Silicon) 🧠 Preserves reasoning quality with 4-bit quantization We also built an agent demo running locally on my 4-year-old M2 Max. Can't wait to upgrade to an M5 Max and see what kind of magic we can do. ✨
Zhijian Liu@zhijianliu_

Reasoning LLMs generate very long chains-of-thought, so even small quantization errors add up. With AWQ, Qwen3-4B drops 71.0 → 68.2 on MMLU-Pro (~4% relative loss). 😬 ParoQuant fixes this! It keeps only the critical rotation pairs and fuses everything into a single kernel. Recovers most of the lost reasoning accuracy with minimal overhead — so 4-bit models stay strong at reasoning. 💪💪

English
14
31
222
43.2K
Jiaming Tang retweetledi
Xialin He
Xialin He@Xialin_He·
Real-world loco-manipulation demands more than replaying fixed reference motions. We argue that true autonomy requires two capabilities: 1️⃣ flexibly leveraging whatever signals are available — dense references, partial cues, state estimates, or egocentric perception 2️⃣ remaining capable when any of these signals are missing or unreliable We introduce ULTRA — an all-in-one controller for unified humanoid loco-manipulation 🤖 It supports: • general reference tracking • sparse goal following • execution with motion capture • execution with egocentric perception 🔗 Project page: ultra-humanoid.github.io
English
3
30
111
11.5K
Jiaming Tang retweetledi
Physical Intelligence
Physical Intelligence@physical_int·
We’ve developed a memory system for our models that provides both short-term visual memory and long-term semantic memory. Our approach allows us to train robots to perform long and complex tasks, like cleaning up a kitchen or preparing a grilled cheese sandwich from scratch 👇
English
49
265
2.1K
439.9K
Jiaming Tang retweetledi
Zhijian Liu
Zhijian Liu@zhijianliu_·
Reasoning LLMs generate very long chains-of-thought, so even small quantization errors add up. With AWQ, Qwen3-4B drops 71.0 → 68.2 on MMLU-Pro (~4% relative loss). 😬 ParoQuant fixes this! It keeps only the critical rotation pairs and fuses everything into a single kernel. Recovers most of the lost reasoning accuracy with minimal overhead — so 4-bit models stay strong at reasoning. 💪💪
English
31
145
1.4K
169K
Jiaming Tang retweetledi
Jyo Pari
Jyo Pari@jyo_pari·
As context windows grow 📈, continual learning matters more! @tianyuanzhang99 will present how to scale test-time training for effectively infinite context ♾ 🗓️ Feb 19, 3pm ET @scaleml
Jyo Pari tweet media
English
8
15
177
26K
AK
AK@_akhaliq·
VLASH Real-Time VLAs via Future-State-Aware Asynchronous Inference
English
2
7
55
8.1K
Jiaming Tang
Jiaming Tang@jmtang42·
On RTX5090, VLASH can reduce the control latency from ~530 ms to ~30 ms, achieving up to a 17× control latency reduction compared to synchronous inference. On RTX4090 and RTX5070, we can achieve ~15× and ~9× latency reduction, respectively. This low-latency control is essential for highly dynamic tasks and high-frequency correction for the robot.
Jiaming Tang tweet media
English
1
0
18
1.4K
Jiaming Tang
Jiaming Tang@jmtang42·
We also add a simple trick to make robots move even faster: “quantize” robot actions for speed. VLAs are trained on very fine-grained teleop data, so they output tiny action steps that are often more precise than necessary. VLASH groups every q fine-grained actions into one coarser action, so the robot takes fewer, larger steps that follow almost the same trajectory, but much faster.
Jiaming Tang tweet media
English
1
1
12
1.9K
Jiaming Tang
Jiaming Tang@jmtang42·
Even large VLAs can play ping-pong in real time! 🏓⚡️ In practice, VLAs struggle with fast, dynamic tasks: • slow reactions, jittery actions. • demos often shown at 5-10× speed to look “smooth”. We introduce VLASH: • future-state-aware asynchronous inference with >30Hz inference frequency for PI0.5 • drop-in to existing VLAs with no extra overhead • enables PI0.5 / PI0 to play ping-pong and other highly dynamic tasks in real time 📄 Paper: arxiv.org/abs/2512.01031 🔧 Code: github.com/mit-han-lab/vl…
English
18
82
440
70.4K
Haotian Tang
Haotian Tang@haotiant1998·
Personal update: I am excited to share that I will join @GoogleDeepMind next week after defending my PhD thesis @MITEECS earlier last month. I will be working on generative models that simulate the physical world. Looking forward to the new journey ahead in 2025!
Haotian Tang tweet media
English
72
51
2.2K
126.3K
Jiaming Tang retweetledi
Guangxuan Xiao
Guangxuan Xiao@Guangxuan_Xiao·
Introducing DuoAttention: Our new framework slashes both memory and latency for long-context LLMs without sacrificing performance! By applying full KV cache only to critical heads, we achieve: ⚡ 2.55x memory reduction ⚡ 2.18x decoding speedup ⚡ 3.3M tokens on a single A100 GPU
English
6
62
292
31K
Jiaming Tang
Jiaming Tang@jmtang42·
🚀Excited to introduce Quest: an efficient long-context LLM inference framework, accepted by ICML 2024!🌟 ⚡️Quest leverages query-aware sparsity to achieve up to 2.23× e2e speedup for long-context LLM inference. 📄Paper: arxiv.org/abs/2406.10774 💻Code: github.com/mit-han-lab
GIF
English
4
18
78
16.3K