Han Guo

3.4K posts

Han Guo banner
Han Guo

Han Guo

@HanGuo97

PhD Student @MIT_CSAIL | Past: @togethercompute @LTIatCMU @MITIBMLab @UNCNLP, @SFResearch, @BaiduResearch | Machine Learning, NLP.

Katılım Ağustos 2016
4.4K Takip Edilen3.8K Takipçiler
Han Guo retweetledi
Tri Dao
Tri Dao@tri_dao·
Nonlinear RNNs seem to do sth genuinely different from attn and linear RNNs/SSMs. By themselves they already do quite well w the right parametrization, but just one nonlinear RNN layers substantially improve transformer-mamba/deltanet hybrid!
Mayank Mishra@MayankMish98

Introducing M²RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling We bring back non-linear recurrence to language modeling and show it's been held back by small state sizes, not by non-linearity itself. 📄 Paper: arxiv.org/abs/2603.14360 💻 Code: github.com/open-lm-engine… 🤗 Models: huggingface.co/collections/op…

English
3
28
229
19.7K
Han Guo retweetledi
Han Guo retweetledi
Jyo Pari
Jyo Pari@jyo_pari·
Hard problems require more than bigger models, they require effective exploration at test time. 💡 @aviral_kumar2 will present new approaches for training LMs to scale test-time exploration, including solving IMO-level math problems. 🏅 🗓️ March 19, 4pm ET @scaleml
Jyo Pari tweet media
English
2
5
94
8K
Han Guo retweetledi
Kimi.ai
Kimi.ai@Kimi_Moonshot·
Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. 🔹 Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. 🔹 Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. 🔹 Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. 🔹 Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. 🔗Full report: github.com/MoonshotAI/Att…
Kimi.ai tweet media
English
326
2K
13.4K
4.8M
Han Guo retweetledi
Zhijian Liu
Zhijian Liu@zhijianliu_·
DFlash⚡ meets OpenClaw🦞 = FlashClaw Same Claw. >4X faster or cheaper. DFlash support for Qwen3.5 is live — outperforming native MTP by up to 2.3X. More to come! 🔥
English
12
37
196
19.1K
Han Guo retweetledi
Yulu Gan
Yulu Gan@yule_gan·
Simply adding Gaussian noise to LLMs (one step—no iterations, no learning rate, no gradients) and ensembling them can achieve performance comparable to or even better than standard GRPO/PPO on math reasoning, coding, writing, and chemistry tasks. We call this algorithm RandOpt. To verify that this is not limited to specific models, we tested it on Qwen, Llama, OLMo3, and VLMs. What's behind this? We find that in the Gaussian search neighborhood around pretrained LLMs, diverse task experts are densely distributed — a regime we term Neural Thickets. Paper: arxiv.org/pdf/2603.12228 Code: github.com/sunrainyg/Rand… Website: thickets.mit.edu
Yulu Gan tweet media
English
86
431
3K
666.1K
Han Guo retweetledi
Seungwook Han
Seungwook Han@seungwookh·
Can language models learn useful priors without ever seeing language? We pre-pre-train transformers on neural cellular automata — fully synthetic, zero language. This improves language modeling by up to 6%, speeds up convergence by 40%, and strengthens downstream reasoning. Surprisingly, it even beats pre-pre-training on natural text! Blog: hanseungwook.github.io/blog/nca-pre-p… (1/n)
Seungwook Han tweet media
English
48
258
1.7K
239.6K
Han Guo retweetledi
Xinghong (Shin) Fu
Xinghong (Shin) Fu@shinfxh·
just got claude to explain attention matching and it made this interactive heatmap to show the relative importance of each layer/head! this might just be better than the diagrams in our own paper...
English
1
5
57
2.9K
Han Guo retweetledi
Bryan Catanzaro
Bryan Catanzaro@ctnzr·
Announcing NVIDIA Nemotron 3 Super! 💚120B-12A Hybrid SSM Latent MoE, designed for Blackwell 💚36 on AAIndex v4 💚up to 2.2X faster than GPT-OSS-120B in FP4 💚Open data, open recipe, open weights Models, Tech report, etc. here: research.nvidia.com/labs/nemotron/… And yes, Ultra is coming!
Bryan Catanzaro tweet media
English
62
205
1.2K
200.3K
Han Guo retweetledi
Zhijian Liu
Zhijian Liu@zhijianliu_·
ParoQuant just got a big upgrade 🚀 ✅ Supports the new Qwen3.5 models ⚡ Now runs on MLX (fast local inference on Apple Silicon) 🧠 Preserves reasoning quality with 4-bit quantization We also built an agent demo running locally on my 4-year-old M2 Max. Can't wait to upgrade to an M5 Max and see what kind of magic we can do. ✨
Zhijian Liu@zhijianliu_

Reasoning LLMs generate very long chains-of-thought, so even small quantization errors add up. With AWQ, Qwen3-4B drops 71.0 → 68.2 on MMLU-Pro (~4% relative loss). 😬 ParoQuant fixes this! It keeps only the critical rotation pairs and fuses everything into a single kernel. Recovers most of the lost reasoning accuracy with minimal overhead — so 4-bit models stay strong at reasoning. 💪💪

English
14
30
221
42.1K
Han Guo retweetledi
PyTorch
PyTorch@PyTorch·
Building on the previous correctness-focused pipeline, KernelAgent can now integrate GPU hardware-performance signals into a closed-loop multi-agent workflow to guide the optimization for Triton Kernels. Learn more: hubs.la/Q045Wsqq0 @KaimingCheng @marksaroufim
PyTorch tweet media
English
3
20
90
22.5K
Han Guo retweetledi
Peter Hase
Peter Hase@peterbhase·
Can we train models to have more monitorable CoT? We introduce Counterfactual Simulation Training to improve CoT faithfulness/monitorability. CST produces models that admit to reward hacking and deferring too much to Stanford profs (@chrisgpotts told me this is very dangerous)
Peter Hase tweet media
English
12
36
208
20.6K
Han Guo retweetledi
Linlu Qiu
Linlu Qiu@linluqiu·
Check out the updated version of our paper and the new blog post! research.google/blog/teaching-…
Tal Linzen@tallinzen

New version of @linluqiu's heroic Google Student Research project, with a lot more experiments! I think it's a nice demonstration of why LLM fine-tuning works so well: you fine-tune the models to adapt to users by having them mimic the optimal Bayesian way to adapt, and they generalize this ability to other contexts:

English
2
5
41
6.1K
Han Guo retweetledi
Itamar Pres
Itamar Pres@PresItamar·
New paper: It's time to optimize for 🔁self-consistency 🔁 We’ve pushed LLMs to the limits of available data, yet failures like sycophancy and factual inconsistency persist. We argue these stem from the same assumption: that behavior can be specified one I/O pair at a time. 🧵
Itamar Pres tweet media
English
16
55
424
70.6K
Han Guo retweetledi
Ted Zadouri
Ted Zadouri@tedzadouri·
Asymmetric hardware scaling is here. Blackwell tensor cores are now so fast, exp2 and shared memory are the wall. FlashAttention-4 changes the algorithm & pipeline so that softmax & SMEM bandwidth no longer dictate speed. Attn reaches ~1600 TFLOPs, pretty much at matmul speed! joint work w/ Markus Hoehnerbach, Jay Shah(@ultraproduct), Timmy Liu, Vijay Thakkar (@__tensorcore__ ), Tri Dao (@tri_dao) 1/
Ted Zadouri tweet media
English
6
132
780
219.4K