Fusheng Liu

60 posts

Fusheng Liu

@mathlfs

PhD student @ National University of Singapore [email protected]

Singapore Beigetreten Eylül 2022

107 Folgt21 Follower

Fusheng Liu@mathlfs·4d

@ZimingLiu11 Feel that we put a lot of efforts to study architecture and optimization algorithms but focus much less on understanding data itself

English

195

Ziming Liu@ZimingLiu11·4d

We should embrace the No-Free-Lunch theorem more. * Attention residual wins on structured data * Standard residual wins on random data kindxiaoming.github.io/blog/2026/atte… We shouldn't race to more models, but understand data structures. New models are probes of data structures, though.

Kimi.ai@Kimi_Moonshot

Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. 🔹 Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. 🔹 Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. 🔹 Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. 🔹 Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. 🔗Full report: github.com/MoonshotAI/Att…

English

212

21.2K

Fusheng Liu retweetet

Andrej Karpathy@karpathy·7 Mar

I packaged up the "autoresearch" project into a new self-contained minimal repo if people would like to play over the weekend. It's basically nanochat LLM training core stripped down to a single-GPU, one file version of ~630 lines of code, then: - the human iterates on the prompt (.md) - the AI agent iterates on the training code (.py) The goal is to engineer your agents to make the fastest research progress indefinitely and without any of your own involvement. In the image, every dot is a complete LLM training run that lasts exactly 5 minutes. The agent works in an autonomous loop on a git feature branch and accumulates git commits to the training script as it finds better settings (of lower validation loss by the end) of the neural network architecture, the optimizer, all the hyperparameters, etc. You can imagine comparing the research progress of different prompts, different agents, etc. github.com/karpathy/autor… Part code, part sci-fi, and a pinch of psychosis :)

English

3.6K

28.2K

10.8M

Fusheng Liu retweetet

Andrej Karpathy@karpathy·8 Mar

The next step for autoresearch is that it has to be asynchronously massively collaborative for agents (think: SETI@home style). The goal is not to emulate a single PhD student, it's to emulate a research community of them. Current code synchronously grows a single thread of commits in a particular research direction. But the original repo is more of a seed, from which could sprout commits contributed by agents on all kinds of different research directions or for different compute platforms. Git(Hub) is *almost* but not really suited for this. It has a softly built in assumption of one "master" branch, which temporarily forks off into PRs just to merge back a bit later. I tried to prototype something super lightweight that could have a flavor of this, e.g. just a Discussion, written by my agent as a summary of its overnight run: github.com/karpathy/autor… Alternatively, a PR has the benefit of exact commits: github.com/karpathy/autor… but you'd never want to actually merge it... You'd just want to "adopt" and accumulate branches of commits. But even in this lightweight way, you could ask your agent to first read the Discussions/PRs using GitHub CLI for inspiration, and after its research is done, contribute a little "paper" of findings back. I'm not actually exactly sure what this should look like, but it's a big idea that is more general than just the autoresearch repo specifically. Agents can in principle easily juggle and collaborate on thousands of commits across arbitrary branch structures. Existing abstractions will accumulate stress as intelligence, attention and tenacity cease to be bottlenecks.

English

519

712

7.5K

1.1M

Fusheng Liu retweetet

Tencent HY@TencentHunyuan·5 Mar

One static model does not fit all😭 We just dropped our latest work: Functional Neural Memory. Instead of static models, we generate custom "parameters" for every single input. ✅Prompt your model anytime ✅Instant personalization ✅Better instruction following ✅Flexible & dynamic memory (w/o memory bank✌️) (🧵1/6)

English

141

331

66.7K

Fusheng Liu retweetet

Andrej Karpathy@karpathy·12 Şub

New art project. Train and inference GPT in 243 lines of pure, dependency-free Python. This is the *full* algorithmic content of what is needed. Everything else is just for efficiency. I cannot simplify this any further. gist.github.com/karpathy/8627f…

English

651

3.2K

25.2K

5.1M

Fusheng Liu retweetet

Andrej Karpathy@karpathy·31 Oca

nanochat can now train GPT-2 grade LLM for <<$100 (~$73, 3 hours on a single 8XH100 node). GPT-2 is just my favorite LLM because it's the first time the LLM stack comes together in a recognizably modern form. So it has become a bit of a weird & lasting obsession of mine to train a model to GPT-2 capability but for much cheaper, with the benefit of ~7 years of progress. In particular, I suspected it should be possible today to train one for <<$100. Originally in 2019, GPT-2 was trained by OpenAI on 32 TPU v3 chips for 168 hours (7 days), with $8/hour/TPUv3 back then, for a total cost of approx. $43K. It achieves 0.256525 CORE score, which is an ensemble metric introduced in the DCLM paper over 22 evaluations like ARC/MMLU/etc. As of the last few improvements merged into nanochat (many of them originating in modded-nanogpt repo), I can now reach a higher CORE score in 3.04 hours (~$73) on a single 8XH100 node. This is a 600X cost reduction over 7 years, i.e. the cost to train GPT-2 is falling approximately 2.5X every year. I think this is likely an underestimate because I am still finding more improvements relatively regularly and I have a backlog of more ideas to try. A longer post with a lot of the detail of the optimizations involved and pointers on how to reproduce are here: github.com/karpathy/nanoc… Inspired by modded-nanogpt, I also created a leaderboard for "time to GPT-2", where this first "Jan29" model is entry #1 at 3.04 hours. It will be fun to iterate on this further and I welcome help! My hope is that nanochat can grow to become a very nice/clean and tuned experimental LLM harness for prototyping ideas, for having fun, and ofc for learning. The biggest improvements of things that worked out of the box and simply produced gains right away were 1) Flash Attention 3 kernels (faster, and allows window_size kwarg to get alternating attention patterns), Muon optimizer (I tried for ~1 day to delete it and only use AdamW and I couldn't), residual pathways and skip connections gated by learnable scalars, and value embeddings. There were many other smaller things that stack up. Image: semi-related eye candy of deriving the scaling laws for the current nanochat model miniseries, pretty and satisfying!

English

333

626

7.4K

1.3M

Fusheng Liu retweetet

Anthropic@AnthropicAI·29 Oca

AI can make work faster, but a fear is that relying on it may make it harder to learn new skills on the job. We ran an experiment with software engineers to learn more. Coding with AI led to a decrease in mastery—but this depended on how people used it. anthropic.com/research/AI-as…

English

292

1.3K

8.6K

3.6M

Fusheng Liu retweetet

Andrej Karpathy@karpathy·8 Oca

New post: nanochat miniseries v1 The correct way to think about LLMs is that you are not optimizing for a single specific model but for a family models controlled by a single dial (the compute you wish to spend) to achieve monotonically better results. This allows you to do careful science of scaling laws and ultimately this is what gives you the confidence that when you pay for "the big run", the extrapolation will work and your money will be well spent. For the first public release of nanochat my focus was on end-to-end pipeline that runs the whole LLM pipeline with all of its stages. Now after YOLOing a few runs earlier, I'm coming back around to flesh out some of the parts that I sped through, starting of course with pretraining, which is both computationally heavy and critical as the foundation of intelligence and knowledge in these models. After locally tuning some of the hyperparameters, I swept out a number of models fixing the FLOPs budget. (For every FLOPs target you can train a small model a long time, or a big model for a short time.) It turns out that nanochat obeys very nice scaling laws, basically reproducing the Chinchilla paper plots: Which is just a baby version of this plot from Chinchilla: Very importantly and encouragingly, the exponent on N (parameters) and D (tokens) is equal at ~=0.5, so just like Chinchilla we get a single (compute-independent) constant that relates the model size to token training horizons. In Chinchilla, this was measured to be 20. In nanochat it seems to be 8! Once we can train compute optimal models, I swept out a miniseries from d10 to d20, which are nanochat sizes that can do 2**19 ~= 0.5M batch sizes on 8XH100 node without gradient accumulation. We get pretty, non-itersecting training plots for each model size. Then the fun part is relating this miniseries v1 to the GPT-2 and GPT-3 miniseries so that we know we're on the right track. Validation loss has many issues and is not comparable, so instead I use the CORE score (from DCLM paper). I calculated it for GPT-2 and estimated it for GPT-3, which allows us to finally put nanochat nicely and on the same scale: The total cost of this miniseries is only ~$100 (~4 hours on 8XH100). These experiments give us confidence that everything is working fairly nicely and that if we pay more (turn the dial), we get increasingly better models. TLDR: we can train compute optimal miniseries and relate them to GPT-2/3 via objective CORE scores, but further improvements are desirable and needed. E.g., matching GPT-2 currently needs ~$500, but imo should be possible to do <$100 with more work. Full post with a lot more detail is here: github.com/karpathy/nanoc… And all of the tuning and code is pushed to master and people can reproduce these with scaling_laws .sh and miniseries .sh bash scripts.

English

228

683

5.4K

702.3K

Fusheng Liu@mathlfs·13 Ara

@teortaxesTex Should compare with Qwen3-Next

English

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·13 Ara

> 100BA6B @ 20T > matching 30BA3B @ 36T > literally zero benchmarks with substantial edge just give up bros. Thank you. enough. This design is not it

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) tweet media

Ant Open Source@ant_oss

🧬 Introducing LLaDA2.0, for the first time scaled to 100B, as a Discrete Diffusion LLMs (dLLM)! Featuring 16B (mini) and 100B (flash) MoE versions. With 2.1x faster inference than AR models and superior performance in Code, Math, and Agentic tasks, we prove that at scale, Diffusion is not just feasible—it's stronger and faster. 🌊 #AI #LLaDA #Diffusion #OpenSource #dllm

English

104

14.1K

Fusheng Liu@mathlfs·16 Ağu

@ethanepperly Is there a lower bound for the condition number?

English

118

Ethan Epperly@ethanepperly·15 Ağu

New blog post out! Vandermonde matrices are famously ill-conditioned, but just how bad are they? In this post, I discuss Gautschi’s 1962 bound showing that Vandermonde matrices are merely exponentially ill-conditioned ethanepperly.com/index.php/2025…

English

7.1K

Fusheng Liu@mathlfs·17 Tem

@hankyang94 Seems like this is related to the msign operation in the Muon optimizer?

English

142

Heng Yang@hankyang94·15 Tem

Sharing a project that’s kept me excited for months: Five years ago, I tried projecting a 10000×10000 symmetric matrix onto the positive semidefinite cone using MATLAB’s eig on my MacBook—gave up out of sheer impatience. Today, we released a CUDA-based factorization-free method that projects a 10000×10000 matrix in 55 ms (FP16) and 400 ms (FP32) on NVIDIA B200 GPUs. The trick? Approximating the ReLU-induced spectral operator with composite low-degree polynomials, evaluated via pure matrix-matrix multiplies—perfect for GPUs. Proud of the students who made this possible: @ShuchengK @Hyhan0118 and Antoine. Paper: arxiv.org/abs/2507.09165

English

331

22.7K

Fusheng Liu@mathlfs·3 Haz

@orvieto_antonio @BachFrancis Nice paper!

English

186

Antonio Orvieto@orvieto_antonio·3 Haz

We have a new SSM theory paper, just accepted to COLT, revisiting recall properties of linear RNNs. It's surprising how much one can delve into, and how beautiful it can become. With (and only thanks to) the amazing Alexandre and @BachFrancis arxiv.org/pdf/2502.09287

English

171

11.1K

Fusheng Liu retweetet

Kimi.ai@Kimi_Moonshot·22 Şub

🚀 Introducing our new tech report: Muon is Scalable for LLM Training We found that Muon optimizer can be scaled up using the follow techniques: • Adding weight decay • Carefully adjusting the per-parameter update scale ✨ Highlights: • ~2x computational efficiency vs AdamW • Seamless transition from AdamW to Muon without hyper-parameter tuning • Memory & communication efficient implementation of distributed Muon optimizer. 🎯 Based on these improvements, we introduce Moonlight: Our 3B/16B MoE model trained with Muon on 5.7T tokens, advancing the Pareto frontier with better performance at fewer FLOPs! 🎁 Open-sourcing everything: 📚 Code & implementation: github.com/MoonshotAI/Moo… 🤗 Full model series (pretrained, instruction-tuned & intermediate checkpoints): huggingface.co/moonshotai 📜 Paper: github.com/MoonshotAI/Moo… #AI #LLM #OpenSource #MoonshotAI

English

303

1.9K

756.1K

Fusheng Liu retweetet

DeepSeek@deepseek_ai·26 Ara

🚀 Introducing DeepSeek-V3! Biggest leap forward yet: ⚡ 60 tokens/second (3x faster than V2!) 💪 Enhanced capabilities 🛠 API compatibility intact 🌍 Fully open-source models & papers 🐋 1/n

GIF

English

656

2.1K

13.1K

7.4M

Fusheng Liu@mathlfs·25 Ara

@iScienceLuvr Transformer is standing on the shoulders of giants

English

556

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr·25 Ara

I feel like excitement surrounding Mamba and S4 models has diminished. Why is that the case? Did these models fail to scale? Harder to train?

English

356

116.6K

Fusheng Liu@mathlfs·19 Ara

@vaiter More precisely, GD converges to minimal distance solution (w.r.t. initialization).

English

277

Samuel Vaiter@vaiter·19 Ara

When optimization problems have multiple minima, algorithms favor specific solutions due to their implicit bias. For ordinary least squares (OLS), gradient descent inherently converges to the minimal norm solution among all possible solutions. fa.bianp.net/blog/2022/impl…

English

346

20.4K

Fusheng Liu@mathlfs·10 Ara

@kellerjordan0 Just finished reading, very informative and digestible! Thanks for sharing.

English

345

Keller Jordan@kellerjordan0·10 Ara

A blog about Muon by Jianlin Su, the creator of RoPE kexue.fm/archives/10592

English

148

11.7K

Fusheng Liu@mathlfs·26 Kas

@jxmnop Interesting history!

English

dr. jack morris@jxmnop·26 Kas

most people don't know that the original research on scaling laws came from Baidu in 2017 – not OpenAI in 2020 they characterized the effects of model params and dataset tokens on loss. also tested on images, and audio they just used LSTMs instead of Transformers, and didnt name their findings "laws"

English

130

1.2K

90.4K

Fusheng Liu@mathlfs·18 Eki

@GalantiTomer We have similar findings for state space models proceedings.mlr.press/v235/liu24ah.h…

English

Tomer Galanti@GalantiTomer·17 Eki

🧵 1/ We use weight decay everywhere. It’s a go-to for improving generalization and stabilizing training, right? But here’s the catch: it can also make models give up on low-frequency classes (😱). Not ideal!

English

4.2K

Fusheng Liu@mathlfs·17 Eki

Highly recommend this user-friendly project if you start with LM pretraining and want to build your own model/optimizer. The repo is easy to understand, easy to edit and easy to implement new ideas with minimum workloads. Well done Keller! Looking forward to your records on VIT:)

Keller Jordan@kellerjordan0

I enjoy getting NanoGPT training speed records. I’m also interested in making my formulation of NanoGPT speedrunning an accessible benchmark on which other people find it easy to try new ideas. To that end, I have tried to keep the code of the current record short, and minimize its installation time. Currently it’s 537 lines of code, and installs+runs in 20 minutes on a fresh 8xH100. That means the cost of a new record attempt is about $8. I’ve enjoyed seeing the records that other people have gotten. @vyasnikhil96 got a new sample-efficiency record using the SOAP optimizer, and I understand he’s currently working on reducing its overhead so that it can potentially compete with Muon on wallclock time in the future. @bozavlado discovered that Muon works better if the QKV weights are orthogonalized separately. And @Grad62304977 improved the record significantly using a wide range of architectural modernizations, including QK-norm. I was surprised to see that QK-norm, which from what I understand was invented to deal with instabilities that appear at large scale, also helps train faster at the small scale. I’ve seen some more interesting new ideas for the speedrun be posted recently, and I’d like to encourage the researchers who came up with those ideas to also be the ones to try them out empirically. I think this makes the benchmark more reliable, if the empirical experiments are distributed across the community, rather than only me doing them. I’m interested in two kinds of new results around this speedrun. First, of course, I’m interested in new records that improve the time to 3.28 val loss. The only rule is that you can’t use external data besides Fineweb10B, and you can’t use pretrained models. Beyond that, everything is fair game. Second, I’m interested in new trainings that match the current record, while being simpler. For example, if it can be shown that we can match the current record using standard AdamW instead of the Muon optimizer, then I think that would be a very interesting result. The log file produced by the current speedrun contains not just the timing and final loss, but also a copy of the code used to produce the run. Therefore, the only thing myself or anyone else needs to verify and reproduce a new record is its log file. Researchers have pointed out that we shouldn’t uncritically trust every result which is obtained at the 124M-parameter scale. I absolutely agree - we shouldn't blindly expect results to scale up. However, I still believe it’s valuable for the community to at least have one stable small-scale benchmark. Once an idea has been clearly proven to work at small scale, it becomes relatively simple to test it a larger scale. I think this is a better situation than the current status quo, where every LM training paper seems to use a different benchmark, making it challenging for the community to evaluate new ideas. The only exception to this evaluation system would be ideas that only work at large scale, so can’t be demonstrated in a small-scale benchmark. These do exist, but I believe they are less common in the recent literature than ideas which are also supposed to work at the 124M-parameter scale, which we should be able to efficiently evaluate using a stable and competitive small-scale benchmark. If the interest in this benchmark stays strong, I am hopeful that some very interesting things can come out of it. Thanks for your interest, Keller

English

138

Entdecken

@ZimingLiu11 @teortaxesTex @ethanepperly @hankyang94 @ShuchengK @Hyhan0118 @orvieto_antonio @BachFrancis