Fusheng Liu

60 posts

Fusheng Liu

Fusheng Liu

@mathlfs

PhD student @ National University of Singapore [email protected]

Singapore Beigetreten Eylül 2022
107 Folgt21 Follower
Fusheng Liu
Fusheng Liu@mathlfs·
@ZimingLiu11 Feel that we put a lot of efforts to study architecture and optimization algorithms but focus much less on understanding data itself
English
0
0
0
195
Fusheng Liu retweetet
Andrej Karpathy
Andrej Karpathy@karpathy·
I packaged up the "autoresearch" project into a new self-contained minimal repo if people would like to play over the weekend. It's basically nanochat LLM training core stripped down to a single-GPU, one file version of ~630 lines of code, then: - the human iterates on the prompt (.md) - the AI agent iterates on the training code (.py) The goal is to engineer your agents to make the fastest research progress indefinitely and without any of your own involvement. In the image, every dot is a complete LLM training run that lasts exactly 5 minutes. The agent works in an autonomous loop on a git feature branch and accumulates git commits to the training script as it finds better settings (of lower validation loss by the end) of the neural network architecture, the optimizer, all the hyperparameters, etc. You can imagine comparing the research progress of different prompts, different agents, etc. github.com/karpathy/autor… Part code, part sci-fi, and a pinch of psychosis :)
Andrej Karpathy tweet media
English
1K
3.6K
28.2K
10.8M
Fusheng Liu retweetet
Andrej Karpathy
Andrej Karpathy@karpathy·
The next step for autoresearch is that it has to be asynchronously massively collaborative for agents (think: SETI@home style). The goal is not to emulate a single PhD student, it's to emulate a research community of them. Current code synchronously grows a single thread of commits in a particular research direction. But the original repo is more of a seed, from which could sprout commits contributed by agents on all kinds of different research directions or for different compute platforms. Git(Hub) is *almost* but not really suited for this. It has a softly built in assumption of one "master" branch, which temporarily forks off into PRs just to merge back a bit later. I tried to prototype something super lightweight that could have a flavor of this, e.g. just a Discussion, written by my agent as a summary of its overnight run: github.com/karpathy/autor… Alternatively, a PR has the benefit of exact commits: github.com/karpathy/autor… but you'd never want to actually merge it... You'd just want to "adopt" and accumulate branches of commits. But even in this lightweight way, you could ask your agent to first read the Discussions/PRs using GitHub CLI for inspiration, and after its research is done, contribute a little "paper" of findings back. I'm not actually exactly sure what this should look like, but it's a big idea that is more general than just the autoresearch repo specifically. Agents can in principle easily juggle and collaborate on thousands of commits across arbitrary branch structures. Existing abstractions will accumulate stress as intelligence, attention and tenacity cease to be bottlenecks.
English
519
712
7.5K
1.1M
Fusheng Liu retweetet
Tencent HY
Tencent HY@TencentHunyuan·
One static model does not fit all😭 We just dropped our latest work: Functional Neural Memory. Instead of static models, we generate custom "parameters" for every single input. ✅Prompt your model anytime ✅Instant personalization ✅Better instruction following ✅Flexible & dynamic memory (w/o memory bank✌️) (🧵1/6)
English
11
141
331
66.7K
Fusheng Liu retweetet
Andrej Karpathy
Andrej Karpathy@karpathy·
New art project. Train and inference GPT in 243 lines of pure, dependency-free Python. This is the *full* algorithmic content of what is needed. Everything else is just for efficiency. I cannot simplify this any further. gist.github.com/karpathy/8627f…
English
651
3.2K
25.2K
5.1M
Fusheng Liu retweetet
Andrej Karpathy
Andrej Karpathy@karpathy·
nanochat can now train GPT-2 grade LLM for <<$100 (~$73, 3 hours on a single 8XH100 node). GPT-2 is just my favorite LLM because it's the first time the LLM stack comes together in a recognizably modern form. So it has become a bit of a weird & lasting obsession of mine to train a model to GPT-2 capability but for much cheaper, with the benefit of ~7 years of progress. In particular, I suspected it should be possible today to train one for <<$100. Originally in 2019, GPT-2 was trained by OpenAI on 32 TPU v3 chips for 168 hours (7 days), with $8/hour/TPUv3 back then, for a total cost of approx. $43K. It achieves 0.256525 CORE score, which is an ensemble metric introduced in the DCLM paper over 22 evaluations like ARC/MMLU/etc. As of the last few improvements merged into nanochat (many of them originating in modded-nanogpt repo), I can now reach a higher CORE score in 3.04 hours (~$73) on a single 8XH100 node. This is a 600X cost reduction over 7 years, i.e. the cost to train GPT-2 is falling approximately 2.5X every year. I think this is likely an underestimate because I am still finding more improvements relatively regularly and I have a backlog of more ideas to try. A longer post with a lot of the detail of the optimizations involved and pointers on how to reproduce are here: github.com/karpathy/nanoc… Inspired by modded-nanogpt, I also created a leaderboard for "time to GPT-2", where this first "Jan29" model is entry #1 at 3.04 hours. It will be fun to iterate on this further and I welcome help! My hope is that nanochat can grow to become a very nice/clean and tuned experimental LLM harness for prototyping ideas, for having fun, and ofc for learning. The biggest improvements of things that worked out of the box and simply produced gains right away were 1) Flash Attention 3 kernels (faster, and allows window_size kwarg to get alternating attention patterns), Muon optimizer (I tried for ~1 day to delete it and only use AdamW and I couldn't), residual pathways and skip connections gated by learnable scalars, and value embeddings. There were many other smaller things that stack up. Image: semi-related eye candy of deriving the scaling laws for the current nanochat model miniseries, pretty and satisfying!
Andrej Karpathy tweet media
English
333
626
7.4K
1.3M
Fusheng Liu retweetet
Anthropic
Anthropic@AnthropicAI·
AI can make work faster, but a fear is that relying on it may make it harder to learn new skills on the job. We ran an experiment with software engineers to learn more. Coding with AI led to a decrease in mastery—but this depended on how people used it. anthropic.com/research/AI-as…
English
292
1.3K
8.6K
3.6M
Fusheng Liu retweetet
Andrej Karpathy
Andrej Karpathy@karpathy·
New post: nanochat miniseries v1 The correct way to think about LLMs is that you are not optimizing for a single specific model but for a family models controlled by a single dial (the compute you wish to spend) to achieve monotonically better results. This allows you to do careful science of scaling laws and ultimately this is what gives you the confidence that when you pay for "the big run", the extrapolation will work and your money will be well spent. For the first public release of nanochat my focus was on end-to-end pipeline that runs the whole LLM pipeline with all of its stages. Now after YOLOing a few runs earlier, I'm coming back around to flesh out some of the parts that I sped through, starting of course with pretraining, which is both computationally heavy and critical as the foundation of intelligence and knowledge in these models. After locally tuning some of the hyperparameters, I swept out a number of models fixing the FLOPs budget. (For every FLOPs target you can train a small model a long time, or a big model for a short time.) It turns out that nanochat obeys very nice scaling laws, basically reproducing the Chinchilla paper plots: Which is just a baby version of this plot from Chinchilla: Very importantly and encouragingly, the exponent on N (parameters) and D (tokens) is equal at ~=0.5, so just like Chinchilla we get a single (compute-independent) constant that relates the model size to token training horizons. In Chinchilla, this was measured to be 20. In nanochat it seems to be 8! Once we can train compute optimal models, I swept out a miniseries from d10 to d20, which are nanochat sizes that can do 2**19 ~= 0.5M batch sizes on 8XH100 node without gradient accumulation. We get pretty, non-itersecting training plots for each model size. Then the fun part is relating this miniseries v1 to the GPT-2 and GPT-3 miniseries so that we know we're on the right track. Validation loss has many issues and is not comparable, so instead I use the CORE score (from DCLM paper). I calculated it for GPT-2 and estimated it for GPT-3, which allows us to finally put nanochat nicely and on the same scale: The total cost of this miniseries is only ~$100 (~4 hours on 8XH100). These experiments give us confidence that everything is working fairly nicely and that if we pay more (turn the dial), we get increasingly better models. TLDR: we can train compute optimal miniseries and relate them to GPT-2/3 via objective CORE scores, but further improvements are desirable and needed. E.g., matching GPT-2 currently needs ~$500, but imo should be possible to do <$100 with more work. Full post with a lot more detail is here: github.com/karpathy/nanoc… And all of the tuning and code is pushed to master and people can reproduce these with scaling_laws .sh and miniseries .sh bash scripts.
Andrej Karpathy tweet mediaAndrej Karpathy tweet mediaAndrej Karpathy tweet mediaAndrej Karpathy tweet media
English
228
683
5.4K
702.3K
Ethan Epperly
Ethan Epperly@ethanepperly·
New blog post out! Vandermonde matrices are famously ill-conditioned, but just how bad are they? In this post, I discuss Gautschi’s 1962 bound showing that Vandermonde matrices are merely exponentially ill-conditioned ethanepperly.com/index.php/2025…
Ethan Epperly tweet media
English
1
14
85
7.1K
Fusheng Liu
Fusheng Liu@mathlfs·
@hankyang94 Seems like this is related to the msign operation in the Muon optimizer?
English
1
0
0
142
Heng Yang
Heng Yang@hankyang94·
Sharing a project that’s kept me excited for months: Five years ago, I tried projecting a 10000×10000 symmetric matrix onto the positive semidefinite cone using MATLAB’s eig on my MacBook—gave up out of sheer impatience. Today, we released a CUDA-based factorization-free method that projects a 10000×10000 matrix in 55 ms (FP16) and 400 ms (FP32) on NVIDIA B200 GPUs. The trick? Approximating the ReLU-induced spectral operator with composite low-degree polynomials, evaluated via pure matrix-matrix multiplies—perfect for GPUs. Proud of the students who made this possible: @ShuchengK @Hyhan0118 and Antoine. Paper: arxiv.org/abs/2507.09165
Heng Yang tweet media
English
6
35
331
22.7K
Antonio Orvieto
Antonio Orvieto@orvieto_antonio·
We have a new SSM theory paper, just accepted to COLT, revisiting recall properties of linear RNNs. It's surprising how much one can delve into, and how beautiful it can become. With (and only thanks to) the amazing Alexandre and @BachFrancis arxiv.org/pdf/2502.09287
Antonio Orvieto tweet media
English
2
40
171
11.1K
Fusheng Liu retweetet
Kimi.ai
Kimi.ai@Kimi_Moonshot·
🚀 Introducing our new tech report: Muon is Scalable for LLM Training We found that Muon optimizer can be scaled up using the follow techniques: • Adding weight decay • Carefully adjusting the per-parameter update scale ✨ Highlights: • ~2x computational efficiency vs AdamW • Seamless transition from AdamW to Muon without hyper-parameter tuning • Memory & communication efficient implementation of distributed Muon optimizer. 🎯 Based on these improvements, we introduce Moonlight: Our 3B/16B MoE model trained with Muon on 5.7T tokens, advancing the Pareto frontier with better performance at fewer FLOPs! 🎁 Open-sourcing everything: 📚 Code & implementation: github.com/MoonshotAI/Moo… 🤗 Full model series (pretrained, instruction-tuned & intermediate checkpoints): huggingface.co/moonshotai 📜 Paper: github.com/MoonshotAI/Moo… #AI #LLM #OpenSource #MoonshotAI
Kimi.ai tweet mediaKimi.ai tweet media
English
84
303
1.9K
756.1K
Fusheng Liu retweetet
DeepSeek
DeepSeek@deepseek_ai·
🚀 Introducing DeepSeek-V3! Biggest leap forward yet: ⚡ 60 tokens/second (3x faster than V2!) 💪 Enhanced capabilities 🛠 API compatibility intact 🌍 Fully open-source models & papers 🐋 1/n
GIF
DeepSeek tweet media
English
656
2.1K
13.1K
7.4M
Tanishq Mathew Abraham, Ph.D.
Tanishq Mathew Abraham, Ph.D.@iScienceLuvr·
I feel like excitement surrounding Mamba and S4 models has diminished. Why is that the case? Did these models fail to scale? Harder to train?
English
32
15
356
116.6K
Fusheng Liu
Fusheng Liu@mathlfs·
@vaiter More precisely, GD converges to minimal distance solution (w.r.t. initialization).
English
0
0
4
277
Samuel Vaiter
Samuel Vaiter@vaiter·
When optimization problems have multiple minima, algorithms favor specific solutions due to their implicit bias. For ordinary least squares (OLS), gradient descent inherently converges to the minimal norm solution among all possible solutions. fa.bianp.net/blog/2022/impl…
English
6
48
346
20.4K
Fusheng Liu
Fusheng Liu@mathlfs·
@kellerjordan0 Just finished reading, very informative and digestible! Thanks for sharing.
English
1
0
2
345
dr. jack morris
dr. jack morris@jxmnop·
most people don't know that the original research on scaling laws came from Baidu in 2017 – not OpenAI in 2020 they characterized the effects of model params and dataset tokens on loss. also tested on images, and audio they just used LSTMs instead of Transformers, and didnt name their findings "laws"
dr. jack morris tweet media
English
26
130
1.2K
90.4K
Tomer Galanti
Tomer Galanti@GalantiTomer·
🧵 1/ We use weight decay everywhere. It’s a go-to for improving generalization and stabilizing training, right? But here’s the catch: it can also make models give up on low-frequency classes (😱). Not ideal!
English
2
2
12
4.2K
Fusheng Liu
Fusheng Liu@mathlfs·
Highly recommend this user-friendly project if you start with LM pretraining and want to build your own model/optimizer. The repo is easy to understand, easy to edit and easy to implement new ideas with minimum workloads. Well done Keller! Looking forward to your records on VIT:)
Keller Jordan@kellerjordan0

I enjoy getting NanoGPT training speed records. I’m also interested in making my formulation of NanoGPT speedrunning an accessible benchmark on which other people find it easy to try new ideas. To that end, I have tried to keep the code of the current record short, and minimize its installation time. Currently it’s 537 lines of code, and installs+runs in 20 minutes on a fresh 8xH100. That means the cost of a new record attempt is about $8. I’ve enjoyed seeing the records that other people have gotten. @vyasnikhil96 got a new sample-efficiency record using the SOAP optimizer, and I understand he’s currently working on reducing its overhead so that it can potentially compete with Muon on wallclock time in the future. @bozavlado discovered that Muon works better if the QKV weights are orthogonalized separately. And @Grad62304977 improved the record significantly using a wide range of architectural modernizations, including QK-norm. I was surprised to see that QK-norm, which from what I understand was invented to deal with instabilities that appear at large scale, also helps train faster at the small scale. I’ve seen some more interesting new ideas for the speedrun be posted recently, and I’d like to encourage the researchers who came up with those ideas to also be the ones to try them out empirically. I think this makes the benchmark more reliable, if the empirical experiments are distributed across the community, rather than only me doing them. I’m interested in two kinds of new results around this speedrun. First, of course, I’m interested in new records that improve the time to 3.28 val loss. The only rule is that you can’t use external data besides Fineweb10B, and you can’t use pretrained models. Beyond that, everything is fair game. Second, I’m interested in new trainings that match the current record, while being simpler. For example, if it can be shown that we can match the current record using standard AdamW instead of the Muon optimizer, then I think that would be a very interesting result. The log file produced by the current speedrun contains not just the timing and final loss, but also a copy of the code used to produce the run. Therefore, the only thing myself or anyone else needs to verify and reproduce a new record is its log file. Researchers have pointed out that we shouldn’t uncritically trust every result which is obtained at the 124M-parameter scale. I absolutely agree - we shouldn't blindly expect results to scale up. However, I still believe it’s valuable for the community to at least have one stable small-scale benchmark. Once an idea has been clearly proven to work at small scale, it becomes relatively simple to test it a larger scale. I think this is a better situation than the current status quo, where every LM training paper seems to use a different benchmark, making it challenging for the community to evaluate new ideas. The only exception to this evaluation system would be ideas that only work at large scale, so can’t be demonstrated in a small-scale benchmark. These do exist, but I believe they are less common in the recent literature than ideas which are also supposed to work at the 124M-parameter scale, which we should be able to efficiently evaluate using a stable and competitive small-scale benchmark. If the interest in this benchmark stays strong, I am hopeful that some very interesting things can come out of it. Thanks for your interest, Keller

English
0
0
1
138