Shengding Hu (@DeanHu11) - Twitter Profili | Zamantika Mersobahis Locabet

@thinkymachines Great research blog that explains every detail clearly! 🥰 I notice a tiny mistake in the code snippet 3, where there should be 103 unique results. Either using more iterations or enumerating permutations can add one more result.

English

0

3

264

Thinking Machines@thinkymachines·10 Eyl

Today Thinking Machines Lab is launching our research blog, Connectionism. Our first blog post is “Defeating Nondeterminism in LLM Inference” We believe that science is better when shared. Connectionism will cover topics as varied as our research is: from kernel numerics to prompt engineering. Here we share what we are working on and connect with the research community frequently and openly. The name Connectionism is a throwback to an earlier era of AI; it was the name of the subfield in the 1980s that studied neural networks and their similarity to biological brains. thinkingmachines.ai/blog/defeating…

English

230

1.3K

7.6K

3.4M

Shengding Hu@DeanHu11·21 Ağu

The most interesting benchmarking/probing this year!

henry@arithmoquine

new post. there's a lot in it. i suggest you check it out

English

0

3

24

2.3K

Shengding Hu retweetledi

henry@arithmoquine·11 Ağu

new post. there's a lot in it. i suggest you check it out

English

70

178

2.7K

263.3K

Shengding Hu@DeanHu11·5 Ağu

Very glad to see this type of evaluation being hosted by a large platform to ensure its maintainess. One can prove that a General Game AI that can dynamically behave according to the opponent's strategy is a type of self-improving ASI.

Demis Hassabis@demishassabis

Thrilled to announce the @Kaggle Game Arena, a new leaderboard testing how modern LLMs perform on games (spoiler: not very well atm!). AI systems play each other, making it an objective & evergreen benchmark that will scale in difficulty as they improve. kaggle.com/game-arena

English

0

1

9

1.8K

Shengding Hu@DeanHu11·24 Mar

But tbh, I think even block language diffusion still has a long way to go in terms of both performance and batch-serving efficiency before they can match autoregressive models.

English

1

22

2.8K

Shengding Hu@DeanHu11·24 Mar

Thanks for discovering our paper! Seems that there is a trend! Just planned to write a blog to connect these highly similar papers. But I'm too busy recently. Autoregressive conditional block attention is all we need for unified modalities🤣

Tao HU@vtaohu

Okay, interesting. I saw a very similar idea in Vision now. They do experiments on images and videos. Even the title is almost the same. 😅😅😅 arxiv.org/abs/2412.07720

English

2

6

74

17.8K

Shengding Hu@DeanHu11·19 Mar

Excellent work on understanding pretraining loss!

Kairong Luo ✈️ ICLR2026@openhonor

🔍How does pretraining loss evolve under different LR schedules? 🌟Meet our Multi-Power Law: predicts the full loss curve for various schedules! 🌟Accurate enough to optimize LR schedules directly. 🌟Result? A WSD-like schedule that outperforms the rest! 🔥Accepted at #ICLR2025

English

0

2

18

4.1K

Shengding Hu@DeanHu11·31 Eki

Thanks for introducing our work! Check out our new observations on Mamba! Great work with @DonnyChan123 . tbh, the name is one of the best that I came up with among all my papers🤣

Albert Gu@_albertgu

this is a great paper (with a great name) - clever exps on the state capacity and long context ability of SSMs arxiv.org/abs/2410.07145… strikingly, for every state size M there's a phase transition at some training context len >= K where SSMs will length-generalize robustly this is because with context length < K, the recurrent state isn't fully utilized and so the model "overfits" during training but once the model's state capacity is fully utilized (by training on long enough seqs), it automatically extrapolates notably, K is linear in M (!) - suggesting there's some notion of intrinsic information content per token (there exists B such that each token in the context corresponds to B bytes of recurrent state). perhaps B is architecture dependent? conversely, worrying about length generalization in recurrent models is probably a red herring. no need to design new mechanisms or special mitigations: just train on longer sequences (which has no computation overhead since linear time!) to generalize better takeaway: stuffing is tasty, and feed your Mambas fully!

English

0

5

26

4.9K

Shengding Hu retweetledi

Albert Gu@_albertgu·31 Eki

this is a great paper (with a great name) - clever exps on the state capacity and long context ability of SSMs arxiv.org/abs/2410.07145… strikingly, for every state size M there's a phase transition at some training context len >= K where SSMs will length-generalize robustly this is because with context length < K, the recurrent state isn't fully utilized and so the model "overfits" during training but once the model's state capacity is fully utilized (by training on long enough seqs), it automatically extrapolates notably, K is linear in M (!) - suggesting there's some notion of intrinsic information content per token (there exists B such that each token in the context corresponds to B bytes of recurrent state). perhaps B is architecture dependent? conversely, worrying about length generalization in recurrent models is probably a red herring. no need to design new mechanisms or special mitigations: just train on longer sequences (which has no computation overhead since linear time!) to generalize better takeaway: stuffing is tasty, and feed your Mambas fully!

English

9

48

271

40.7K

Shengding Hu@DeanHu11·8 Eki

Love COLM，and it is my great honor to present here！

Sasha Rush@srush_nlp

Really enjoyed this presentation at COLM. Such a dense set of experiments. arxiv.org/abs/2404.06395

English

2

86

13.6K

Shengding Hu retweetledi

Sasha Rush@srush_nlp·8 Eki

Really enjoyed this presentation at COLM. Such a dense set of experiments. arxiv.org/abs/2404.06395

English

3

35

276

30.9K

Shengding Hu@DeanHu11·6 Ağu

Thrilled to discover that both Huggingface and Megatron have incorporated the WSD scheduler! Everyone is welcome to try it out! Finally, I've made a small contribution to LLMs.

English

0

4

53

5.7K

Shengding Hu@DeanHu11·29 Haz

@Dominicliu12 @TsingYoga 好像这个是老问题了，我这边差距能大到 40vs9 22 vs 0 （semantic vs gg）

中文

1

0

44

Xiang Liu@Dominicliu12·28 Haz

@TsingYoga citation 数目不一样的问题能解决吗🤦‍♂️，同一个 paper arxiv 里和 google scholar 引用数目不一样

中文

2

0

1

209

Yujia Qin@TsingYoga·28 Haz

经朋友提醒，发现自己有一篇老论文从Google Scholar的收录里消失了（google scholar里原来是可以搜到的，有这篇论文的收录链接，个人profile里也能看到这篇论文），目前arxiv链接 / semantic scholar还是正常的，求助下情况该怎么解决？

中文

3

0

3

3.3K

Shengding Hu retweetledi

Zhiyuan Liu@zibuyu9·3 Haz

😢

PrimerYang@yangzhizheng1

Shocked! Llama3-V project from a Stanford team plagiarized a lot from MiniCPM-Llama3-V 2.5! its code is a reformatting of MiniCPM-Llama3-V 2.5, and the model's behavior is highly similar to a noised version of MiniCPM-Llama3-V 2.5 checkpoint. Evidence: github.com/OpenBMB/MiniCP…

ART

2

4

24

7.2K

Shengding Hu retweetledi

Zach Mueller@TheZachMueller·30 May

Super easy to use with the @huggingface trainer as well!

elie@eliebakouch

📢 I submitted my first (ever) paper! We find that with WSD LR schedule you can: 1. establish scaling laws with ~x2 savings 2. safely train a model without knowing the total training steps in advance 3. a new cooldown function outperforming linear/cosine decay 🧵 ⬇️

English

2

3

35

4.6K

Shengding Hu@DeanHu11·31 May

This is the author's post x.com/eliebakouch/st…

elie@eliebakouch

📢 I submitted my first (ever) paper! We find that with WSD LR schedule you can: 1. establish scaling laws with ~x2 savings 2. safely train a model without knowing the total training steps in advance 3. a new cooldown function outperforming linear/cosine decay 🧵 ⬇️

English

0

2

605

Shengding Hu@DeanHu11·31 May

arxiv.org/pdf/2405.18392 A cool paper that studies WSD and compares it against various schedulers including SFO(Schedule Free Optimizer) and other techniques such as Stochastic Weight Averaging (SWA). Thanks @eliebakouch for the detailed study, which greatly updates my knowledge.

English

2

7

29

4K

Shengding Hu@DeanHu11·19 May

Time for some ''yolo runs''! 😉

Jason Wei@_jasonwei

An incredible skill that I have witnessed, especially at OpenAI, is the ability to make “yolo runs” work. The traditional advice in academic research is, “change one thing at a time.” This approach forces you to understand the effect of each component in your model, and therefore is a reliable way to make something work. I personally do this quite religiously. However, the downside is that it takes a long time, especially if you want to understand the interactive effects among components. A “yolo run” directly implements an ambitious new model without extensively de-risking individual components. The researcher doing the yolo run relies primarily on intuition to set hyperparameter values, decide what parts of the model matter, and anticipate potential problems. These choices are non-obvious to everyone else on the team. Yolo runs are hard to get right because many things have to go correctly for it to work, and even a single bad hyperparameter can cause your run to fail. It is probabilistically unlikely to guess most or all of them correctly. Yet multiple times I have seen someone make a yolo run work on the first or second try, resulting in a SOTA model. Such yolo runs are very impactful, as they can leapfrog the team forward when everyone else is stuck. I do not know how these researchers do it; my best guess is intuition built up from decades of running experiments, a deep understanding of what matters to make a language model successful, and maybe a little bit of divine benevolence. But what I do know is that the people who can do this are surely 10-100x AI researchers. They should be given as many GPUs as they want and be protected like unicorns.

English

0

4

666

Shengding Hu retweetledi

elie@eliebakouch·15 May

@huggingface The training loss and evaluation results are very similar for different learning rate values. We also see a net decrease in the training loss when we start the decay phase as expected. Cosine best run vs WSD for the same learning rate:

English

4

1

18

3.9K

Shengding Hu@DeanHu11·16 May

Very glad to hear that! Why the decay stage works well might also have some implications on the training dynamics ...

elie@eliebakouch

@huggingface @DeanHu11 @Yikang_Shen @LightOnIO @deepseek_ai TL;DR: WSD works as well as cosine schedule. It seems to become the standard for open-source models and might (already?) be used in closed-source models 👀 Also big thanks to @LoubnaBenAllal1 and @lvwerra for helping me with these experiments 🤗

English

0

6

674

Shengding Hu

Keşfet