Shengding Hu

41 posts

Shengding Hu

Shengding Hu

@DeanHu11

PhD @ Tsinghua University @deepseek_ai

Beijing, China Katılım Ekim 2018
142 Takip Edilen1.1K Takipçiler
Shengding Hu
Shengding Hu@DeanHu11·
@thinkymachines Great research blog that explains every detail clearly! 🥰 I notice a tiny mistake in the code snippet 3, where there should be 103 unique results. Either using more iterations or enumerating permutations can add one more result.
English
0
0
3
264
Thinking Machines
Thinking Machines@thinkymachines·
Today Thinking Machines Lab is launching our research blog, Connectionism. Our first blog post is “Defeating Nondeterminism in LLM Inference” We believe that science is better when shared. Connectionism will cover topics as varied as our research is: from kernel numerics to prompt engineering. Here we share what we are working on and connect with the research community frequently and openly. The name Connectionism is a throwback to an earlier era of AI; it was the name of the subfield in the 1980s that studied neural networks and their similarity to biological brains. thinkingmachines.ai/blog/defeating…
Thinking Machines tweet media
English
230
1.3K
7.6K
3.4M
Shengding Hu retweetledi
henry
henry@arithmoquine·
new post. there's a lot in it. i suggest you check it out
henry tweet media
English
70
178
2.7K
263.3K
Shengding Hu
Shengding Hu@DeanHu11·
Very glad to see this type of evaluation being hosted by a large platform to ensure its maintainess. One can prove that a General Game AI that can dynamically behave according to the opponent's strategy is a type of self-improving ASI.
Demis Hassabis@demishassabis

Thrilled to announce the @Kaggle Game Arena, a new leaderboard testing how modern LLMs perform on games (spoiler: not very well atm!). AI systems play each other, making it an objective & evergreen benchmark that will scale in difficulty as they improve. kaggle.com/game-arena

English
0
1
9
1.8K
Shengding Hu
Shengding Hu@DeanHu11·
But tbh, I think even block language diffusion still has a long way to go in terms of both performance and batch-serving efficiency before they can match autoregressive models.
English
1
1
22
2.8K
Shengding Hu
Shengding Hu@DeanHu11·
Thanks for introducing our work! Check out our new observations on Mamba! Great work with @DonnyChan123 . tbh, the name is one of the best that I came up with among all my papers🤣
Albert Gu@_albertgu

this is a great paper (with a great name) - clever exps on the state capacity and long context ability of SSMs arxiv.org/abs/2410.07145… strikingly, for every state size M there's a phase transition at some training context len >= K where SSMs will length-generalize robustly this is because with context length < K, the recurrent state isn't fully utilized and so the model "overfits" during training but once the model's state capacity is fully utilized (by training on long enough seqs), it automatically extrapolates notably, K is linear in M (!) - suggesting there's some notion of intrinsic information content per token (there exists B such that each token in the context corresponds to B bytes of recurrent state). perhaps B is architecture dependent? conversely, worrying about length generalization in recurrent models is probably a red herring. no need to design new mechanisms or special mitigations: just train on longer sequences (which has no computation overhead since linear time!) to generalize better takeaway: stuffing is tasty, and feed your Mambas fully!

English
0
5
26
4.9K
Shengding Hu retweetledi
Albert Gu
Albert Gu@_albertgu·
this is a great paper (with a great name) - clever exps on the state capacity and long context ability of SSMs arxiv.org/abs/2410.07145… strikingly, for every state size M there's a phase transition at some training context len >= K where SSMs will length-generalize robustly this is because with context length < K, the recurrent state isn't fully utilized and so the model "overfits" during training but once the model's state capacity is fully utilized (by training on long enough seqs), it automatically extrapolates notably, K is linear in M (!) - suggesting there's some notion of intrinsic information content per token (there exists B such that each token in the context corresponds to B bytes of recurrent state). perhaps B is architecture dependent? conversely, worrying about length generalization in recurrent models is probably a red herring. no need to design new mechanisms or special mitigations: just train on longer sequences (which has no computation overhead since linear time!) to generalize better takeaway: stuffing is tasty, and feed your Mambas fully!
Albert Gu tweet mediaAlbert Gu tweet media
English
9
48
271
40.7K
Shengding Hu retweetledi
Sasha Rush
Sasha Rush@srush_nlp·
Really enjoyed this presentation at COLM. Such a dense set of experiments. arxiv.org/abs/2404.06395
Sasha Rush tweet media
English
3
35
276
30.9K
Shengding Hu
Shengding Hu@DeanHu11·
Thrilled to discover that both Huggingface and Megatron have incorporated the WSD scheduler! Everyone is welcome to try it out! Finally, I've made a small contribution to LLMs.
Shengding Hu tweet mediaShengding Hu tweet media
English
0
4
53
5.7K
Xiang Liu
Xiang Liu@Dominicliu12·
@TsingYoga citation 数目不一样的问题能解决吗🤦‍♂️,同一个 paper arxiv 里和 google scholar 引用数目不一样
中文
2
0
1
209
Yujia Qin
Yujia Qin@TsingYoga·
经朋友提醒,发现自己有一篇老论文从Google Scholar的收录里消失了(google scholar里原来是可以搜到的,有这篇论文的收录链接,个人profile里也能看到这篇论文),目前arxiv链接 / semantic scholar还是正常的,求助下情况该怎么解决?
Yujia Qin tweet mediaYujia Qin tweet mediaYujia Qin tweet media
中文
3
0
3
3.3K
Shengding Hu
Shengding Hu@DeanHu11·
arxiv.org/pdf/2405.18392 A cool paper that studies WSD and compares it against various schedulers including SFO(Schedule Free Optimizer) and other techniques such as Stochastic Weight Averaging (SWA). Thanks @eliebakouch for the detailed study, which greatly updates my knowledge.
Shengding Hu tweet media
English
2
7
29
4K
Shengding Hu
Shengding Hu@DeanHu11·
Time for some ''yolo runs''! 😉
Jason Wei@_jasonwei

An incredible skill that I have witnessed, especially at OpenAI, is the ability to make “yolo runs” work. The traditional advice in academic research is, “change one thing at a time.” This approach forces you to understand the effect of each component in your model, and therefore is a reliable way to make something work. I personally do this quite religiously. However, the downside is that it takes a long time, especially if you want to understand the interactive effects among components. A “yolo run” directly implements an ambitious new model without extensively de-risking individual components. The researcher doing the yolo run relies primarily on intuition to set hyperparameter values, decide what parts of the model matter, and anticipate potential problems. These choices are non-obvious to everyone else on the team. Yolo runs are hard to get right because many things have to go correctly for it to work, and even a single bad hyperparameter can cause your run to fail. It is probabilistically unlikely to guess most or all of them correctly. Yet multiple times I have seen someone make a yolo run work on the first or second try, resulting in a SOTA model. Such yolo runs are very impactful, as they can leapfrog the team forward when everyone else is stuck. I do not know how these researchers do it; my best guess is intuition built up from decades of running experiments, a deep understanding of what matters to make a language model successful, and maybe a little bit of divine benevolence. But what I do know is that the people who can do this are surely 10-100x AI researchers. They should be given as many GPUs as they want and be protected like unicorns.

English
0
0
4
666
Shengding Hu retweetledi
elie
elie@eliebakouch·
@huggingface The training loss and evaluation results are very similar for different learning rate values. We also see a net decrease in the training loss when we start the decay phase as expected. Cosine best run vs WSD for the same learning rate:
elie tweet mediaelie tweet media
English
4
1
18
3.9K