Yulun Du

159 posts

Yulun Du

@Yulun_Du

Scaling @Kimi_Moonshot prev @LTIatCMU Opinions are my own.

Pittsburgh, PA Katılım Ekim 2012

830 Takip Edilen2.9K Takipçiler

Sabitlenmiş Tweet

Yulun Du@Yulun_Du·3d

@ilyasut once said that an LSTM is a ResNet rotated 90 degrees. :) It turns out attention can be rotated 90 degrees too — yielding a natural generalization of residual connections. 🥳

Kimi.ai@Kimi_Moonshot

Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. 🔹 Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. 🔹 Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. 🔹 Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. 🔹 Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. 🔗Full report: github.com/MoonshotAI/Att…

English

517

56.8K

Yulun Du@Yulun_Du·9h

A must read

jianlin.su@Jianlin_S

Attention Residuals Revisited kexue.fm/archives/11664

English

6.3K

Yulun Du retweetledi

熊师傅 weight decay 了吗@bigeagle_xd·1d

AttnRes is not just a typical "novelty paper". it stems from a much bigger project, co-designed by both model research and infra teams, with considerations that go way beyond just "lower loss" or "better expressivity". here is the "ultra think pro xhigh" part from inference infra team: zhuanlan.zhihu.com/p/201752829528… translation from k2.5: 5wvb5ya5wncq4.ok.kimi.link you can always trust the kimi solidness.

English

153

27.9K

Yulun Du@Yulun_Du·2d

@bigeagle_xd @giffmana Don’t want anyone knows we are that competitive

English

127

熊师傅 weight decay 了吗@bigeagle_xd·2d

@Yulun_Du @giffmana you missed the leaderboard part

English

364

Lucas Beyer (bl16)@giffmana·3d

Love this place. Just noticed someone I'm following is called: "[Chinese] weight decay [Chinese]" lol

English

149

86.3K

Yulun Du@Yulun_Du·2d

@YizhouLiu0 Not sure what you mean by bad. This component is data dependent.

English

1.3K

Yizhou Liu@YizhouLiu0·3d

Why is the baseline so bad? Chinchilla has C^-0.17 and GPT-4 looks like C^-0.13. Did they do optimal-compute?

Kimi.ai@Kimi_Moonshot

Scaling law experiments reveal a consistent 1.25× compute advantage across varying model sizes.

English

10.5K

Yulun Du@Yulun_Du·3d

🤓

Elon Musk@elonmusk

@_avichawla Impressive work from Kimi

ART

1.5K

Yulun Du@Yulun_Du·3d

So insightful :) SGD is residual on weights. Now I’m wondering where else attention might be useful 🤓

Andrej Karpathy@karpathy

@Yulun_Du @ilyasut SGD is a ResNet too (the blocks of it are fwd+bwd), the residual stream is the weights so... 🤔 We're not taking the Attention is All You Need part literally enough? :D

English

238

31.7K

Yulun Du@Yulun_Du·4d

I personally love this blog. Two of my colleagues had already recommended it to me. It has great taste and deep technical merit.

xjdr@_xjdr

Noumena.com/research

English

113

21.8K

Yulun Du@Yulun_Du·12 Mar

Really cool work from Xinyu and the MetaClaw team. It treats continuous agent improvement as both a product loop and a model training problem — skill injection for immediate gains, RL/OPD for longer-term evolution, and a setup simple enough for real use. And it uses Kimi K2.5 :) imo this is the kind of systems thinking that agent products need. 🫡

Xinyu Yang@Xinyu2ML

With one click, you can launch your own Kimi-2.5 as a production-level personalized agent.

English

7.1K

Yulun Du@Yulun_Du·10 Mar

@sainingxie @ylecun @amilabs Congrats! 🥳

English

257

Saining Xie@sainingxie·10 Mar

i’m joining forces with @ylecun and an incredible group of people to start AMI Labs @amilabs. AMI isn’t a conventional lab. we don’t intend to become one. a lot to say about why this moment matters, but for now we’re heads down building. join us: amilabs.xyz

AMI Labs@amilabs

Advanced Machine Intelligence (AMI) is building a new breed of AI systems that understand the world, have persistent memory, can reason and plan, and are controllable and safe. We’ve raised a $1.03B (~€890M) round from global investors who believe in our vision of universally intelligent systems centered on world models. This round is co-led by Cathay Innovation, Greycroft, Hiro Capital, HV Capital, and Bezos Expeditions, along with other investors and angels across the world. We are a growing team of researchers and builders, operating in Paris, New York, Montreal and Singapore from day one. Read more: amilabs.xyz AMI - Real world. Real intelligence.

English

152

162

2.8K

458.3K

Yulun Du@Yulun_Du·7 Mar

@YouJiacheng 游少要应聘？

中文

618

You Jiacheng@YouJiacheng·7 Mar

no requirement on sex?

Andy Boreham 安柏然@AndyBxxx

Unitree Robitics founder and CEO Wang Xingxing is single and looking for a partner on a Chinese dating app. His requirements have caused heated discussion online in China: No smoking, no drinking, have a good outlook, be kind, and have at least basic technological literacy.

English

10.6K

Yulun Du@Yulun_Du·6 Mar

@JingyuanLiu123 @Jianlin_S @clu_cheng 刘静远又中彩票了

中文

360

JingyuanLiu@JingyuanLiu123·6 Mar

Some updates: I've always been bullish on TML, and I actually joined TML this Monday Looking back, I am feeling so lucky that I have the privilege to work closely with the best optimization experts on the Muon optimizer ( @Jianlin_S from Kimi and @clu_cheng from Meta). Now I am so excited to be able to work with @jxbz and build new cool things! (On the other hand, there have always been some bad rumors about Meta TBD's potential failure. That's not true! From my personal experiences, it really has the best talents in the field, and I really enjoyed learning from the lab. The avocado model will for sure be great!)

JingyuanLiu@JingyuanLiu123

hmm I sort of disagree and I am bullish for TML. I think they really really have the top talents that I admire in the field, e.g. Jeremy and Sam for optimization, Songlin for Attn, Lia for MoE, Andrew for FSDPv2, and a bunch more folks it's just natural that it takes a while to publish good models: - dpsk starts to publish papers in 2023, even piblished dspkv2 (which I think is already amazing) in mid 2024 and nobody cares, until dpskv3 and r1 - msh took 10+ month to deliver a first not bad long ctx model in 2023 and be silent for the whole 2024 year, and starts to catch up gradually in 2025 - qwen starts to be a much better model than llama until qwen2.5, mid or late 2024, while the lab has been there forever it takes time to get infra and data done, but as long as you have good folks, and principled ways of doing science and experiments, some time or later, scaling laws will pay back

English

273

53.3K

Yulun Du@Yulun_Du·4 Mar

@Xinyu2ML 哥，你真牛逼，the Xin(yu)formation

中文

9.6K

Xinyu Yang@Xinyu2ML·4 Mar

中文发一下今天通义大会的内容吧，感觉是没有转机了 1. 首席hr自称这波调整是扩充更多人才，提供更多资源 2. 阿里是模型公司，qwen是集团的事情，而不只是基模的事情，集团来做大闭环，要快速发展，组织形式没沟通好 3. qwen是集团最重要的事情，希望人才来扩大，必然涉及到阵型变化，无论怎么变化希望大家做好。什么东西都不是没有代价的。用junyang一个人的脑子来处理肯定高效，但站着jingren的角度，需要考虑把zhouhao放在什么位置上比较高效，全过程没有考虑过政治因素（btw昨天高层的说法是，zhouhao比较担心一开始融不进qwen团队，所以主动要求把自己先放在jingren下面，高层就答应了） 4. 我们做的事情很宏大，100多个人肯定不够，需要扩张，很难照顾到每个人的想法 5. 吴妈说中国国情特殊，资源很难大家都满意，道歉没有更早知道资源的问题。说是中国最激进寻求算力的ceo，Qwen是第一优先级&尽了中国CEO最大的努力了。 6. 关于资源被集团卡脖子，吴妈说不知道被卡，心里一直优先级是最高的，问题是信息传递流程的问题 7. jingren说一直资源紧张，在做整体规划，然后说自己也是被架空的。然后说内部阿里云不好用是历史原因 8. 然后下面问junyang能不能回来，首席hr说：不能推上神坛&公司不能接受非理性的要求不计代价来挽留，并问台下那大家觉得自己是什么代价呢

中文

232

160

1.1K

1.3M

Yulun Du@Yulun_Du·3 Mar

All the best. Alibaba is certainly making a huge mistake.

Junyang Lin@JustinLin610

me stepping down. bye my beloved qwen.

English

409

21.7K

Yulun Du@Yulun_Du·15 Şub

@mianmoe 还得是健身

中文

Yulun Du@Yulun_Du·10 Şub

@HaoningTimothy 你是说我是聪明的留子吗（

中文

115

Wu Haoning@HaoningTimothy·10 Şub

longrun RL 之后聪明的模型难免学会 code switch 的 tendency

Yangyi@yangyi

opus4.6有留学症了

中文

2.3K

Yulun Du@Yulun_Du·9 Şub

@SinclairWang1 👀

QME

120

Zengzhi Wang@SinclairWang1·8 Şub

Launch something big

English

741

Yulun Du@Yulun_Du·6 Şub

The gap between developers using AI coding agents (Codex, Claude Code, Kimi CLI) and those who aren't is indeed widening fast. Don't sleep on this and try Kimi CLI (as well as others) now. :)

Greg Brockman@gdb

Software development is undergoing a renaissance in front of our eyes. If you haven't used the tools recently, you likely are underestimating what you're missing. Since December, there's been a step function improvement in what tools like Codex can do. Some great engineers at OpenAI yesterday told me that their job has fundamentally changed since December. Prior to then, they could use Codex for unit tests; now it writes essentially all the code and does a great deal of their operations and debugging. Not everyone has yet made that leap, but it's usually because of factors besides the capability of the model. Every company faces the same opportunity now, and navigating it well — just like with cloud computing or the Internet — requires careful thought. This post shares how OpenAI is currently approaching retooling our teams towards agentic software development. We're still learning and iterating, but here's how we're thinking about it right now: As a first step, by March 31st, we're aiming that: (1) For any technical task, the tool of first resort for humans is interacting with an agent rather than using an editor or terminal. (2) The default way humans utilize agents is explicitly evaluated as safe, but also productive enough that most workflows do not need additional permissions. In order to get there, here's what we recommended to the team a few weeks ago: 1. Take the time to try out the tools. The tools do sell themselves — many people have had amazing experiences with 5.2 in Codex, after having churned from codex web a few months ago. But many people are also so busy they haven't had a chance to try Codex yet or got stuck thinking "is there any way it could do X" rather than just trying. - Designate an "agents captain" for your team — the primary person responsible for thinking about how agents can be brought into the teams' workflow. - Share experiences or questions in a few designated internal channels - Take a day for a company-wide Codex hackathon 2. Create skills and AGENTS[.md]. - Create and maintain an AGENTS[.md] for any project you work on; update the AGENTS[.md] whenever the agent does something wrong or struggles with a task. - Write skills for anything that you get Codex to do, and commit it to the skills directory in a shared repository 3. Inventory and make accessible any internal tools. - Maintain a list of tools that your team relies on, and make sure someone takes point on making it agent-accessible (such as via a CLI or MCP server). 4. Structure codebases to be agent-first. With the models changing so fast, this is still somewhat untrodden ground, and will require some exploration. - Write tests which are quick to run, and create high-quality interfaces between components. 5. Say no to slop. Managing AI generated code at scale is an emerging problem, and will require new processes and conventions to keep code quality high - Ensure that some human is accountable for any code that gets merged. As a code reviewer, maintain at least the same bar as you would for human-written code, and make sure the author understands what they're submitting. 6. Work on basic infra. There's a lot of room for everyone to build basic infrastructure, which can be guided by internal user feedback. The core tools are getting a lot better and more usable, but there's a lot of infrastructure that currently go around the tools, such as observability, tracking not just the committed code but the agent trajectories that led to them, and central management of the tools that agents are able to use. Overall, adopting tools like Codex is not just a technical but also a deep cultural change, with a lot of downstream implications to figure out. We encourage every manager to drive this with their team, and to think through other action items — for example, per item 5 above, what else can prevent a lot of "functionally-correct but poorly-maintainable code" from creeping into codebases.

English

1.2K

Keşfet

@bigeagle_xd @giffmana @YizhouLiu0 @sainingxie @ylecun @amilabs @YouJiacheng @JingyuanLiu123