momo
94 posts

momo
@x0Er_go
01 | 研3 | 算法 | 啥都碎碎念 保持热爱 | 保持好奇 :-)


Personal update: I've joined Anthropic. I think the next few years at the frontier of LLMs will be especially formative. I am very excited to join the team here and get back to R&D. I remain deeply passionate about education and plan to resume my work on it in time.

damn dario walking his walk — word is anthro stopped hiring below L6


Introducing Typeless for Students!🎓 Unlock the full power of Typeless Pro with 50% off.✨ 🔗: typeless.com/students #Typeless

"PRs should be prompt requests" is the most karpathy thing ive ever heard and hes completely right. when models are good enough that implementation is a commodity, the scarce resource becomes the quality of the specification. this is exactly why system design and architecture skills matter MORE in the AI era not less. the idea file concept is elegant - you share intent and constraints, the agent handles the rest. the gap between a good idea file and a bad one is the same gap between a senior and junior engineer except now its measured in output quality not code volume @karpathy

兄弟们,原来蒸馏如此的见效快! 难怪大厂都热衷于此哈哈哈😂 Apple Research(苹果研究院)刚刚发布了一篇“超级简单却效果炸裂”的论文,标题直接叫《Embarrassingly Simple Self-Distillation Improves Code Generation》(尴尬的简单自蒸馏就能大幅提升代码生成能力)。 论文核心发现(Simple Self-Distillation,简称SSD): 你不需要: - 更好的教师模型 - 任何verifier(正确性验证器) - RL(强化学习) - 代码执行环境 - 外部标签或奖励模型 方法简单到离谱: 1. 用当前模型自己采样生成代码(带一定temperature和truncation,不用greedy解码) 2. 完全不过滤这些输出的正确性 3. 直接拿这些“原始”输出做标准SFT(监督微调) 就这么三步,模型就能大幅进步! 实测效果(震撼): - Qwen3-30B-Instruct:LiveCodeBench pass@1 从 42.4% → 55.3%(相对提升30%!) - 尤其在hard problems上提升最大:pass@5 从31.1% → 54.1% - 只需每个prompt采样1次就够 - 在Qwen和Llama系列的4B、8B、30B规模上全部有效(包括instruct和thinking变体) 论文最有洞见的解释: 很多coding模型其实已经把“正确能力”藏在权重里了,只是greedy decoding(贪婪解码)把它锁住了。 SSD通过在自己生成的数据上训练,上下文依赖地重塑token分布。 在需要精确的地方压制干扰项,在需要探索的分支处保留多样性,从而把模型的潜在能力真正释放出来。 总结一句话: “很多coding模型其实在用自己的权重‘欠发挥’。 用自己的输出再训一轮,就能把藏着的实力挖出来,而且完全不需要外部信号。”** 地址见评论区👇












