ShadmanRohan

126 posts

ShadmanRohan banner
ShadmanRohan

ShadmanRohan

@RohanShadman

🎓 CS grad | 🧠 Computational Linguistics & Vision he/him Linkedin: https://t.co/00NRe8lTSM

Earth Katılım Temmuz 2018
699 Takip Edilen36 Takipçiler
ShadmanRohan retweetledi
Jason Zhu
Jason Zhu@GoSailGlobal·
Stanford CS336 上,Tatsu 讲了一节 LLM 架构课,把过去 3 年所有主流 LLM 拆开,看它们的共通模板 结论挺爆:90% 的架构选择已经收敛,你随便挑一个开源大模型,它跟其他模型在这些维度上几乎一模一样 讲师的原话 - 2024 年大家都在 cosplay Llama2 - 2025 年的主题是「怎么训得不崩」 - 2026 年的主题是「怎么扛住长上下文」 下面是 2026 年开源 LLM 的标准模板 你训自己的模型可以直接抄 【架构层 已经收敛的 7 件事】 1)Layer Norm 挪出残差流(pre-norm) 原版 Transformer 把 LN 放在残差里 几乎所有现代模型都挪到外面 原因:keep your residual stream clean 梯度反传更稳 2)RMS Norm 替代 LayerNorm LayerNorm 的减均值 + 加 bias 那部分实际没怎么帮上忙 丢掉之后 flops 只省 0.17% 但运行时省到 25% (瓶颈在数据搬运 计算反而次要) 3)所有 bias 项全删 跟 RMS Norm 一个道理 系统层省内存搬运 4)激活函数用 SwiGLU 或 GeGLU gated linear unit 几乎所有现代模型都用 Llama 系 / Qwen / Mistral 用 SwiGLU Google 系(Gemma / T5)用 GeGLU 区别极小 选哪个都行 5)位置编码用 RoPE 2024 年之后基本统一了 原理:把每对维度按位置旋转一个角度 让 inner product 只依赖相对位置 6)Transformer block 串联(不是并联) GPT-J / Palm 试过并联 现在基本被放弃 串联的实现优化得太好了 并联省的那点系统开销不值得损失表达力 7)Layer norm 可以「撒」 哪儿不稳就在哪儿加 LN attention 之前能加 之后能加 两边都加(double norm)也可以 现代模型很多这样做 【超参数 已经收敛的 5 个数】 1)feedforward 维度 / hidden 维度 - 非 GLU 模型:4 倍 - GLU 模型:8/3 ≈ 2.67 倍(因为 GLU 多一组矩阵 要保持总参数量) - Llama 系:3.5 倍 - T5 1.0 试过 64 倍 后来 T5 1.1 改回标准 别学 2)head 数 × head 维度 ≈ hidden 维度 几乎所有模型都遵守 T5 是为数不多的例外 3)模型纵横比(hidden / 层数)≈ 100 太深 pipeline parallel 难做 太宽 表达力受限 100 这个数字是系统约束 + 表达力的平衡点 4)vocab size 单语模型:30K 左右(早期 GPT-2 那种) 多语 / 通用模型:100K-200K(GPT-4 / Llama 3 / Gemma 都在这个范围) 现代基本都是后者 5)weight decay 仍然普遍使用 但研究发现它在 LLM 里干的事其实是优化器干预 让你最终能收敛到更深的最优点 跟你想的「防过拟合」没什么关系 所以别因为「单 epoch 不会过拟合」就把它关掉 【稳定性 三个救命 trick】 训练大模型最怕中途 loss 突然飙升 然后 NaN 全军覆没 现代模型用三个 trick 防这件事 1)Z-loss output softmax 的 normalizer 容易爆 加一个 (log Z)² 的正则项 让 Z 始终接近 1 DCLM / Olmo 都用 2)QK norm attention 的 Q 和 K 在矩阵乘之前各加一个 LN 让 softmax 的输入永远是单位尺度 multimodal 圈先用起来 现在所有大模型都加 3)Logit soft cap(仅 Google 系) attention logit 用 tanh 硬封顶 Gemma 2/3/4 都在用 但会损失一点点性能 慎用 【Attention 两个新趋势】 1)GQA(Grouped Query Attention)几乎统一 原版 multi-head 推理时 KV cache 会让算术强度崩到 1/h GQA 共享 K 和 V 但保留多个 Q 表达力几乎不损失 推理成本砍掉 80% 现在所有要做生产部署的大模型 没有不用 GQA 的 2)局部 + 全局 attention 交替 处理长上下文的新方式 Cohere Command A 起头 现在 Llama 4 / Gemma 4 / Olmo 3 全在用 比如每 4 层有 1 层 full attention 其他 3 层是 sliding window 只看附近的 token 比纯 SSM 更稳 比纯 full attention 便宜得多 (Qwen 3.5 做了变体 把 sliding window 那 3 层换成 SSM) 收尾一句 如果你正在训自己的 LLM,上面这一套就是 2026 年的「默认配置」 不需要重新发明,直接抄 如果你只是想看懂 GitHub 上那些 modeling_xxx.py 这一份足够你不再被术语吓住
Roan@RohOnChain

Anthropic pays $750,000+ a year for engineers who can build LLM architectures from scratch. Stanford taught the entire thing in 1 hour lecture & released it for free. Bookmark & watch this today before someone takes it down.

中文
29
589
3.1K
529.4K
ShadmanRohan retweetledi
Roan
Roan@RohOnChain·
This 2 hour Stanford lecture shows exactly how Stanford trains it's engineers to build AI systems. It's more practical than every Claude tutorial & prompting threads you've seen. Bookmark & give it 2 hours, no matter what. It'll be the most productive thing you do this weekend.
English
159
1.9K
13.7K
1.7M
ShadmanRohan retweetledi
François Chollet
François Chollet@fchollet·
AI agents will soon graduate to fully-fledged economic actors that buy services, compute, and even data in the course of accomplishing high-level goals. 1-2 years before we start seeing this at scale.
English
195
184
1.7K
263.1K
ShadmanRohan retweetledi
Boris Cherny
Boris Cherny@bcherny·
I'm Boris and I created Claude Code. Lots of people have asked how I use Claude Code, so I wanted to show off my setup a bit. My setup might be surprisingly vanilla! Claude Code works great out of the box, so I personally don't customize it much. There is no one correct way to use Claude Code: we intentionally build it in a way that you can use it, customize it, and hack it however you like. Each person on the Claude Code team uses it very differently. So, here goes.
English
1.3K
7K
54.6K
8.2M
ShadmanRohan
ShadmanRohan@RohanShadman·
2019 vs. Today. We’ve come a long way. Back then, the “gotcha” was: ask it a simple arithmetic word problem and it collapses. Today, Fields Medalists are using these models to turn research math into machine-checkable proofs.
ShadmanRohan tweet media
English
0
0
0
11
ShadmanRohan retweetledi
Akshay 🚀
Akshay 🚀@akshay_pachaar·
When outputs are verifiable, labels become optional. Maths, code, and logic can be automatically checked and validated. Let's use this fact to build a reasoning model without manual labelling. We'll use: - @UnslothAI for parameter-efficient finetuning. - @HuggingFace TRL to apply GRPO. Let's go! 🚀
English
3
8
78
47.7K
ShadmanRohan retweetledi
Paata Ivanisvili
Paata Ivanisvili@PI010101·
GPT-5 Pro found a counterexample to the NICD-with-erasures majority optimality (Simons list, p.25). simons.berkeley.edu/sites/default/… At p=0.4, n=5, f(x) = sign(x_1-3x_2+x_3-x_4+3x_5) gives E|f(x)|=0.43024 vs best majority 0.42904.
Paata Ivanisvili tweet mediaPaata Ivanisvili tweet media
English
55
211
1.5K
772.5K
ShadmanRohan
ShadmanRohan@RohanShadman·
Current LLMs are hitting the ceiling on “more tokens = better thinking.” A promising direction is procedural memory over ever-longer chains of thought—capturing recurring reasoning as reusable behaviors. Think smarter, not just longer. #AI #LLM #Reasoning #Efficiency #MLOps
ShadmanRohan tweet media
English
0
0
0
16
ShadmanRohan
ShadmanRohan@RohanShadman·
New from Google Research✨: Learn Your Way🎒 Upload a 📚textbook/PDF → Interactive Lessons 🧭Mind maps ⚡Quizzes 🎧Audio lessons 📊11% better retention (78% vs. 67%) vs digital reader. 𝗧𝗵𝗼𝘂𝗴𝗵𝘁𝗳𝘂𝗹 𝗱𝗲𝘀𝗶𝗴𝗻 > 𝗺𝗼𝗿𝗲 𝘀𝗰𝗿𝗲𝗲𝗻 𝘁𝗶𝗺𝗲. #AI #education #google
English
0
0
0
27
ShadmanRohan
ShadmanRohan@RohanShadman·
🎉 Just had an incredible experience attending The 63rd Annual Meeting of the Association for Computational Linguistics! 🎉 - via #Whova event app
ShadmanRohan tweet media
English
0
0
1
47
God of Prompt
God of Prompt@godofprompt·
4/ Content Creation • DeepSeek-V3: Data-driven and structured for investors. • Qwen2.5: Narrative-driven and engaging, but less structured. Winner: DeepSeek-V3 (3-1)
English
2
3
45
14.4K
God of Prompt
God of Prompt@godofprompt·
I tested China’s leading AI models, so you don’t have to. DeepSeek-V3 VS Qwen 2.5 The results will shock you. (Video demos are included 👇)
God of Prompt tweet media
English
148
266
1.5K
338.6K
ShadmanRohan retweetledi
Matthew Berman
Matthew Berman@MatthewBerman·
1/ Google Research unveils new paper: "Titans: Learning to Memorize at Test Time" It introduces human-like memory structures to overcome the limits of Transformers, with one "SURPRISING" feature. Here's why this is huge for AI. 🧵👇
Matthew Berman tweet media
English
58
434
2.9K
432.7K
ShadmanRohan retweetledi
Aaron Mueller
Aaron Mueller@amuuueller·
What can mechanistic interpretability do for computational psycholinguists? @michaelwhanna and I took a stab at this question! We investigate garden path sentence processing in LMs at the feature (circuit) level.
Michael Hanna@michaelwhanna

Sentences are partially understood before they're fully read. How do LMs incrementally interpret their inputs? In a new paper @amuuueller and I use mech interp to study how LMs process structurally ambiguous sentences. We show LMs rely on both syntactic & spurious features! 1/10

English
2
10
59
5.7K
ShadmanRohan retweetledi
Wei Xu
Wei Xu@cocoweixu·
We wrapped up CS 8803 "Large Language Model" class at @GeorgiaTech for Fall 2024. Here is the reading list: • learning from human preferences (PPO, DPO, SimPO, CPO, RRHF, ORPO, CTO) • real-world LLM (Llama-3, Aya, Arena's) • efficient LLM (MoMa, LoRA, QLoRA, LESS)
Wei Xu tweet mediaWei Xu tweet media
English
14
166
1.1K
95.8K
Mohammad Ali Arafat
Mohammad Ali Arafat@MAarafat71·
সংযুক্ত আরব আমিরাতসহ মধ্যপ্রাচ্যের বিভিন্ন দেশে কোটা আন্দোলনের সাথে সংহতি দেখাতে গিয়ে সেই দেশগুলোতে অনেকেই আইনের আওতায় এসেছেন এবং সাজাপ্রাপ্ত হয়েছেন। মাননীয় প্রধানমন্ত্রী শেখ হাসিনা এবং তাঁর সরকার তাদের ব্যাপারে খুবই উদ্বিগ্ন। এই বিষয় ঘিরে আমাদের অন্যান্য প্রবাসীরা যেনো আর কোনো সমস্যার সম্মুখীন না হন সে বিষয়ে মাননীয় প্রধানমন্ত্রীর নির্দেশে দূতাবাসগুলো কাজ করছে। সরকার প্রবাসীদের সুরক্ষা নিশ্চিত করার বিষয়ে বদ্ধপরিকর।
বাংলা
131
43
303
47K
ShadmanRohan retweetledi
Arpit Adlakha
Arpit Adlakha@arpit20adlakha·
One of the finest roadmaps I have seen for Senior Software Interviews, a guy posted on LeetCode for clearing Uber L5A, L5B or Google L5/L6 levels.
Arpit Adlakha tweet media
English
68
526
7.5K
1.7M
ShadmanRohan retweetledi
Tim Denning
Tim Denning@Tim_Denning·
I’ve spent over 120 hours studying one of the most controversial authors. Nassim Taleb. Here are 11 of his best lessons ↓
Tim Denning tweet media
English
115
577
3K
1.4M
ShadmanRohan retweetledi
Ole Lehmann
Ole Lehmann@itsolelehmann·
I'm 32. After living my whole life in Germany, last year I took the leap and moved abroad to Cyprus. It's the greatest lifestyle upgrade I've ever experienced. 20 lessons for living the good life abroad (that'll make your move easier):
Ole Lehmann tweet mediaOle Lehmann tweet media
English
25
22
409
259.9K
ShadmanRohan
ShadmanRohan@RohanShadman·
@rose_e_wang Will the workshop submissions be included in the conference proceedings, or are they considered non-archival?
English
1
0
0
32
Rose
Rose@rose_e_wang·
📢 Calling the #EdTech community! Intrigued by the [potential/positive/negative] impact of LLMs on education? Submit your work to this workshop at #EDM2024 🪇👩‍🏫 ➡️sites.google.com/view/llmworksh… Deadline: May 10th Looking forward to the discussions!!!
Rose@rose_e_wang

How will ed tech change w LLMs? What is and isn't possible? If these Qs have been on your mind, submit your work to a workshop I'm organizing: Leveraging LLMs for Next Gen Ed Tech @ EDM 2024 by May 10th! ➡️sites.google.com/view/llmworksh… #EDM #EdTech

English
3
5
25
5.9K
ShadmanRohan retweetledi
Toan Truong
Toan Truong@ToanTruong_·
I'm 18. I’m obsessed with learning how to learn. So, I spent 200+ hours studying how geniuses, prodigies, and high performers master their disciplines. Here's what I found on how to master anything faster:
Toan Truong tweet media
English
442
6.3K
30.6K
5.7M