ShadmanRohan

126 posts

ShadmanRohan

@RohanShadman

🎓 CS grad | 🧠 Computational Linguistics & Vision he/him Linkedin: https://t.co/00NRe8lTSM

Earth Katılım Temmuz 2018

699 Takip Edilen36 Takipçiler

ShadmanRohan retweetledi

Jason Zhu@GoSailGlobal·5 May

Stanford CS336 上，Tatsu 讲了一节 LLM 架构课，把过去 3 年所有主流 LLM 拆开，看它们的共通模板结论挺爆：90% 的架构选择已经收敛，你随便挑一个开源大模型，它跟其他模型在这些维度上几乎一模一样讲师的原话 - 2024 年大家都在 cosplay Llama2 - 2025 年的主题是「怎么训得不崩」 - 2026 年的主题是「怎么扛住长上下文」下面是 2026 年开源 LLM 的标准模板你训自己的模型可以直接抄【架构层已经收敛的 7 件事】 1）Layer Norm 挪出残差流（pre-norm）原版 Transformer 把 LN 放在残差里几乎所有现代模型都挪到外面原因：keep your residual stream clean 梯度反传更稳 2）RMS Norm 替代 LayerNorm LayerNorm 的减均值 + 加 bias 那部分实际没怎么帮上忙丢掉之后 flops 只省 0.17% 但运行时省到 25% （瓶颈在数据搬运计算反而次要） 3）所有 bias 项全删跟 RMS Norm 一个道理系统层省内存搬运 4）激活函数用 SwiGLU 或 GeGLU gated linear unit 几乎所有现代模型都用 Llama 系 / Qwen / Mistral 用 SwiGLU Google 系（Gemma / T5）用 GeGLU 区别极小选哪个都行 5）位置编码用 RoPE 2024 年之后基本统一了原理：把每对维度按位置旋转一个角度让 inner product 只依赖相对位置 6）Transformer block 串联（不是并联） GPT-J / Palm 试过并联现在基本被放弃串联的实现优化得太好了并联省的那点系统开销不值得损失表达力 7）Layer norm 可以「撒」哪儿不稳就在哪儿加 LN attention 之前能加之后能加两边都加（double norm）也可以现代模型很多这样做【超参数已经收敛的 5 个数】 1）feedforward 维度 / hidden 维度 - 非 GLU 模型：4 倍 - GLU 模型：8/3 ≈ 2.67 倍（因为 GLU 多一组矩阵要保持总参数量） - Llama 系：3.5 倍 - T5 1.0 试过 64 倍后来 T5 1.1 改回标准别学 2）head 数 × head 维度 ≈ hidden 维度几乎所有模型都遵守 T5 是为数不多的例外 3）模型纵横比（hidden / 层数）≈ 100 太深 pipeline parallel 难做太宽表达力受限 100 这个数字是系统约束 + 表达力的平衡点 4）vocab size 单语模型：30K 左右（早期 GPT-2 那种）多语 / 通用模型：100K-200K（GPT-4 / Llama 3 / Gemma 都在这个范围）现代基本都是后者 5）weight decay 仍然普遍使用但研究发现它在 LLM 里干的事其实是优化器干预让你最终能收敛到更深的最优点跟你想的「防过拟合」没什么关系所以别因为「单 epoch 不会过拟合」就把它关掉【稳定性三个救命 trick】训练大模型最怕中途 loss 突然飙升然后 NaN 全军覆没现代模型用三个 trick 防这件事 1）Z-loss output softmax 的 normalizer 容易爆加一个 (log Z)² 的正则项让 Z 始终接近 1 DCLM / Olmo 都用 2）QK norm attention 的 Q 和 K 在矩阵乘之前各加一个 LN 让 softmax 的输入永远是单位尺度 multimodal 圈先用起来现在所有大模型都加 3）Logit soft cap（仅 Google 系） attention logit 用 tanh 硬封顶 Gemma 2/3/4 都在用但会损失一点点性能慎用【Attention 两个新趋势】 1）GQA（Grouped Query Attention）几乎统一原版 multi-head 推理时 KV cache 会让算术强度崩到 1/h GQA 共享 K 和 V 但保留多个 Q 表达力几乎不损失推理成本砍掉 80% 现在所有要做生产部署的大模型没有不用 GQA 的 2）局部 + 全局 attention 交替处理长上下文的新方式 Cohere Command A 起头现在 Llama 4 / Gemma 4 / Olmo 3 全在用比如每 4 层有 1 层 full attention 其他 3 层是 sliding window 只看附近的 token 比纯 SSM 更稳比纯 full attention 便宜得多（Qwen 3.5 做了变体把 sliding window 那 3 层换成 SSM）收尾一句如果你正在训自己的 LLM，上面这一套就是 2026 年的「默认配置」不需要重新发明，直接抄如果你只是想看懂 GitHub 上那些 modeling_xxx.py 这一份足够你不再被术语吓住

Roan@RohOnChain

Anthropic pays $750,000+ a year for engineers who can build LLM architectures from scratch. Stanford taught the entire thing in 1 hour lecture & released it for free. Bookmark & watch this today before someone takes it down.

中文

589

3.1K

529.4K

ShadmanRohan retweetledi

Roan@RohOnChain·11 Nis

This 2 hour Stanford lecture shows exactly how Stanford trains it's engineers to build AI systems. It's more practical than every Claude tutorial & prompting threads you've seen. Bookmark & give it 2 hours, no matter what. It'll be the most productive thing you do this weekend.

English

159

1.9K

13.7K

1.7M

ShadmanRohan retweetledi

François Chollet@fchollet·10 Mar

AI agents will soon graduate to fully-fledged economic actors that buy services, compute, and even data in the course of accomplishing high-level goals. 1-2 years before we start seeing this at scale.

English

195

184

1.7K

263.1K

ShadmanRohan retweetledi

Boris Cherny@bcherny·2 Oca

I'm Boris and I created Claude Code. Lots of people have asked how I use Claude Code, so I wanted to show off my setup a bit. My setup might be surprisingly vanilla! Claude Code works great out of the box, so I personally don't customize it much. There is no one correct way to use Claude Code: we intentionally build it in a way that you can use it, customize it, and hack it however you like. Each person on the Claude Code team uses it very differently. So, here goes.

English

1.3K

54.6K

8.2M

ShadmanRohan@RohanShadman·29 Ara

2019 vs. Today. We’ve come a long way. Back then, the “gotcha” was: ask it a simple arithmetic word problem and it collapses. Today, Fields Medalists are using these models to turn research math into machine-checkable proofs.

English

ShadmanRohan retweetledi

Akshay 🚀@akshay_pachaar·6 Ara

When outputs are verifiable, labels become optional. Maths, code, and logic can be automatically checked and validated. Let's use this fact to build a reasoning model without manual labelling. We'll use: - @UnslothAI for parameter-efficient finetuning. - @HuggingFace TRL to apply GRPO. Let's go! 🚀

English

47.7K

ShadmanRohan retweetledi

Paata Ivanisvili@PI010101·5 Eki

GPT-5 Pro found a counterexample to the NICD-with-erasures majority optimality (Simons list, p.25). simons.berkeley.edu/sites/default/… At p=0.4, n=5, f(x) = sign(x_1-3x_2+x_3-x_4+3x_5) gives E|f(x)|=0.43024 vs best majority 0.42904.

English

211

1.5K

772.5K

ShadmanRohan@RohanShadman·28 Eyl

Current LLMs are hitting the ceiling on “more tokens = better thinking.” A promising direction is procedural memory over ever-longer chains of thought—capturing recurring reasoning as reusable behaviors. Think smarter, not just longer. #AI #LLM #Reasoning #Efficiency #MLOps

English

ShadmanRohan@RohanShadman·24 Eyl

New from Google Research✨: Learn Your Way🎒 Upload a 📚textbook/PDF → Interactive Lessons 🧭Mind maps ⚡Quizzes 🎧Audio lessons 📊11% better retention (78% vs. 67%) vs digital reader. 𝗧𝗵𝗼𝘂𝗴𝗵𝘁𝗳𝘂𝗹 𝗱𝗲𝘀𝗶𝗴𝗻 > 𝗺𝗼𝗿𝗲 𝘀𝗰𝗿𝗲𝗲𝗻 𝘁𝗶𝗺𝗲. #AI #education #google

English

ShadmanRohan@RohanShadman·2 Ağu

🎉 Just had an incredible experience attending The 63rd Annual Meeting of the Association for Computational Linguistics! 🎉 - via #Whova event app

English

ShadmanRohan@RohanShadman·3 Şub

@godofprompt Oh no,, you ss so,

English

God of Prompt@godofprompt·1 Şub

4/ Content Creation • DeepSeek-V3: Data-driven and structured for investors. • Qwen2.5: Narrative-driven and engaging, but less structured. Winner: DeepSeek-V3 (3-1)

English

14.4K

God of Prompt@godofprompt·1 Şub

I tested China’s leading AI models, so you don’t have to. DeepSeek-V3 VS Qwen 2.5 The results will shock you. (Video demos are included 👇)

English

148

266

1.5K

338.6K

ShadmanRohan retweetledi

Matthew Berman@MatthewBerman·15 Oca

1/ Google Research unveils new paper: "Titans: Learning to Memorize at Test Time" It introduces human-like memory structures to overcome the limits of Transformers, with one "SURPRISING" feature. Here's why this is huge for AI. 🧵👇

English

434

2.9K

432.7K

ShadmanRohan retweetledi

Aaron Mueller@amuuueller·19 Ara

What can mechanistic interpretability do for computational psycholinguists? @michaelwhanna and I took a stab at this question! We investigate garden path sentence processing in LMs at the feature (circuit) level.

Michael Hanna@michaelwhanna

Sentences are partially understood before they're fully read. How do LMs incrementally interpret their inputs? In a new paper @amuuueller and I use mech interp to study how LMs process structurally ambiguous sentences. We show LMs rely on both syntactic & spurious features! 1/10

English

5.7K

ShadmanRohan retweetledi

Wei Xu@cocoweixu·4 Ara

We wrapped up CS 8803 "Large Language Model" class at @GeorgiaTech for Fall 2024. Here is the reading list: • learning from human preferences (PPO, DPO, SimPO, CPO, RRHF, ORPO, CTO) • real-world LLM (Llama-3, Aya, Arena's) • efficient LLM (MoMa, LoRA, QLoRA, LESS)

English

166

1.1K

95.8K

ShadmanRohan@RohanShadman·28 Tem

@MAarafat71 Star Jalsha shuru hoa gese😂

Indonesia

205

Mohammad Ali Arafat@MAarafat71·27 Tem

সংযুক্ত আরব আমিরাতসহ মধ্যপ্রাচ্যের বিভিন্ন দেশে কোটা আন্দোলনের সাথে সংহতি দেখাতে গিয়ে সেই দেশগুলোতে অনেকেই আইনের আওতায় এসেছেন এবং সাজাপ্রাপ্ত হয়েছেন। মাননীয় প্রধানমন্ত্রী শেখ হাসিনা এবং তাঁর সরকার তাদের ব্যাপারে খুবই উদ্বিগ্ন। এই বিষয় ঘিরে আমাদের অন্যান্য প্রবাসীরা যেনো আর কোনো সমস্যার সম্মুখীন না হন সে বিষয়ে মাননীয় প্রধানমন্ত্রীর নির্দেশে দূতাবাসগুলো কাজ করছে। সরকার প্রবাসীদের সুরক্ষা নিশ্চিত করার বিষয়ে বদ্ধপরিকর।

বাংলা

131

303

47K

ShadmanRohan retweetledi

Arpit Adlakha@arpit20adlakha·24 Haz

One of the finest roadmaps I have seen for Senior Software Interviews, a guy posted on LeetCode for clearing Uber L5A, L5B or Google L5/L6 levels.

English

526

7.5K

1.7M

ShadmanRohan retweetledi

Tim Denning@Tim_Denning·28 May

I’ve spent over 120 hours studying one of the most controversial authors. Nassim Taleb. Here are 11 of his best lessons ↓

English

115

577

1.4M

ShadmanRohan retweetledi

Ole Lehmann@itsolelehmann·15 May

I'm 32. After living my whole life in Germany, last year I took the leap and moved abroad to Cyprus. It's the greatest lifestyle upgrade I've ever experienced. 20 lessons for living the good life abroad (that'll make your move easier):

English

409

259.9K

ShadmanRohan@RohanShadman·3 May

@rose_e_wang Will the workshop submissions be included in the conference proceedings, or are they considered non-archival?

English

Rose@rose_e_wang·1 May

📢 Calling the #EdTech community! Intrigued by the [potential/positive/negative] impact of LLMs on education? Submit your work to this workshop at #EDM2024 🪇👩‍🏫 ➡️sites.google.com/view/llmworksh… Deadline: May 10th Looking forward to the discussions!!!

Rose@rose_e_wang

How will ed tech change w LLMs? What is and isn't possible? If these Qs have been on your mind, submit your work to a workshop I'm organizing: Leveraging LLMs for Next Gen Ed Tech @ EDM 2024 by May 10th! ➡️sites.google.com/view/llmworksh… #EDM #EdTech

English

5.9K

ShadmanRohan retweetledi

Toan Truong@ToanTruong_·25 Şub

I'm 18. I’m obsessed with learning how to learn. So, I spent 200+ hours studying how geniuses, prodigies, and high performers master their disciplines. Here's what I found on how to master anything faster:

English

442

6.3K

30.6K

5.7M

Keşfet

@UnslothAI @HuggingFace @godofprompt @michaelwhanna @GeorgiaTech @MAarafat71 @elonmusk @BarackObama