Sebastian Raschka

19.7K posts

Sebastian Raschka

@rasbt

ML/AI research engineer. Ex stats professor. Author of "Build a Large Language Model From Scratch" (https://t.co/O8LAAMRzzW) & reasoning (https://t.co/5TueQKx2Fk)

United States Katılım Ekim 2012

1.1K Takip Edilen445.1K Takipçiler

Sabitlenmiş Tweet

Sebastian Raschka@rasbt·4 Nis

Components of a coding agent: a little write-up on the building blocks behind coding agents, from repo context and tool use to memory and delegation. Link: magazine.sebastianraschka.com/p/components-o…

English

196

1.1K

111.6K

Sebastian Raschka@rasbt·17h

@bbstats @scaling01 Perhaps compute or serving efficiency

English

108

Nathan Walker@bbstats·17h

@rasbt @scaling01 What is this claiming to be a heuristic for?

English

102

Lisan al Gaib@scaling01·18h

I wouldn't be surprised if Google was at 1-2% active

Sebastian Raschka@rasbt

Meta observation: DeepSeek is still king of the active-parameter ratio

English

122

18.3K

Sebastian Raschka@rasbt·18h

@teortaxesTex good catch, will add

English

605

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·18h

@rasbt Xiaomi 2.5 pro missing (4.2%, 1T)

Filipino

852

Sebastian Raschka@rasbt·1d

Meta observation: DeepSeek is still king of the active-parameter ratio

English

291

44.6K

Sebastian Raschka@rasbt·23h

The table in HTML format for easier (and non-truncated) viewing: sebastianraschka.com/llm-architectu…

English

5.7K

Sebastian Raschka@rasbt·1d

@MRashadnow In theory, yes. But then, I am not sure how much of a signal this is here. And I don't want to suggest that %active is the only variable determining quality. I.e., you would get vastly different numbers for K2, K2.5, K2.6 even though they have the same 3.2% active ratio.

English

781

Mohamed Rashad@MRashadnow·1d

@rasbt I think a column for quality or overall intelligince of the model will paint the picture better

English

841

Sebastian Raschka@rasbt·1d

@AlexTheGoodman Yes! I did a write-up on these a few weeks ago: magazine.sebastianraschka.com/p/visual-atten…

English

493

Alex Goodman@AlexTheGoodman·1d

@rasbt I was unaware of so many attention variants. MLA appears to be becoming more common?

English

631

Sebastian Raschka@rasbt·1d

@yechan_ai Thanks for the link and ping. I try to focus on those with open weights these days but I’ll add this to the list for the future

English

Yechan Do@yechan_ai·1d

@rasbt Big fan of your works Any interests on DLM?

Yechan Do@yechan_ai

Came across Cola-DLM(hongcanguo.github.io/Cola-DLM/) from ByteDance. A hierarchical continuous latent diffusion LM that separates global semantic planning (DiT in latent space) from local token realization (VAE decoder). Paper is out, but no code and no HF model yet. So I reproduced it from scratch. Happy to share with anyone interested 👇

English

210

Sebastian Raschka@rasbt·1d

A little talk on what we can learn from implementing LLM architectures from scratch in Python and PyTorch. And how I approach new open-weight models, compare them against reference implementations etc: youtube.com/watch?v=TXzQ7P…

YouTube

English

155

953

65.1K

Sebastian Raschka@rasbt·1d

@yechan_ai Thanks for the link and ping. I try to focus on those with open weights these days but I’ll add this to the list

English

Sebastian Raschka@rasbt·1d

@MLCatttt I usually look at the config.json and eyeball it based on other architectures (that I already have in the LLM-gallery.com)

English

259

engineer cat 🐈@MLCatttt·1d

@rasbt does anyone actually have a reliable workflow for picking the right reference impl when there are 3-4 candidates floating around. would love to see how you triage

English

309

Sebastian Raschka@rasbt·1d

@bygregorr historically, this (and norm layers) is probably where I spent most of debugging time

English

279

Gregor@bygregorr·1d

@rasbt 6 hours debugging positional embeddings. still don't fully get it.

English

305

Sebastian Raschka@rasbt·1d

@140ismymax Yes! It would be interesting whether this happens in GQA:SWA models only though or also in MLA or Gated DeltaNet models, for example.

English

467

mark seery@140ismymax·1d

Hi @rasbt really enjoyed your comments on sliding context windows in different models. Given all the talk over the last few months about repeating information in prompts, and the information in the front of the prompt getting lost, this has to be one of the most interesting aspects of LLMs at the moment.

English

605

Sebastian Raschka@rasbt·2d

Interesting paper. What I like about this is that it is a relatively low-commitment attention modification. I.e., one can use it during most of training, switch back to vanilla attention near the end, and recover roughly the same modeling performance as if full attention had been used the whole time.

elvis@omarsar0

Cool idea from Nous Research. What if you could speed up long-context pretraining with a subquadratic wrapper that you remove before deployment? That is the idea behind Lighthouse Attention. The method wraps ordinary SDPA with a hierarchical, gradient-free selection layer that compresses and decompresses queries, keys, and values symmetrically, preserving left-to-right causality. Crucially, it can be removed near the end of training in a short recovery phase, so the deployed model still runs vanilla attention with no architectural cost at inference. Preliminary LLM experiments report faster total training time and lower final loss than full-attention baselines. Why does it matter? Most efficient-attention work either changes the deployment-time architecture or pays a quality tax to do so. A training-only wrapper that survives a clean recovery phase sidesteps both. If it scales, this becomes an important training-time speedup for long-context pretraining. Paper: arxiv.org/abs/2605.06554 Learn to build effective AI agents in our academy: academy.dair.ai

English

327

47.6K

Sebastian Raschka@rasbt·2d

@pastaraspberry @giffmana yes, that's unfortunately true

English

dreaming android󠅙󠅗󠅞󠅟󠅢󠅕󠄐󠅠󠅢󠅕󠅦󠅙󠅟󠅥󠅣󠄜󠄐@pastaraspberry·2d

@rasbt @giffmana you know that flash is fairly unreliable as a long-term storage, right?

English

Lucas Beyer (bl16)@giffmana·5d

My wife after watching some pop-investing reel or something: > Lucas, you bought SanDisk a bit over a year ago, right??? Me: yeah Her, all excited: how much?? Me: lemme check, but i think 2Tb The SanDisk i bought:

English

1.2M

Sebastian Raschka@rasbt·3d

@willdepue @ChiefScientist You are joking but I use an hdmi version of this for OpenClaw. But to be fair, you could just use software (caffeinate, amphetamine apps) to keep the laptop awake.

English

7.4K

will depue@willdepue·3d

Tired of holding your laptop half open to keep your agents running? Introducing AgentPlug: A USB-C dummy plug that keeps your Mac in clamshell mode by pretending to be an external display! No commands, no security worries (just pull it out to stop!), no hassle.

English

484

186

7.5K

1.2M

Sebastian Raschka@rasbt·4d

@Tiaanmin @TSMCCruz Thanks, but there are no open weights yet, right? Asking because it would be impossible to cover architecture details without open weights and/or a detailed technical report

English

Tianmin@Tiaanmin·4d

@rasbt @TSMCCruz ERNIE 5.1 is available on the chatbot ERNIE Bot (ernie.baidu.com), while the API access should be on the way

English

Sebastian Raschka@rasbt·5d

Back from a little family break! Lots has happened, and I’m planning to do a deeper dive into the most interesting architectural components (soon). Btw, are there any major architectures I missed below?

English

446

31.8K

Sebastian Raschka@rasbt·4d

@i_mika_el Nice, missed that one! thanks!

English

227

Mikhail Rogov@i_mika_el·4d

@rasbt maybe prismML?

English

242

Sebastian Raschka@rasbt·4d

@TSMCCruz Awesome, thanks! Looks like Ernie 5.1 & SubQ are not available yet, but the other ones look interesting.

English

461

TSMCCruz@TSMCCruz·4d

@rasbt As always you did a pretty thorough job Sebastian. Can't wait for that deep dive. Skimming through the gallery didn't see: -LongCat (Meituan Model-x.com/Meituan_LongCa…); -SubQ( SubQuadratic Model- x.com/alex_whedon/st… -ERNIE 5.1 ( Baidu Inc. Model- x.com/Baidu_Inc/stat…)

Baidu Inc.@Baidu_Inc

ERNIE 5.1 just dropped. Built on ERNIE 5.0's pre-training foundation, our latest foundation model upgrades search, reasoning, knowledge Q&A, creative writing, and agentic capabilities, while using only around 6% of the pre-training cost of comparable models. More in the thread 🧵

English

615

Sebastian Raschka@rasbt·4d

@TSMCCruz Oh, I should have said April-today. But thx!

English

TSMCCruz@TSMCCruz·4d

@rasbt Also these, but some are hybrids: -Jamba 2 (AI21 Labs) -LFM2.5 (Liquid AI) -ZAYA1 (Zyphra)

English

Keşfet

@bbstats @scaling01 @teortaxesTex @MRashadnow @AlexTheGoodman @yechan_ai @MLCatttt @bygregorr