Gokul Santhanam

7K posts

Gokul Santhanam

@gokstudio

Senior ML Engineer @🍎, Working in the intersection btwn Multimodal LLMs, efficient image encoders, and on device ML. Views my own, Retweet != Endorsement.

Zurich, Switzerland Katılım Ocak 2015

5.5K Takip Edilen1.3K Takipçiler

Gokul Santhanam@gokstudio·1h

@industriaalist @JeffDean How are they using distillation as a regularization strategy?

English

164

Samip@industriaalist·3h

here's @JeffDean talking about how labs will do multi-epoch pretraining with heavy regularization to keep scaling even with limited data. no wonder slowrun gets so much attention from pretraining teams at big labs. pretraining is about to look very very different.

English

165

18.1K

Gokul Santhanam@gokstudio·18 Mar

@y_m_asano @Kimi_Moonshot I thought deepseek mHC paper would've already triggered it but I guess the threshold is higher

English

Yuki@y_m_asano·16 Mar

@Kimi_Moonshot I think I can feel the Schmidhuber Highway-Networks/golden years post being written as we speak.

English

980

Kimi.ai@Kimi_Moonshot·16 Mar

Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. 🔹 Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. 🔹 Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. 🔹 Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. 🔹 Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. 🔗Full report: github.com/MoonshotAI/Att…

English

332

2.1K

13.5K

4.9M

Gokul Santhanam@gokstudio·13 Mar

@nthngdy Great insights. I haven't read the paper yet but what's your thoughts on how can we apply this to LLMs where the embedding table is reused for the LM head ?

English

157

Nathan Godey@nthngdy·12 Mar

🧵New paper: "Lost in Backpropagation: The LM Head is a Gradient Bottleneck" The output layer of LLMs destroys 95-99% of your training signal during backpropagation, and this significantly slows down pretraining 👇

English

852

73.8K

Gokul Santhanam retweetledi

Peter Tong@TongPetersb·4 Mar

Train Beyond Language. We bet on the visual world as the critical next step alongside and beyond language modeling. So, we studied building foundation models from scratch with vision. We share our exploration: visual representations, data, world modeling, architecture, and scaling behavior! [1/9]

English

222

1.1K

208.9K

Gokul Santhanam@gokstudio·19 Şub

@prajdabre You still have the causal mask so there's some notion of positions still encoded.

English

Raj Dabre@prajdabre·18 Şub

Slightly advanced ML question: Interviewer: In a decoder only causal language model, I removed all positional encodings and train on a large corpus. What do you think will happen? If you answered "The model will produce garbage", your answer is wrong. Why?

English

355

53.4K

Gokul Santhanam@gokstudio·16 Şub

@gabriberton Not directly into the Value vectors, no, but AFAICT, to the input embeddings. They're similar in that they too are a way to increase param count w/o increasing FLOPs

English

Gabriele Berton@gabriberton·16 Şub

@gokstudio I read the Gemma 3 paper but missed the Gemma 3n. Do they do the same as here?

English

Gabriele Berton@gabriberton·15 Şub

The most interesting thing I've seen in a while The recipe by @karpathy to reduce GPT2-1.5B training cost from 43000$ to 73$! 7 years of improvements over vanilla GPT in 10 points Let's start from the uncommon ones: 1) Value Embeddings: I've never seen this in any LLM, [1/N]

English

125

1.6K

151.4K

Gokul Santhanam@gokstudio·16 Şub

@gabriberton This is similar to the Per-Layer Embedding that Gemma 3n introduced.

English

Gabriele Berton@gabriberton·15 Şub

Every 2 layers they add trainable "Value Embeddings" (VEs) to the V tokens. VEs have shape like the token embeddings (D x vocab_size). VEs use 600M params (~50% of total!) but add negligible FLOPs. My intuition is that VEs "remind" the LLM about the initial input token [2/N]

English

14K

Gokul Santhanam@gokstudio·8 Şub

@prajdabre I still have old relatives that use 4 Anna and 8 Anna Terminologies, especially colloquially, like "his work's not even worth 8 anna". Also, 2k years ago? You gotta share pics, man!

English

663

Raj Dabre@prajdabre·8 Şub

As a coin collector (or numismatist) I can tell you some interesting facts: 1. Back then 64 paise was 1 Rupee. Because 1 Anna was 1/16 of a Rupee. They later changed it to 100 paise = 1 Rupee. 2. This conversation had an interesting effect. 8 Anna meant half a rupee or 32 paise. But even after conversion from 64 paise to 100 paise for a Rupee, people still used to call half a rupee as 8 Anna. This was the point where the Anna was demonetized. In short, 8 Anna = 50 paise although the Anna was no longer valid. People still retained this nomenclature till the 2000s. 3. The photo is missing the 1/4th rupee, 1/12th rupee and so on. I have some of these coins minted during the British Raj. 4. There are a ton of Paisa denominations which came after the Anna which are missing in the picture and I also own these. Unrelated to Indian currency: I have coins dating back to 2000 years ago and it's just amazing to consider the history and how many hands these currencies have passed through.

Stocks World@anandchokshi19

The Evolution Of Indian Currency:

English

607

58.9K

Gokul Santhanam@gokstudio·24 Oca

@cneuralnetwork No wonder they're so similar :)

English

Gokul Santhanam@gokstudio·24 Oca

@cneuralnetwork First para of the Wiki entry on DP Dynamic programming is both a mathematical optimization method and an algorithmic paradigm. The method was developed by Richard Bellman in the 1950s and has found applications in numerous fields, such as aerospace engineering and economics

English

201

neural nets.@cneuralnetwork·23 Oca

the intuition behind this is cool [1] what is basically dp? you break a big prob -> smaller ones -> solve each subproblem -> combine to get bigger soln basically dp[i] comes from dp[0..i-1] [2] rl is basically a sequential optimization problem, you are maximizing future reward in a setting where you do an action, get a reward and move from s-> s' [3] just like dp, frame the value function as -> V(s) = expected total future reward starting from state s [4] now how to get the equations let the total return from time (t) be G_t G_t= r_t + gamma r_{t+1} + gamma^2 r_{t+2} + .. basically rewards from the future with some discount factor gamma! why the powers? because further the rewards are, less we care! [5] what is the value function then? lets define it as V(s) = E[G_t|s_t=s] for some policy pi basically if you are at state t, and make it follow policy pi, so on an avg how much reward do you get? [6] where does DP come? So let's solve this eqn a bit! V(s) = E[r_t + gamma r_{t+1} + gamma^2 r_{t+2} + ..|s_t=s] rewrite this as -> V(s) = E[r_t + gamma( r_{t+1} + gamma r_{t+2} + ..)|s_t=s] now -> r_{t+1} + gamma r_{t+2} + .. = G_{t+1} so G_t = r_t + gamma G_{t+1} WE GOT A RECURSION, WE DO DP!! So, finally V(s) = E[r_t + gamma G_t+1|S_t=s]

Khushi@starlitmatcha

Spent time with Q-learning and value-based RL today. Fun thing I noticed: the Bellman equation follows the same logic as dynamic programming.

English

105

8.8K

Gokul Santhanam@gokstudio·23 Oca

@ArmenAgha Cool work! Do you have some ideas of how to extend this to dense models?

English

178

Armen Aghajanyan@ArmenAgha·22 Oca

Since I started working on multimodal models 4 years ago, one harsh realization was that standard architectures don't allocate compute intelligently across modalities. We tried dense multimodal models (Chameleon) and MoE extensions (MoMA)... none felt quite right. Today we're proposing data sparsity as a new axis to solve this. Paper and blog below.

Maciej Kilian@kilian_maciej

in our most recent work we study data sparsity (ρ) - the dual axis to weight sparsity in standard token-choice MoEs. composing both weight and data sparsity improves training compute efficiency.

English

460

52.8K

Gokul Santhanam retweetledi

Neil Zeghidour@neilzegh·20 Oca

Me defending my O(n^3) solution to the coding interviewer.

English

422

49.7K

Gokul Santhanam@gokstudio·14 Oca

@giffmana Filed your taxes so early in the year? So it basically created your lohnausweis? truly an AI agent beyond our wildest imaginations. Expect a visit from the Steueramt soon ;)

English

Lucas Beyer (bl16)@giffmana·13 Oca

I installed Claude Cowork on personal laptop yesterday. Since then, it has: - freed 14GB - got boot time from 15s to 6s - nearly doubled batterylife - cleared my inbox; gmail... and linkedin! - filed my taxes - resolved all my open github issues - successfully updated nvidia drivers - finished a thought I started at uni - taught my 5yo the piano - fixed my posture - settled a family dispute from 2013 - negotiated peace between neighbours - achieved cold fusion - looked at me and sighed 11/10 would install again.

Claude@claudeai

Introducing Cowork: Claude Code for the rest of your work. Cowork lets you complete non-technical tasks much like how developers use Claude Code.

English

324

385

9.6K

1.3M

Gokul Santhanam@gokstudio·4 Oca

@ahall_research @giffmana There's a follow-up story to this. The filters were applied only to English queries. So when a group of Japanese folks visited the HQ, they were able to see all the unfiltered JP queries.

English

Andy Hall@ahall_research·4 Oca

There's a famous old story that, way back in the early days of google, they installed live electronic boards in the entrance to google hq showing live search terms, and the first day they installed them, they had to unplug them because almost all of the searches were pornographic. Implemented a bunch of filters and then turned them back on (I think they may still be there though it's been a long time since I've visited)

English

9.5K

Lucas Beyer (bl16)@giffmana·4 Oca

I just learned about the Venezuela news. My brain immediately went to think what are likely messages LLMs are getting. My best guess for the mode is: "hey have you heard of venezuela? which stocks should I buy? Ultrathink, make no mistakes." Google offices have TVs showing current live major search terms, I was recently thinking I want the same for the major LLMs.

English

306

51.5K

Gokul Santhanam@gokstudio·16 Kas

@prajdabre @DhotarAnkit Fair enough. I guess the most optimal long term decision long term is suck it up and move on?

English

Raj Dabre@prajdabre·16 Kas

@gokstudio @DhotarAnkit That's why I said it was a stupid statement :)

English

Gokul Santhanam@gokstudio·16 Kas

@prajdabre @DhotarAnkit Isn't the power dynamic lopsided that students stand to lose way more than the prof, who can prob change the narrative completely? Toxic behavior (not exactly this kind) is prevalent even in prestigious orgs like MPI and it's a losing proposition for students there too.

English

Raj Dabre@prajdabre·15 Kas

@DhotarAnkit I'll say something stupid here: Get students from all universities involved and make mass reports.

English

221

Gokul Santhanam@gokstudio·12 Kas

@alexkaplan0 Is there a recommended preparation way to extract maximum health benefit? Perhaps a naive question but what do you mean by a cup of coffee?

English

106

Alex Kaplan@alexkaplan0·11 Kas

Some of my favorite facts about coffee & health outcomes: 1. If you drink 3-4 cups per day, you have a 17% decrease in all cause mortality (BMJ 2017;359) 2. One additional cup of coffee reduces risk of kidney stones by 19% (Yuan et al, 2021) 3. Coffee drinkers have ~49% lower risk of dying from liver disease (Kennedy et al, 2021) Coffee is truly beneficial for your health when consumed without milk or sugar.

Cassie Pritchard@hecubian_devil

It’s crazy how every study done on coffee shows significant benefits for basically every organ system

English

834

179.8K

Gokul Santhanam retweetledi

Raj Dabre@prajdabre·8 Kas

Here's your weekend challenge: Implement speculative decoding. Step 1: Read the following paper and/or blog: arxiv.org/abs/2211.17192 galacodes.hashnode.dev/speculative-de… (cc @jaygala223) Step 2: Choose a family of models which come in various sizes. My choice would be the Gemma3 or Qwen models. Within a family, choose a big model (say 27B for Gemma) and a small model (270M or 1B for Gemma). Step 3: Benchmark the decoding speed of the larger model on any given dataset. For simplicity I would choose any machine translation dataset like IN 22 conv (huggingface.co/datasets/ai4bh…). Let's say the decoding speed is X seconds for the whole test set. Step 4: Implement speculative decoding with the smaller model as the drafter and the larger one as the verifier. If you do this right, then you will have a decoding speed of Y seconds where Y < X (ideally by a multiple >2). Step 5: Play around with factors such as block size (number of tokens to be generated by the drafter before being verified) and identify block efficiency. Step 6: If you're feeling really ambitious then release this as a library and I'll promote it. If you do this, then you will have figured out one of the recipes behind how to get your large models to run fast in prod. PS: this is a Google paper. cc @yanivle Matan Kalman and @ymatias.

English

549

68.8K

Gokul Santhanam@gokstudio·6 Kas

@giffmana Leads blessed and still no compute budget? Was it really blessed in the first place? 🤔

English

240

Lucas Beyer (bl16)@giffmana·6 Kas

Well yeah, we did a good portion of PaliGemma development (2024) on v3 (2018) and even v2 (2017) TPUs. The reason is that they were very cheap internally and we really didn't get many compute credits for this leads-blessed project, which I'm still bitter about.

Jerry Capital@JerryCap

Google 7 and 8 year old TPUs are running at 100% utilization Old fully depreciated chips running hot

English

201

39K

Gokul Santhanam@gokstudio·30 Eki

Very curious if these kinds of ecosystems exist in other Indian cities, especially in South India

Kartik@lifeofsigh

I was VERY deep into Delhi high-school tech culture -- started the tech society of my school looking at what dpsrkp/modern kids would pull up with -- have been to every school in Delhi and analyzed the landscape extremely well -- this was pretty much all I did in 11th grade. I pushed my younger sister to join dpsrkp because you get an almost unfair advantage by being in these schools. (I can write A LOT about this) Most smart kids were as smart as folks from the 'elite' schools -- I met some incredibly cracked coders, cryptic solvers, gamers, animators who would casually beat college grads at most things. Lack of information, privilege and support is why most of them ended up in traditional pipelines. As an Indian kid, engaging in extra-curriculars and applying for unis in India puts you at a major disadvantage bec its not used as a primary/secondary admission metric in any leading uni in India. And not making it to a T1 college here is a major nerf because the cultures (academic/corporate/social) in most colleges are garbage. Parents/teachers who are often the aggressive decision makers simply wouldn't understand and these ECA kids were blindly pushed towards competitive exams -- many ended up with trauma and confidence hits that still linger 6 years later. (I have kept in touch with many of them) This was in stark contrast to the elite schools where they would just let kids be and follow fun/curiosity. Mostly because everyone knew how things would workout. Profiles would be boosted, grades assigned on good vibes and money wouldn't be the limiting factor. Plenty of factors beyond smartness and hard work. You simply can't replicate these systems in most schools because they are so structural in nature.

English

183

Keşfet

@industriaalist @JeffDean @y_m_asano @Kimi_Moonshot @nthngdy @prajdabre @gabriberton @karpathy