Gokul Santhanam

7K posts

Gokul Santhanam banner
Gokul Santhanam

Gokul Santhanam

@gokstudio

Senior ML Engineer @🍎, Working in the intersection btwn Multimodal LLMs, efficient image encoders, and on device ML. Views my own, Retweet != Endorsement.

Zurich, Switzerland Katılım Ocak 2015
5.5K Takip Edilen1.3K Takipçiler
Samip
Samip@industriaalist·
here's @JeffDean talking about how labs will do multi-epoch pretraining with heavy regularization to keep scaling even with limited data. no wonder slowrun gets so much attention from pretraining teams at big labs. pretraining is about to look very very different.
English
5
13
165
18.1K
Yuki
Yuki@y_m_asano·
@Kimi_Moonshot I think I can feel the Schmidhuber Highway-Networks/golden years post being written as we speak.
English
2
0
12
980
Kimi.ai
Kimi.ai@Kimi_Moonshot·
Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. 🔹 Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. 🔹 Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. 🔹 Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. 🔹 Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. 🔗Full report: github.com/MoonshotAI/Att…
Kimi.ai tweet media
English
332
2.1K
13.5K
4.9M
Gokul Santhanam
Gokul Santhanam@gokstudio·
@nthngdy Great insights. I haven't read the paper yet but what's your thoughts on how can we apply this to LLMs where the embedding table is reused for the LM head ?
English
0
0
0
157
Nathan Godey
Nathan Godey@nthngdy·
🧵New paper: "Lost in Backpropagation: The LM Head is a Gradient Bottleneck" The output layer of LLMs destroys 95-99% of your training signal during backpropagation, and this significantly slows down pretraining 👇
Nathan Godey tweet media
English
24
92
852
73.8K
Gokul Santhanam retweetledi
Peter Tong
Peter Tong@TongPetersb·
Train Beyond Language. We bet on the visual world as the critical next step alongside and beyond language modeling. So, we studied building foundation models from scratch with vision. We share our exploration: visual representations, data, world modeling, architecture, and scaling behavior! [1/9]
Peter Tong tweet media
English
36
222
1.1K
208.9K
Gokul Santhanam
Gokul Santhanam@gokstudio·
@prajdabre You still have the causal mask so there's some notion of positions still encoded.
English
0
0
0
24
Raj Dabre
Raj Dabre@prajdabre·
Slightly advanced ML question: Interviewer: In a decoder only causal language model, I removed all positional encodings and train on a large corpus. What do you think will happen? If you answered "The model will produce garbage", your answer is wrong. Why?
English
37
10
355
53.4K
Gokul Santhanam
Gokul Santhanam@gokstudio·
@gabriberton Not directly into the Value vectors, no, but AFAICT, to the input embeddings. They're similar in that they too are a way to increase param count w/o increasing FLOPs
English
1
0
1
20
Gabriele Berton
Gabriele Berton@gabriberton·
@gokstudio I read the Gemma 3 paper but missed the Gemma 3n. Do they do the same as here?
English
1
0
0
48
Gabriele Berton
Gabriele Berton@gabriberton·
The most interesting thing I've seen in a while The recipe by @karpathy to reduce GPT2-1.5B training cost from 43000$ to 73$! 7 years of improvements over vanilla GPT in 10 points Let's start from the uncommon ones: 1) Value Embeddings: I've never seen this in any LLM, [1/N]
Gabriele Berton tweet media
English
24
125
1.6K
151.4K
Gabriele Berton
Gabriele Berton@gabriberton·
Every 2 layers they add trainable "Value Embeddings" (VEs) to the V tokens. VEs have shape like the token embeddings (D x vocab_size). VEs use 600M params (~50% of total!) but add negligible FLOPs. My intuition is that VEs "remind" the LLM about the initial input token [2/N]
Gabriele Berton tweet media
English
5
4
84
14K
Gokul Santhanam
Gokul Santhanam@gokstudio·
@prajdabre I still have old relatives that use 4 Anna and 8 Anna Terminologies, especially colloquially, like "his work's not even worth 8 anna". Also, 2k years ago? You gotta share pics, man!
English
0
0
1
663
Raj Dabre
Raj Dabre@prajdabre·
As a coin collector (or numismatist) I can tell you some interesting facts: 1. Back then 64 paise was 1 Rupee. Because 1 Anna was 1/16 of a Rupee. They later changed it to 100 paise = 1 Rupee. 2. This conversation had an interesting effect. 8 Anna meant half a rupee or 32 paise. But even after conversion from 64 paise to 100 paise for a Rupee, people still used to call half a rupee as 8 Anna. This was the point where the Anna was demonetized. In short, 8 Anna = 50 paise although the Anna was no longer valid. People still retained this nomenclature till the 2000s. 3. The photo is missing the 1/4th rupee, 1/12th rupee and so on. I have some of these coins minted during the British Raj. 4. There are a ton of Paisa denominations which came after the Anna which are missing in the picture and I also own these. Unrelated to Indian currency: I have coins dating back to 2000 years ago and it's just amazing to consider the history and how many hands these currencies have passed through.
Stocks World@anandchokshi19

The Evolution Of Indian Currency:

English
22
46
607
58.9K
Gokul Santhanam
Gokul Santhanam@gokstudio·
@cneuralnetwork First para of the Wiki entry on DP Dynamic programming is both a mathematical optimization method and an algorithmic paradigm. The method was developed by Richard Bellman in the 1950s and has found applications in numerous fields, such as aerospace engineering and economics
English
1
0
2
201
neural nets.
neural nets.@cneuralnetwork·
the intuition behind this is cool [1] what is basically dp? you break a big prob -> smaller ones -> solve each subproblem -> combine to get bigger soln basically dp[i] comes from dp[0..i-1] [2] rl is basically a sequential optimization problem, you are maximizing future reward in a setting where you do an action, get a reward and move from s-> s' [3] just like dp, frame the value function as -> V(s) = expected total future reward starting from state s [4] now how to get the equations let the total return from time (t) be G_t G_t= r_t + gamma r_{t+1} + gamma^2 r_{t+2} + .. basically rewards from the future with some discount factor gamma! why the powers? because further the rewards are, less we care! [5] what is the value function then? lets define it as V(s) = E[G_t|s_t=s] for some policy pi basically if you are at state t, and make it follow policy pi, so on an avg how much reward do you get? [6] where does DP come? So let's solve this eqn a bit! V(s) = E[r_t + gamma r_{t+1} + gamma^2 r_{t+2} + ..|s_t=s] rewrite this as -> V(s) = E[r_t + gamma( r_{t+1} + gamma r_{t+2} + ..)|s_t=s] now -> r_{t+1} + gamma r_{t+2} + .. = G_{t+1} so G_t = r_t + gamma G_{t+1} WE GOT A RECURSION, WE DO DP!! So, finally V(s) = E[r_t + gamma G_t+1|S_t=s]
Khushi@starlitmatcha

Spent time with Q-learning and value-based RL today. Fun thing I noticed: the Bellman equation follows the same logic as dynamic programming.

English
2
2
105
8.8K
Gokul Santhanam
Gokul Santhanam@gokstudio·
@ArmenAgha Cool work! Do you have some ideas of how to extend this to dense models?
English
0
0
0
178
Armen Aghajanyan
Armen Aghajanyan@ArmenAgha·
Since I started working on multimodal models 4 years ago, one harsh realization was that standard architectures don't allocate compute intelligently across modalities. We tried dense multimodal models (Chameleon) and MoE extensions (MoMA)... none felt quite right. Today we're proposing data sparsity as a new axis to solve this. Paper and blog below.
Armen Aghajanyan tweet media
Maciej Kilian@kilian_maciej

in our most recent work we study data sparsity (ρ) - the dual axis to weight sparsity in standard token-choice MoEs. composing both weight and data sparsity improves training compute efficiency.

English
9
47
460
52.8K
Gokul Santhanam retweetledi
Neil Zeghidour
Neil Zeghidour@neilzegh·
Me defending my O(n^3) solution to the coding interviewer.
English
422
5K
49.7K
4M
Gokul Santhanam
Gokul Santhanam@gokstudio·
@giffmana Filed your taxes so early in the year? So it basically created your lohnausweis? truly an AI agent beyond our wildest imaginations. Expect a visit from the Steueramt soon ;)
English
0
0
0
41
Lucas Beyer (bl16)
Lucas Beyer (bl16)@giffmana·
I installed Claude Cowork on personal laptop yesterday. Since then, it has: - freed 14GB - got boot time from 15s to 6s - nearly doubled batterylife - cleared my inbox; gmail... and linkedin! - filed my taxes - resolved all my open github issues - successfully updated nvidia drivers - finished a thought I started at uni - taught my 5yo the piano - fixed my posture - settled a family dispute from 2013 - negotiated peace between neighbours - achieved cold fusion - looked at me and sighed 11/10 would install again.
Claude@claudeai

Introducing Cowork: Claude Code for the rest of your work. Cowork lets you complete non-technical tasks much like how developers use Claude Code.

English
324
385
9.6K
1.3M
Gokul Santhanam
Gokul Santhanam@gokstudio·
@ahall_research @giffmana There's a follow-up story to this. The filters were applied only to English queries. So when a group of Japanese folks visited the HQ, they were able to see all the unfiltered JP queries.
English
0
0
0
14
Andy Hall
Andy Hall@ahall_research·
There's a famous old story that, way back in the early days of google, they installed live electronic boards in the entrance to google hq showing live search terms, and the first day they installed them, they had to unplug them because almost all of the searches were pornographic. Implemented a bunch of filters and then turned them back on (I think they may still be there though it's been a long time since I've visited)
English
2
1
39
9.5K
Lucas Beyer (bl16)
Lucas Beyer (bl16)@giffmana·
I just learned about the Venezuela news. My brain immediately went to think what are likely messages LLMs are getting. My best guess for the mode is: "hey have you heard of venezuela? which stocks should I buy? Ultrathink, make no mistakes." Google offices have TVs showing current live major search terms, I was recently thinking I want the same for the major LLMs.
English
13
2
306
51.5K
Gokul Santhanam
Gokul Santhanam@gokstudio·
@prajdabre @DhotarAnkit Isn't the power dynamic lopsided that students stand to lose way more than the prof, who can prob change the narrative completely? Toxic behavior (not exactly this kind) is prevalent even in prestigious orgs like MPI and it's a losing proposition for students there too.
English
1
0
0
22
Raj Dabre
Raj Dabre@prajdabre·
@DhotarAnkit I'll say something stupid here: Get students from all universities involved and make mass reports.
English
2
1
2
221
Gokul Santhanam
Gokul Santhanam@gokstudio·
@alexkaplan0 Is there a recommended preparation way to extract maximum health benefit? Perhaps a naive question but what do you mean by a cup of coffee?
English
0
0
0
106
Alex Kaplan
Alex Kaplan@alexkaplan0·
Some of my favorite facts about coffee & health outcomes: 1. If you drink 3-4 cups per day, you have a 17% decrease in all cause mortality (BMJ 2017;359) 2. One additional cup of coffee reduces risk of kidney stones by 19% (Yuan et al, 2021) 3. Coffee drinkers have ~49% lower risk of dying from liver disease (Kennedy et al, 2021) Coffee is truly beneficial for your health when consumed without milk or sugar.
Cassie Pritchard@hecubian_devil

It’s crazy how every study done on coffee shows significant benefits for basically every organ system

English
38
34
834
179.8K
Gokul Santhanam retweetledi
Raj Dabre
Raj Dabre@prajdabre·
Here's your weekend challenge: Implement speculative decoding. Step 1: Read the following paper and/or blog: arxiv.org/abs/2211.17192 galacodes.hashnode.dev/speculative-de… (cc @jaygala223) Step 2: Choose a family of models which come in various sizes. My choice would be the Gemma3 or Qwen models. Within a family, choose a big model (say 27B for Gemma) and a small model (270M or 1B for Gemma). Step 3: Benchmark the decoding speed of the larger model on any given dataset. For simplicity I would choose any machine translation dataset like IN 22 conv (huggingface.co/datasets/ai4bh…). Let's say the decoding speed is X seconds for the whole test set. Step 4: Implement speculative decoding with the smaller model as the drafter and the larger one as the verifier. If you do this right, then you will have a decoding speed of Y seconds where Y < X (ideally by a multiple >2). Step 5: Play around with factors such as block size (number of tokens to be generated by the drafter before being verified) and identify block efficiency. Step 6: If you're feeling really ambitious then release this as a library and I'll promote it. If you do this, then you will have figured out one of the recipes behind how to get your large models to run fast in prod. PS: this is a Google paper. cc @yanivle Matan Kalman and @ymatias.
Raj Dabre tweet media
English
13
38
549
68.8K
Gokul Santhanam
Gokul Santhanam@gokstudio·
@giffmana Leads blessed and still no compute budget? Was it really blessed in the first place? 🤔
English
0
0
1
240
Gokul Santhanam
Gokul Santhanam@gokstudio·
Very curious if these kinds of ecosystems exist in other Indian cities, especially in South India
Kartik@lifeofsigh

I was VERY deep into Delhi high-school tech culture -- started the tech society of my school looking at what dpsrkp/modern kids would pull up with -- have been to every school in Delhi and analyzed the landscape extremely well -- this was pretty much all I did in 11th grade. I pushed my younger sister to join dpsrkp because you get an almost unfair advantage by being in these schools. (I can write A LOT about this) Most smart kids were as smart as folks from the 'elite' schools -- I met some incredibly cracked coders, cryptic solvers, gamers, animators who would casually beat college grads at most things. Lack of information, privilege and support is why most of them ended up in traditional pipelines. As an Indian kid, engaging in extra-curriculars and applying for unis in India puts you at a major disadvantage bec its not used as a primary/secondary admission metric in any leading uni in India. And not making it to a T1 college here is a major nerf because the cultures (academic/corporate/social) in most colleges are garbage. Parents/teachers who are often the aggressive decision makers simply wouldn't understand and these ECA kids were blindly pushed towards competitive exams -- many ended up with trauma and confidence hits that still linger 6 years later. (I have kept in touch with many of them) This was in stark contrast to the elite schools where they would just let kids be and follow fun/curiosity. Mostly because everyone knew how things would workout. Profiles would be boosted, grades assigned on good vibes and money wouldn't be the limiting factor. Plenty of factors beyond smartness and hard work. You simply can't replicate these systems in most schools because they are so structural in nature.

English
0
0
0
183