Saurabh Dash

546 posts

Saurabh Dash

@TheyCallMeMr_

Bottomless pit supervisor. ML @CohereAI , PhD Student @GeorgiaTech. Previously @Apple, @IITkgp. https://t.co/yZLkUsiZ7P. Opinions expressed are my own

Atlanta, GA, United States Katılım Eylül 2016

692 Takip Edilen620 Takipçiler

Saurabh Dash@TheyCallMeMr_·2d

@giffmana @tenderizzation @bilaltwovec I agree! Unfortunately, It just didn't gel well with the 384 size!

English

Lucas Beyer (bl16)@giffmana·3d

@TheyCallMeMr_ @tenderizzation @bilaltwovec 14 is actually the superior one that leads to good seqlen on then-common image sizes (224, 448, etc), which is what matters.

English

bilal@bilaltwovec·3d

who remembers efficientnets

Lucas Beyer (bl16)@giffmana

@tenderizzation And they too completely disrespected your RAM and didn't scale well at all because of that 😬

English

3.2K

Saurabh Dash@TheyCallMeMr_·3d

@giffmana @tenderizzation @bilaltwovec Needs another line for the patch size

English

Lucas Beyer (bl16)@giffmana·3d

@tenderizzation @bilaltwovec People have still not forgiven me refusing to slightly bump these two numbers...

English

742

Saurabh Dash@TheyCallMeMr_·14 Mar

@F1 bring back the 3 digit precision in race interval visuals, you cowards

English

Saurabh Dash@TheyCallMeMr_·11 Mar

Borat did it first

Jürgen Schmidhuber@SchmidhuberAI

LeCun’s new company on physical AI with world models [9] looks a lot like our 2014 company on physical AI with world models [1] 😀 See also [2-8] - all references in the reply!

English

144

Saurabh Dash retweetledi

Dwarak@DwaraknathG·9 Mar

All this and more! we go over our design decisions and the lessons learned along the way! Please do come hang out if you are at GTC. A huge thank you to all the team members for their hard work! nvidia.com/gtc/session-ca…

English

754

Saurabh Dash retweetledi

Dwarak@DwaraknathG·9 Mar

Hey all, I will be at GTC next week talking about all the work my team and I did on large-scale MoE training in JAX on GPUs! We decided early on to have a fully dropless training stack to avoid token dropping. (1/7)

English

103

14.8K

Saurabh Dash@TheyCallMeMr_·1 Mar

My ICML review pile in a nutshell: This is the best thing invented since sliced bread. Look at how good it's on CIFAR-10.

English

11.4K

Saurabh Dash@TheyCallMeMr_·28 Şub

@awnihannun GOATed run @awnihannun! 🫡

English

Awni Hannun@awnihannun·28 Şub

Today is my last day at Apple. Building MLX with our amazing team and community has been an absolute pleasure. It's still early days for AI on Apple silicon. Apple makes the best consumer hardware on the planet. There's so much potential for it to be the leading platform for AI. And I'm confident MLX will continue to have a big role in that. To the future: MLX remains in the exceptionally capable hands of our team including @angeloskath, @zcbenz, @DiganiJagrit, @NasFilippova, @trebolloc (and others not on X). Follow them or @shshnkp for future updates.

English

260

2.2K

396.1K

Saurabh Dash retweetledi

Matthew Leavitt@leavittron·19 Şub

@RicardoMonti9 @KaleighMentzer @agcrnz I'm also quite pleased that @Cohere_Labs released Tiny Aya the day before we released ÜberWeb, and we were able to evaluate it and include it in our report. The whole Aya project has been a big inspiration for us. I officially declare it Multilingual Release Week!!

English

1.1K

Saurabh Dash retweetledi

Sebastian Raschka@rasbt·20 Şub

Tiny Aya reimplementation From Scratch! Have been reading through the technical reports of the recent wave of open-weight LLM releases (more on that soon). Tiny Aya (2 days ago) was a bit under the radar. Looks like a nice, small 3.35B model with strongest multilingual support of that size class. Great for on-device translation tasks. Just did a from-scratch implementation here: github.com/rasbt/LLMs-fro… Architecture-wise, Tiny Aya is a classic decoder-style transformer with a few noteworthy modifications (besides the obvious ones like SwiGLU and Grouped Query Attention): 1. Parallel transformer blocks. A parallel transformer block computes attention and MLP from the same normalized input, then adds both to the residual in one step. I assume this is to reduce serial dependencies inside a layer to improve computational throughput. 2. Sliding window attention. Specifically, it uses a 3:1 local:global ratio similar to Arcee Trinity and Olmo 3. The window size is also 4096. Also, similar to Arcee, the sliding window layers use RoPE whereas the full attention layers use NoPE. 3. LayerNorm. Most architectures moved to RMSNorm as it's computationally a bit cheaper and performs well. Tiny Aya is keeping it more classic with a modified version of LayerNorm (the implementation here is like standard LayerNorm but without shift, i.e., bias, parameter).

English

166

1.2K

67.3K

Saurabh Dash retweetledi

Asghar Ghorbani@ghorbani_asghar·17 Şub

@Cohere_Labs Tiny Aya is the true multilingual on-device model we've been waiting for. 70+ global languages in a 3.35B parameter model 🤯

English

Saurabh Dash retweetledi

TechCrunch@TechCrunch·17 Şub

Cohere launches a family of open multilingual models techcrunch.com/2026/02/17/coh…

English

120

22.8K

Saurabh Dash retweetledi

Cohere Labs@Cohere_Labs·17 Şub

Mobile access unlocks the real potential of open models. Thanks to @pocketpal_ai and especially @ghorbani_asghar for helping us bring Tiny Aya to mobile! 📱 Their expertise made it possible to deliver the most capable multilingual model at this scale directly to people's phones.

English

1.4K

Saurabh Dash retweetledi

Cohere Labs@Cohere_Labs·17 Şub

Introducing ✨Tiny Aya✨, a family of massively multilingual small language models built to run where people actually are. Tiny Aya delivers strong multilingual performance in 70+ global languages in a 3.35B parameter model, efficient enough to run locally, even on a phone.

English

158

859

185.2K

Saurabh Dash@TheyCallMeMr_·12 Şub

@____aakanksha @sarahookr @mziizm +1 these are great recommendations. Gulati’s is an absolute pain to get into, so the restaurant beside it is also great!

English

aakanksha@____aakanksha·12 Şub

@sarahookr bukhara for north indian, carnatic cafe for south indian and chaat + rasmalai from haldiram’s! friends rave about gulati’s for butter chicken etc; oh and the tender coconut ice cream at natural’s 🤤 (should also try indo-chinese / momos in delhi - maybe at berco’s!) cc @mziizm :)

English

2.8K

Sara Hooker@sarahookr·12 Şub

Ok. It is time. I have time on Saturday and Sunday night to explore New Delhi ahead of the summit. Let’s go Delhi food recommendations. 🔥

English

142

414

57K

Saurabh Dash@TheyCallMeMr_·12 Şub

@sarahookr Definitely try Bukhara! Especially the kebabs and the Dal Bukhara.

English

Saurabh Dash@TheyCallMeMr_·7 Şub

If you are interested in squeezing every drop of available compute, @acyr_l is hiring for an ML Systems engineer to make GPUs go brrrrrrr jobs.ashbyhq.com/cohere/c99e61c…

English

276

Saurabh Dash@TheyCallMeMr_·7 Şub

Calling it compute rich/poor and not Jensen’s Inequality is a missed opportunity

English

6.9K

Saurabh Dash@TheyCallMeMr_·15 Oca

@MaxMa1987 Super cool to see this idea applied to LLMs! We previously explored it for event-camera data. openreview.net/pdf?id=ZCStthy…

English

Xuezhe Ma (Max)@MaxMa1987·13 Oca

After about 2 years, we are proud to release Gecko, an efficient architecture that improves upon Megalodon, with capability of efficiently and inherently processing sequences with unlimited context length. One of the most important idea in Gecko is Adaptive Working Memory(AWM), implemented using a linear attention mechanism with a position-aware online softmax activation. Notably, AWM globally compresses information into memory, rather than discarding historical information through forgetting. In a controlled head-to-head comparison with Llama2 and Megalodon, Gecko achieves better performance in the scale of 7B and 2T training tokens. Gecko achieves 1.68 training loss, vs. 1.67 of Llama2-13B, with half number of parameters on 2T tokens. Paper: arxiv.org/abs/2601.06463 Code: github.com/XuezheMax/geck…

English

143

20.7K

Saurabh Dash@TheyCallMeMr_·13 Oca

@sarahookr I can send the food recommendations! ;)

English

299

Sara Hooker@sarahookr·13 Oca

I'm really looking forward to also getting to know the AI ecosystem in Delhi (for the summit) + Bangalore. If you have recommendations of initiatives I should visit or people I should meet while I'm there, send them my way.

English

5.1K

Sara Hooker@sarahookr·13 Oca

My first trip to India is next month. I'm honored to be attending the India-AI Impact Summit 2026. Truly very meaningful given our commitment @adaptionlabs to building global technology and ensuring language coverage.

English

551

27.3K

Keşfet

@giffmana @tenderizzation @bilaltwovec @F1 @awnihannun @angeloskath @zcbenz @DiganiJagrit