Akash Mahajan

5

17

2.5K

Akash Mahajan retweetledi

tmuxvim@tmuxvim·15 May

I put a prompt injection into my LinkedIn bio and recruiters are messaging me in Old English and calling me Lord.

English

658

7.6K

93K

4.4M

Akash Mahajan@akashmjn·15 May

Less typewriter, more editor. Diffusion models are the future of codegen, HCI/dynamic UIs and hence agents.

Inception@_inception_ai

Today's autoregressive models generate one token at a time. Mercury 2 generates tokens in parallel. Over 1,000 tok/sec on standard GPUs, at comparable quality to speed-optimized models. Since launch, the community has been showing what diffusion LLMs can unlock. Thanks to the team at Clyep for the breakdown.

English

Thinking Machines@thinkymachines

1

145

Akash Mahajan retweetledi

Joscha Bach@Plinz·14 May

LLMs are wordcels AND shape rotators

Goodfire@GoodfireAI

Neural networks do math by rotating shapes. We found a shape-rotating calculator hidden inside an LLM – and it’s used for more than just math! (1/6)

English

25

34

324

27.5K

Akash Mahajan retweetledi

Desh Raj@rdesh26·12 May

x.com/i/article/2054…

ZXX

6

18

170

35.7K

Akash Mahajan retweetledi

Rowan Zellers@rown·11 May

We are so back!

People talk, listen, watch, think, and collaborate at the same time, in real time. We've designed an AI that works with people the same way. We share our approach, early results, and a quick look at our model in action. thinkingmachines.ai/blog/interacti…

English

37

18

548

52.7K

Akash Mahajan@akashmjn·12 May

@Stanford_AI_Bio @BioAI_Pharma @suragnair Ty for the clarification! :) (signup link from the QR code below if it helps others) go.roche.com/aibiomed_regis…

English

1

52

Stanford AI+Biomedicine Seminar@Stanford_AI_Bio·11 May

@akashmjn @BioAI_Pharma @suragnair Thank you for the interest ! (: please join the mailing list for zoom access (dm for current link) Recording will be posted pending on speaker's approval.

English

0

4

91

Stanford AI+Biomedicine Seminar@Stanford_AI_Bio·11 May

We are excited to welcome @suragnair this Tuesday to present CompBioBench: A benchmark of 100 diverse tasks for evaluating agentic systems in computational biology! 📍2:30pm Tuesday May 12 | CoDa E160 | Stanford and Zoom

Stanford AI+Biomedicine Seminar tweet media

English

37

258

23.3K

Akash Mahajan@akashmjn·11 May

@BioAI_Pharma @Stanford_AI_Bio @suragnair Looking at snap.stanford.edu/ai-bio-seminar/ - will likely be updated with a recording after.

English

0

1

75

@BioAI_Neuro@BioAI_Pharma·11 May

@Stanford_AI_Bio @suragnair Is this open to public via zoom ?

English

0

7

535

Akash Mahajan retweetledi

DailyPapers@HuggingPapers·25 Mar

MinerU-Diffusion A 2.5B diffusion-based OCR model that replaces slow autoregressive decoding with parallel block-wise diffusion, achieving up to 3.2x faster inference while improving robustness on complex documents with tables, formulas, and layouts.

English

Google for Developers@googledevs

40

204

15.3K

Akash Mahajan@akashmjn·11 May

Be careful not to sleep on @diffusion_llms ... Several OOM breakthroughs on the horizon

Breaking LLM inference’s autoregressive bottleneck 🛠️ We've teamed up with @haozhangml, @YimingBob, and @aaronzhfeng, among others from UCSD to achieve a massive 3.13X speedup for LLM inference on Google Cloud TPUs using Diffusion-Style Speculative Decoding (DFlash). Read the blog: goo.gle/4naZ8Yv

English

1

110

Akash Mahajan@akashmjn·11 May

Voice is a great input modality, but v limited as output. For this reason, voice product demos so far (even OpenAI's real-time voice last week) - feel a little stuck in awkward VR-like territory. When we see SNAPPY voice input PLUS UI changes on screen, that'll be a takeoff moment. Thumbing away at your devices will then feel quite primitive.

Andrej Karpathy@karpathy

This works really well btw, at the end of your query ask your LLM to "structure your response as HTML", then view the generated file in your browser. I've also had some success asking the LLM to present its output as slideshows, etc. More generally, imo audio is the human-preferred input to AIs but vision (images/animations/video) is the preferred output from them. Around a ~third of our brains are a massively parallel processor dedicated to vision, it is the 10-lane superhighway of information into brain. As AI improves, I think we'll see a progression that takes advantage: 1) raw text (hard/effortful to read) 2) markdown (bold, italic, headings, tables, a bit easier on the eyes) <-- current default 3) HTML (still procedural with underlying code, but a lot more flexibility on the graphics, layout, even interactivity) <-- early but forming new good default ...4,5,6,... n) interactive neural videos/simulations Imo the extrapolation (though the technology doesn't exist just yet) ends in some kind of interactive videos generated directly by a diffusion neural net. Many open questions as to how exact/procedural "Software 1.0" artifacts (e.g. interactive simulations) may be woven together with neural artifacts (diffusion grids), but generally something in the direction of the recently viral x.com/zan2434/status… There are also improvements necessary and pending at the input. Audio nor text nor video alone are not enough, e.g. I feel a need to point/gesture to things on the screen, similar to all the things you would do with a person physically next to you and your computer screen. TLDR The input/output mind meld between humans and AIs is ongoing and there is a lot of work to do and significant progress to be made, way before jumping all the way into neuralink-esque BCIs and all that. For what's worth exploring at the current stage, hot tip try ask for HTML.

English

75

Akash Mahajan retweetledi

Andrej Karpathy@karpathy·11 May

This works really well btw, at the end of your query ask your LLM to "structure your response as HTML", then view the generated file in your browser. I've also had some success asking the LLM to present its output as slideshows, etc. More generally, imo audio is the human-preferred input to AIs but vision (images/animations/video) is the preferred output from them. Around a ~third of our brains are a massively parallel processor dedicated to vision, it is the 10-lane superhighway of information into brain. As AI improves, I think we'll see a progression that takes advantage: 1) raw text (hard/effortful to read) 2) markdown (bold, italic, headings, tables, a bit easier on the eyes) <-- current default 3) HTML (still procedural with underlying code, but a lot more flexibility on the graphics, layout, even interactivity) <-- early but forming new good default ...4,5,6,... n) interactive neural videos/simulations Imo the extrapolation (though the technology doesn't exist just yet) ends in some kind of interactive videos generated directly by a diffusion neural net. Many open questions as to how exact/procedural "Software 1.0" artifacts (e.g. interactive simulations) may be woven together with neural artifacts (diffusion grids), but generally something in the direction of the recently viral x.com/zan2434/status… There are also improvements necessary and pending at the input. Audio nor text nor video alone are not enough, e.g. I feel a need to point/gesture to things on the screen, similar to all the things you would do with a person physically next to you and your computer screen. TLDR The input/output mind meld between humans and AIs is ongoing and there is a lot of work to do and significant progress to be made, way before jumping all the way into neuralink-esque BCIs and all that. For what's worth exploring at the current stage, hot tip try ask for HTML.

Thariq@trq212

x.com/i/article/2052…

English

997

2K

18.8K

3.6M

Akash Mahajan retweetledi

Naveen Rao@NaveenGRao·5 May

Great post to understand some of the subtle changes to LLMs over the last couple of years. It really is evolution rather than revolution. But each evolution has high leverage

Jason Zhu@GoSailGlobal

Stanford CS336 上，Tatsu 讲了一节 LLM 架构课，把过去 3 年所有主流 LLM 拆开，看它们的共通模板结论挺爆：90% 的架构选择已经收敛，你随便挑一个开源大模型，它跟其他模型在这些维度上几乎一模一样讲师的原话 - 2024 年大家都在 cosplay Llama2 - 2025 年的主题是「怎么训得不崩」 - 2026 年的主题是「怎么扛住长上下文」下面是 2026 年开源 LLM 的标准模板你训自己的模型可以直接抄【架构层已经收敛的 7 件事】 1）Layer Norm 挪出残差流（pre-norm）原版 Transformer 把 LN 放在残差里几乎所有现代模型都挪到外面原因：keep your residual stream clean 梯度反传更稳 2）RMS Norm 替代 LayerNorm LayerNorm 的减均值 + 加 bias 那部分实际没怎么帮上忙丢掉之后 flops 只省 0.17% 但运行时省到 25% （瓶颈在数据搬运计算反而次要） 3）所有 bias 项全删跟 RMS Norm 一个道理系统层省内存搬运 4）激活函数用 SwiGLU 或 GeGLU gated linear unit 几乎所有现代模型都用 Llama 系 / Qwen / Mistral 用 SwiGLU Google 系（Gemma / T5）用 GeGLU 区别极小选哪个都行 5）位置编码用 RoPE 2024 年之后基本统一了原理：把每对维度按位置旋转一个角度让 inner product 只依赖相对位置 6）Transformer block 串联（不是并联） GPT-J / Palm 试过并联现在基本被放弃串联的实现优化得太好了并联省的那点系统开销不值得损失表达力 7）Layer norm 可以「撒」哪儿不稳就在哪儿加 LN attention 之前能加之后能加两边都加（double norm）也可以现代模型很多这样做【超参数已经收敛的 5 个数】 1）feedforward 维度 / hidden 维度 - 非 GLU 模型：4 倍 - GLU 模型：8/3 ≈ 2.67 倍（因为 GLU 多一组矩阵要保持总参数量） - Llama 系：3.5 倍 - T5 1.0 试过 64 倍后来 T5 1.1 改回标准别学 2）head 数 × head 维度 ≈ hidden 维度几乎所有模型都遵守 T5 是为数不多的例外 3）模型纵横比（hidden / 层数）≈ 100 太深 pipeline parallel 难做太宽表达力受限 100 这个数字是系统约束 + 表达力的平衡点 4）vocab size 单语模型：30K 左右（早期 GPT-2 那种）多语 / 通用模型：100K-200K（GPT-4 / Llama 3 / Gemma 都在这个范围）现代基本都是后者 5）weight decay 仍然普遍使用但研究发现它在 LLM 里干的事其实是优化器干预让你最终能收敛到更深的最优点跟你想的「防过拟合」没什么关系所以别因为「单 epoch 不会过拟合」就把它关掉【稳定性三个救命 trick】训练大模型最怕中途 loss 突然飙升然后 NaN 全军覆没现代模型用三个 trick 防这件事 1）Z-loss output softmax 的 normalizer 容易爆加一个 (log Z)² 的正则项让 Z 始终接近 1 DCLM / Olmo 都用 2）QK norm attention 的 Q 和 K 在矩阵乘之前各加一个 LN 让 softmax 的输入永远是单位尺度 multimodal 圈先用起来现在所有大模型都加 3）Logit soft cap（仅 Google 系） attention logit 用 tanh 硬封顶 Gemma 2/3/4 都在用但会损失一点点性能慎用【Attention 两个新趋势】 1）GQA（Grouped Query Attention）几乎统一原版 multi-head 推理时 KV cache 会让算术强度崩到 1/h GQA 共享 K 和 V 但保留多个 Q 表达力几乎不损失推理成本砍掉 80% 现在所有要做生产部署的大模型没有不用 GQA 的 2）局部 + 全局 attention 交替处理长上下文的新方式 Cohere Command A 起头现在 Llama 4 / Gemma 4 / Olmo 3 全在用比如每 4 层有 1 层 full attention 其他 3 层是 sliding window 只看附近的 token 比纯 SSM 更稳比纯 full attention 便宜得多（Qwen 3.5 做了变体把 sliding window 那 3 层换成 SSM）收尾一句如果你正在训自己的 LLM，上面这一套就是 2026 年的「默认配置」不需要重新发明，直接抄如果你只是想看懂 GitHub 上那些 modeling_xxx.py 这一份足够你不再被术语吓住

English

5

39

12.3K

Akash Mahajan retweetledi

Jyotika Singh@JyotikaSingh_·3 Mar

Deadline extended! Submit your work at Grail-V to be at @CVPR. #CVPR2026 @amitpinaki @Hitesh_LPatel

English

4

8

1.9K

Akash Mahajan@akashmjn·22 Ara

@reach_vb @EAccelerate_42 Congrats VB!! Really bending the timeline again here :) One way or another looking forward to seeing lots more from you. They’re lucky to have you!

English

1

32

Vaibhav (VB) Srivastav@reach_vb·22 Ara

@EAccelerate_42 🤝

QME

0

4

452

Vaibhav (VB) Srivastav@reach_vb·22 Ara

Excited to share that I joined OpenAI last week! The pace of progress in models and tooling is still hard to internalise - capabilities that felt like demos are now real workflows developers rely on daily. I’m convinced that this is the right place to be at this time for me - stoked to be able to shape the future! The goal remains the same: building what helps builders ship 🚢 If you’re building with OpenAI, I’d love to hear what’s working and what’s painful. DMs open - Let’s get to work!

English

140

22

932

185.9K

Akash Mahajan@akashmjn·20 Ara

@mhnt1580 Nice work!

English

We’re open-sourcing Perception Encoder Audiovisual (PE-AV), the technical engine that helps drive SAM Audio’s state-of-the-art audio separation. Built on our Perception Encoder model from earlier this year, PE-AV integrates audio with visual perception, achieving state-of-the-art results across a wide range of audio and video benchmarks. Its native multimodal support can assist people in everyday tasks, including sound detection and richer audio-visual scene understanding. 🔗 Read the paper: go.meta.me/e541b6 🔗 Download the code: go.meta.me/7fbef0

22

Wei-Ning Hsu@mhnt1580·19 Ara

Two big audio open-sourcing release in one week!! *SAM Audio*: Isolate ANY audio - vocal, instrument, birds, siren, you name it. You can tell the model what to extract in multiple ways 1. Visual crop - Use SAM3 to select the visual object 2. Time span - Tell the model *where* the sound is 3. Free-form text - "female speaker", "electric guitar"... *PE-AV*: A trimodal contrastive audio-video-text encoder You can now retrieve audio or video with BOTH audio and visual description, such as "a dog barking in distance when a postman delivers a package" Shout out to Audiobox team & @BowenShi20 and @apoorv2904 who led these two amazing projects!

AI at Meta@AIatMeta

English

0

10

359

Akash Mahajan@akashmjn·17 Ara

@atiorh @NVIDIAAI @nithinraok_ Nice work!

English

2

54

Atila@atiorh·17 Ara

Thanks to @NVIDIAAI/@nithinraok_ et al. for open-sourcing banger models back to back!

English

0

3

193

Atila@atiorh·17 Ara

I would have considered this alien technology last year! So excited to see the accelerating pace of improvement in speech and speaker recognition!

argmax@argmax

Introducing Real-time Transcription with Speakers! - Step change in accuracy, surpassing top cloud APIs - Faster than real-time on Mac and iPhone - Still under 3 watts when all features are enabled Available in Argmax SDK 2.0 for early access! Benchmarks and details in comments.

English