Akash Mahajan

719 posts

Akash Mahajan banner
Akash Mahajan

Akash Mahajan

@akashmjn

now 🎧; prev chatting with PDFs @ContextualAI; transcription @Azure Speech; @Stanford @atherenergy @iitmadras

Redwood City Katılım Ekim 2013
800 Takip Edilen636 Takipçiler
Sabitlenmiş Tweet
Akash Mahajan
Akash Mahajan@akashmjn·
Context engineering → better tools for agents (not just better retrieval/RAG). Traditional retrieval works well on pointed questions over chunks/snippets. But struggles with holistic cross-document questions, forcing you to stuff entire docs into context. We need `llms.txt` for documents 🤖🗺️. Here's a demo inspired by talks and conversations at @aiDotEngineer: A "document navigator" AI agent (with Cursor + MCP) browsing a 250-page US Govt. document to answer: "summarize all parts of the document about US government debt". (skim to 1:10 and 2:10 for just the demo) Under the hood: - @ContextualAI /parse API + auto-generated `document_metadata.hierarchy` (as an `llms.txt`) - Cursor's agent loop + navigation tools via MCP - Verifiable attribution via interpretable tool call traces with rationales This simple demo uses purely “navigation” tools on one 250 page doc. But combining with solid traditional retrieval can scale context for agents to 10-100x more than can fit in context - while keeping all the agentic goodness intact 🚀. The bigger picture: LLMs can read, summarize, and index various cross-document metadata. The emerging pattern: Index-time compute → smarter tools → more capable agents. Chat with the US Govt FY 24 Financial report in Cursor yourself (Github link below)
English
1
5
17
2.5K
Akash Mahajan retweetledi
tmuxvim
tmuxvim@tmuxvim·
I put a prompt injection into my LinkedIn bio and recruiters are messaging me in Old English and calling me Lord.
tmuxvim tweet mediatmuxvim tweet media
English
658
7.6K
93K
4.4M
Stanford AI+Biomedicine Seminar
Stanford AI+Biomedicine Seminar@Stanford_AI_Bio·
We are excited to welcome @suragnair this Tuesday to present CompBioBench: A benchmark of 100 diverse tasks for evaluating agentic systems in computational biology! 📍2:30pm Tuesday May 12 | CoDa E160 | Stanford and Zoom
Stanford AI+Biomedicine Seminar tweet media
English
3
37
258
23.3K
Akash Mahajan retweetledi
DailyPapers
DailyPapers@HuggingPapers·
MinerU-Diffusion A 2.5B diffusion-based OCR model that replaces slow autoregressive decoding with parallel block-wise diffusion, achieving up to 3.2x faster inference while improving robustness on complex documents with tables, formulas, and layouts.
English
3
40
204
15.3K
Akash Mahajan
Akash Mahajan@akashmjn·
Voice is a great input modality, but v limited as output. For this reason, voice product demos so far (even OpenAI's real-time voice last week) - feel a little stuck in awkward VR-like territory. When we see SNAPPY voice input PLUS UI changes on screen, that'll be a takeoff moment. Thumbing away at your devices will then feel quite primitive.
Andrej Karpathy@karpathy

This works really well btw, at the end of your query ask your LLM to "structure your response as HTML", then view the generated file in your browser. I've also had some success asking the LLM to present its output as slideshows, etc. More generally, imo audio is the human-preferred input to AIs but vision (images/animations/video) is the preferred output from them. Around a ~third of our brains are a massively parallel processor dedicated to vision, it is the 10-lane superhighway of information into brain. As AI improves, I think we'll see a progression that takes advantage: 1) raw text (hard/effortful to read) 2) markdown (bold, italic, headings, tables, a bit easier on the eyes) <-- current default 3) HTML (still procedural with underlying code, but a lot more flexibility on the graphics, layout, even interactivity) <-- early but forming new good default ...4,5,6,... n) interactive neural videos/simulations Imo the extrapolation (though the technology doesn't exist just yet) ends in some kind of interactive videos generated directly by a diffusion neural net. Many open questions as to how exact/procedural "Software 1.0" artifacts (e.g. interactive simulations) may be woven together with neural artifacts (diffusion grids), but generally something in the direction of the recently viral x.com/zan2434/status… There are also improvements necessary and pending at the input. Audio nor text nor video alone are not enough, e.g. I feel a need to point/gesture to things on the screen, similar to all the things you would do with a person physically next to you and your computer screen. TLDR The input/output mind meld between humans and AIs is ongoing and there is a lot of work to do and significant progress to be made, way before jumping all the way into neuralink-esque BCIs and all that. For what's worth exploring at the current stage, hot tip try ask for HTML.

English
0
0
0
75
Akash Mahajan retweetledi
Andrej Karpathy
Andrej Karpathy@karpathy·
This works really well btw, at the end of your query ask your LLM to "structure your response as HTML", then view the generated file in your browser. I've also had some success asking the LLM to present its output as slideshows, etc. More generally, imo audio is the human-preferred input to AIs but vision (images/animations/video) is the preferred output from them. Around a ~third of our brains are a massively parallel processor dedicated to vision, it is the 10-lane superhighway of information into brain. As AI improves, I think we'll see a progression that takes advantage: 1) raw text (hard/effortful to read) 2) markdown (bold, italic, headings, tables, a bit easier on the eyes) <-- current default 3) HTML (still procedural with underlying code, but a lot more flexibility on the graphics, layout, even interactivity) <-- early but forming new good default ...4,5,6,... n) interactive neural videos/simulations Imo the extrapolation (though the technology doesn't exist just yet) ends in some kind of interactive videos generated directly by a diffusion neural net. Many open questions as to how exact/procedural "Software 1.0" artifacts (e.g. interactive simulations) may be woven together with neural artifacts (diffusion grids), but generally something in the direction of the recently viral x.com/zan2434/status… There are also improvements necessary and pending at the input. Audio nor text nor video alone are not enough, e.g. I feel a need to point/gesture to things on the screen, similar to all the things you would do with a person physically next to you and your computer screen. TLDR The input/output mind meld between humans and AIs is ongoing and there is a lot of work to do and significant progress to be made, way before jumping all the way into neuralink-esque BCIs and all that. For what's worth exploring at the current stage, hot tip try ask for HTML.
Thariq@trq212

x.com/i/article/2052…

English
997
2K
18.8K
3.6M
Akash Mahajan retweetledi
Naveen Rao
Naveen Rao@NaveenGRao·
Great post to understand some of the subtle changes to LLMs over the last couple of years. It really is evolution rather than revolution. But each evolution has high leverage
Jason Zhu@GoSailGlobal

Stanford CS336 上,Tatsu 讲了一节 LLM 架构课,把过去 3 年所有主流 LLM 拆开,看它们的共通模板 结论挺爆:90% 的架构选择已经收敛,你随便挑一个开源大模型,它跟其他模型在这些维度上几乎一模一样 讲师的原话 - 2024 年大家都在 cosplay Llama2 - 2025 年的主题是「怎么训得不崩」 - 2026 年的主题是「怎么扛住长上下文」 下面是 2026 年开源 LLM 的标准模板 你训自己的模型可以直接抄 【架构层 已经收敛的 7 件事】 1)Layer Norm 挪出残差流(pre-norm) 原版 Transformer 把 LN 放在残差里 几乎所有现代模型都挪到外面 原因:keep your residual stream clean 梯度反传更稳 2)RMS Norm 替代 LayerNorm LayerNorm 的减均值 + 加 bias 那部分实际没怎么帮上忙 丢掉之后 flops 只省 0.17% 但运行时省到 25% (瓶颈在数据搬运 计算反而次要) 3)所有 bias 项全删 跟 RMS Norm 一个道理 系统层省内存搬运 4)激活函数用 SwiGLU 或 GeGLU gated linear unit 几乎所有现代模型都用 Llama 系 / Qwen / Mistral 用 SwiGLU Google 系(Gemma / T5)用 GeGLU 区别极小 选哪个都行 5)位置编码用 RoPE 2024 年之后基本统一了 原理:把每对维度按位置旋转一个角度 让 inner product 只依赖相对位置 6)Transformer block 串联(不是并联) GPT-J / Palm 试过并联 现在基本被放弃 串联的实现优化得太好了 并联省的那点系统开销不值得损失表达力 7)Layer norm 可以「撒」 哪儿不稳就在哪儿加 LN attention 之前能加 之后能加 两边都加(double norm)也可以 现代模型很多这样做 【超参数 已经收敛的 5 个数】 1)feedforward 维度 / hidden 维度 - 非 GLU 模型:4 倍 - GLU 模型:8/3 ≈ 2.67 倍(因为 GLU 多一组矩阵 要保持总参数量) - Llama 系:3.5 倍 - T5 1.0 试过 64 倍 后来 T5 1.1 改回标准 别学 2)head 数 × head 维度 ≈ hidden 维度 几乎所有模型都遵守 T5 是为数不多的例外 3)模型纵横比(hidden / 层数)≈ 100 太深 pipeline parallel 难做 太宽 表达力受限 100 这个数字是系统约束 + 表达力的平衡点 4)vocab size 单语模型:30K 左右(早期 GPT-2 那种) 多语 / 通用模型:100K-200K(GPT-4 / Llama 3 / Gemma 都在这个范围) 现代基本都是后者 5)weight decay 仍然普遍使用 但研究发现它在 LLM 里干的事其实是优化器干预 让你最终能收敛到更深的最优点 跟你想的「防过拟合」没什么关系 所以别因为「单 epoch 不会过拟合」就把它关掉 【稳定性 三个救命 trick】 训练大模型最怕中途 loss 突然飙升 然后 NaN 全军覆没 现代模型用三个 trick 防这件事 1)Z-loss output softmax 的 normalizer 容易爆 加一个 (log Z)² 的正则项 让 Z 始终接近 1 DCLM / Olmo 都用 2)QK norm attention 的 Q 和 K 在矩阵乘之前各加一个 LN 让 softmax 的输入永远是单位尺度 multimodal 圈先用起来 现在所有大模型都加 3)Logit soft cap(仅 Google 系) attention logit 用 tanh 硬封顶 Gemma 2/3/4 都在用 但会损失一点点性能 慎用 【Attention 两个新趋势】 1)GQA(Grouped Query Attention)几乎统一 原版 multi-head 推理时 KV cache 会让算术强度崩到 1/h GQA 共享 K 和 V 但保留多个 Q 表达力几乎不损失 推理成本砍掉 80% 现在所有要做生产部署的大模型 没有不用 GQA 的 2)局部 + 全局 attention 交替 处理长上下文的新方式 Cohere Command A 起头 现在 Llama 4 / Gemma 4 / Olmo 3 全在用 比如每 4 层有 1 层 full attention 其他 3 层是 sliding window 只看附近的 token 比纯 SSM 更稳 比纯 full attention 便宜得多 (Qwen 3.5 做了变体 把 sliding window 那 3 层换成 SSM) 收尾一句 如果你正在训自己的 LLM,上面这一套就是 2026 年的「默认配置」 不需要重新发明,直接抄 如果你只是想看懂 GitHub 上那些 modeling_xxx.py 这一份足够你不再被术语吓住

English
3
5
39
12.3K
Akash Mahajan
Akash Mahajan@akashmjn·
@reach_vb @EAccelerate_42 Congrats VB!! Really bending the timeline again here :) One way or another looking forward to seeing lots more from you. They’re lucky to have you!
English
0
0
1
32
Vaibhav (VB) Srivastav
Vaibhav (VB) Srivastav@reach_vb·
Excited to share that I joined OpenAI last week! The pace of progress in models and tooling is still hard to internalise - capabilities that felt like demos are now real workflows developers rely on daily. I’m convinced that this is the right place to be at this time for me - stoked to be able to shape the future! The goal remains the same: building what helps builders ship 🚢 If you’re building with OpenAI, I’d love to hear what’s working and what’s painful. DMs open - Let’s get to work!
English
140
22
932
185.9K
Wei-Ning Hsu
Wei-Ning Hsu@mhnt1580·
Two big audio open-sourcing release in one week!! *SAM Audio*: Isolate ANY audio - vocal, instrument, birds, siren, you name it. You can tell the model what to extract in multiple ways 1. Visual crop - Use SAM3 to select the visual object 2. Time span - Tell the model *where* the sound is 3. Free-form text - "female speaker", "electric guitar"... *PE-AV*: A trimodal contrastive audio-video-text encoder You can now retrieve audio or video with BOTH audio and visual description, such as "a dog barking in distance when a postman delivers a package" Shout out to Audiobox team & @BowenShi20 and @apoorv2904 who led these two amazing projects!
AI at Meta@AIatMeta

We’re open-sourcing Perception Encoder Audiovisual (PE-AV), the technical engine that helps drive SAM Audio’s state-of-the-art audio separation. Built on our Perception Encoder model from earlier this year, PE-AV integrates audio with visual perception, achieving state-of-the-art results across a wide range of audio and video benchmarks. Its native multimodal support can assist people in everyday tasks, including sound detection and richer audio-visual scene understanding. 🔗 Read the paper: go.meta.me/e541b6 🔗 Download the code: go.meta.me/7fbef0

English
1
0
10
359
Arun Vinayak
Arun Vinayak@Arun_Vinayak_S·
I think India is done building only pragmatic “safe” companies. We need a lot more crazy ones. We’ll have grander failures and grander successes. Already seeing the ambition among early 20s founders. Glad to see the support system and capital following! There is something magical and pure about young founders building crazy stuff during college/right after it. You’re truly blue eyed then. Think anything is possible. And it is.
Hemant Mohapatra@MohapatraHemant

Today's a special day for @LightspeedIndia. Introducing INDIA ASCENDS'2026, a program purpose-built for India's youngest (>25yo), boldest, cohort of world shapers & change makers. If you are one of them, put your headphones on, turn to volume to max, click on the video, and read on :) Building something is hard but building something the world has never seen before is nigh impossible. There is this concept of not just building a kingdom, but building a kingdom at the edge of a precipice -- founders who want to go all the way to the edge of what’s possible, beyond which there is no land, there is no road, the compass stops working, and they look into the abyss, and say ‘yes, this is for me, this will be my life’s work’. These are the rarest of birds that take the plunge and know they’d fall before they fly, but when they fly, oh how glorious do they look. We @lightspeedindia have been fortunate to partner with several of these founders. We met @PixxelSpace when the founders @awaisahmedna @kshitijgokul were just 22yo. We backed @Airbound_Aero when @TheRealNamzoo was just 17. We’ve backed many others doing their life’s work at absolute cutting edge of what’s possible - @Arun_Vinayak_S of @ExponentEnergy, @pratykumar & @vivekrag of @SarvamAI @devdutdalal & @XaviLaguarta at @MittiLabs and more. Beyond our portfolio, there is some amazing founders doing their life’s work - @PawanKChandana @SkyrootA , @sohamsankaran @PopVaxIndia , @khushhhi_ @AsperaAero, @nagokul @CynLr00, @adrnschm @sarlaaviation, @deepigoyal @lataerospace & @temple, @Manu_J_Nair @EtherealXTech & many more. We need more of these founders coming out of India. Not just that, we need to fill gap that exists in this market which is in backing really young (<25yo) founders who are tinkering in school or college labs, or spending their weekends building, experimenting and failing fast, and are truly building globally competitive and de novo tech that, if it works, can have huge consequences in the world. To that end, we are proud to launch INDIA ASCENDS'2026 -- our flagship yearly program for the most cracked young builders in the country doing incredible cutting-edge research in robotics, quantum, space, energy, AI, bio or more. Our program applications open today and we’ll select 12-15 of the best, boldest ideas that we think has the potential to shape the future. We’ll bring them all to BLR for a 2-day program. Each participant will get ~$100K in support from our partners @AnthropicAI @GroqInc , @googlecloud @awscloud and we’ll also select 3-4 winners who will get venture funded to build their dream starting from $200K all the way to $3M and almost $500K of non-dilutive credits & grants from our partners. We look forward to seeing the boldest ideas you've been working on. Link to apply in the first comment:

English
2
7
66
8.2K
Zachary Novack
Zachary Novack@zacknovack·
Nearly a full house for @NeurIPSConf AI4Music workshop at 8am!! Don’t miss it, we’ll be here all day in hall 27!
Zachary Novack tweet media
English
2
3
40
1.8K
Christian Steinmetz
Christian Steinmetz@csteinmetz1·
Standing room only before 9am at the AI Music Workshop at NeurIPS. I guess AI music isnt a niche research topic anymore…
Christian Steinmetz tweet media
English
5
2
49
2.4K