Mrinal Mathur

8.6K posts

Mrinal Mathur banner
Mrinal Mathur

Mrinal Mathur

@bobthemaster

Research Engineer @Google |@BytedanceTalk | @Amazon | @Apple | @CenterTrends | @ARM

United States Katılım Şubat 2010
679 Takip Edilen357 Takipçiler
Mrinal Mathur retweetledi
AlphaSignal AI
AlphaSignal AI@AlphaSignalAI·
Researchers just taught AI to think 12x faster without using words. Reasoning chains are powerful but expensive. Every token a model "thinks" costs time and money. A new paper called Abstract Chain-of-Thought proposes a fix. Instead of reasoning in full sentences, the model invents its own compressed language. It uses reserved placeholder tokens like as shorthand for entire thoughts. The result is up to 11.6x fewer reasoning tokens with comparable accuracy. Training happens in two stages: 1. A warm-up loop teaches the model what these abstract tokens mean using a teacher's verbal reasoning. 2. Reinforcement learning then refines how the tokens are sequenced for better answers. Tested on math, instruction-following, and multi-hop benchmarks, performance held up against verbal chains. Even stranger, the abstract vocabulary started forming patterns similar to real language. Frequent tokens dominated like common words do.
AlphaSignal AI tweet media
English
20
37
212
14K
Mrinal Mathur retweetledi
Jiyeon Kim
Jiyeon Kim@jiyeonkimd·
📢 Diffusion-based LLM paper accepted to #ICML2026 🥳 Diffusion LLMs promise parallel & bidirectional generation, but fully non-autoregressive decoding still struggles in practice. We analyzed why NAR fails, and show how minimal interventions can substantially improve it!
Jiyeon Kim tweet media
English
2
14
118
4.7K
Mrinal Mathur retweetledi
𝗿𝗮𝗺𝗮𝗸𝗿𝘂𝘀𝗵𝗻𝗮— 𝗲/𝗮𝗰𝗰
Stanford's latest seminar is a deep dive into the evolution of world modeling in AI. Focuses on the shift in the world model from traditional reconstruction methods toward latent space prediction. Covers topics like: - Introduction to JEPA & World Models - Causal JEPA - LOWER Model - Practical Applications & Planning - Future Outlook
𝗿𝗮𝗺𝗮𝗸𝗿𝘂𝘀𝗵𝗻𝗮— 𝗲/𝗮𝗰𝗰 tweet media
English
21
164
1.5K
198.7K
Mrinal Mathur retweetledi
alphaXiv
alphaXiv@askalphaxiv·
“The Recurrent Transformer: Greater Effective Depth and Efficient Decoding” Transformers are great at parallel processing, but they’re shallow through time, as each layer only lets tokens interact once. This paper changes that by storing keys/values from each layer’s output, not input, so later tokens can read already-updated states. The result is more effective depth at the same decode cost, better C4 pretraining performance than parameter-matched Transformers, and similar quality with fewer layers, reducing KV cache and inference traffic.
alphaXiv tweet media
English
7
45
242
11K
Mrinal Mathur retweetledi
Elias Al
Elias Al@iam_elias1·
MIT just made every AI company's billion dollar bet look embarrassing. They solved AI memory. Not by building a bigger brain. By teaching it how to read. The paper dropped on December 31, 2025. Three MIT CSAIL researchers. One idea so obvious it hurts. And a result that makes five years of context window arms racing look like the wrong war entirely. Here is the problem nobody solved. Every AI model on the planet has a hard ceiling. A context window. The maximum amount of text it can hold in working memory at once. Cross that line and something ugly happens — something researchers have a clinical name for. Context rot. The more you pack into an AI's context, the worse it performs on everything already inside it. Facts blur. Information buried in the middle vanishes. The model does not become more capable as you feed it more. It becomes more confused. You give it your entire codebase and it forgets what it read three files ago. You hand it a 500-page legal document and it loses the clause from page 12 by the time it reaches page 400. So the industry built a workaround. RAG. Retrieval Augmented Generation. Chop the document into chunks. Store them in a database. Retrieve the relevant ones when needed. It was always a compromise dressed up as a solution. The retriever guesses which chunks matter before the AI has read anything. If it guesses wrong — and it does, constantly — the AI never sees the information it needed. The act of chunking destroys every relationship between distant paragraphs. The full picture gets shredded into fragments that the AI then tries to reassemble blindfolded. Two bad options. One broken industry. Three MIT researchers and a deadline of December 31st. Here is what they built. Stop putting the document in the AI's memory at all. That is the entire idea. That is the breakthrough. Store the document as a Python variable outside the AI's context window entirely. Tell the AI the variable exists and how big it is. Then get out of the way. When you ask a question, the AI does not try to remember anything. It behaves like a human expert dropped into a library with a computer. It writes code. It searches the document with regular expressions. It slices to the exact section it needs. It scans the structure. It navigates. It finds precisely what is relevant and pulls only that into its active window. Then it does something that makes this recursive. When the AI finds relevant material, it spawns smaller sub-AI instances to read and analyze those sections in parallel. Each one focused. Each one fast. Each one reporting back. The root AI synthesizes everything and produces an answer. No summarization. No deletion. No information loss. No decay. Every byte of the original document remains intact, accessible, and queryable for as long as you need it. Now here are the numbers. Standard frontier models on the hardest long-context reasoning benchmarks: scores near zero. Complete collapse. GPT-5 on a benchmark requiring it to track complex code history beyond 75,000 tokens — could not solve even 10% of problems. RLMs on the same benchmarks: solved them. Dramatically. Double-digit percentage gains over every alternative approach. Successfully handling inputs up to 10 million tokens — 100 times beyond a model's native context window. Cost per query: comparable to or cheaper than standard massive context calls. Read that again. One hundred times the context. Better answers. Same price. The timeline of the arms race makes this sting harder. GPT-3 in 2020: 4,000 tokens. GPT-4: 32,000. Claude 3: 200,000. Gemini: 1 million. Gemini 2: 2 million. Every generation, every company, billions of dollars spent, all betting on the same assumption. More context equals better performance. MIT just proved that assumption was wrong the entire time. Not slightly wrong. Fundamentally wrong. The entire premise of the last five years of context window research — that the solution to AI memory was a bigger window — was the wrong answer to the wrong question. The right question was never how much can you force an AI to hold in its head. It was whether you could teach an AI to know where to look. A human expert handed a 10,000-page archive does not read all 10,000 pages before answering your question. They navigate. They search. They find the relevant section, read it deeply, and synthesize the answer. RLMs are the first AI architecture that works the same way. The code is open source. On GitHub right now. Free. No license fees. No API costs. Drop it in as a replacement for your existing LLM API calls and your application does not even notice the difference — except that it suddenly works on inputs it used to fail on entirely. Prime Intellect — one of the leading AI research labs in the space — has already called RLMs a major research focus and described what comes next: teaching models to manage their own context through reinforcement learning, enabling agents to solve tasks spanning not hours, but weeks and months. The context window wars are over. MIT won them by walking away from the battlefield. Source: Zhang, Kraska, Khattab · MIT CSAIL · arXiv:2512.24601 Paper: arxiv.org/abs/2512.24601 GitHub: github.com/alexzhang13/rlm
Elias Al tweet media
English
147
449
2.2K
323K
Reza Bayat
Reza Bayat@reza_byt·
📄 New Paper Alert! ✨ 🚀Mixture of Recursions (MoR): Smaller models • Higher accuracy • Greater throughput Across 135 M–1.7 B params, MoR carves a new Pareto frontier: equal training FLOPs yet lower perplexity, higher few‑shot accuracy, and more than 2x throughput. Let’s break it down! 🧵👇
Reza Bayat tweet media
English
7
63
282
36.7K
Mrinal Mathur retweetledi
Gowthami
Gowthami@gowthami_s·
In this work, authors first systematically study what makes latent tokens (or representations) easy for a diffusion model to learn (I believe they coined the term “diffusibility” to convey the ease of learning). They first define something called - frequency profile which is the normalized amplitude of DCT frequencies, and they say if this profile is low, more high frequency information is captured in the representation which is not very useful for generation downstream tasks. This can be combated by aligning the RGB and latent space spectral properties and turns out training with scale equivariance can help achieve that. Day 4 #minisummaries #100daysofgenai
Gowthami tweet media
English
4
6
76
6K
Mrinal Mathur retweetledi
Francesco Bertolotti
Francesco Bertolotti@f14bertolotti·
In this work the authors provide both empirical and theoretical evidence on why MTP training is superior to NTP. The theoretical section is pretty dense, however, it seems to boil down to the opt. process being able to look at the solution. Cool work! 🔗arxiv.org/abs/2604.11912
Francesco Bertolotti tweet mediaFrancesco Bertolotti tweet mediaFrancesco Bertolotti tweet mediaFrancesco Bertolotti tweet media
English
3
15
78
6.6K
Mrinal Mathur retweetledi
Rosinality
Rosinality@rosinality·
A looped transformer would have cyclic trajectories, which means that the output of a specific block would be similar to that of the same block in different iterations. But it also depends on architectural choice, especially input injection.
Rosinality tweet media
English
3
40
240
23.4K
Mrinal Mathur retweetledi
Ksenia_TuringPost
Ksenia_TuringPost@TheTuringPost·
A must-read survey The Latent Space: Foundation, Evolution, Mechanism, Ability, and Outlook Shows how models are moving beyond tokens into continuous internal representations, covering: - What latent space is (vs. text and visual spaces) - Architecture and mechanisms - Why it helps: less redundancy, no token limits, faster reasoning - Evolution: early ideas → large-scale latent systems - Abilities: reasoning, planning, perception, memory, collaboration, etc. - Role in next-gen intelligence
Ksenia_TuringPost tweet media
English
7
127
542
28.4K
Mrinal Mathur retweetledi
Google for Developers
Google for Developers@googledevs·
Gemma 4 is here! Our most intelligent open models to date, are built on the same world-class research and tech as Gemini 3, and are sized to run and fine-tune efficiently on local hardware. Check out what @GoogleGemma 4 brings to devs: 💎 Advanced Reasoning: Deep logic tasks, complex multi-step planning, and beyond 💎 Longer context: Seamlessly analyze entire codebases with context windows of 128K tokens for our edge models and 256K tokens for our largest models 💎 Vision and audio: Rich, multimodal interactions out of the box 💎 140+ languages: Trained on 140+ languages 💎 Apache 2.0 license: industry-standard open-source license
English
54
202
1.4K
268.1K
Mrinal Mathur retweetledi
Google
Google@Google·
We just released Gemma 4 — our most intelligent open models to date. Built from the same world-class research as Gemini 3, Gemma 4 brings breakthrough intelligence directly to your own hardware for advanced reasoning and agentic workflows. Released under a commercially permissive Apache 2.0 license so anyone can build powerful AI tools. 🧵↓
English
736
3.1K
20.6K
7.7M
Mrinal Mathur retweetledi
Ksenia_TuringPost
Ksenia_TuringPost@TheTuringPost·
2 methods that help Transformers to retrieve from depth (layers): ▪️ Attention Residuals (AttnRes) – makes the residual stream depth-aware, letting each layer use information from multiple earlier layers. ▪️ Mixture-of-Depths Attention (MoDA) – makes the attention heads depth-aware: attention can "look" not just at other tokens, but also at different layers Never before has depth in Transformers been treated so explicitly as a retrieval problem. Here is how these two methods work: turingpost.com/p/transformers…
Ksenia_TuringPost tweet media
English
3
51
242
12.9K
Mrinal Mathur retweetledi
alphaXiv
alphaXiv@askalphaxiv·
"Reaching Beyond the Mode: RL for Distributional Reasoning in Language Models" Instead of standard RL post-training collapsing an LLM toward one dominant answer, this paper shows you can train it to produce a set of plausible answers in a single pass. This is important because many real-world tasks involve ambiguity or multiple valid solutions, like diagnosis, incomplete-information QA, and coding, and this approach improves diversity & convergence while using fewer tokens than repeated sampling.
alphaXiv tweet media
English
7
40
241
13.1K
Mrinal Mathur retweetledi
alphaXiv
alphaXiv@askalphaxiv·
"Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?" Self-distillation can make LLMs look smarter by producing shorter, more confident reasoning traces, but in math it often takes out the model's uncertainty and self-correction signals. This can potentially badly hurt out-of-domain reasoning, so this paper suggests that better reasoning is not just about compression, but preserving useful "doubt".
alphaXiv tweet media
English
3
41
221
12.2K
Mrinal Mathur retweetledi
Google for Developers
Google for Developers@googledevs·
Gemini 3.1 Flash Live delivers quality updates in latency, reliability and natural-sounding dialogue, so developers can build AI agents that process information and respond in real time. Check out the model improvements: ✅ Higher task completion 👂 Better instruction following 🗣 More natural and low-latency dialogue
English
22
46
286
95.7K