Tsendsuren

22K posts

Tsendsuren

@TsendeeMTS

Research scientist at Google DeepMind | previously at Microsoft Research and Postdoc at UMass. Views are my own. Most tweets in Mongolian 🇲🇳.

Bay Area Inscrit le Ocak 2010

615 Abonnements4.6K Abonnés

Tsendsuren@TsendeeMTS·16h

Interesting! Few years back, I did experiment and observed 4x reduction without regression. In some cases, it even gave boost. But still entailed computing that giant LxL matrix so I dropped it.

Ashwin Gopinath@ashwingop

x.com/i/article/2040…

English

159

Tsendsuren@TsendeeMTS·27 Mar

@GiimaaAj Bayariin mendee

Indonesia

G@GiimaaAj·27 Mar

Эмч нарын баяр тэмдэглэчихээд ирий даа 🤭

Русский

356

Tsendsuren retweeté

Jim Musil Painter@JimMusilPainter·22 Mar

My painting EASTERN SIERRA

English

108

1.4K

14.3K

Tsendsuren@TsendeeMTS·16 Mar

Reminds me of Grid LSTM.

Kimi.ai@Kimi_Moonshot

Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. 🔹 Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. 🔹 Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. 🔹 Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. 🔹 Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. 🔗Full report: github.com/MoonshotAI/Att…

English

216

Tsendsuren retweeté

Kimi.ai@Kimi_Moonshot·16 Mar

English

334

2.1K

13.5K

4.9M

Tsendsuren@TsendeeMTS·13 Mar

This might just work on top of existing checkpoints with mid-training/tail-patching.

Shuangfei Zhai@zhaisf

Say hi to Exclusive Self Attention (XSA), a (nearly) free improvement to Transformers for LM. Observation: for y = attn(q, k, v), yᵢ and vᵢ tend to have a very high cosine similarity Fix: exclude vᵢ from yᵢ via zᵢ = yᵢ - (yᵢᵀvᵢ)vᵢ/‖vᵢ‖² Result: better training/val loss across model sizes; increasing gains as sequence length grows. See more: arxiv.org/abs/2603.09078

English

222

Tsendsuren@TsendeeMTS·13 Mar

Fitz Roy-д кампэлсан кэмпэлсан

Earth@earthcurated

The winding river towards Monte Fitz Roy, Patagonia

Русский

Tsendsuren@TsendeeMTS·12 Mar

Driving with Google maps becoming more fun thanks to Gemini 🧭

Sundar Pichai@sundarpichai

We’re also launching Immersive Navigation - our biggest navigation upgrade in over a decade! A new vivid 3D view better reflects your surroundings, with helpful road details like lanes, crosswalks, traffic lights etc. Gemini models analyze real world imagery from Street View and aerial photos to give you an accurate view of landmarks along your route. Starts rolling out in the US today.

English

134

Tsendsuren retweeté

Michael Boegl@michaelboegl·11 Mar

yosemite on 35mm film 🎞️

Eesti

301

2.5K

47.7K

Tsendsuren@TsendeeMTS·12 Mar

Энэ уулыг өгсөж буугаад 100км алхсаан. Дараа дахиад очиж бх да.

The Nature@NaturalWorld515

Torres del Paine, Chile

Русский

250

Tsendsuren retweeté

Jason Ramapuram@jramapuram·7 Mar

GRPO is LayerNorm for PPO rewards. 🙃

Awni Hannun@awnihannun

Deep learning has always been a field where the ideas are pretty simple but the implementations can get arbitrarily complex (the devil is in the details). RL for LLMs takes this to another level. The main idea is quite simple: - Generate rollouts using your model for a given task - Assign a reward to those rollouts (e.g. 1 if the code passes the tests, 0 otherwise) - Train the model to increase probability of high reward rollouts, decrease probability of low reward rollouts That's pretty much it. I'm no expert here, but I suspect the central reason RL is quite a bit more complex to implement and scale is that it's the only stage of LLM training where inference is in the main loop of the learning algorithm.

English

178

22.4K

Tsendsuren@TsendeeMTS·6 Mar

Compression is triggered in online manner in between tool chaining and the policy is updated through the compression outcome via final reward - very nice!

Jonathan Frankle@jefrankle

Reusability influenced every decision we made: * Off-policy RL for simplicity and robustness. No quality tradeoffs: arxiv.org/abs/2602.19362 * Simple, elegant test-time compute via parallel thinking to control latency * RL-learned context compression * Multi-task RL for modularity

English

778

Tsendsuren retweeté

Jonathan Frankle@jefrankle·5 Mar

Meet KARL, an RL'd model for document-centric tasks at frontier quality and open source cost/speed. Great for @databricks customers and scientists (77-page tech report!) As usual, this isn't just one model - it's an RL assembly line to churn out models for us and our customers 🧵

English

241

68.9K

Tsendsuren retweeté

Tomek Korbak@tomekkorbak·5 Mar

We have a new eval to help keep chains of thought (CoT) monitorable: CoT Controllability. This tests whether LLMs can control their CoT, helping to evade CoT monitors. So far, the results leave us cautiously optimistic: today’s models struggle to obfuscate their reasoning in ways that undermine monitorability.

English

423

53.1K

Tsendsuren@TsendeeMTS·27 Şub

Since we are talking about online learning and rapid adaptation of Transformers, here is another work from 2020. arxiv.org/abs/2009.01803

Robert Lange@RobertTLange

Doc-to-LoRA: What if you could online distill documents into your LLM weights without training? 🚀 Stoked to share our new work on instant LLM adaptation using meta-learned hypernetworks 📷📝 Building on our previous Text-to-LoRA work, we doc-condition a hypernetwork to output LoRA adapters, improving the base LLM's effective context window. The hypernetwork is meta-trained on 1000s of summarization tasks and shows remarkable compression capabilities at low latency 📈 🧑‍🔬 Work led by @tan51616 with @edo_cet & Shin Useka at @SakanaAILabs 📷

English

330

Tsendsuren retweeté

Reiner Pope@reinerpope·24 Şub

We’re building an LLM chip that delivers much higher throughput than any other chip while also achieving the lowest latency. We call it the MatX One. The MatX One chip is based on a splittable systolic array, which has the energy and area efficiency that large systolic arrays are famous for, while also getting high utilization on smaller matrices with flexible shapes. The chip combines the low latency of SRAM-first designs with the long-context support of HBM. These elements, plus a fresh take on numerics, deliver higher throughput on LLMs than any announced system, while simultaneously matching the latency of SRAM-first designs. Higher throughput and lower latency give you smarter and faster models for your subscription dollar. We’ve raised a $500M Series B to wrap up development and quickly scale manufacturing, with tapeout in under a year. The round was led by Jane Street, one of the most tech-savvy Wall Street firms, and Situational Awareness LP, whose founder @leopoldasch wrote the definitive memo on AGI. Participants include @sparkcapital, @danielgross and @natfriedman’s fund, @patrickc and @collision, @TriatomicCap, @HarpoonVentures, @karpathy, @dwarkesh_sp, and others. We’re also welcoming investors across the supply chain, including Marvell and Alchip. @MikeGunter_ and I started MatX because we felt that the best chip for LLMs should be designed from first principles with a deep understanding of what LLMs need and how they will evolve. We are willing to give up on small-model performance, low-volume workloads, and even ease of programming to deliver on such a chip. We’re now a 100-person team with people who think about everything from learning rate schedules, to Swing Modulo Scheduling, to guard/round/sticky bits, to blind-mated connections—all in the same building. If you’d like to help us architect, design, and deploy many generations of chips in large volume, consider joining us.

English

124

202

2.2K

Tsendsuren retweeté

Itamar Zimerman@ItamarZimerman·31 Oca

📜🚨 Introducing TensorLens! 🔎 Our new tool for Transformer & LLM interpretability. The problem: attention matrices are (i) a shallow view that ignores embeddings, FFNs, and values, and (ii) there are too numerous (per head & layer), which quickly becomes overwhelming. 🧵 1/6

English

126

915

50.6K

Tsendsuren retweeté

TechHalla@techhalla·30 Oca

Less than 24 hours since Google dropped Project Genie and people are already creating wild stuff! The era of vibe gaming starts. 15 insane examples 🧵👇 1. Discarded Pack of Cigarettes in the station

English

139

441

4.7K

1.1M

Tsendsuren retweeté

idan shenfeld@IdanShenfeld·29 Oca

People keep saying 2026 will be the year of continual learning. But there are still major technical challenges to making it a reality. Today we take the next step towards that goal — a new on-policy learning algorithm, suitable for continual learning! (1/n)

English

210

1.4K

201.6K

Tsendsuren@TsendeeMTS·29 Oca

This is what creativity buys us, very cool! This work has broader implications and impact.

idan shenfeld@IdanShenfeld

Introducing Self-Distillation Fine-Tuning (SDFT), a simple method that enables on-policy learning from expert demos—no rewards needed. Key insight: Putting SFT data in-context to turn the model into its own teacher, producing on-policy signals that preserve prior skills and improve generalization. (3/n)

English

468

Découvrir

@GiimaaAj @databricks @leopoldasch @sparkcapital @danielgross @natfriedman @patrickc @collision