Ben Athiwaratkun

390 posts

Ben Athiwaratkun

@ben_athi

Leading LLM Efficiency Research @ Together AI. prev: @awscloud @MSFTResearch, @Cornell PhD.

Earth, the Milky Way. Katılım Temmuz 2014

710 Takip Edilen967 Takipçiler

Ben Athiwaratkun retweetledi

Tri Dao@tri_dao·5d

The frontier has increasingly shifted to hybrid models - from Qwen to Kimi-Linear and now with NVIDIA's Nemotron-3 Super - that rely on a strong linear sequence model. Today we release Mamba-3, the most powerful linear model to date. x.com/_albertgu/stat…

Albert Gu@_albertgu

The newest model in the Mamba series is finally here 🐍 Hybrid models have become increasingly popular, raising the importance of designing the next generation of linear models. We've introduced several SSM-centric ideas to significantly increase Mamba-2's modeling capabilities without compromising on speed. The resulting Mamba-3 model has noticeable performance gains over the most popular previous linear models (such as Mamba-2 and Gated DeltaNet) at all sizes. This is the first Mamba that was student led: all credit to @aakash_lahoti @kevinyli_ @_berlinchen @caitWW9, and of course @tri_dao!

English

112

845

73.4K

Ben Athiwaratkun@ben_athi·6d

Glad to be highlighted during the GTC keynote. We’re building the systems that bring frontier research to production and accelerate inference (and beyond). Come join us at Together! @NVIDIAAI @togethercompute job-boards.greenhouse.io/togetherai/job… job-boards.greenhouse.io/togetherai?gh_…

English

2.8K

Ben Athiwaratkun retweetledi

Together AI@togethercompute·13 Mar

Introducing v2 of our Open Deep Research app! Generate detailed reports on any topic with open source LLMs. Fully free & open source. We're releasing everything: evaluation dataset, code, app, and blog 🔥

English

320

27.8K

Ben Athiwaratkun retweetledi

Grace Isford@graceisford·7 Mar

Amazing @togethercompute conference bringing *together* the AI native community💫 Insane growth over 3 years & major milestones: ✅Flash Attention-4, Atlas-2, ThunderAgent ✅Trillions of tokens served per day ✅250MW+ of compute & counting just getting started 🔥

Together AI@togethercompute

Together Research has produced FlashAttention, ATLAS, ThunderKittens and more. This week at AI Native Conf: seven more releases, all coming to production soon. Thread → #ainativeconf #ainativecloud

English

2.5K

Ben Athiwaratkun@ben_athi·6 Mar

@tedzadouri @Together @Meta @xai @PrincetonPLI @nvidia Congrats @tedzadouri !

English

Ted Zadouri@tedzadouri·5 Mar

We thank @Together, @Meta, and @xAI for providing compute support. We also thank @PrincetonPLI . Finally, we thank the cuDNN, TensorRT-LLM, and CUTLASS teams at @NVIDIA for ongoing discussions, ideas, and feedback.

English

1.9K

Ted Zadouri@tedzadouri·5 Mar

Asymmetric hardware scaling is here. Blackwell tensor cores are now so fast, exp2 and shared memory are the wall. FlashAttention-4 changes the algorithm & pipeline so that softmax & SMEM bandwidth no longer dictate speed. Attn reaches ~1600 TFLOPs, pretty much at matmul speed! joint work w/ Markus Hoehnerbach, Jay Shah(@ultraproduct), Timmy Liu, Vijay Thakkar (@__tensorcore__ ), Tri Dao (@tri_dao) 1/

English

132

781

220.5K

Ben Athiwaratkun retweetledi

Tri Dao@tri_dao·5 Mar

The FA4 paper is finally out after a year of work. On Blackwell GPUs, attention now goes about as fast as matmul even though the bottlenecks are so different! Tensor cores are now crazy fast that attn fwd is bottlenecked by exponential, and attn bwd is bottlenecked by shared memory bandwidth. Some fun stuff in the redesigned algorithm to overcome these bottlenecks: exponential emulation with polynomials, new online softmax to avoid 90% of softmax rescaling, 2CTA MMA instructions that allow two thread blocks to share operands to reduce smem traffic.

Ted Zadouri@tedzadouri

English

230

1.8K

183.7K

Ben Athiwaratkun@ben_athi·4 Mar

Really great work from Together research team on accelerating inference! @tanishqkumar07 @avnermay @tri_dao TL;DR - why run draft and verifier sequentially where you can run them in parallel?! The trick is that since we don't know what tokens will be accept, we start doing speculation for all possible accepted tokens. This leads to very low drafting time (due to the overlap), hence a significant speedup. After a certain point scaling with TP doesn't increase TPS anymore, so this is a great way to add hardware and accelerate speed. ps. We are hiring exceptional research engineers for Core ML team -- please consider joining us job-boards.greenhouse.io/togetherai/job…

Tanishq Kumar@tanishqkumar07

I've been working on a new LLM inference algorithm. It's called Speculative Speculative Decoding (SSD) and it's up to 2x faster than the strongest inference engines in the world. Collab w/ @tri_dao @avnermay. Details in thread.

English

3.4K

Ben Athiwaratkun@ben_athi·26 Şub

Plus, we are hiring exceptional research engineers. Come join us! job-boards.greenhouse.io/togetherai/job…

English

Ben Athiwaratkun@ben_athi·26 Şub

Pretty excited about this data release and thanks @percyliang for the project guidance. We built a large-scale, test-verified trajectory pipeline across 51K tasks and 1,655 repos, then filtered down to 258K trajectories (155K passing). The 59.4% pass@1 on SWE-bench Verified from a 32B model, with data built from our scaffolding. Next step is scaling the data collection: more tasks, more scaffolds/tools, more diversity — and then moving beyond SFT into agentic RL.

Percy Liang@percyliang

These days, I'm much more excited about dataset releases than model releases. Models come and go and don't compose, whereas good datasets are more enduring and can be studied, used, revised to create better models more broadly. Excited about these 155K coding agent trajectories...just SFT'ing on this data improves SWE-bench Verified massively (23% -> 59.4%).

English

584

Ben Athiwaratkun@ben_athi·20 Şub

Consistency Diffusion Language Models → 4.1–7.7x fewer refinement steps → Up to 14.5x lower latency → Competitive accuracy on math and coding TL;DR Normal LLM (autoregressive) decode process is slow because it's memory bound Vanilla diffusion LM can be faster since it can generate multiple tokens in parallel, hitting closer to compute bound regime. Block-wise diffusion is faster due to intra-block parallelism + reliable generation via consistency DLM. great work by @Chenfeng_X and collaborators

Together AI@togethercompute

New from Together Research: up to 14.5x lower latency for diffusion language models. CDLM (consistency diffusion language models) tackles two core bottlenecks — KV caching incompatibility and high step counts — through a post-training recipe applicable to any block-diffusion model. Results on Dream-7B: → 4.1–7.7x fewer refinement steps → Up to 14.5x lower latency → Competitive accuracy on math and coding Core finding: you can't just truncate steps. Quality collapses without training that enforces trajectory-consistent behavior. Read more in the thread!

English

432

Ben Athiwaratkun@ben_athi·17 Şub

Uo to 3.9x faster agentic serving / rollout via ThunderAgents, without any quality tradeoff! Great work from Together AI by @GT_HaoKang @Chenfeng_X @_junxiong_wang @simran_s_arora

Hao Kang@GT_HaoKang

🔥Modifying 2 lines of code and get your agentic serving/rollout up to 3.9x faster losslessly! ⚡️Say hello to ThunderAgent, a fast, simple, and program-aware agentic Inference System. 🥇 We propose a program abstraction to schedule all GPU and CPU resources, the first principled approach for distributed agentic inference and rollout. 🌐 Blog: thunderagent.ai 💻 Code: github.com/ThunderAgent-o… 📜 Paper: arxiv.org/pdf/2602.13692 #AI #ThunderAgent #LLMAgent #Mlsys 1/n

English

1.3K

Ben Athiwaratkun@ben_athi·3 Ara

Just published a blog post on how we deliver the fastest inference on leading open-source models — powered by our research. 🚀 If you’re at #NeurIPS2025 and interested in efficiency, come swing by the Together AI booth! We’re working on frontier speculative decoding systems, efficient RL, Turbo SWE agents, and more. We’re also hiring exceptional full-stack researchers and engineers with expertise in at least one of: systems, post-training/RL, or inference efficiency. Would love to meet you!

Together AI@togethercompute

Together AI now offers the fastest inference for the most popular OSS LLMs including Qwen3 235B 2507, GPS-OSS-20B, and Kimi-K2-0905.

English

5.4K

Ben Athiwaratkun@ben_athi·1 Ara

The reduced MFU for newer hardware is expected, and is an artifact of compute scaling much faster than memory and communication bandwidth. which is why we proposed architecture like ladder residual that overlaps compute w. communication so that modern hardware like Blackwell or later generations can scale better arxiv.org/pdf/2501.06589

English

elvis@omarsar0·30 Kas

Interesting research from Meta on hardware scaling trends. More GPUs doesn't always mean faster training. The default approach to scaling LLM training today remains throwing more hardware at the problem. More accelerators, more parallelism, more compute. However, there's a ceiling that most teams don't see until they hit it. This new research demonstrates that scaling the total number of accelerators for large model training quickly yields diminishing returns, even with optimized hardware and parallelization strategies. The researchers tested Llama-2 models (1B to 70B parameters) across 8 to 2,048 GPUs spanning V100, A100, and H100 hardware. What did they find? When scaling from 128 to 2,048 GPUs, throughput decreased by 37.22% while per-GPU power draw only dropped 5.87%. The culprit is communication overhead. At large scales, AllGather and ReduceScatter (two MPI primitives) operations become bottlenecks. The majority of communication becomes exposed, and computation can't hide the latency anymore. Counter-intuitively, model parallelism strategies (tensor and pipeline parallelism at degrees 2-4) that were previously thought to reduce hardware utilization actually become preferable at scale. They reduce exposed communication compared to pure data parallelism. On newer hardware, utilization gets worse, not better. Model FLOPS Utilization dropped from 59.67% on A100 to 40.77% on H100; faster chips expose more communication overhead. Why it matters: Adding more GPUs provides poor marginal performance per additional unit of power or GPU-hour. Teams scaling to thousands of accelerators need to carefully reconsider parallelization strategies rather than assuming more hardware equals faster training.

English

484

216.4K

Ben Athiwaratkun@ben_athi·10 Eki

Most speculative decoding research focuses on algorithms. But we know that data matters a ton! (e.g. no matter how good the spec algorithm is, if it's trained on bad & misaligned data, the speed will be poor) What if we build on algorithms that make data really shine?! In this work, we introduce ATLAS, a speculative decoding system that enables customization to your LLM traffic data, making the model speed blazing fast! together.ai/blog/adaptive-…

Tri Dao@tri_dao

This work, led by @_junxiong_wang and @ben_athi, is a first step towards building AI systems that evolve and get better as you use them. More to come!

English

4.9K

Ben Athiwaratkun@ben_athi·22 Ağu

LLM efficiency research on steroids with agentic workflows! 🚀

Together AI@togethercompute

Building AI agents for complex engineering tasks ≠ building chatbots 🧵 Most AI agents today excel at short, simple tasks. But automating multi-day engineering workflows? That’s a whole different game. At Together AI, we learned this the hard way while optimizing LLM inference. Here’s what actually works: ✅ Good tools & documentation ✅ Safe execution environments ✅ Smart session management & progress verification

English

665

Ben Athiwaratkun retweetledi

Together AI@togethercompute·5 Ağu

🤖OpenAI's open models are here. gpt-oss models just landed on Together AI. Achieves near-parity with o4- mini, trained using o3 techniques. Build anything, deploy anywhere🔥

GIF

English

109

34.1K

Ben Athiwaratkun@ben_athi·17 Tem

Blog post on how we achieve the world’s fastest inference speed on NVIDIA Blackwell - together.ai/blog/fastest-i…

English

119

Ben Athiwaratkun@ben_athi·17 Tem

If you’re at icml and interested in LMM efficiency research, come chat with us at Together AI Booth!

Together AI@togethercompute

Together AI Sets a New Bar: Fastest Inference for DeepSeek-R1-0528 We’ve upgraded the Together Inference Engine to run on @NVIDIA Blackwell GPUs—and the results speak for themselves: 📈 Highest known serverless throughput: 334 tokens/sec 🏃‍Fastest time to first answer token: 7.1 sec ⏱️ Lowest end-to-end response time: 8.7 sec Need more performance? Our Dedicated Endpoints hit 386 tokens/sec.

English

346

Ben Athiwaratkun@ben_athi·16 Tem

TL;DR - one way to push the quality-efficiency frontier: obtain high quality generations via a collection of LLMs -> distill to a smaller model -> get a higher quality small model that is more inference-efficient than the original collection of models. Poster session happening now at ICML East Exhibition Hall A E2505

Junlin Wang@JunlinWang3

Work done during my internship at Together AI is being presented at #icml25. Come and check it out! We propose a new model alignment pipeline that harness collective intelligence from open-source llms!

English

421

Keşfet

@NVIDIAAI @togethercompute @tedzadouri @Together @Meta @xai @PrincetonPLI @nvidia