Ben Athiwaratkun

390 posts

Ben Athiwaratkun banner
Ben Athiwaratkun

Ben Athiwaratkun

@ben_athi

Leading LLM Efficiency Research @ Together AI. prev: @awscloud @MSFTResearch, @Cornell PhD.

Earth, the Milky Way. Katılım Temmuz 2014
710 Takip Edilen967 Takipçiler
Ben Athiwaratkun retweetledi
Ben Athiwaratkun retweetledi
Together AI
Together AI@togethercompute·
Introducing v2 of our Open Deep Research app! Generate detailed reports on any topic with open source LLMs. Fully free & open source. We're releasing everything: evaluation dataset, code, app, and blog 🔥
Together AI tweet media
English
4
48
320
27.8K
Ben Athiwaratkun retweetledi
Grace Isford
Grace Isford@graceisford·
Amazing @togethercompute conference bringing *together* the AI native community💫 Insane growth over 3 years & major milestones: ✅Flash Attention-4, Atlas-2, ThunderAgent ✅Trillions of tokens served per day ✅250MW+ of compute & counting just getting started 🔥
Grace Isford tweet media
Together AI@togethercompute

Together Research has produced FlashAttention, ATLAS, ThunderKittens and more. This week at AI Native Conf: seven more releases, all coming to production soon. Thread → #ainativeconf #ainativecloud

English
0
6
17
2.5K
Ted Zadouri
Ted Zadouri@tedzadouri·
We thank @Together, @Meta, and @xAI for providing compute support. We also thank @PrincetonPLI . Finally, we thank the cuDNN, TensorRT-LLM, and CUTLASS teams at @NVIDIA for ongoing discussions, ideas, and feedback.
English
1
0
29
1.9K
Ted Zadouri
Ted Zadouri@tedzadouri·
Asymmetric hardware scaling is here. Blackwell tensor cores are now so fast, exp2 and shared memory are the wall. FlashAttention-4 changes the algorithm & pipeline so that softmax & SMEM bandwidth no longer dictate speed. Attn reaches ~1600 TFLOPs, pretty much at matmul speed! joint work w/ Markus Hoehnerbach, Jay Shah(@ultraproduct), Timmy Liu, Vijay Thakkar (@__tensorcore__ ), Tri Dao (@tri_dao) 1/
Ted Zadouri tweet media
English
7
132
781
220.5K
Ben Athiwaratkun retweetledi
Tri Dao
Tri Dao@tri_dao·
The FA4 paper is finally out after a year of work. On Blackwell GPUs, attention now goes about as fast as matmul even though the bottlenecks are so different! Tensor cores are now crazy fast that attn fwd is bottlenecked by exponential, and attn bwd is bottlenecked by shared memory bandwidth.  Some fun stuff in the redesigned algorithm to overcome these bottlenecks: exponential emulation with polynomials, new online softmax to avoid 90% of softmax rescaling, 2CTA MMA instructions that allow two thread blocks to share operands to reduce smem traffic.
Ted Zadouri@tedzadouri

Asymmetric hardware scaling is here. Blackwell tensor cores are now so fast, exp2 and shared memory are the wall. FlashAttention-4 changes the algorithm & pipeline so that softmax & SMEM bandwidth no longer dictate speed. Attn reaches ~1600 TFLOPs, pretty much at matmul speed! joint work w/ Markus Hoehnerbach, Jay Shah(@ultraproduct), Timmy Liu, Vijay Thakkar (@__tensorcore__ ), Tri Dao (@tri_dao) 1/

English
30
230
1.8K
183.7K
Ben Athiwaratkun
Ben Athiwaratkun@ben_athi·
Really great work from Together research team on accelerating inference! @tanishqkumar07 @avnermay @tri_dao TL;DR - why run draft and verifier sequentially where you can run them in parallel?! The trick is that since we don't know what tokens will be accept, we start doing speculation for all possible accepted tokens. This leads to very low drafting time (due to the overlap), hence a significant speedup. After a certain point scaling with TP doesn't increase TPS anymore, so this is a great way to add hardware and accelerate speed. ps. We are hiring exceptional research engineers for Core ML team -- please consider joining us job-boards.greenhouse.io/togetherai/job…
Tanishq Kumar@tanishqkumar07

I've been working on a new LLM inference algorithm. It's called Speculative Speculative Decoding (SSD) and it's up to 2x faster than the strongest inference engines in the world. Collab w/ @tri_dao @avnermay. Details in thread.

English
0
0
36
3.4K
Ben Athiwaratkun
Ben Athiwaratkun@ben_athi·
Pretty excited about this data release and thanks @percyliang for the project guidance. We built a large-scale, test-verified trajectory pipeline across 51K tasks and 1,655 repos, then filtered down to 258K trajectories (155K passing). The 59.4% pass@1 on SWE-bench Verified from a 32B model, with data built from our scaffolding. Next step is scaling the data collection: more tasks, more scaffolds/tools, more diversity — and then moving beyond SFT into agentic RL.
Percy Liang@percyliang

These days, I'm much more excited about dataset releases than model releases. Models come and go and don't compose, whereas good datasets are more enduring and can be studied, used, revised to create better models more broadly. Excited about these 155K coding agent trajectories...just SFT'ing on this data improves SWE-bench Verified massively (23% -> 59.4%).

English
1
0
9
584
Ben Athiwaratkun
Ben Athiwaratkun@ben_athi·
Consistency Diffusion Language Models → 4.1–7.7x fewer refinement steps → Up to 14.5x lower latency → Competitive accuracy on math and coding TL;DR Normal LLM (autoregressive) decode process is slow because it's memory bound Vanilla diffusion LM can be faster since it can generate multiple tokens in parallel, hitting closer to compute bound regime. Block-wise diffusion is faster due to intra-block parallelism + reliable generation via consistency DLM. great work by @Chenfeng_X and collaborators
Together AI@togethercompute

New from Together Research: up to 14.5x lower latency for diffusion language models. CDLM (consistency diffusion language models) tackles two core bottlenecks — KV caching incompatibility and high step counts — through a post-training recipe applicable to any block-diffusion model. Results on Dream-7B: → 4.1–7.7x fewer refinement steps → Up to 14.5x lower latency → Competitive accuracy on math and coding Core finding: you can't just truncate steps. Quality collapses without training that enforces trajectory-consistent behavior. Read more in the thread!

English
0
0
6
432
Ben Athiwaratkun
Ben Athiwaratkun@ben_athi·
Just published a blog post on how we deliver the fastest inference on leading open-source models — powered by our research. 🚀 If you’re at #NeurIPS2025 and interested in efficiency, come swing by the Together AI booth! We’re working on frontier speculative decoding systems, efficient RL, Turbo SWE agents, and more. We’re also hiring exceptional full-stack researchers and engineers with expertise in at least one of: systems, post-training/RL, or inference efficiency. Would love to meet you!
Together AI@togethercompute

Together AI now offers the fastest inference for the most popular OSS LLMs including Qwen3 235B 2507, GPS-OSS-20B, and Kimi-K2-0905.

English
1
1
11
5.4K
Ben Athiwaratkun
Ben Athiwaratkun@ben_athi·
The reduced MFU for newer hardware is expected, and is an artifact of compute scaling much faster than memory and communication bandwidth. which is why we proposed architecture like ladder residual that overlaps compute w. communication so that modern hardware like Blackwell or later generations can scale better arxiv.org/pdf/2501.06589
Ben Athiwaratkun tweet media
English
0
2
15
3K
elvis
elvis@omarsar0·
Interesting research from Meta on hardware scaling trends. More GPUs doesn't always mean faster training. The default approach to scaling LLM training today remains throwing more hardware at the problem. More accelerators, more parallelism, more compute. However, there's a ceiling that most teams don't see until they hit it. This new research demonstrates that scaling the total number of accelerators for large model training quickly yields diminishing returns, even with optimized hardware and parallelization strategies. The researchers tested Llama-2 models (1B to 70B parameters) across 8 to 2,048 GPUs spanning V100, A100, and H100 hardware. What did they find? When scaling from 128 to 2,048 GPUs, throughput decreased by 37.22% while per-GPU power draw only dropped 5.87%. The culprit is communication overhead. At large scales, AllGather and ReduceScatter (two MPI primitives) operations become bottlenecks. The majority of communication becomes exposed, and computation can't hide the latency anymore. Counter-intuitively, model parallelism strategies (tensor and pipeline parallelism at degrees 2-4) that were previously thought to reduce hardware utilization actually become preferable at scale. They reduce exposed communication compared to pure data parallelism. On newer hardware, utilization gets worse, not better. Model FLOPS Utilization dropped from 59.67% on A100 to 40.77% on H100; faster chips expose more communication overhead. Why it matters: Adding more GPUs provides poor marginal performance per additional unit of power or GPU-hour. Teams scaling to thousands of accelerators need to carefully reconsider parallelization strategies rather than assuming more hardware equals faster training.
elvis tweet media
English
29
83
484
216.4K
Ben Athiwaratkun
Ben Athiwaratkun@ben_athi·
Most speculative decoding research focuses on algorithms. But we know that data matters a ton! (e.g. no matter how good the spec algorithm is, if it's trained on bad & misaligned data, the speed will be poor) What if we build on algorithms that make data really shine?! In this work, we introduce ATLAS, a speculative decoding system that enables customization to your LLM traffic data, making the model speed blazing fast! together.ai/blog/adaptive-…
Tri Dao@tri_dao

This work, led by @_junxiong_wang and @ben_athi, is a first step towards building AI systems that evolve and get better as you use them. More to come!

English
0
5
24
4.9K
Ben Athiwaratkun retweetledi
Together AI
Together AI@togethercompute·
🤖OpenAI's open models are here. gpt-oss models just landed on Together AI. Achieves near-parity with o4- mini, trained using o3 techniques. Build anything, deploy anywhere🔥
GIF
English
13
21
109
34.1K
Ben Athiwaratkun
Ben Athiwaratkun@ben_athi·
If you’re at icml and interested in LMM efficiency research, come chat with us at Together AI Booth!
Together AI@togethercompute

Together AI Sets a New Bar: Fastest Inference for DeepSeek-R1-0528 We’ve upgraded the Together Inference Engine to run on @NVIDIA Blackwell GPUs—and the results speak for themselves: 📈 Highest known serverless throughput: 334 tokens/sec 🏃‍Fastest time to first answer token: 7.1 sec ⏱️ Lowest end-to-end response time: 8.7 sec Need more performance? Our Dedicated Endpoints hit 386 tokens/sec.

English
1
0
3
346
Ben Athiwaratkun
Ben Athiwaratkun@ben_athi·
TL;DR - one way to push the quality-efficiency frontier: obtain high quality generations via a collection of LLMs -> distill to a smaller model -> get a higher quality small model that is more inference-efficient than the original collection of models. Poster session happening now at ICML East Exhibition Hall A E2505
Junlin Wang@JunlinWang3

Work done during my internship at Together AI is being presented at #icml25. Come and check it out! We propose a new model alignment pipeline that harness collective intelligence from open-source llms!

English
0
1
4
421