James Liu

63 posts

James Liu

@JamesLiuID

@AnthropicAI, prev @MIT, @togethercompute

Boston, MA Katılım Nisan 2022

231 Takip Edilen434 Takipçiler

Sabitlenmiş Tweet

James Liu@JamesLiuID·30 Ağu

Your LLM may be sparser than you thought! Excited to announce TEAL, a simple training-free method that achieves up to 40-50% model-wide activation sparsity on Llama-2/3 and Mistral models. Combined with a custom kernel, we achieve end-to-end speedups of up to 1.53x-1.8x!

English

427

109K

James Liu retweetledi

Ted Zadouri@tedzadouri·5 Mar

Asymmetric hardware scaling is here. Blackwell tensor cores are now so fast, exp2 and shared memory are the wall. FlashAttention-4 changes the algorithm & pipeline so that softmax & SMEM bandwidth no longer dictate speed. Attn reaches ~1600 TFLOPs, pretty much at matmul speed! joint work w/ Markus Hoehnerbach, Jay Shah(@ultraproduct), Timmy Liu, Vijay Thakkar (@__tensorcore__ ), Tri Dao (@tri_dao) 1/

English

131

783

227.3K

James Liu retweetledi

Luke J. Huang@whatthelukh·3 Mar

We introduce Variance Controlled Policy Optimization (VCPO), a method for explicit variance-targeted controls for policy-gradient objectives in off-policy RL — enabling stable, scalable Async RL training. ✨ Seamlessly integrates into common policy-gradient methods like REINFORCE/RLOO/GRPO 🚀 2.5x faster Async RL training while matching Synchronous RL performance 🧠 Robust training stability under high off-policy settings (at least 128 steps off-policy) 📄Paper: arxiv.org/abs/2602.17616 🔗Code: github.com/mit-han-lab/vc… 🧵👇

English

11.6K

James Liu retweetledi

Adam Zweiger@AdamZweiger·19 Şub

We introduce a new approach for fast and high-quality context compaction in latent space. Attention Matching (AM) achieves 50× compaction in seconds with little performance loss, substantially outperforming summarization and other baselines.

English

148

944

129.9K

James Liu@JamesLiuID·23 Oca

congrats on the launch!!

Woosuk Kwon@woosuk_k

Today, we're proud to announce @inferact, a startup founded by creators and core maintainers of @vllm_project, the most popular open-source LLM inference engine. Our mission is to grow vLLM as the world's AI inference engine and accelerate AI progress by making inference cheaper and faster. The Challenge Inference is not solved. It's getting harder. Models grow larger. New architectures proliferate: mixture-of-experts, multimodal, agentic. Every breakthrough demands new infrastructure. Meanwhile, hardware fragments: more accelerators, more programming models, and more combinations to optimize. The capability gap between models and the systems that serve them is widening. Left this way, the most capable models remain bottlenecked and with full scope of their capabilities accessible only to those who can build custom infrastructure. Close the gap, and we unlock new possibilities. And the problem is growing. Inference is shifting from a fraction of compute to the majority: test-time compute, RL training loops, synthetic data. We see a future where serving AI becomes effortless. Today, deploying a frontier model at scale requires a dedicated infrastructure team. Tomorrow, it should be as simple as spinning up a serverless database. The complexity doesn't disappear; it gets absorbed into the infrastructure we're building. Why Us vLLM sits at the intersection of models and hardware: a position that took years to build. When model vendors ship new architectures, they work with us to ensure day-zero support. When hardware vendors develop new silicon, they integrate with vLLM. When teams deploy at scale, they run vLLM, from frontier labs to hyperscalers to startups serving millions of users. Today, vLLM supports 500+ model architectures, runs on 200+ accelerator types, and powers inference at global scale. This ecosystem, built with 2,000+ contributors, is our foundation. We've been stewards of this engine since its first commit. We know it inside out. We deployed it at frontier scale—in research and in production. Open Source vLLM was built in the open. That's not changing. Inferact exists to supercharge vLLM adoption. The optimizations we develop flow back to the community. We plan to push vLLM's performance further, deepen support for emerging model architectures, and expand coverage across frontier hardware. The AI industry needs inference infrastructure that isn't locked behind proprietary walls. Join Us Through the open source community, we are fortunate to work with some of the best people we know. For @inferact, we're hiring engineers and researchers to work at the frontier of inference, where models meet hardware at scale. Come build with us. We're fortunate to be supported by investors who share our vision, including @a16z and @lightspeedvp who led our $150M seed, as well as @sequoia, @AltimeterCap, @Redpoint, @ZhenFund, The House Fund, @strikervp, @LaudeVentures, and @databricks. - @woosuk_k, @simon_mo_, @KaichaoYou, @rogerw0108, @istoica05 and the rest of the founding team

English

294

James Liu retweetledi

Shengjie Wang@ShengjieWa34067·20 Ara

I can add some intuition that Zonghan @yang_zonghan and I have about the difference in pass@N behavior between multi-turn and single-turn RL. Prior work on single-turn reasoning models shows that RL does not improve pass@N (e.g., Yang Yue @YangYue_THU et al., arxiv.org/pdf/2504.13837). We suspect this is because most knowledge capacity is already fixed during pre-training, making it hard for the model to acquire new knowledge through RL exploration. In contrast, multi-turn RL allows the model to learn how to interact with the training environment—capabilities that are not introduced during pre-training—which may explain the observed gains in pass@N. If true, this suggests a useful separation of roles: pre-training for core knowledge, and post-training for learning effective interaction with diverse environments. I’m excited to see follow-up work in this direction. (The following figure is powered by Nano Banana.)

Shengjie Wang@ShengjieWa34067

@srush_nlp Yeah, in multi-turn RL experiments, we actually see pass@N increase with the number of training steps. Maybe you can take a look at our discussion. x.com/ShengjieWa3406…

English

208

39.1K

James Liu retweetledi

Locke Cai@couplefire12·11 Ara

RL for reasoning often rely on verifiers — great for math, but tricky for creative writing or open-ended research. Meet RARO: a new paradigm that teaches LLMs to reason via adversarial games instead of verification. No verifiers. No environments. Just demonstrations. 🧵👇

English

609

177.6K

James Liu@JamesLiuID·2 Ara

@daniel_c0deb0t skill issue

English

253

Daniel Liu@daniel_c0deb0t·2 Ara

Want to go to ml confs but itll just be sad cuz I cant say much

English

3.1K

James Liu retweetledi

James Bradbury@jekbradbury·24 Kas

opus 4.5 is really good at GPU programming, but somehow it’s even better at GPU programming jokes (h/t @Si_Boehm)

English

541

83.5K

James Liu@JamesLiuID·24 Kas

🤯

QME

269

James Liu@JamesLiuID·24 Kas

Lisan al Gaib@scaling01

I guess Anthropic is in fact just cooked and out of the competition until Claude 5 But first let me hype the 80% SWE-Bench Verified score tomorrow for the 15th time, only to ignore it because it's 10x more expensive than any reasonable person would pay

QST

1.2K

James Liu@JamesLiuID·24 Kas

@scaling01 :(

QAM

326

Lisan al Gaib@scaling01·24 Kas

Lisan al Gaib@scaling01

Claude 4.5 Opus will be completely irrelevant to the market situation if it's still at the same price

English

400

115.4K

James Liu retweetledi

Anthropic@AnthropicAI·12 Kas

For the first time, Anthropic is building its own AI infrastructure. We’re constructing data centers in Texas and New York that will create thousands of American jobs. This is a $50 billion investment in America. anthropic.com/news/anthropic…

English

209

409

4.3K

713.9K

James Liu retweetledi

William Hu@_williamhu·11 Kas

AI is compute-hungry. While it has generally relied on a single hardware vendor in the past, AMD GPUs now offer competitive memory and compute throughput. Yet, the software stack is brittle. So we ask: can the same DSL principles that simplified NVIDIA kernel dev translate to AMD? We’re excited to introduce the newest addition to the ThunderKittens cinematic universe of kernel DSLs: HipKittens (HK) 🚀for Fast and Furious AMD kernels.

English

160

52.9K

James Liu@JamesLiuID·29 Eki

congrats on the launch!!

Applied Compute@appliedcompute

Generalists are useful, but it’s not enough to be smart. Advances come from specialists, whether human or machine. To have an edge, agents need specific expertise, within specific companies, built on models trained on specific data. We call this Specific Intelligence. It's what we're building at Applied Compute. We unlock the latent knowledge inside a company, use it to train custom models, and deploy an in-house agent workforce that reports to your team. We work with sophisticated companies that have already captured early gains from general models, like @cognition, @DoorDash, and @mercor_ai. They’re pulling even further ahead with proprietary in-house agents that don’t need to wait for the next public model release. Together, we are building and validating models and agents in days instead of months, achieving state-of-the-art performance on customer evals. Our team has high density and low latency. Our founders all worked on different parts of this problem while they were researchers at OpenAI — @ypatil125 as a key member on the agentic software engineer effort (Codex), @rhythmrg as a core contributor to the first RL-trained reasoning model (o1), and @lindensli as a core contributor on ML systems and infrastructure for RL training. Two-thirds of the team are former founders, and everyone brings a deep technical background, from top AI researchers to Math Olympiad winners. We are backed by $80M in funding from Benchmark, Sequoia, Lux, Elad Gil, Victor Lazarte, Omri Casspi, and others. With their support, we are growing the team, scaling deployments, and bringing to market the first generation of agent workforces built on specific models. In short: 1. We are building Specific Intelligence for specific work at specific companies. 2. That will power in-house agent workforces to support their human bosses. 3. That in turn will unlock AI’s full potential through humanity’s greatest engine of progress: thriving corporations in a free market.

English

696

James Liu retweetledi

tender@tenderizzation·25 Eki

"introducing our new super easy to use DSL that is guaranteed to extract just as much performance out of the hardware without the headache of traditional lower level languages"

English

223

5.8K

489.9K

James Liu@JamesLiuID·24 Eki

woah

Anthropic@AnthropicAI

Today, we announced that we plan to expand our use of Google TPUs, securing approximately one million TPUs and more than a gigawatt of capacity in 2026.

English

303

James Liu@JamesLiuID·15 Eki

neat framework for thinking about subagents/tool use. the primitives learned by good coding models seem to be generally useful, even outside of code.

alex zhang@a1zhang

What if scaling the context windows of frontier LLMs is much easier than it sounds? We’re excited to share our work on Recursive Language Models (RLMs). A new inference strategy where LLMs can decompose and recursively interact with input prompts of seemingly unbounded length, as a REPL environment. On the OOLONG benchmark, RLMs with GPT-5-mini outperforms GPT-5 by over 110% gains (more than double!) on 132k-token sequences and is cheaper to query on average. On the BrowseComp-Plus benchmark, RLMs with GPT-5 can take in 10M+ tokens as their “prompt” and answer highly compositional queries without degradation and even better than explicit indexing/retrieval. We link our blogpost, (still very early!) experiments, and discussion below.

English

794

James Liu@JamesLiuID·14 Eki

nice

Adit@aditabrm

I’m thrilled to announce @reductoai’s $75M Series B led by @a16z, which brings our total funding to $108M. Just five months after our Series A, we've surpassed 1 billion pages processed and grown our monthly volume 6x. We now process hundreds of millions of pages every month for some of the world's best AI teams. Here's what we've learned and where we're headed 🧵

English

462

James Liu@JamesLiuID·13 Eki

@HeMuyu0327 whats the loss with/without this intervention?

English

629

Muyu He@HeMuyu0327·13 Eki

The original attention sink paper finds that sink token occurs regardless of semantic content, but it simply assumes that the absolute position of the token is what causes the sink. I did some layer sweep exps and the results are interesting -- by changing the positional encoding of the first few tokens to some much larger numbers, we see: - the sink disappears in the first few tokens - it reappears in the **next available token with the smallest position** in the sequence - the same sink is unambiguously chosen by all query tokens in the sequence - in a model with 32 layers (Qwen3-8B), we see early formation of the sink even in layer 4 and 8 So the model DOES leverage the position of tokens to figure out what should be the sink, but it is not really the absolute position, but the relative position. It is able to figure out the smallest index within a sequence, when index 0, 1, 2... are unavailable. Fascinating!!

English

266

23.2K

James Liu@JamesLiuID·30 Eyl

smart way to combine dynamic sparsity with MLA

DeepSeek@deepseek_ai

⚡️ Efficiency Gains 🤖 DSA achieves fine-grained sparse attention with minimal impact on output quality — boosting long-context performance & reducing compute cost. 📊 Benchmarks show V3.2-Exp performs on par with V3.1-Terminus. 2/n

English

342

Keşfet

@ultraproduct @__tensorcore__ @tri_dao @yang_zonghan @YangYue_THU @daniel_c0deb0t @Si_Boehm @scaling01