James Liu

63 posts

James Liu

James Liu

@JamesLiuID

@AnthropicAI, prev @MIT, @togethercompute

Boston, MA Katılım Nisan 2022
231 Takip Edilen434 Takipçiler
Sabitlenmiş Tweet
James Liu
James Liu@JamesLiuID·
Your LLM may be sparser than you thought! Excited to announce TEAL, a simple training-free method that achieves up to 40-50% model-wide activation sparsity on Llama-2/3 and Mistral models. Combined with a custom kernel, we achieve end-to-end speedups of up to 1.53x-1.8x!
James Liu tweet media
English
6
63
427
109K
James Liu retweetledi
Ted Zadouri
Ted Zadouri@tedzadouri·
Asymmetric hardware scaling is here. Blackwell tensor cores are now so fast, exp2 and shared memory are the wall. FlashAttention-4 changes the algorithm & pipeline so that softmax & SMEM bandwidth no longer dictate speed. Attn reaches ~1600 TFLOPs, pretty much at matmul speed! joint work w/ Markus Hoehnerbach, Jay Shah(@ultraproduct), Timmy Liu, Vijay Thakkar (@__tensorcore__ ), Tri Dao (@tri_dao) 1/
Ted Zadouri tweet media
English
7
131
783
227.3K
James Liu retweetledi
Luke J. Huang
Luke J. Huang@whatthelukh·
We introduce Variance Controlled Policy Optimization (VCPO), a method for explicit variance-targeted controls for policy-gradient objectives in off-policy RL — enabling stable, scalable Async RL training. ✨ Seamlessly integrates into common policy-gradient methods like REINFORCE/RLOO/GRPO 🚀 2.5x faster Async RL training while matching Synchronous RL performance 🧠 Robust training stability under high off-policy settings (at least 128 steps off-policy) 📄Paper: arxiv.org/abs/2602.17616 🔗Code: github.com/mit-han-lab/vc… 🧵👇
Luke J. Huang tweet media
English
3
12
68
11.6K
James Liu retweetledi
Adam Zweiger
Adam Zweiger@AdamZweiger·
We introduce a new approach for fast and high-quality context compaction in latent space. Attention Matching (AM) achieves 50× compaction in seconds with little performance loss, substantially outperforming summarization and other baselines.
Adam Zweiger tweet media
English
23
148
944
129.9K
James Liu
James Liu@JamesLiuID·
congrats on the launch!!
Woosuk Kwon@woosuk_k

Today, we're proud to announce @inferact, a startup founded by creators and core maintainers of @vllm_project, the most popular open-source LLM inference engine. Our mission is to grow vLLM as the world's AI inference engine and accelerate AI progress by making inference cheaper and faster. The Challenge Inference is not solved. It's getting harder. Models grow larger. New architectures proliferate: mixture-of-experts, multimodal, agentic. Every breakthrough demands new infrastructure. Meanwhile, hardware fragments: more accelerators, more programming models, and more combinations to optimize. The capability gap between models and the systems that serve them is widening. Left this way, the most capable models remain bottlenecked and with full scope of their capabilities accessible only to those who can build custom infrastructure. Close the gap, and we unlock new possibilities. And the problem is growing. Inference is shifting from a fraction of compute to the majority: test-time compute, RL training loops, synthetic data. We see a future where serving AI becomes effortless. Today, deploying a frontier model at scale requires a dedicated infrastructure team. Tomorrow, it should be as simple as spinning up a serverless database. The complexity doesn't disappear; it gets absorbed into the infrastructure we're building. Why Us vLLM sits at the intersection of models and hardware: a position that took years to build. When model vendors ship new architectures, they work with us to ensure day-zero support. When hardware vendors develop new silicon, they integrate with vLLM. When teams deploy at scale, they run vLLM, from frontier labs to hyperscalers to startups serving millions of users. Today, vLLM supports 500+ model architectures, runs on 200+ accelerator types, and powers inference at global scale. This ecosystem, built with 2,000+ contributors, is our foundation. We've been stewards of this engine since its first commit. We know it inside out. We deployed it at frontier scale—in research and in production. Open Source vLLM was built in the open. That's not changing. Inferact exists to supercharge vLLM adoption. The optimizations we develop flow back to the community. We plan to push vLLM's performance further, deepen support for emerging model architectures, and expand coverage across frontier hardware. The AI industry needs inference infrastructure that isn't locked behind proprietary walls. Join Us Through the open source community, we are fortunate to work with some of the best people we know. For @inferact, we're hiring engineers and researchers to work at the frontier of inference, where models meet hardware at scale. Come build with us. We're fortunate to be supported by investors who share our vision, including @a16z and @lightspeedvp who led our $150M seed, as well as @sequoia, @AltimeterCap, @Redpoint, @ZhenFund, The House Fund, @strikervp, @LaudeVentures, and @databricks. - @woosuk_k, @simon_mo_, @KaichaoYou, @rogerw0108, @istoica05 and the rest of the founding team

English
0
0
3
294
James Liu retweetledi
Shengjie Wang
Shengjie Wang@ShengjieWa34067·
I can add some intuition that Zonghan @yang_zonghan and I have about the difference in pass@N behavior between multi-turn and single-turn RL. Prior work on single-turn reasoning models shows that RL does not improve pass@N (e.g., Yang Yue @YangYue_THU et al., arxiv.org/pdf/2504.13837). We suspect this is because most knowledge capacity is already fixed during pre-training, making it hard for the model to acquire new knowledge through RL exploration. In contrast, multi-turn RL allows the model to learn how to interact with the training environment—capabilities that are not introduced during pre-training—which may explain the observed gains in pass@N. If true, this suggests a useful separation of roles: pre-training for core knowledge, and post-training for learning effective interaction with diverse environments. I’m excited to see follow-up work in this direction. (The following figure is powered by Nano Banana.)
Shengjie Wang tweet media
Shengjie Wang@ShengjieWa34067

@srush_nlp Yeah, in multi-turn RL experiments, we actually see pass@N increase with the number of training steps. Maybe you can take a look at our discussion. x.com/ShengjieWa3406…

English
13
18
208
39.1K
James Liu retweetledi
Locke Cai
Locke Cai@couplefire12·
RL for reasoning often rely on verifiers — great for math, but tricky for creative writing or open-ended research. Meet RARO: a new paradigm that teaches LLMs to reason via adversarial games instead of verification. No verifiers. No environments. Just demonstrations. 🧵👇
Locke Cai tweet media
English
23
78
609
177.6K
Daniel Liu
Daniel Liu@daniel_c0deb0t·
Want to go to ml confs but itll just be sad cuz I cant say much
English
3
0
27
3.1K
James Liu retweetledi
James Bradbury
James Bradbury@jekbradbury·
opus 4.5 is really good at GPU programming, but somehow it’s even better at GPU programming jokes (h/t @Si_Boehm)
James Bradbury tweet media
English
21
47
541
83.5K
James Liu retweetledi
Anthropic
Anthropic@AnthropicAI·
For the first time, Anthropic is building its own AI infrastructure. We’re constructing data centers in Texas and New York that will create thousands of American jobs. This is a $50 billion investment in America. anthropic.com/news/anthropic…
English
209
409
4.3K
713.9K
James Liu retweetledi
William Hu
William Hu@_williamhu·
AI is compute-hungry. While it has generally relied on a single hardware vendor in the past, AMD GPUs now offer competitive memory and compute throughput. Yet, the software stack is brittle. So we ask: can the same DSL principles that simplified NVIDIA kernel dev translate to AMD? We’re excited to introduce the newest addition to the ThunderKittens cinematic universe of kernel DSLs: HipKittens (HK) 🚀for Fast and Furious AMD kernels.
William Hu tweet media
English
7
37
160
52.9K
James Liu
James Liu@JamesLiuID·
congrats on the launch!!
Applied Compute@appliedcompute

Generalists are useful, but it’s not enough to be smart. Advances come from specialists, whether human or machine. To have an edge, agents need specific expertise, within specific companies, built on models trained on specific data. We call this Specific Intelligence. It's what we're building at Applied Compute. We unlock the latent knowledge inside a company, use it to train custom models, and deploy an in-house agent workforce that reports to your team. We work with sophisticated companies that have already captured early gains from general models, like @cognition, @DoorDash, and @mercor_ai. They’re pulling even further ahead with proprietary in-house agents that don’t need to wait for the next public model release. Together, we are building and validating models and agents in days instead of months, achieving state-of-the-art performance on customer evals. Our team has high density and low latency. Our founders all worked on different parts of this problem while they were researchers at OpenAI — @ypatil125 as a key member on the agentic software engineer effort (Codex), @rhythmrg as a core contributor to the first RL-trained reasoning model (o1), and @lindensli as a core contributor on ML systems and infrastructure for RL training. Two-thirds of the team are former founders, and everyone brings a deep technical background, from top AI researchers to Math Olympiad winners. We are backed by $80M in funding from Benchmark, Sequoia, Lux, Elad Gil, Victor Lazarte, Omri Casspi, and others. With their support, we are growing the team, scaling deployments, and bringing to market the first generation of agent workforces built on specific models. In short: 1. We are building Specific Intelligence for specific work at specific companies. 2. That will power in-house agent workforces to support their human bosses. 3. That in turn will unlock AI’s full potential through humanity’s greatest engine of progress: thriving corporations in a free market.

English
1
0
8
696
James Liu retweetledi
tender
tender@tenderizzation·
"introducing our new super easy to use DSL that is guaranteed to extract just as much performance out of the hardware without the headache of traditional lower level languages"
English
38
223
5.8K
489.9K
James Liu
James Liu@JamesLiuID·
nice
Adit@aditabrm

I’m thrilled to announce @reductoai’s $75M Series B led by @a16z, which brings our total funding to $108M. Just five months after our Series A, we've surpassed 1 billion pages processed and grown our monthly volume 6x. We now process hundreds of millions of pages every month for some of the world's best AI teams. Here's what we've learned and where we're headed 🧵

English
0
0
5
462
James Liu
James Liu@JamesLiuID·
@HeMuyu0327 whats the loss with/without this intervention?
English
1
0
0
629
Muyu He
Muyu He@HeMuyu0327·
The original attention sink paper finds that sink token occurs regardless of semantic content, but it simply assumes that the absolute position of the token is what causes the sink. I did some layer sweep exps and the results are interesting -- by changing the positional encoding of the first few tokens to some much larger numbers, we see: - the sink disappears in the first few tokens - it reappears in the **next available token with the smallest position** in the sequence - the same sink is unambiguously chosen by all query tokens in the sequence - in a model with 32 layers (Qwen3-8B), we see early formation of the sink even in layer 4 and 8 So the model DOES leverage the position of tokens to figure out what should be the sink, but it is not really the absolute position, but the relative position. It is able to figure out the smallest index within a sequence, when index 0, 1, 2... are unavailable. Fascinating!!
Muyu He tweet mediaMuyu He tweet mediaMuyu He tweet mediaMuyu He tweet media
English
7
21
266
23.2K