Anjiang Wei

69 posts

Anjiang Wei

@anjiangw

CS PhD student @Stanford. LLM for program reasoning and optimization. Advised by Alex Aiken. Undergraduate @PKU1898

Stanford, CA Katılım Ocak 2022

322 Takip Edilen346 Takipçiler

Anjiang Wei retweetledi

Hao Wang@MogicianTony·4d

Benchmarks are often easier to game than they look. We build BenchJack to audit benchmarks for hidden shortcuts and reward hacks — before they evaluate your agent. Now in preview. Fully open source, with support for auditing your own benchmarks too. github.com/benchjack/benc… Issues and PRs welcome.

Hao Wang@MogicianTony

SWE-bench Verified and Terminal-Bench—two of the most cited AI benchmarks—can be reward-hacked with simple exploits. Our agent scored 100% on both. It solved 0 tasks. Evaluate the benchmark before it evaluates your agent. If you’re picking models by leaderboard score alone, you’re optimizing for the wrong thing. 🧵

English

32.9K

Anjiang Wei@anjiangw·12 Nis

@ren_hongyu Congrats, Hongyu!

Català

Hongyu Ren@ren_hongyu·8 Nis

Check out Muse Spark, our first milestone in the quest for personal superintelligence! Scaling this with the team has been a total blast. Give it a spin and let us know what you think! 🥑

English

317

65.2K

Anjiang Wei retweetledi

Hanchen Li @ ICLR@lihanc02·7 Nis

Prompt Learning does not scale for parallel agents. More parallel agents 🤖 = worse prompts 😭 Why? Processing too many trajectories concurrently damages the prompt update process 🐝 We fix this with Combee : → preserves high-quality learnt system prompt → scales to more than 80 concurrent agents → up to 17× speedup without quality drop on top of ACE and GEPA 🥽Use Cases: 1. Prompt learning on large scale collected agent traces 2. Parallel agent learning online with fast knowledge sharing Read more below to learn how agents actually learn at scale ⬇️

English

173

51.1K

Anjiang Wei retweetledi

General Reasoning@GenReasoning·31 Mar

SATBench is a benchmark by @anjiangw et al evalutes LLM logical reasoning through natural language puzzles derived from Boolean satisfiability (SAT) formulas. openreward.ai/anjiang/SATBen…

English

567

Anjiang Wei retweetledi

Matt Dancho (Business Science)@mdancho84·21 Mar

This is huge. A group of 50 AI researchers (ByteDance, Alibaba, Tencent + universities) just dropped a 303 page field guide on code models + coding agents. And the takeaways are not what most people assume. Here are the highlights I’m thinking about (as someone who lives in Python + agents):

Matt Dancho (Business Science) tweet media

English

167

925

85.9K

Anjiang Wei@anjiangw·11 Mar

@anneouyang @Standard_Kernel Huge congrats to the team! It’s great to see systems software getting the spotlight it deserves.

English

511

Anjiang Wei retweetledi

Anne Ouyang@anneouyang·11 Mar

Excited to share @Standard_Kernel's seed round and some reflections on what we’ve learned about kernel generation and what we believe is next. Grateful to our amazing team, supporters, and the broader community pushing this space forward.

English

513

129.4K

Anjiang Wei@anjiangw·12 Oca

@baifeng_shi @physical_int Congrats!

English

Baifeng@baifeng_shi·11 Oca

Life update: joined @physical_int this week. Excited about what 🤖 we will build!

English

298

19K

Anjiang Wei retweetledi

机器之心 JIQIZHIXIN@jiqizhixin·10 Ara

What if you could mathematically prove an AI's outputs are safe, instead of just hoping they are? Enter BEAVER, a framework that gives deterministic, sound probability bounds on whether an LLM will violate a given constraint, offering guarantees where sampling only provides intuition. BEAVER: An Efficient Deterministic LLM Verifier Paper: arxiv.org/abs/2512.05439

English

Anjiang Wei retweetledi

Ying Sheng@ying11231·8 Ara

We've been running @radixark for a few months, started by many core developers in SGLang @lmsysorg and its extended ecosystem (slime @slime_framework , AReaL @jxwuyi). I left @xai in August — a place where I built deep emotions and countless beautiful memories. It was the best place I’ve ever worked, the place I watched grow from a few dozen people to hundreds, and it truly felt like home. What pushed me to make such a hard decision is the momentum of building SGLang open source and the mission of creating an ambitious future, within an open spirit that I learnt from my first job at @databricks after my PhD. We started SGLang in the summer of 2023 and made it public in January 2024. Over the past 2 years, hundreds of people have made great efforts to get to where they are today. We experienced several waves of growth after its first release. I still remember the many dark nights in the summer of 2024, I spent with @lm_zheng , @lsyincs , and @zhyncs42 debugging, while @ispobaoke single-handedly took on DeepSeek inference optimizations, seeing @GenAI_is_real and the community strike team tag-teaming on-call shifts non-stop. There are so many more who have joined that I'm out of space to call out, but they're recorded on the GitHub contributor list forever. The demands grow exponentially, and we have been pushed to make it a dedicated effort supported by RadixArk. It’s the step-by-step journey of a thousand miles that has carried us here today, and the same relentless Long March that will lead us into the tens of thousands of miles yet to come. The story never stops growing. Over the past year, we’ve seen something very clear: The world is full of people eager to build AI, but the infrastructure that makes it possible is not shared. The most advanced inference and training stacks live inside a few companies. Everyone else is forced to rebuild the same schedulers, compilers, serving engines, and training pipelines again and again — often under enormous pressure, with lots of duplicated effort and wasted insight. RadixArk was born to change that. Today, we’re building an infrastructure-first, deep-tech company with a simple and ambitious mission: "Make frontier-level AI infrastructure open and accessible to everyone." If the two values below resonate with you, come talk to us: (1) Engineering as an art. Infrastructure is a first-class citizen in RadixArk. We care about elegant design and code that lasts. Beneath every line of code lies the soul of the engineer who wrote it. (2) A belief in openness. We share what we build. We bet on long-term compounding through community, contribution, and giving more than we take. A product is defined by its users, yet it truly comes alive the moment functionality transcends mere utility and begins to embody aesthetics. Thanks to all the miles (the name of our first released RL framework; see below). radixark.ai

English

112

130

1.1K

541.3K

Anjiang Wei retweetledi

Azalia Mirhoseini@Azaliamirh·2 Ara

Thrilled to share that @annadgoldie and I are launching @RicursiveAI, a frontier lab enabling recursive self-improvement through AIs that design their own chips. Our vision for transforming chip design began with AlphaChip, an AI for layout optimization used to design four generations of TPUs, data center CPUs, and smartphones. AlphaChip offered a glimpse into a future where AI designs the silicon that fuels it. Ricursive extends this vision to the entire chip stack, building AI that architects, verifies, and implements silicon, enabling models and chips to co-evolve in a tight loop. We sat down with WSJ’s @berber_jin1 to discuss Ricursive: wsj.com/tech/this-ai-s…

Ricursive Intelligence@RicursiveAI

Introducing Ricursive Intelligence, a frontier AI lab enabling a recursive self-improvement loop between AI and the chips that fuel it. Learn more at ricursive.com

English

125

136

1.5K

230.3K

Anjiang Wei retweetledi

Shiyi Cao@shiyi_c98·27 Kas

1/n 🚀 Introducing SkyRL-Agent, a framework for efficient RL agent training. ⚡ 1.55× faster async rollout dispatch 🛠 Lightweight tool + task integration 🔄 Backend-agnostic (SkyRL-train / VeRL / Tinker) 🏆 Used to train SA-SWE-32B, improving Qwen3-32B from 24.4% → 39.4% Pass@1 on SWE-Bench Verified with >2× lower cost GitHub: github.com/NovaSky-AI/Sky… Paper link: arxiv.org/pdf/2511.16108 👇 more details

English

277

71.3K

Anjiang Wei retweetledi

Allen Nie (🇺🇦☮️)@allenainie·26 Kas

Gemini 3 showed the advantage of TPU on training ever-increasingly larger frontier models. AWS is building the fastest-growing compute centers with Trainium 2 chips under Project Rainier (1M chips) for Anthropic Claude. In collaboration with Jiin Woo, @zhang677 , @ShaoweiZhu95pu, @anjiangw, @sunny_szy, and the AWS Neuron Science team, I’m helping release two papers on agentic backbone building and RL post-training for kernels! 📝 AccelOpt: arxiv.org/abs/2511.15915 (Optimizing kernel for Trainium 1 and 2) (📈🧵1/6) 📝TritonRL: arxiv.org/abs/2510.17891 (RL Post-training for Triton kernels) (🔱🧵1/7)

English

1.4K

Anjiang Wei retweetledi

Genghan Zhang@zhang677·21 Kas

🚀 AccelOpt, a self-improving LLM agentic system for AI accelerator kernel optimization. 📈 Boosts utilization from 49→61% on Trainium1 and 45→59% on Trainium2 using open-source models, matching Claude Sonnet 4 while being 26× cheaper Paper: arxiv.org/pdf/2511.15915

English

Anjiang Wei@anjiangw·5 Kas

I’ll be presenting SATBench at EMNLP 2025 🗓️ Thursday, Nov. 6, 16:30–18:00 📍 Hall C, Session 11 📄 Paper ID: 4150-Main Excited to share how SATBench turns Boolean SAT problems into search-based logical puzzles in natural language!

Anjiang Wei@anjiangw

We introduce SATBench to evaluate LLMs' logical reasoning via puzzles from Boolean satisfiability problems 🧩. SATBench evaluates search-based reasoning to find truth assignments satisfying constraints🔍 📄 arxiv.org/abs/2505.14615 💻 github.com/Anjiang-Wei/SA… #LLM #Reasoning #SAT

English

694

Anjiang Wei@anjiangw·5 Kas

I’ll be presenting EquiBench at EMNLP 2025! 🗓️ Friday, Nov. 7, 14:00–15:30 📍 Hall C, Session 15 📄 Paper ID: 4154-Main How well can language models reason about program semantics? Stop by if you’re into LLMs and program analysis!

Anjiang Wei@anjiangw

🚨 New benchmark drop: EquiBench 🚨 We introduce equivalence checking as a rigorous test of LLMs’ code reasoning ability, featuring 4 languages, 6 categories, and 2,400 program pairs. Top models still struggle with this task. 🔗 Website: anjiang-wei.github.io/EquiBench-Webs… 📝 Preprint: arxiv.org/pdf/2502.12466 💻 Code: github.com/Anjiang-Wei/eq… 📊 Dataset: huggingface.co/datasets/anjia… #AI #LLM #LLM4Code #Reasoning #EquiBench

English

234

Anjiang Wei retweetledi

Chuyue (Livia) Sun@chuyue_sun·4 Kas

🚀 New paper alert! We introduce VeriStruct, a framework that extends AI-assisted automated verification from single functions to complex data-structure modules in Verus. 👉 Paper: arxiv.org/pdf/2510.25015 👉 Code: github.com/ChuyueSun/Veri…

English

3.6K

Keşfet

@ren_hongyu @anneouyang @Standard_Kernel @baifeng_shi @physical_int @radixark @lmsysorg @slime_framework