Anjiang Wei

69 posts

Anjiang Wei

Anjiang Wei

@anjiangw

CS PhD student @Stanford. LLM for program reasoning and optimization. Advised by Alex Aiken. Undergraduate @PKU1898

Stanford, CA Katılım Ocak 2022
322 Takip Edilen346 Takipçiler
Anjiang Wei retweetledi
Hao Wang
Hao Wang@MogicianTony·
Benchmarks are often easier to game than they look. We build BenchJack to audit benchmarks for hidden shortcuts and reward hacks — before they evaluate your agent. Now in preview. Fully open source, with support for auditing your own benchmarks too. github.com/benchjack/benc… Issues and PRs welcome.
Hao Wang tweet media
Hao Wang@MogicianTony

SWE-bench Verified and Terminal-Bench—two of the most cited AI benchmarks—can be reward-hacked with simple exploits. Our agent scored 100% on both. It solved 0 tasks. Evaluate the benchmark before it evaluates your agent. If you’re picking models by leaderboard score alone, you’re optimizing for the wrong thing. 🧵

English
4
7
44
32.9K
Hongyu Ren
Hongyu Ren@ren_hongyu·
Check out Muse Spark, our first milestone in the quest for personal superintelligence! Scaling this with the team has been a total blast. Give it a spin and let us know what you think! 🥑
Hongyu Ren tweet mediaHongyu Ren tweet media
English
18
59
317
65.2K
Anjiang Wei retweetledi
Hanchen Li @ ICLR
Hanchen Li @ ICLR@lihanc02·
Prompt Learning does not scale for parallel agents. More parallel agents 🤖 = worse prompts 😭 Why? Processing too many trajectories concurrently damages the prompt update process 🐝 We fix this with Combee : → preserves high-quality learnt system prompt → scales to more than 80 concurrent agents → up to 17× speedup without quality drop on top of ACE and GEPA 🥽Use Cases: 1. Prompt learning on large scale collected agent traces 2. Parallel agent learning online with fast knowledge sharing Read more below to learn how agents actually learn at scale ⬇️
Hanchen Li @ ICLR tweet media
English
5
26
173
51.1K
Anjiang Wei retweetledi
Matt Dancho (Business Science)
This is huge. A group of 50 AI researchers (ByteDance, Alibaba, Tencent + universities) just dropped a 303 page field guide on code models + coding agents. And the takeaways are not what most people assume. Here are the highlights I’m thinking about (as someone who lives in Python + agents):
Matt Dancho (Business Science) tweet media
English
25
167
925
85.9K
Anjiang Wei retweetledi
Anne Ouyang
Anne Ouyang@anneouyang·
Excited to share @Standard_Kernel's seed round and some reflections on what we’ve learned about kernel generation and what we believe is next. Grateful to our amazing team, supporters, and the broader community pushing this space forward.
Anne Ouyang tweet media
English
47
45
513
129.4K
Baifeng
Baifeng@baifeng_shi·
Life update: joined @physical_int this week. Excited about what 🤖 we will build!
English
25
2
298
19K
Anjiang Wei retweetledi
机器之心 JIQIZHIXIN
机器之心 JIQIZHIXIN@jiqizhixin·
What if you could mathematically prove an AI's outputs are safe, instead of just hoping they are? Enter BEAVER, a framework that gives deterministic, sound probability bounds on whether an LLM will violate a given constraint, offering guarantees where sampling only provides intuition. BEAVER: An Efficient Deterministic LLM Verifier Paper: arxiv.org/abs/2512.05439
机器之心 JIQIZHIXIN tweet media
English
1
3
45
3K
Anjiang Wei retweetledi
Ying Sheng
Ying Sheng@ying11231·
We've been running @radixark for a few months, started by many core developers in SGLang @lmsysorg and its extended ecosystem (slime @slime_framework , AReaL @jxwuyi). I left @xai in August — a place where I built deep emotions and countless beautiful memories. It was the best place I’ve ever worked, the place I watched grow from a few dozen people to hundreds, and it truly felt like home. What pushed me to make such a hard decision is the momentum of building SGLang open source and the mission of creating an ambitious future, within an open spirit that I learnt from my first job at @databricks after my PhD. We started SGLang in the summer of 2023 and made it public in January 2024. Over the past 2 years, hundreds of people have made great efforts to get to where they are today. We experienced several waves of growth after its first release. I still remember the many dark nights in the summer of 2024, I spent with @lm_zheng , @lsyincs , and @zhyncs42 debugging, while @ispobaoke single-handedly took on DeepSeek inference optimizations, seeing @GenAI_is_real and the community strike team tag-teaming on-call shifts non-stop. There are so many more who have joined that I'm out of space to call out, but they're recorded on the GitHub contributor list forever. The demands grow exponentially, and we have been pushed to make it a dedicated effort supported by RadixArk. It’s the step-by-step journey of a thousand miles that has carried us here today, and the same relentless Long March that will lead us into the tens of thousands of miles yet to come. The story never stops growing. Over the past year, we’ve seen something very clear: The world is full of people eager to build AI, but the infrastructure that makes it possible is not shared. The most advanced inference and training stacks live inside a few companies. Everyone else is forced to rebuild the same schedulers, compilers, serving engines, and training pipelines again and again — often under enormous pressure, with lots of duplicated effort and wasted insight. RadixArk was born to change that. Today, we’re building an infrastructure-first, deep-tech company with a simple and ambitious mission: "Make frontier-level AI infrastructure open and accessible to everyone." If the two values below resonate with you, come talk to us: (1) Engineering as an art. Infrastructure is a first-class citizen in RadixArk. We care about elegant design and code that lasts. Beneath every line of code lies the soul of the engineer who wrote it. (2) A belief in openness. We share what we build. We bet on long-term compounding through community, contribution, and giving more than we take. A product is defined by its users, yet it truly comes alive the moment functionality transcends mere utility and begins to embody aesthetics. Thanks to all the miles (the name of our first released RL framework; see below). radixark.ai
English
112
130
1.1K
541.3K
Anjiang Wei retweetledi
Azalia Mirhoseini
Azalia Mirhoseini@Azaliamirh·
Thrilled to share that @annadgoldie and I are launching @RicursiveAI, a frontier lab enabling recursive self-improvement through AIs that design their own chips. Our vision for transforming chip design began with AlphaChip, an AI for layout optimization used to design four generations of TPUs, data center CPUs, and smartphones. AlphaChip offered a glimpse into a future where AI designs the silicon that fuels it. Ricursive extends this vision to the entire chip stack, building AI that architects, verifies, and implements silicon, enabling models and chips to co-evolve in a tight loop. We sat down with WSJ’s @berber_jin1 to discuss Ricursive: wsj.com/tech/this-ai-s…
Ricursive Intelligence@RicursiveAI

Introducing Ricursive Intelligence, a frontier AI lab enabling a recursive self-improvement loop between AI and the chips that fuel it. Learn more at ricursive.com

English
125
136
1.5K
230.3K
Anjiang Wei retweetledi
Shiyi Cao
Shiyi Cao@shiyi_c98·
1/n 🚀 Introducing SkyRL-Agent, a framework for efficient RL agent training. ⚡ 1.55× faster async rollout dispatch 🛠 Lightweight tool + task integration 🔄 Backend-agnostic (SkyRL-train / VeRL / Tinker) 🏆 Used to train SA-SWE-32B, improving Qwen3-32B from 24.4% → 39.4% Pass@1 on SWE-Bench Verified with >2× lower cost GitHub: github.com/NovaSky-AI/Sky… Paper link: arxiv.org/pdf/2511.16108 👇 more details
Shiyi Cao tweet media
English
6
61
277
71.3K
Anjiang Wei retweetledi
Allen Nie (🇺🇦☮️)
Allen Nie (🇺🇦☮️)@allenainie·
Gemini 3 showed the advantage of TPU on training ever-increasingly larger frontier models. AWS is building the fastest-growing compute centers with Trainium 2 chips under Project Rainier (1M chips) for Anthropic Claude. In collaboration with Jiin Woo, @zhang677 , @ShaoweiZhu95pu, @anjiangw, @sunny_szy, and the AWS Neuron Science team, I’m helping release two papers on agentic backbone building and RL post-training for kernels! 📝 AccelOpt: arxiv.org/abs/2511.15915 (Optimizing kernel for Trainium 1 and 2) (📈🧵1/6) 📝TritonRL: arxiv.org/abs/2510.17891 (RL Post-training for Triton kernels) (🔱🧵1/7)
Allen Nie (🇺🇦☮️) tweet mediaAllen Nie (🇺🇦☮️) tweet mediaAllen Nie (🇺🇦☮️) tweet media
English
1
5
13
1.4K
Anjiang Wei retweetledi
Genghan Zhang
Genghan Zhang@zhang677·
🚀 AccelOpt, a self-improving LLM agentic system for AI accelerator kernel optimization. 📈 Boosts utilization from 49→61% on Trainium1 and 45→59% on Trainium2 using open-source models, matching Claude Sonnet 4 while being 26× cheaper Paper: arxiv.org/pdf/2511.15915
Genghan Zhang tweet media
English
4
4
16
7K
Anjiang Wei
Anjiang Wei@anjiangw·
I’ll be presenting SATBench at EMNLP 2025 🗓️ Thursday, Nov. 6, 16:30–18:00 📍 Hall C, Session 11 📄 Paper ID: 4150-Main Excited to share how SATBench turns Boolean SAT problems into search-based logical puzzles in natural language!
Anjiang Wei@anjiangw

We introduce SATBench to evaluate LLMs' logical reasoning via puzzles from Boolean satisfiability problems 🧩. SATBench evaluates search-based reasoning to find truth assignments satisfying constraints🔍 📄 arxiv.org/abs/2505.14615 💻 github.com/Anjiang-Wei/SA… #LLM #Reasoning #SAT

English
0
1
8
694
Anjiang Wei
Anjiang Wei@anjiangw·
I’ll be presenting EquiBench at EMNLP 2025! 🗓️ Friday, Nov. 7, 14:00–15:30 📍 Hall C, Session 15 📄 Paper ID: 4154-Main How well can language models reason about program semantics? Stop by if you’re into LLMs and program analysis!
Anjiang Wei@anjiangw

🚨 New benchmark drop: EquiBench 🚨 We introduce equivalence checking as a rigorous test of LLMs’ code reasoning ability, featuring 4 languages, 6 categories, and 2,400 program pairs. Top models still struggle with this task. 🔗 Website: anjiang-wei.github.io/EquiBench-Webs… 📝 Preprint: arxiv.org/pdf/2502.12466 💻 Code: github.com/Anjiang-Wei/eq… 📊 Dataset: huggingface.co/datasets/anjia… #AI #LLM #LLM4Code #Reasoning #EquiBench

English
0
0
1
234