
Lei Li @NeurIPS2025
793 posts

Lei Li @NeurIPS2025
@lileics
Generative AI for language and science. MT, LLM, GenAI Safety, Drug Discovery


4/ Key Leaderboard Highlights: 🏆 Security Leader: @OpenHands + GLM4.7 🏆 Functionality Leader: SWE-agent + Claude 4 Sonnet If we are moving toward an agent-led dev cycle, we need to talk about security now, not later.

3/ The "Vibe" Trap: Even when we gave agents hints about potential vulnerabilities, they struggled to mitigate the risks.

2/, We tested the world’s leading coding agents, and the results are a wake-up call for the industry: Functionality ≠ Security: For example, while SWE-Agent with Claude 4 Sonnet solved 61% of tasks correctly, only 10.5% of those solutions were actually secure.

🚀 Is "Vibe Coding" actually safe for production? We’ve all seen the demos: give an LLM agent a prompt, watch it work its magic, and boom—you have a feature. But there’s a massive hidden risk. In our latest paper, we introduce SUSVIBES, a benchmark of 200 real-world SE tasks.

Your vibe coded app works. But is it secure? New benchmark SusVibes from Songwen Zhao, Danqing Wang, Kexun Zhang, Jiaxuan Luo, Zhuo Li, and @lileics at @CarnegieMellon, @Columbia, and @JohnsHopkins tested 200 real world feature requests on coding agents. The results are sobering: SWE Agent with Claude 4 Sonnet produced functionally correct code 61% of the time, but only 10.5% of solutions were actually secure. Even adding security hints to prompts did not fix the problem. The gap between 'it works' and 'it is safe to deploy' is massive. 77 different CWE vulnerability types showed up across the benchmark. Worth thinking about next time someone says AI will replace software engineers. The harder question was never about writing code that runs. It was always about writing code that does not break under adversarial conditions. Source: arxiv.org/abs/2512.03262



Poster day for our “Generative AI in Biomedicine” course this semester. The students’ creativity, energy, and enthusiasm for this exciting area are truly inspiring!



Welcome to use our models. More Details: 🎉 Paper: LLaMAX2: Your Translation-Enhanced Model also Performs Well in Reasoning (huggingface.co/papers/2510.09…) 🎉 Code: github.com/CONE-MT/LLaMAX… 🎉 Model: huggingface.co/collections/LL…

I’m ✨ super excited and grateful ✨to announce that I'm part of the 2025 class of #PackardFellows (packard.org/2025fellows). The Packard Foundation and this fellowship will allow me to explore exciting research directions towards culturally responsible and safe AI 🌍🌈


📢 We're thrilled to announce the CMU AI for Science Workshop on Sept 12 at CUC-MPW! Featuring an amazing lineup of speakers: - Akari Asai (AI2/CMU) - Gabe Gomes (CMU) - Chenglei Si (Stanford) - Keyon Vafa (Harvard) Join us on campus, submit your poster & register here: cmu-ai-for-science-workshop.github.io Questions? Feel free to email: cmu-ai-for-science-workshop@andrew.cmu.edu We look forward to see you there!🤗

Introducing MCPMark, a collaboration with @EvalSysOrg and @lobehub! We created a challenging benchmark to stress-test MCP use in comprehensive contexts. - 127 high-quality data samples created by experts. - GPT-5 takes the current lead and achieves a Pass@1 of 46.96% while the other models fall in the range of 10-30%. - Diverse test cases on Notion, Github, Filesystem, Playwright (browser), and Postgres. 9🧵s ahead



With fresh support of $75M from @NSF and $77M from @NVIDIA, we’re set to scale our open model ecosystem, bolster the infrastructure behind it, and fast‑track reproducible AI research to unlock the next wave of scientific discovery. 💡




