emily mcmilin
155 posts


Software agents can self-improve via self-play RL Introducing Self-play SWE-RL (SSR): training a single LLM agent to self-play between bug-injection and bug-repair, grounded in real-world repositories, no human-labeled issues or tests. 🧵

🗣️📣Announcing VerifAI 2: AI Verification in the Wild, an upcoming workshop at #ICLR2026!! 🗣️📣 VerifAI will gather researchers to explore topics at the intersection of genAI and trustworthy ML. Submit your work! Check out our website and CFP for more: verifai-workshop.github.io

1/ today we're releasing muse spark, the first model from MSL. nine months ago we rebuilt our ai stack from scratch. new infrastructure, new architecture, new data pipelines. muse spark is the result of that work, and now it powers meta ai. 🧵


(🧵) Today, we release Meta Code World Model (CWM), a 32-billion-parameter dense LLM that enables novel research on improving code generation through agentic reasoning and planning with world models. ai.meta.com/research/publi…

Goated FAIR team just found how coding agents sometimes "cheat" on SWE-Bench Verified. It's really simple. For example, Qwen3 literally greps all commit logs for the issue number of the issue it needs to fix. lol, clever model. "cheat" cuz it's more like env hacking.
















