Lawrence Jang
86 posts

Lawrence Jang
@JangLawrenceK
CMU Machine Learning PhD Student




Most web agents drive a browser one click at a time. We tried something different and it worked better than we expected. Webwright, a new project from our team, gives the model a terminal instead of a click loop. The agent writes Playwright code, spawns browser sessions on demand, and ends with a reusable script rather than a transient session. The results: SOTA on long horizon web benchmark Odysseys (60.8%, a 16-point jump over the previous best) and 86.7% on Online-Mind2Web with GPT-5.4 — the highest of any open-source AutoEval recipe we know of. All from a minimal harness that's roughly 1K lines of code with no multi-agent orchestration. The broader bet: as models get better at code, the right harness gets smaller, not larger. Great work by @Adamlu28 @Xu_Lingrui_ @huang_chao4969 @ahmed @AhmedHAwadallah You can check it out: microsoft.github.io/Webwright/


One of the things I’m most excited about this year is building agents that can work productively for hours, days, or weeks. Coding agents are starting to become very competent at this, but what about computer use agents? Our new benchmark, Odysseys (co-led with @JangLawrenceK) is a set of 200 new tasks derived from real world browsing behavior that measure long horizon web navigation capabilities (potentially up to hours of web browsing work). Interestingly, we find that frontier CUAs are already surprisingly good at working productively for up to an hour on these tasks, but there’s a lot of work to be done in making them even more efficient. Like every other AI researcher, my real dream is to open a cafe once we solve ASI. So, here’s Opus 4.6 doing some market research for me ("I want to do market research on the most popular cafes in Singapore. Analyse the menus of the top 10 cafes in Singapore (by Google reviews/ratings), and make sure we include at least 1 from the North/South/East/West/Central regions of Singapore. Keep the relevant pages of each cafe open, and summarise their pricing, menu offerings, unique selling points, making sure to reference which tab is opened for each cafe. For each cafe, also help me figure out how long it would take to get to it from Tampines MRT, and include this in your final summary."). I was very impressed to see Opus 4.6 complete this task after working for 52 mins, satisfying all 7 rubrics that corresponded to this task. It provided a very nice markdown summary at the end that gave me all the information I asked for!









Introducing Vero, the strongest fully open RL recipe for training next-generation visual reasoners. From charts to spatial to open-ended tasks, Vero sets a new bar. • sota 8B VLM across 30 benchmarks • +4.4 avg over four base models (30 evals) • beats prior RL datasets 🧵👇


The newest model in the Mamba series is finally here 🐍 Hybrid models have become increasingly popular, raising the importance of designing the next generation of linear models. We've introduced several SSM-centric ideas to significantly increase Mamba-2's modeling capabilities without compromising on speed. The resulting Mamba-3 model has noticeable performance gains over the most popular previous linear models (such as Mamba-2 and Gated DeltaNet) at all sizes. This is the first Mamba that was student led: all credit to @aakash_lahoti @kevinyli_ @_berlinchen @caitWW9, and of course @tri_dao!



