Lawrence Jang (@JangLawrenceK) - Twitter Profili

Korean websites have the worst UI of all time. Registering for the smart transit card online has asked me for my citizenship ID 10 times This country is in dire need of web agents but they might be the last to get them to work

English

0

1

65

Lawrence Jang retweetledi

Russ Salakhutdinov@rsalakhu·3d

Congrats to the Webwright team microsoft.github.io/Webwright at @MSFTResearch for taking the #1 spot on Odysseys, a highly challenging benchmark for long-horizon web agents: odysseys-website.pages.dev/leaderboard Odysseys evaluates realistic, multi-hour web workflows that require sustained planning, memory, reasoning, and verification across many websites and tools. These are far beyond short single-step browser tasks. For example, if you are searching for CS faculty positions, a single task could involve building a comprehensive Excel tracker of openings across the top CS schools using CSRankings as the master checklist; verifying every school directly through department, engineering, and university careers pages for CS/AI/ML/data science/robotics/vision faculty roles; opening and validating each posting; maintaining structured evidence and verification tabs; and finishing with a completeness audit and summary of hiring trends. Exciting progress toward truly capable long-horizon web agents.

English

0

7

31

4.6K

Lawrence Jang retweetledi

Naveen Raman@NaveenJRaman·11 May

Wrote a short piece on the marginal value of capabilities, and how language model improvements correspond to usefulness. naveenraman.com/writing/margin…

English

0

1

5

189

Lawrence Jang@JangLawrenceK·10 May

When the music stops and the subsidized coding plans run out, I'm planning to go to culinary school. If you have recommendations, let me know

English

0

122

Lawrence Jang@JangLawrenceK·10 May

Now that my first year is coming to a close, I took @jacspringer's suggestion to look into my coding agent usage - I switched to a Claude Code Plan in March from @PranjalAggarw16's suggestion and apparently this is my usage from the last 2 and a half months. Pranjal switched back to Codex though, so maybe I'll have to do that. 💸 $14,011.53 in token usage 🔢 23,209,871,145 tokens 📦 22.76B cache reads (98% of all tokens) ✍️ 43.1M output tokens 📅 $259/day avg 📈 Peak usage in a day: $1,006.30 You can just use npx ccusage to get your own stats, pretty cool

English

1

13

379

Lawrence Jang@JangLawrenceK·5 May

Really cool work @Adamlu28 ! I am excited to see new submissions to Odysseys. It'll be interesting to see how real deployment of web agents will happen. @kohjingyu's belief in pure CUAs is getting challenged by the day.

Microsoft AI Frontiers@ms_aifrontiers

Most web agents drive a browser one click at a time. We tried something different and it worked better than we expected. Webwright, a new project from our team, gives the model a terminal instead of a click loop. The agent writes Playwright code, spawns browser sessions on demand, and ends with a reusable script rather than a transient session. The results: SOTA on long horizon web benchmark Odysseys (60.8%, a 16-point jump over the previous best) and 86.7% on Online-Mind2Web with GPT-5.4 — the highest of any open-source AutoEval recipe we know of. All from a minimal harness that's roughly 1K lines of code with no multi-agent orchestration. The broader bet: as models get better at code, the right harness gets smaller, not larger. Great work by @Adamlu28 @Xu_Lingrui_ @huang_chao4969 @ahmed @AhmedHAwadallah You can check it out: microsoft.github.io/Webwright/

English

1

7

1.2K

Lawrence Jang retweetledi

Jing Yu Koh@kohjingyu·29 Nis

One of the things I’m most excited about this year is building agents that can work productively for hours, days, or weeks. Coding agents are starting to become very competent at this, but what about computer use agents? Our new benchmark, Odysseys (co-led with @JangLawrenceK) is a set of 200 new tasks derived from real world browsing behavior that measure long horizon web navigation capabilities (potentially up to hours of web browsing work). Interestingly, we find that frontier CUAs are already surprisingly good at working productively for up to an hour on these tasks, but there’s a lot of work to be done in making them even more efficient. Like every other AI researcher, my real dream is to open a cafe once we solve ASI. So, here’s Opus 4.6 doing some market research for me ("I want to do market research on the most popular cafes in Singapore. Analyse the menus of the top 10 cafes in Singapore (by Google reviews/ratings), and make sure we include at least 1 from the North/South/East/West/Central regions of Singapore. Keep the relevant pages of each cafe open, and summarise their pricing, menu offerings, unique selling points, making sure to reference which tab is opened for each cafe. For each cafe, also help me figure out how long it would take to get to it from Tampines MRT, and include this in your final summary."). I was very impressed to see Opus 4.6 complete this task after working for 52 mins, satisfying all 7 rubrics that corresponded to this task. It provided a very nice markdown summary at the end that gave me all the information I asked for!

English

11

25

124

44.6K

Lawrence Jang retweetledi

Russ Salakhutdinov@rsalakhu·29 Nis

How well do today’s frontier models handle long-horizon, multi-step web agent tasks, such as identifying the top 25 U.S. CS PhD programs with ML/AI faculty likely accepting students and compiling the results into a structured sheet? Check out our new work on Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks Paper: odysseys-website.pages.dev Leaderboard: odysseys-website.pages.dev/leaderboard We introduce Odysseys, a benchmark of 200 long-horizon tasks derived from real browsing sessions and evaluated on the live Internet. We show that binary pass/fail is inadequate in this setting and propose rubric-based evaluation, which better aligns with human judgment and provides more informative signals. Across leading models, the best achieves only 44.5% success, leaving substantial headroom. We further introduce a Trajectory Efficiency metric (rubric score per step) and find efficiency remains extremely low (1.15%), highlighting a key bottleneck. Odysseys provides a realistic benchmark for measuring progress toward web agents capable of sustained, efficient, real-world operation. See a more detailed thread by @kohjingyu.

Jing Yu Koh@kohjingyu

One of the things I’m most excited about this year is building agents that can work productively for hours, days, or weeks. Coding agents are starting to become very competent at this, but what about computer use agents? Our new benchmark, Odysseys (co-led with @JangLawrenceK) is a set of 200 new tasks derived from real world browsing behavior that measure long horizon web navigation capabilities (potentially up to hours of web browsing work). Interestingly, we find that frontier CUAs are already surprisingly good at working productively for up to an hour on these tasks, but there’s a lot of work to be done in making them even more efficient. Like every other AI researcher, my real dream is to open a cafe once we solve ASI. So, here’s Opus 4.6 doing some market research for me ("I want to do market research on the most popular cafes in Singapore. Analyse the menus of the top 10 cafes in Singapore (by Google reviews/ratings), and make sure we include at least 1 from the North/South/East/West/Central regions of Singapore. Keep the relevant pages of each cafe open, and summarise their pricing, menu offerings, unique selling points, making sure to reference which tab is opened for each cafe. For each cafe, also help me figure out how long it would take to get to it from Tampines MRT, and include this in your final summary."). I was very impressed to see Opus 4.6 complete this task after working for 52 mins, satisfying all 7 rubrics that corresponded to this task. It provided a very nice markdown summary at the end that gave me all the information I asked for!

English

0

7

64

18.2K

Lawrence Jang@JangLawrenceK·29 Nis

Another example we added was to “Plan a summer vacation that spans all 30 MLB Ballparks”. This is a bucket-list trip I plan to do with my dad someday. It would be extremely hard to plan manually, and our tests suggest it will take LLMs a while to fully crack it too. We have 200 real, human inspired long horizon tasks that would love to be solved. We did a lot of analysis across varying models on the benchmark in the paper if you are interested. We release: 💻 Data: github.com/ljang0/Odyssey… 🌐 Website: odysseys-website.pages.dev 📝 Paper: arxiv.org/pdf/2604.24964 📷 📊Leaderboard: odysseys-website.pages.dev/leaderboard Please check out our work and come take a hack with us at these problems! Done with @kohjingyu, @dan_fried, and @rsalakhu.

English

0

4

343

Lawrence Jang@JangLawrenceK·29 Nis

AI agents can work pretty well on the web now for short tasks. I wanted to know: could they go longer, on harder tasks? Can an agent plan 2 weddings in different cities and a honeymoon within the same month, or find the most suitable culinary arts school across the US for my post PhD plans? We are releasing Odysseys: a benchmark of 200 long-horizon web agent tasks evaluated on the live internet. All our tasks are inspired from real human data and many take hours to complete. The best frontier model we tested (Claude Opus 4.6) reaches only 44.5% perfect-task success, leaving substantial room for improvement. I donated a couple of my own automation wishes to this benchmark. My favorite contribution was to “Rank the top 10 ACL + Meniscus surgeons in the area” - as this took me a decent amount of time to do myself when I got my own knee fixed. GPT 5.5 was able to do this for me with 4 dollars and 30 minutes!

English

2

12

33

17.5K

Lawrence Jang retweetledi

Yutong (Kelly) He@electronickale·23 Nis

Diffusion planners are great for offline RL. But they need many steps to work well! Way too slow for real-time decision making! Presenting RACTD at #ICLR2026: reward-aware distillation that plans in ONE step 🇧🇷 Today (4/23) P4-#4618 3:15-5:45 PM arxiv.org/abs/2506.07822 1/

English

2

19

96

8.2K

Lawrence Jang retweetledi

Pranjal Aggarwal ✈️ ICLR'26@PranjalAggarw16·10 Nis

Thought Computer Use is only about booking flights? Think Again! In CUA-World, we task agents with real-world work like planning Artemis trajectory, filing taxes, engineering, medical imaging, enterprise work, mission planning and even designing homes! Some of my favorites: 🧵

English

6

14

94

7.1K

Lawrence Jang retweetledi

Naveen Raman@NaveenJRaman·8 Nis

Training concept-based models relies on concept selection which is labor-intensive and slow. We introduce Decision-Relevant Selection (DRS), a principled algorithm for automatic concept selection in RL. Paper: arxiv.org/abs/2604.04808 Website: naveenraman.com/projects/conce… 🧵 1/n

English

2

14

66

9.8K

Lawrence Jang@JangLawrenceK·8 Nis

Pranjal is so good at CUAs

Pranjal Aggarwal ✈️ ICLR'26@PranjalAggarw16

What if computer-use agents could do real work? We built Gym-Anything: a framework that turns any software into a computer-use agent environment. We used it to create CUA-World: 200+ real software, 10,000+ tasks and environments, across all major occupation groups, from medical imaging to financial trading. 🧵

Indonesia

0

3

175

Lawrence Jang retweetledi

Pranjal Aggarwal ✈️ ICLR'26@PranjalAggarw16·8 Nis

What if computer-use agents could do real work? We built Gym-Anything: a framework that turns any software into a computer-use agent environment. We used it to create CUA-World: 200+ real software, 10,000+ tasks and environments, across all major occupation groups, from medical imaging to financial trading. 🧵

English

20

82

427

144.9K

Lawrence Jang@JangLawrenceK·8 Nis

Check out Gabe's recent work! Looks like he's still keeping the GPUs hot at Princeton.

Gabriel Sarch@GabrielSarch

Introducing Vero, the strongest fully open RL recipe for training next-generation visual reasoners. From charts to spatial to open-ended tasks, Vero sets a new bar. • sota 8B VLM across 30 benchmarks • +4.4 avg over four base models (30 evals) • beats prior RL datasets 🧵👇

English

0

3

201

Lawrence Jang retweetledi

Gabriel Sarch@GabrielSarch·7 Nis

Introducing Vero, the strongest fully open RL recipe for training next-generation visual reasoners. From charts to spatial to open-ended tasks, Vero sets a new bar. • sota 8B VLM across 30 benchmarks • +4.4 avg over four base models (30 evals) • beats prior RL datasets 🧵👇

English

3

59

299

61.8K

Lawrence Jang retweetledi

Kevin Li@kevinyli_·1 Nis

how making minor abstract changes right before the deadline feels

English

0

1

12

238

Lawrence Jang retweetledi

Shuyan Zhou@shuyanzh36·23 Mar

In 2023, WebArena took 7 grad students more than 6 months to build just 5 environments with 812 variable browser-use tasks. Now, it takes under 10 hours and less than $100 per environment, with easy support for parallel generation. Excited to introduce WebArena-Infinity: a scalable approach for automatically generating high-authenticity, high-complexity browser environments with verifiable tasks suitable for RL training and benchmarking. Even strong open-source models that already achieve 60%+ success rates on WebArena and OSWorld complete fewer than 50% of tasks here. Project page: webarena.dev/webarena-infin… Repo: github.com/web-arena-x/we… 🧵 (1/n)

GIF

English

12

49

330

43.7K

Lawrence Jang retweetledi

Aakash Lahoti@aakash_lahoti·17 Mar

A year of cooking 👨‍🍳and we’re finally serving Mamba-3. What began as a small effort to revisit a few recurring limitations of SSMs grew into a much bigger project. Taking a more principled state space perspective ended up tying these threads together.

Albert Gu@_albertgu

The newest model in the Mamba series is finally here 🐍 Hybrid models have become increasingly popular, raising the importance of designing the next generation of linear models. We've introduced several SSM-centric ideas to significantly increase Mamba-2's modeling capabilities without compromising on speed. The resulting Mamba-3 model has noticeable performance gains over the most popular previous linear models (such as Mamba-2 and Gated DeltaNet) at all sizes. This is the first Mamba that was student led: all credit to @aakash_lahoti @kevinyli_ @_berlinchen @caitWW9, and of course @tri_dao!

English

7

21

96

8.8K

Lawrence Jang

Keşfet