Lawrence Jang

86 posts

Lawrence Jang banner
Lawrence Jang

Lawrence Jang

@JangLawrenceK

CMU Machine Learning PhD Student

Pittsburgh, PA Katılım Haziran 2024
288 Takip Edilen211 Takipçiler
Lawrence Jang
Lawrence Jang@JangLawrenceK·
Korean websites have the worst UI of all time. Registering for the smart transit card online has asked me for my citizenship ID 10 times This country is in dire need of web agents but they might be the last to get them to work
English
0
0
1
65
Lawrence Jang retweetledi
Russ Salakhutdinov
Russ Salakhutdinov@rsalakhu·
Congrats to the Webwright team microsoft.github.io/Webwright at @MSFTResearch for taking the #1 spot on Odysseys, a highly challenging benchmark for long-horizon web agents: odysseys-website.pages.dev/leaderboard Odysseys evaluates realistic, multi-hour web workflows that require sustained planning, memory, reasoning, and verification across many websites and tools. These are far beyond short single-step browser tasks. For example, if you are searching for CS faculty positions, a single task could involve building a comprehensive Excel tracker of openings across the top CS schools using CSRankings as the master checklist; verifying every school directly through department, engineering, and university careers pages for CS/AI/ML/data science/robotics/vision faculty roles; opening and validating each posting; maintaining structured evidence and verification tabs; and finishing with a completeness audit and summary of hiring trends. Exciting progress toward truly capable long-horizon web agents.
English
0
7
31
4.6K
Lawrence Jang retweetledi
Naveen Raman
Naveen Raman@NaveenJRaman·
Wrote a short piece on the marginal value of capabilities, and how language model improvements correspond to usefulness. naveenraman.com/writing/margin…
English
0
1
5
189
Lawrence Jang
Lawrence Jang@JangLawrenceK·
When the music stops and the subsidized coding plans run out, I'm planning to go to culinary school. If you have recommendations, let me know
English
0
0
0
122
Lawrence Jang
Lawrence Jang@JangLawrenceK·
Now that my first year is coming to a close, I took @jacspringer's suggestion to look into my coding agent usage - I switched to a Claude Code Plan in March from @PranjalAggarw16's suggestion and apparently this is my usage from the last 2 and a half months. Pranjal switched back to Codex though, so maybe I'll have to do that. 💸 $14,011.53 in token usage 🔢 23,209,871,145 tokens 📦 22.76B cache reads (98% of all tokens) ✍️ 43.1M output tokens 📅 $259/day avg 📈 Peak usage in a day: $1,006.30 You can just use npx ccusage to get your own stats, pretty cool
English
1
1
13
379
Lawrence Jang retweetledi
Jing Yu Koh
Jing Yu Koh@kohjingyu·
One of the things I’m most excited about this year is building agents that can work productively for hours, days, or weeks. Coding agents are starting to become very competent at this, but what about computer use agents? Our new benchmark, Odysseys (co-led with @JangLawrenceK) is a set of 200 new tasks derived from real world browsing behavior that measure long horizon web navigation capabilities (potentially up to hours of web browsing work). Interestingly, we find that frontier CUAs are already surprisingly good at working productively for up to an hour on these tasks, but there’s a lot of work to be done in making them even more efficient. Like every other AI researcher, my real dream is to open a cafe once we solve ASI. So, here’s Opus 4.6 doing some market research for me ("I want to do market research on the most popular cafes in Singapore. Analyse the menus of the top 10 cafes in Singapore (by Google reviews/ratings), and make sure we include at least 1 from the North/South/East/West/Central regions of Singapore. Keep the relevant pages of each cafe open, and summarise their pricing, menu offerings, unique selling points, making sure to reference which tab is opened for each cafe. For each cafe, also help me figure out how long it would take to get to it from Tampines MRT, and include this in your final summary."). I was very impressed to see Opus 4.6 complete this task after working for 52 mins, satisfying all 7 rubrics that corresponded to this task. It provided a very nice markdown summary at the end that gave me all the information I asked for!
English
11
25
124
44.6K
Lawrence Jang retweetledi
Russ Salakhutdinov
Russ Salakhutdinov@rsalakhu·
How well do today’s frontier models handle long-horizon, multi-step web agent tasks, such as identifying the top 25 U.S. CS PhD programs with ML/AI faculty likely accepting students and compiling the results into a structured sheet? Check out our new work on Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks Paper: odysseys-website.pages.dev Leaderboard: odysseys-website.pages.dev/leaderboard We introduce Odysseys, a benchmark of 200 long-horizon tasks derived from real browsing sessions and evaluated on the live Internet. We show that binary pass/fail is inadequate in this setting and propose rubric-based evaluation, which better aligns with human judgment and provides more informative signals. Across leading models, the best achieves only 44.5% success, leaving substantial headroom. We further introduce a Trajectory Efficiency metric (rubric score per step) and find efficiency remains extremely low (1.15%), highlighting a key bottleneck. Odysseys provides a realistic benchmark for measuring progress toward web agents capable of sustained, efficient, real-world operation. See a more detailed thread by @kohjingyu.
Jing Yu Koh@kohjingyu

One of the things I’m most excited about this year is building agents that can work productively for hours, days, or weeks. Coding agents are starting to become very competent at this, but what about computer use agents? Our new benchmark, Odysseys (co-led with @JangLawrenceK) is a set of 200 new tasks derived from real world browsing behavior that measure long horizon web navigation capabilities (potentially up to hours of web browsing work). Interestingly, we find that frontier CUAs are already surprisingly good at working productively for up to an hour on these tasks, but there’s a lot of work to be done in making them even more efficient. Like every other AI researcher, my real dream is to open a cafe once we solve ASI. So, here’s Opus 4.6 doing some market research for me ("I want to do market research on the most popular cafes in Singapore. Analyse the menus of the top 10 cafes in Singapore (by Google reviews/ratings), and make sure we include at least 1 from the North/South/East/West/Central regions of Singapore. Keep the relevant pages of each cafe open, and summarise their pricing, menu offerings, unique selling points, making sure to reference which tab is opened for each cafe. For each cafe, also help me figure out how long it would take to get to it from Tampines MRT, and include this in your final summary."). I was very impressed to see Opus 4.6 complete this task after working for 52 mins, satisfying all 7 rubrics that corresponded to this task. It provided a very nice markdown summary at the end that gave me all the information I asked for!

English
0
7
64
18.2K
Lawrence Jang
Lawrence Jang@JangLawrenceK·
Another example we added was to “Plan a summer vacation that spans all 30 MLB Ballparks”. This is a bucket-list trip I plan to do with my dad someday. It would be extremely hard to plan manually, and our tests suggest it will take LLMs a while to fully crack it too. We have 200 real, human inspired long horizon tasks that would love to be solved. We did a lot of analysis across varying models on the benchmark in the paper if you are interested. We release: 💻 Data: github.com/ljang0/Odyssey… 🌐 Website: odysseys-website.pages.dev 📝 Paper: arxiv.org/pdf/2604.24964 📷 📊Leaderboard: odysseys-website.pages.dev/leaderboard Please check out our work and come take a hack with us at these problems! Done with @kohjingyu, @dan_fried, and @rsalakhu.
English
0
0
4
343
Lawrence Jang
Lawrence Jang@JangLawrenceK·
AI agents can work pretty well on the web now for short tasks. I wanted to know: could they go longer, on harder tasks? Can an agent plan 2 weddings in different cities and a honeymoon within the same month, or find the most suitable culinary arts school across the US for my post PhD plans? We are releasing Odysseys: a benchmark of 200 long-horizon web agent tasks evaluated on the live internet. All our tasks are inspired from real human data and many take hours to complete. The best frontier model we tested (Claude Opus 4.6) reaches only 44.5% perfect-task success, leaving substantial room for improvement. I donated a couple of my own automation wishes to this benchmark. My favorite contribution was to “Rank the top 10 ACL + Meniscus surgeons in the area” - as this took me a decent amount of time to do myself when I got my own knee fixed. GPT 5.5 was able to do this for me with 4 dollars and 30 minutes!
English
2
12
33
17.5K
Lawrence Jang retweetledi
Yutong (Kelly) He
Yutong (Kelly) He@electronickale·
Diffusion planners are great for offline RL. But they need many steps to work well! Way too slow for real-time decision making! Presenting RACTD at #ICLR2026: reward-aware distillation that plans in ONE step 🇧🇷 Today (4/23) P4-#4618 3:15-5:45 PM arxiv.org/abs/2506.07822 1/
Yutong (Kelly) He tweet media
English
2
19
96
8.2K
Lawrence Jang retweetledi
Pranjal Aggarwal ✈️ ICLR'26
Pranjal Aggarwal ✈️ ICLR'26@PranjalAggarw16·
Thought Computer Use is only about booking flights? Think Again! In CUA-World, we task agents with real-world work like planning Artemis trajectory, filing taxes, engineering, medical imaging, enterprise work, mission planning and even designing homes! Some of my favorites: 🧵
English
6
14
94
7.1K
Lawrence Jang retweetledi
Naveen Raman
Naveen Raman@NaveenJRaman·
Training concept-based models relies on concept selection which is labor-intensive and slow. We introduce Decision-Relevant Selection (DRS), a principled algorithm for automatic concept selection in RL. Paper: arxiv.org/abs/2604.04808 Website: naveenraman.com/projects/conce… 🧵 1/n
Naveen Raman tweet media
English
2
14
66
9.8K
Lawrence Jang retweetledi
Pranjal Aggarwal ✈️ ICLR'26
Pranjal Aggarwal ✈️ ICLR'26@PranjalAggarw16·
What if computer-use agents could do real work? We built Gym-Anything: a framework that turns any software into a computer-use agent environment. We used it to create CUA-World: 200+ real software, 10,000+ tasks and environments, across all major occupation groups, from medical imaging to financial trading. 🧵
English
20
82
427
144.9K
Lawrence Jang retweetledi
Gabriel Sarch
Gabriel Sarch@GabrielSarch·
Introducing Vero, the strongest fully open RL recipe for training next-generation visual reasoners. From charts to spatial to open-ended tasks, Vero sets a new bar. • sota 8B VLM across 30 benchmarks • +4.4 avg over four base models (30 evals) • beats prior RL datasets 🧵👇
Gabriel Sarch tweet media
English
3
59
299
61.8K
Lawrence Jang retweetledi
Kevin Li
Kevin Li@kevinyli_·
how making minor abstract changes right before the deadline feels
Kevin Li tweet media
English
0
1
12
238
Lawrence Jang retweetledi
Shuyan Zhou
Shuyan Zhou@shuyanzh36·
In 2023, WebArena took 7 grad students more than 6 months to build just 5 environments with 812 variable browser-use tasks. Now, it takes under 10 hours and less than $100 per environment, with easy support for parallel generation. Excited to introduce WebArena-Infinity: a scalable approach for automatically generating high-authenticity, high-complexity browser environments with verifiable tasks suitable for RL training and benchmarking. Even strong open-source models that already achieve 60%+ success rates on WebArena and OSWorld complete fewer than 50% of tasks here. Project page: webarena.dev/webarena-infin… Repo: github.com/web-arena-x/we… 🧵 (1/n)
GIF
English
12
49
330
43.7K
Lawrence Jang retweetledi