Lin Shi

13 posts

Lin Shi

Lin Shi

@LinShi592021

Katılım Eylül 2025
14 Takip Edilen12 Takipçiler
Sabitlenmiş Tweet
Lin Shi
Lin Shi@LinShi592021·
Terminal Bench 2.0 paper available: arxiv.org/abs/2601.11868. See where frontier agents and models still fail and how we crowdsource hundreds of high quality environments from the open source community 🚀 Follow github.com/laude-institut… to see how to run TB2 in Harbor!
Lin Shi tweet mediaLin Shi tweet media
English
1
1
9
240
Lin Shi retweetledi
Alex Shaw
Alex Shaw@alexgshaw·
The Harbor registry is getting an upgrade. Now, anyone can publish to the registry to make their dataset available to every Harbor user:
Alex Shaw tweet media
English
4
5
38
4.6K
Lin Shi retweetledi
Richard Zhuang
Richard Zhuang@RichardZ412·
Terminal-Bench is a leading benchmark for agents. Unfortunately it’s hard: most small coding agents get very low scores on TB2, so training/system ablations look flat - you can't tell what's working. Announcing OpenThoughts-TBLite - 100 curated TB2-style tasks, difficulty-calibrated so even 8B models can make progress. It's designed to give researchers measurable signal during development, providing faster feedback for experimental iteration while closely tracking true TB2 performance🧵
Richard Zhuang tweet media
English
11
21
181
44.6K
Lin Shi retweetledi
Lin Shi
Lin Shi@LinShi592021·
Also, super honored to give my first conference talk at NeurIPS 2025 about Terminal Bench, Harbor, and Adapters! If you are interested in our work or want to gain some context in 15 minutes, this might be a great resource👀
English
1
0
3
158
Lin Shi
Lin Shi@LinShi592021·
Terminal Bench 2.0 paper available: arxiv.org/abs/2601.11868. See where frontier agents and models still fail and how we crowdsource hundreds of high quality environments from the open source community 🚀 Follow github.com/laude-institut… to see how to run TB2 in Harbor!
Lin Shi tweet mediaLin Shi tweet media
English
1
1
9
240
Lin Shi
Lin Shi@LinShi592021·
@Mike_A_Merrill Super excited to see the paper coming out! Thanks Mike and Alex for leading the project and community member for contributing!
English
0
0
2
245
Lin Shi retweetledi
Mike A. Merrill
Mike A. Merrill@Mike_A_Merrill·
The Terminal-Bench paper is here! Read it to learn where frontier models still fail and the secrets of how we sourced hundreds of high quality environments from our open source community. 🧵
Mike A. Merrill tweet media
English
23
102
457
101.7K
Lin Shi retweetledi
Mike A. Merrill
Mike A. Merrill@Mike_A_Merrill·
Thanks to everyone who came to the Terminal-Bench-2.0 and Harbor launch event last night! And special thanks to @LaudeInstitute for organizing and @databricks for hosting.
Mike A. Merrill tweet media
English
4
4
56
2.7K
Lin Shi retweetledi
Alex Shaw
Alex Shaw@alexgshaw·
Today, we’re announcing the next chapter of Terminal-Bench with two releases: 1. Harbor, a new package for running sandboxed agent rollouts at scale 2. Terminal-Bench 2.0, a harder version of Terminal-Bench with increased verification
Alex Shaw tweet media
English
25
73
391
141.2K