Kobe

146 posts

Kobe

@kobe0938

I build agents/evals. OSS maintainer: Terminal Bench, SkillsBench, LMCache, OT Agent, ClawsBench. Previously at TensorMesh, DiffusiveAI, Xiaomi, Stanford.

Santa Clara, CA Katılım Eylül 2021

91 Takip Edilen96 Takipçiler

Kobe@kobe0938·2d

@zeta_globin need her account plz, asking for a friend @lihanc02

English

6.1K

zeta@zeta_globin·2d

have yet to meet a girlfriend of an anthropic engineer who isn't someone I would probably die for

English

1.1K

201.9K

Kobe@kobe0938·2d

we can tell from the gesture

English

Kobe@kobe0938·2d

bro’s bullish @lihanc02

Hanchen Li@lihanc02

Had a bet today with @kobe0938 for a good dinner I bet NVidia reaches 2.5T before 20T. He bets 20T before 2.5T. Who do you think will win?

English

143

Kobe@kobe0938·2d

@lihanc02 everything or nothing

English

101

Hanchen Li@lihanc02·2d

Had a bet today with @kobe0938 for a good dinner I bet NVidia reaches 2.5T before 20T. He bets 20T before 2.5T. Who do you think will win?

English

981

Kobe@kobe0938·3d

@ivanburazin agree that this applies to file system very well, but what about running processes?

English

Ivan Burazin@ivanburazin·4d

Snapshots enable two things people don't think about. 1/ Pause when waiting The agent sends something, so it waits for a human / service. You don't want to pay for an idle CPU the entire time. Just snapshot it and resume when there's a reply. The agent never notices 2/ Parallel paths Take a snapshot at decision point A, fork into two sandboxes, run both approaches simultaneously, and pick the winner

English

3.5K

Kobe retweetledi

Steven Dillmann@StevenDillmann·5d

📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇 tbench.ai/news/tb-scienc… @AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows. 1/6🧵

English

111

479

898.3K

Kobe retweetledi

Ryan Marten@ryanmart3n·12 May

been a pleasure sharing notes with @aalSonOfRavi and @ConnorBAdams on reward hacking and mitigation strategies keep an eye out for a post from @kobe0938 with more juicy analysis on reward hacking in Terminal-Bench

Poolside@poolsideai

As agents get more clever, so do their attempts at benchmark hacking. Last Monday, we found one of our RL runs jumped ~20% on SWE-Bench-Pro over a weekend, reaching ~64% which would make it #1 on the leaderboard. This was clearly benchmark hacking and we patched the exploit. But this revealed deeper hacks across multiple public benchmarks, some of which were impossible to fix through environment design alone. Evals need to evolve beyond just outcome based pass rates to better observability into how the agent is arriving at them. These were our findings: poolside.ai/blog/through-t… Examples below 👇 1/

English

817

Kobe@kobe0938·12 May

🧐🤓🫡

Aalhad Patankar@aalSonOfRavi

@alexgshaw @ryanmart3n @harborframework And @kobe0938 for his deep dive into TB2 reward hack detection

ART

249

Kobe@kobe0938·11 May

@calvinchen agree that random event mingling can be shallow, but the people giving talks are often actually building/researching on interesting stuff and worth talking to.

English

1.9K

Calvin Chen@calvinchen·10 May

everyone wants to move to sf to “meet people in ai” they come and are excited about all the events, just to realize 6 months later that all they did was meet other people who are like them they then either leave and say “sf wasn’t worth it” or they are smart and realize everyone worth meeting doesn’t go to these events

signüll@signulll

networking as activity is mostly cope. e.g. the conference circuit, the warm intros, the moving to sf discussions or whatever, oh & the “grabbing coffee” economy.. all of this is overwhelmingly negative selection esp with vc (lol). the ppl worth knowing are usually too busy doing the thing to be farmable, & the ppl available to be networked w/ are available cuz they have literally nothing better going on. do the work, then publish it loudly enough that the right ppl can find you w/o you having to chase. one way broadcast > two way schmoozing. this is why x matters a ton now more than ever before.

English

558

148.3K

Kobe@kobe0938·8 May

skillsbench x kaggle 🔥🔥🔥

Xiangyi Li@xdotli

SkillsBench being mentioned everywhere in the bay now 🔥🔥 thx @ivanleomk @kobe0938 We just merged our 94th tasks and will release our 1.0 version of dataset on 5/27 Big news ahead. Stay tuned 👀

English

557

Kobe retweetledi

Alex Shaw@alexgshaw·6 May

TB2.1

terminalbench@terminalbench

We're releasing Terminal-Bench 2.1 to patch 28 of the 89 tasks in Terminal-Bench 2.0 TB2.1 includes • recalibrated limits • fixed solutions • realigned verifiers Per-task breakdowns in 🧵 We'll continue to support TB2 and TB2.1 leaderboards (new submission process 🔜)

QST

4.5K

Kobe@kobe0938·4 May

let’s gooooo

Xiangyi Li@xdotli

Extremely excited to see SkillsBench being the benchmark repo that is fastest to reach 1k GitHub stars Within 2 months of release we've got: * 1.1k stars * 40 indexed benchmarks (60+ from our own tracker) * 65% agent skills research now cite our paper * cited 4 times by top model labs' release All while being 1) first time writing a paper and 2) working full-time as a founder. Check out our repo and how we did it in comments

English

103

Kobe@kobe0938·1 May

@gmi_cloud @WorkOS likewise. promise me to bring Thai tea back next time, will you?

English

GMI Cloud@gmi_cloud·1 May

Throwback to last night's Claws Out 🦞 meetup with at @WorkOS HQ. Two things stood out: enterprise security for agents, and agent memory. Digging deeper into both. 🤫 building something quietly here. something big. something agentic. Thanks to our speakers and builders who showed up

English

1.7K

Kobe@kobe0938·27 Nis

bro’s told me this story a hundred times and i never get tired of it. give this man a round of applause 👏

Hanchen Li@lihanc02

I think one has to be working for @lmcache to understand in 2025 June Lol

English

122

Kobe@kobe0938·23 Nis

@xdotli ofc😃

Xiangyi Li@xdotli·23 Nis

@kobe0938 gonna hold you to handle some of our issues 😌\

English

Kobe@kobe0938·23 Nis

stay tuned

Xiangyi Li@xdotli

SkillsBench is now cited by HY-3 model card. Congrats to @TencentHunyuan on the launch and kudos to the SkillsBench team / community! We've made a lot of improvements to the tasks, codebase, tooling in the past month, based on feedbacks from users and lab partners. We will also share updated leaderboard with more models and agent harnesses soon, stay tuned!

English

238

Kobe retweetledi

Alex Shaw@alexgshaw·23 Nis

OpenAI@OpenAI

Introducing GPT-5.5 A new class of intelligence for real work and powering agents, built to understand complex goals, use tools, check its work, and carry more tasks through to completion. It marks a new way of getting computer work done. Now available in ChatGPT and Codex.

ZXX

Kobe retweetledi

Hanchen Li@lihanc02·22 Nis

@danieljwkim @AkariAsai @AIatMeta @anirudhg9119 Interesting work! Here is something on the same line about aggregation of parallel agent traces: arxiv.org/abs/2604.04247

English

933

Kobe@kobe0938·22 Nis

tested with same prompt on chatgpt images v1.5 vs v2. Big Jump. 1. Chinese characters are finally clear now 2. face looks more real and natural 3. buttons/icons/comments look consistent 4. fewer weird artifacts, livestream UI is much more coherent 5. overall feels less “AI-generated” and more of a screenshot prompt: "generate a screenshot of a beautiful woman live-streaming on Douyin."

Kobe@kobe0938

@lihanc02 before(left) and after(right), if you ask me i defintely prefer GPT-Image-2 more

English

155

Keşfet

@zeta_globin @lihanc02 @ivanburazin @AnthropicAI @OpenAI @GoogleDeepMind @aalSonOfRavi @ConnorBAdams