Kobe

146 posts

Kobe banner
Kobe

Kobe

@kobe0938

I build agents/evals. OSS maintainer: Terminal Bench, SkillsBench, LMCache, OT Agent, ClawsBench. Previously at TensorMesh, DiffusiveAI, Xiaomi, Stanford.

Santa Clara, CA Katılım Eylül 2021
91 Takip Edilen96 Takipçiler
zeta
zeta@zeta_globin·
have yet to meet a girlfriend of an anthropic engineer who isn't someone I would probably die for
English
15
2
1.1K
201.9K
Kobe
Kobe@kobe0938·
we can tell from the gesture
English
0
0
0
34
Hanchen Li
Hanchen Li@lihanc02·
Had a bet today with @kobe0938 for a good dinner I bet NVidia reaches 2.5T before 20T. He bets 20T before 2.5T. Who do you think will win?
English
4
0
3
981
Kobe
Kobe@kobe0938·
@ivanburazin agree that this applies to file system very well, but what about running processes?
English
0
0
1
39
Ivan Burazin
Ivan Burazin@ivanburazin·
Snapshots enable two things people don't think about. 1/ Pause when waiting The agent sends something, so it waits for a human / service. You don't want to pay for an idle CPU the entire time. Just snapshot it and resume when there's a reply. The agent never notices 2/ Parallel paths Take a snapshot at decision point A, fork into two sandboxes, run both approaches simultaneously, and pick the winner
English
5
1
45
3.5K
Kobe retweetledi
Steven Dillmann
Steven Dillmann@StevenDillmann·
📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇 tbench.ai/news/tb-scienc… @AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows. 1/6🧵
Steven Dillmann tweet media
English
15
111
479
898.3K
Kobe retweetledi
Ryan Marten
Ryan Marten@ryanmart3n·
been a pleasure sharing notes with @aalSonOfRavi and @ConnorBAdams on reward hacking and mitigation strategies keep an eye out for a post from @kobe0938 with more juicy analysis on reward hacking in Terminal-Bench
Poolside@poolsideai

As agents get more clever, so do their attempts at benchmark hacking. Last Monday, we found one of our RL runs jumped ~20% on SWE-Bench-Pro over a weekend, reaching ~64% which would make it #1 on the leaderboard. This was clearly benchmark hacking and we patched the exploit. But this revealed deeper hacks across multiple public benchmarks, some of which were impossible to fix through environment design alone. Evals need to evolve beyond just outcome based pass rates to better observability into how the agent is arriving at them. These were our findings: poolside.ai/blog/through-t… Examples below 👇 1/

English
0
2
9
817
Kobe
Kobe@kobe0938·
@calvinchen agree that random event mingling can be shallow, but the people giving talks are often actually building/researching on interesting stuff and worth talking to.
English
0
0
3
1.9K
Kobe
Kobe@kobe0938·
@gmi_cloud @WorkOS likewise. promise me to bring Thai tea back next time, will you?
English
1
0
1
36
GMI Cloud
GMI Cloud@gmi_cloud·
Throwback to last night's Claws Out 🦞 meetup with at @WorkOS HQ. Two things stood out: enterprise security for agents, and agent memory. Digging deeper into both. 🤫 building something quietly here. something big. something agentic. Thanks to our speakers and builders who showed up
GMI Cloud tweet mediaGMI Cloud tweet mediaGMI Cloud tweet mediaGMI Cloud tweet media
English
5
2
13
1.7K
Xiangyi Li
Xiangyi Li@xdotli·
@kobe0938 gonna hold you to handle some of our issues 😌\
English
1
0
0
36
Kobe
Kobe@kobe0938·
tested with same prompt on chatgpt images v1.5 vs v2. Big Jump. 1. Chinese characters are finally clear now 2. face looks more real and natural 3. buttons/icons/comments look consistent 4. fewer weird artifacts, livestream UI is much more coherent 5. overall feels less “AI-generated” and more of a screenshot prompt: "generate a screenshot of a beautiful woman live-streaming on Douyin."
Kobe@kobe0938

@lihanc02 before(left) and after(right), if you ask me i defintely prefer GPT-Image-2 more

English
0
0
3
155