Kobe
146 posts

Kobe
@kobe0938
I build agents/evals. OSS maintainer: Terminal Bench, SkillsBench, LMCache, OT Agent, ClawsBench. Previously at TensorMesh, DiffusiveAI, Xiaomi, Stanford.






As agents get more clever, so do their attempts at benchmark hacking. Last Monday, we found one of our RL runs jumped ~20% on SWE-Bench-Pro over a weekend, reaching ~64% which would make it #1 on the leaderboard. This was clearly benchmark hacking and we patched the exploit. But this revealed deeper hacks across multiple public benchmarks, some of which were impossible to fix through environment design alone. Evals need to evolve beyond just outcome based pass rates to better observability into how the agent is arriving at them. These were our findings: poolside.ai/blog/through-t… Examples below 👇 1/


networking as activity is mostly cope. e.g. the conference circuit, the warm intros, the moving to sf discussions or whatever, oh & the “grabbing coffee” economy.. all of this is overwhelmingly negative selection esp with vc (lol). the ppl worth knowing are usually too busy doing the thing to be farmable, & the ppl available to be networked w/ are available cuz they have literally nothing better going on. do the work, then publish it loudly enough that the right ppl can find you w/o you having to chase. one way broadcast > two way schmoozing. this is why x matters a ton now more than ever before.









I think one has to be working for @lmcache to understand in 2025 June Lol

SkillsBench is now cited by HY-3 model card. Congrats to @TencentHunyuan on the launch and kudos to the SkillsBench team / community! We've made a lot of improvements to the tasks, codebase, tooling in the past month, based on feedbacks from users and lab partners. We will also share updated leaderboard with more models and agent harnesses soon, stay tuned!


@lihanc02 before(left) and after(right), if you ask me i defintely prefer GPT-Image-2 more



