Kevin Huang (Wenqi)
782 posts

Kevin Huang (Wenqi)
@winkey_h
Founding Research @datacurve DeepSWE ∗ prev @openai, @nvidia, @tesla, @uwaterloo cs 26

Opus 4.8 is now on DeepSWE. On the default high thinking effort, it scores 6% higher than Opus 4.7 xhigh, while also lowering average cost per task.


official confirmation that the claude code harness has become slop btw



Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.


Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.

Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.



Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.

Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.

Our investigation has revealed that the incident originated from a third-party AI tool with hundreds of users whose Google Workspace OAuth app was compromised. We recommend that Google Workspace Administrators check for usage of this app immediately. #indicators-of-compromise-iocs" target="_blank" rel="nofollow noopener">vercel.com/kb/bulletin/ve…


Honestly don't know what happened to Claude Code. Tried a one-off simple task on a fresh directory yesterday, tried a bunch of things that didn't work, asked for a ton of permissions, and then got stuck for 4 minutes before I got tired of waiting and killed the session. This was on medium effort. Switched to Codex gpt 5.4 with medium effort and one shotted the task in under 1 minute.

Just realized @nextjs adds this in the agents.md LOL

source code now available github.com/collaborator-a…


How come the NanoGPT speedrun challenge is not fully AI automated research by now?









