Margins

27 posts

Margins

@themarginguy

https://t.co/rKSVcown3Q

Katılım Aralık 2025

5 Takip Edilen817 Takipçiler

Margins@themarginguy·1d

We are seeing a sustained statistically significant degradation in Claude Code with Opus 4.7 since last Friday May 22nd at marginlab.ai/trackers/claud…

English

771

116.8K

Margins@themarginguy·6d

We identified an issue with our Codex test harness that caused a subset of test instances after May 9th to use gpt-5.5-high instead of gpt-5.5-xhigh This issue exclusively affected our Codex tracker (marginlab.ai/trackers/codex/) and has since been resolved

English

932

Margins@themarginguy·11 Mar

There is more to the story: 1. Agree on the lessons, eval infra is critical 2. Our testing is robust to infra. We re-try any test instance that fails to generate a patch at marginlab.ai/trackers/claud… 3. In spite of (2), we've detected statistical degradation for CC several times

Thariq@trq212

really good post on why agentic coding evals are so noisy and might vary by several percentage points between runs

English

2.1K

Margins@themarginguy·5 Mar

GPT 5.4 remains behind 5.3 Codex on Terminal Bench

English

1.7K

Margins@themarginguy·5 Mar

GPT-5.4!! SWE-Bench Pro scores:

Indonesia

1.3K

Margins@themarginguy·5 Mar

@ThePrimeagen Correct, this is an all-time low since Opus 4.6 launched. Any sustained drop today will tip our tracker into statistically significant degradation territory

English

718

Margins retweetledi

ThePrimeagen@ThePrimeagen·5 Mar

Claude Opus 4.6 had it's worst benchmark day yesterday

English

222

1.9K

717.5K

Margins retweetledi

ThePrimeagen@ThePrimeagen·5 Mar

Here you go and you are welcome: marginlab.ai/trackers/claud…

@levelsio@levelsio

We need some whistleblowers to confirm our gut feeling they do this, it's getting annoying If no whistleblowers show up we can all sleep in peace again

English

459

114.7K

Margins@themarginguy·5 Mar

Our tracker is also showing a large drop in token usage and tool calls from Opus 4.6 + Claude Code, we are investigating if this is due to a recent change to decrease the default effort in the Claude Code harness

English

653

Margins@themarginguy·5 Mar

Opus 4.6 in Claude Code has reached an all-time low since launch on our daily benchmarks. Any sustained decrease today will tip our tracker into statistically significant degradation territory

English

2.1K

Margins@themarginguy·27 Şub

@alexatallah We are building an agent harness benchmarking platform at marginlab.ai , we use it internally to run our daily benchmarks, e.g. Claude Code benchmarks (marginlab.ai/trackers/claud…) Coming soon™

English

110

Alex Atallah@alexatallah·27 Şub

What is the best benchmark for agent harnesses?

English

15.7K

Margins@themarginguy·10 Şub

We are also now tracking a bunch of other metrics: input/output token count, API cost, average runtime, and tool call counts

English

450

Margins@themarginguy·10 Şub

Claude Code with Opus 4.6 Degradation Tracker is live at marginlab.ai/trackers/claud… See historical data (including Opus 4.5) under marginlab.ai/trackers/claud… Tracker for Codex with gpt-5.3-codex available too!

English

1.3K

Margins@themarginguy·10 Şub

@levelsio Checkout our the trackers at marginlab.ai/trackers/claud… Daily evals on Claude Code/Codex

English

752

@levelsio@levelsio·10 Şub

My secret conspiracy theory about AI companies is they nerf models to save on compute Then they check X to see if anyone notices it If yes, give back compute If not, continue

@levelsio@levelsio

I have never experienced a more dumb Claude Code than today, I have to start coding myself again cause it makes so many Low IQ mistakes, they must be nerfing it now that Opus 4.6 is out or something is up

English

455

134

4.8K

619.2K

Margins@themarginguy·5 Şub

gpt-5.3-codex vs Opus 4.6 Big jump for OpenAI!

Indonesia

895

Margins@themarginguy·5 Şub

Opus 4.6 Released Big jump on Terminal-Bench-2.0, no improvement on SWE-Bench Verified

English

776

Margins@themarginguy·3 Şub

Claude Code continues to show a statistically significant degradation over the past month marginlab.ai/trackers/claud…

English

1.4K

Margins@themarginguy·1 Şub

@jxmnop This is the MarginLab Claude Code tracker. We run a subset of SWE-Bench-Pro on Claude Code daily and report results

English

274

Jack Morris@jxmnop·1 Şub

the real alignment scare from this week: a subtle bug in claude code made the model much dumber. software engineers worldwide were significantly less productive for two days not just one anecdote, many people have been telling me this it's fixed now apparently

English

276

31.1K

Margins@themarginguy·15 Oca

Coding agent evals are broken. See the official SWE-Bench-Pro leaderboard results compared to the Opus 4.5 system card below. What explains the difference? SWE-Bench-Pro leaderboard uses a minimal scaffold, which is not how models are actually used.

English

494

Margins@themarginguy·13 Oca

Claude Code Degradation Tracker: Seeing a sustained (but not statistically significant) drop in performance over the past few days. See details here: marginlab.ai/trackers/claud…

English

355

Keşfet

@ThePrimeagen @alexatallah @levelsio @jxmnop @elonmusk @BarackObama @taylorswift13 @cristiano