Margins

27 posts

Margins

Margins

@themarginguy

https://t.co/rKSVcown3Q

Katılım Aralık 2025
5 Takip Edilen817 Takipçiler
Margins
Margins@themarginguy·
We are seeing a sustained statistically significant degradation in Claude Code with Opus 4.7 since last Friday May 22nd at marginlab.ai/trackers/claud…
Margins tweet media
English
34
27
771
116.8K
Margins
Margins@themarginguy·
We identified an issue with our Codex test harness that caused a subset of test instances after May 9th to use gpt-5.5-high instead of gpt-5.5-xhigh This issue exclusively affected our Codex tracker (marginlab.ai/trackers/codex/) and has since been resolved
English
1
0
4
932
Margins
Margins@themarginguy·
GPT 5.4 remains behind 5.3 Codex on Terminal Bench
Margins tweet media
English
2
0
6
1.7K
Margins
Margins@themarginguy·
GPT-5.4!! SWE-Bench Pro scores:
Margins tweet media
Indonesia
0
0
5
1.3K
Margins
Margins@themarginguy·
@ThePrimeagen Correct, this is an all-time low since Opus 4.6 launched. Any sustained drop today will tip our tracker into statistically significant degradation territory
English
0
0
4
718
Margins retweetledi
ThePrimeagen
ThePrimeagen@ThePrimeagen·
Claude Opus 4.6 had it's worst benchmark day yesterday
ThePrimeagen tweet media
English
222
42
1.9K
717.5K
Margins
Margins@themarginguy·
Our tracker is also showing a large drop in token usage and tool calls from Opus 4.6 + Claude Code, we are investigating if this is due to a recent change to decrease the default effort in the Claude Code harness
Margins tweet mediaMargins tweet media
English
1
0
5
653
Margins
Margins@themarginguy·
Opus 4.6 in Claude Code has reached an all-time low since launch on our daily benchmarks. Any sustained decrease today will tip our tracker into statistically significant degradation territory
Margins tweet media
English
6
1
16
2.1K
Alex Atallah
Alex Atallah@alexatallah·
What is the best benchmark for agent harnesses?
English
46
0
93
15.7K
Margins
Margins@themarginguy·
We are also now tracking a bunch of other metrics: input/output token count, API cost, average runtime, and tool call counts
Margins tweet media
English
0
0
1
450
Margins
Margins@themarginguy·
gpt-5.3-codex vs Opus 4.6 Big jump for OpenAI!
Margins tweet media
Indonesia
1
1
7
895
Margins
Margins@themarginguy·
Opus 4.6 Released Big jump on Terminal-Bench-2.0, no improvement on SWE-Bench Verified
Margins tweet media
English
10
0
0
776
Margins
Margins@themarginguy·
@jxmnop This is the MarginLab Claude Code tracker. We run a subset of SWE-Bench-Pro on Claude Code daily and report results
English
0
0
1
274
Jack Morris
Jack Morris@jxmnop·
the real alignment scare from this week: a subtle bug in claude code made the model much dumber. software engineers worldwide were significantly less productive for two days not just one anecdote, many people have been telling me this it's fixed now apparently
Jack Morris tweet media
English
20
3
276
31.1K
Margins
Margins@themarginguy·
Coding agent evals are broken. See the official SWE-Bench-Pro leaderboard results compared to the Opus 4.5 system card below. What explains the difference? SWE-Bench-Pro leaderboard uses a minimal scaffold, which is not how models are actually used.
Margins tweet mediaMargins tweet media
English
0
0
1
494
Margins
Margins@themarginguy·
Claude Code Degradation Tracker: Seeing a sustained (but not statistically significant) drop in performance over the past few days. See details here: marginlab.ai/trackers/claud…
Margins tweet media
English
0
0
2
355