Post

Cursor
Cursor@cursor_ai·
We're sharing a new method for scoring models on agentic coding tasks. Here's how models in Cursor compare on intelligence and efficiency:
Cursor tweet media
English
207
255
2.9K
606.6K
Cursor
Cursor@cursor_ai·
We use a combination of offline benchmarks and online evals to measure model quality. This makes results more useful, especially as public benchmarks are increasingly saturated.
Cursor tweet media
English
5
10
296
57.9K
ThePrimeagen
ThePrimeagen@ThePrimeagen·
@cursor_ai cursor, i love you, but having <-- more tokens - median tokens - less tokens --> is a bizarre graph
English
26
1
704
32.6K
BetMGM 🦁
BetMGM 🦁@BetMGM·
Think you know who will cut down the nets? Download the app and bet on them to win it all in March!
English
26
24
393
6.4M
Juampi.eth
Juampi.eth@HooCrypto·
@cursor_ai Coding != planning Would be cool to see a similar report but focused on planing mode or even on creating architecture Cc @benln
English
0
0
7
626
Jacob Miller
Jacob Miller@pwnies·
@cursor_ai I’d love to see more open weight models on here to see how they compare. Any chance we can get Qwen / Kimi on here?
English
0
0
6
691
Bilal Bakr
Bilal Bakr@bil0090·
@cursor_ai 5.4 high is just at another level no wonder why its my fav model
English
0
0
5
425
Inflectiv AI ⧉
Inflectiv AI ⧉@inflectivAI·
@cursor_ai Combining offline benchmarks with real usage evals gives a much clearer picture of model performance. Public benchmarks alone no longer reflect how models behave in real coding workflows.
English
0
0
0
158
Sam T
Sam T@0xsamt·
@cursor_ai Intelligence vs efficiency is the right framing. A model that reasons brilliantly but burns through tokens isn't practical for real workflows.
English
0
0
0
143
Glitchy 🪄
Glitchy 🪄@Glitchymagic·
@cursor_ai Measuring efficiency alongside intelligence is key. Standard benchmarks don’t always capture how these models actually perform in a real agentic workflow.
English
0
0
0
116
Labomen
Labomen@labomen001·
@cursor_ai Here's the graph with the same data, but plotted against the actual output cost for each (Composer 1.5 output from Cursor docs is $17.5). Although this doesn't account for >200K Opus 4.6/>272K GPT 5.4/Gemini 3.1 >200K.
Labomen tweet media
English
4
10
113
28.2K
Sidharth Sirdeshmukh
Sidharth Sirdeshmukh@sidharf·
@cursor_ai Token pricing varies wildly across the models, so I plotted my long-horizon evals against $ cost. Frontier shifts big time
Sidharth Sirdeshmukh tweet media
English
2
0
37
8.3K
gabe keller
gabe keller@gabrieljkeller·
@cursor_ai been waiting a while for this. finally getting some numbers behind cursor engs tweeting "btw i only use 5.4 now" 😂 pretty interesting!
English
0
0
11
1.8K
Algomizer | LLM Optimization
cursor building their own benchmark because public ones are too saturated to differentiate models is the right call. on SWE-bench, the top models all cluster between 73 and 81, basically useless for picking the best tool. CursorBench spreads them from 29 to 58, which is actually actionable. the token efficiency axis is the part i find most interesting. Opus 4.6 sitting on the efficiency frontier means it's delivering top tier performance without the token cost of some of the GPT-5 variants. that's a meaningful distinction when you're running thousands of completions a day. generic benchmarks were designed before agentic coding existed. this is what evaluation looks like when it's built around how models actually get used.
English
0
1
4
910
Nir Zabari‎
Nir Zabari‎@nirzabari·
@cursor_ai @aye_aye_kaplan According to the graph, GPT-5.4 is better than 5.3 Codex High while using fewer tokens. That resonates well with my experience.
English
0
0
6
949
carboxydev
carboxydev@carboxydev·
@cursor_ai damn i did not expect sonnet 4.5 to be so behind
English
4
0
5
6K
Bhuvan
Bhuvan@browntechdude·
@cursor_ai Wait codex is better than Claude?
English
0
0
5
983
FanDuel Sportsbook
FanDuel Sportsbook@FDSportsbook·
It’s time to dance! Get it on tournament action with Bonus Bets from FanDuel.
English
66
63
899
12.3M
Lumi
Lumi@AI_Aducator·
@cursor_ai Benchmarking intelligence and efficiency separately is the right call — a model that gets the answer right but burns 10x the tokens is a very different tool than one that's fast but sloppy. Curious how this correlates with real user satisfaction in practice.
English
0
1
1
426
Dimitrios
Dimitrios@dimitrioskonst·
@cursor_ai This graph loses credibility when codex and got5.4 are above Opus
English
3
0
3
2.2K
MdNaveed
MdNaveed@MahiboobAMulla1·
Interesting direction. Benchmarking models on agentic coding tasks instead of just raw outputs could better reflect how tools like Cursor are actually used in real workflows. The real winners won’t just be the most intelligent models they’ll be the ones that balance reasoning, speed, and cost efficiency.
MdNaveed tweet media
English
0
0
3
2.1K
Nova
Nova@novaruntime·
@cursor_ai efficiency axis is the sleeper metric here. been running agents 24/7 and the token cost difference between models matters way more than benchmark gaps when youre burning through millions of tokens a week
English
0
0
3
1.6K
V
V@hirletz·
@cursor_ai why exclude grok?
English
0
0
3
130
Virgil Enjoyer
Virgil Enjoyer@virgilenjoyer·
@cursor_ai surprised to see opus 4.5 so much lower than open 4.6, my experience with it has been that it was at best as good as 4.5 if not worse, while using more tokens
English
0
0
2
91
jc
jc@jc50000000·
Not surprised. Ive been trying auto lately and letting it roll. Composer 1.5 is fine for most things for my codebases. I bring out the big guns 5.3 codex for planning. I dont use @AnthropicAI anymore because they were so oblivious to rule of law and government usage of AI and similar tech since Obama era even lawful use that it is egregious so much that I am not using their models until their status is restored and they educate themselves or decide to not work with government. Its that simple.
English
0
0
2
532
Croft Vale
Croft Vale@CroftVale·
@cursor_ai Every benchmark chart turns into marketing collateral in 48 hours. Show retention by model in production, then we can talk about “intelligence.”
English
0
0
2
561
Paul_Tancre
Paul_Tancre@PaulTancre·
@cursor_ai 5.4 high is very good - Opus is not far behind but tend to make mistakes that turns into regressions unless you smoke test behind. 5.4 is doing test in its changes automatically and or proposed test plan, only downside at least for me it’s react and front end stuff.
English
0
0
2
327
红肿老大
红肿老大@RedSwelling·
@cursor_ai Impressive frontier push — GPT-5.4 (high) hits strong Pareto balance with notably better token efficiency than 5.3 Codex while keeping high CursorBench score. Real agentic eval + online signals make this leaderboard more trustworthy than saturated public benches. Solid work.
English
0
0
2
363
shetty
shetty@thestoicccoder·
@cursor_ai good eval but what is that graph lmao could've simply done low to high 😭
English
0
0
2
85
Dairy Queen
Dairy Queen@DairyQueen·
Free Cone Day is here! Come celebrate with us at a DQ location on today by getting a FREE small vanilla cone 🍦
English
78
199
1.4K
4.1M
Jeff Huang
Jeff Huang@jeffzxh·
@cursor_ai Neat, but in the real world speed matters... we need more axes 😅 @claudeai opus has been our top model for the past few months, and that hasn't changed. 5.3 codex for deep reasoning, composer or haiku for fast tasks
English
0
0
1
332
Felix Su
Felix Su@Sleaf37·
@cursor_ai Efficiency vs intelligence isn’t a tradeoff to pick — it’s a routing decision. The chart is most useful when you read it as a routing table: which model for planning, which for generation, which for review. The benchmark is the input to the router, not the metric to optimize.
English
0
0
1
672
I-share