Shaun Smith

2.4K posts

Shaun Smith

@evalstate

https://t.co/rA1UoojwhN https://t.co/76p6mDAfej

united kingdom Katılım Temmuz 2024

889 Takip Edilen1.1K Takipçiler

Shaun Smith@evalstate·4h

@onusoz I'll tell you in a few hours 😉

English

Onur Solmaz@onusoz·4h

@evalstate how does ds4 flash 0731 compare?

English

Shaun Smith@evalstate·7h

This. is. insane. $0.04 per trial. Last weekend I benched Grok 4.5 at 80.5% on tb-2.1 for $159 and was impressed. Today gpt-5.6-luna has beaten it: with flex tier pricing the run cost less than $20. Industry economics changed this week.

English

727

Shaun Smith@evalstate·6h

@xeophon No - not fast. But I've not had any weird errors or disconnects [yet].

English

Florian Brand@xeophon·6h

@evalstate ehhhh, not the fastest for us tbh

English

160

Florian Brand@xeophon·6h

i love benching the whale

English

2.6K

Shaun Smith@evalstate·7h

@andrew_n_carr Agreed. I've completely changed the way I use subagents these last 2 weeks; for conversational stuff big models are a total waste now.

English

100

Andrew Carr 🤸@andrew_n_carr·17h

My guess is that the models are getting harder to understand because they're being optimized for sub agent spawning and communication. We are not the intended audience

English

4.7K

Shaun Smith@evalstate·7h

fast-agent model string: responses.gpt-5.6-luna?reasoning=max&service_tier=flex Migrate your automations here: fast-agent.ai/guides/migrate…

English

Shaun Smith@evalstate·7h

Full ATIF trajectories available in harbor, leaderboard submission PR#184. Cost taken from API Key used for the trial. fast-agent cache efficiency was 93.83%, 1.75% of prompt tokens were priced at long context rates.

English

Shaun Smith@evalstate·8h

Great overview (as always) from @simonw on MCP. The protocol is reinvigorated. A single URL reliably connects, authenticates and adds capabilities almost universally - it's hard to beat.

Simon Willison@simonw

The new stateless MCP specification has rekindled my interest in MCP, and inspired some new projects, including mcp-explorer and datasette-mcp simonwillison.net/2026/Jul/31/st…

English

1.6K

Shaun Smith@evalstate·8h

@SarahLacard Yep, my estimate for a TB 2.1 run is about $10. I'm currently figuring out how reproduceable the published score is and doing a couple of reasoning level sweeps.

English

Sarah 🇨🇦 🏳️‍⚧️@SarahLacard·8h

@evalstate oh i haven't gotten that fancy with the sauce - only using it in opencode, it's a monster, and cheaper than luna from what i can tell

English

Shaun Smith@evalstate·9h

omw... this was $20 well spent.

English

217

Shaun Smith@evalstate·8h

@SarahLacard I have, and am on that now😀. Very nice they are delivering it over the Responses API too (but no Web Sockets).

English

Sarah 🇨🇦 🏳️‍⚧️@SarahLacard·8h

@evalstate have you tried deepseek v4 flash?

English

Shaun Smith@evalstate·1d

@Infoxicador MCP is Lindy.

English

Ruben Casas 🦊@Infoxicador·1d

MCP is here to stay

English

387

Shaun Smith@evalstate·1d

@reach_vb Cheers @reach_vb , next round is on me. And I am partial to negroni.

English

Vaibhav (VB) Srivastav@reach_vb·1d

I love my team, colleagues and job! It’s been so much fun the last couple days - just letting my ideas run free and building things I care about!! Oh, what a privilege it is that I get to do what I do \o/ Yes, this post is brought to you by two pints and a negroni

English

156

6.1K

Shaun Smith@evalstate·1d

@Infoxicador Yeah, it's on the official leaderboard but an earlier version of codex. Token drain is real, I think fast-agent is about half the cost of OpenAIs figures in their report.

English

Ruben Casas 🦊@Infoxicador·1d

@evalstate Surely that codex one is an outlier right? 2k!

English

Shaun Smith@evalstate·2d

Efficiency matters you say?

English

621

Shaun Smith@evalstate·1d

@Infoxicador @reach_vb Real skill is correct use of "pal".

English

Ruben Casas 🦊@Infoxicador·1d

@evalstate @reach_vb Next level is Geezer

English

Vaibhav (VB) Srivastav@reach_vb·1d

One of the things I love most about this video is how well it captures the ethos of OpenAI and what drives the people behind it to wake up every day and deliver some of their life’s best work. Jason really put his heart and soul into this one. Kudos, mate!

jason@jxnlco

one of the most beautiful things about OpenAI is that every employee really has a voice. i wanted to capture what it feels like to work here, what our mission means to me, and why you should join us. so i made this video with Codex, shared it with the team, and they felt it was worth producing and sharing with the world. this is our mission. and it’s why i’m here.

English

101

8.8K

Shaun Smith@evalstate·1d

@liran_tal Yep, but when tokens are that cheap... (aesthetically it's horrible, but I think after using GLM for a bit we should all switch reasoning traces off for good)!

English

Liran Tal@liran_tal·1d

@evalstate Interesting! Wouldn't heavy token cost directly contribute towards higher spend though? (not across pricing models but across same tier in general) Like for example Opus 4.8 and seems like Sonnet 5 too are incredibly verbose in thinking process

English

Liran Tal@liran_tal·1d

Wait what 5.6 Luna is scoring higher than Sonnet 5 ???

English

1.5K

Shaun Smith retweetledi

merve@mervenoyann·1d

Thinking Machines released Inkling Small (🦖) + NVFP4 12B active 276B total params, the model performs better than larger Inkling on coding 🤯 > check out our blog covering benchmarks, performance and deployment huggingface.co/blog/thinkingm… > huggingface.co/collections/th…

English

158

16.4K

Shaun Smith@evalstate·1d

@un3valuated Yeah, I'll post some terra benchmarks tomorrow. Sonnet is buried, Flash is buried and at $15 output Moonshot doesn't compete here. Grok is the one to watch now. What a time.

English

short circuit@un3valuated·1d

@evalstate My half cheap half intelligent model was Gemini 3 flash for last 8 months, now Luna costs in half. Rest in peace bro, you'll be missed.

English

Shaun Smith@evalstate·1d

Holy shi_. Assuming flex tier pricing still holds this is insanely aggressive.

OpenAI@OpenAI

We are committed to pushing the model frontier across cost efficiency, capability, and speed. Starting today, we are reducing prices for GPT-5.6 Luna by 80% and GPT-5.6 Terra by 20% , and offering a faster option for GPT-5.6 Sol in the API. Luna and Terra’s lower prices are reflected in how usage is counted in Codex and ChatGPT Work, so your usage goes further.

English

550

Keşfet

@onusoz @xeophon @andrew_n_carr @simonw @SarahLacard @Infoxicador @reach_vb @elonmusk