Simon Smith
5.8K posts

Simon Smith
@_simonsmith
EVP Generative AI @klickhealth





We're shipping a new feature in Claude Cowork as a research preview that I'm excited about: Dispatch! One persistent conversation with Claude that runs on your computer. Message it from your phone. Come back to finished work. To try it out, download Claude Desktop, then pair your phone.






it has long been said that model naming is AGI-complete at long last

Introducing Durable. The first AI business builder that replaces your 9-5 income. RT + comment “Durable” and we'll build your business for FREE.



The WSJ is reporting that OpenAI is about to take a hard turn into enterprise.





How often are users unhappy with the answers that top AI models give? This can give us a real world view of how the frontier shifts throughout time. We have looked back to 2023 and traced back how often users rated both responses in Battle Mode to be bad (limited to Top 25 modes at any given time). We see three eras: - Pre-reasoning: Responses were rated as ‘both bad’ >15% of the time and the trend was not reducing - Early reasoning: Models like o1-preview started to noticeably shift the performance, reducing the rate to ~10% - Advanced reasoning: Continued reduction in the % of ‘both bad’ rankings with increasingly advanced models The metric is not close to saturation, meaning that even among models in the Top 25, there are still ~9% of responses that users rate as ‘both bad’ and the models are not meeting expectations of users in terms of quality.

How often are users unhappy with the answers that top AI models give? This can give us a real world view of how the frontier shifts throughout time. We have looked back to 2023 and traced back how often users rated both responses in Battle Mode to be bad (limited to Top 25 modes at any given time). We see three eras: - Pre-reasoning: Responses were rated as ‘both bad’ >15% of the time and the trend was not reducing - Early reasoning: Models like o1-preview started to noticeably shift the performance, reducing the rate to ~10% - Advanced reasoning: Continued reduction in the % of ‘both bad’ rankings with increasingly advanced models The metric is not close to saturation, meaning that even among models in the Top 25, there are still ~9% of responses that users rate as ‘both bad’ and the models are not meeting expectations of users in terms of quality.


