infrecursion
3.8K posts



OPENAI TO MERGE CHATGPT, CODEX APP & BROWSER INTO DESKTOP SUPERAPP TO STREAMLINE RESOURCES & USER EXPERIENCE - WSJ



At this rate everyone’s gonna have their own app and zero users.



I have eyeballed some of the outputs and imho the results are not as clear-cut as being presented (and that's an understatement). For a lot of outputs, it's very debatable what's the proper way of categorizing them. It seems to me that each model has it's own default style when answering a bullshit question and this style is surprisingly consistent between different questions. What you're mainly measuring is the preference that each judge has for each particular writing style. My other complain is that there's no diversity in questions: you're just embedding a random unrelated or made up terminology in a somewhat valid short question.


BullshitBench v2 is out! It is one of the few benchmarks where models are generally not getting better (except Claude) and where reasoning isn't helping. What's new: 100 new questions, by domain (coding (40 Q's), medical (15), legal (15), finance (15), physics(15)), 70+ model variants tested. BullshitBench is already at 380 starts on GitHub - all questions, scripts, responses and judgements are there so check it out. TL;DR: - Results replicated - @AnthropicAI latest models are scoring exceptionally well - @Alibaba_Qwen is another very strong performer - OpenAI and Google models are not doing well and are not improving - Domains do not show much difference - rates of BS detection are about the same across all domains - Reasoning, if anything, has negative effect - Newer models don't do that much better than older ones (except Anthropic) Links: - Data explorer: petergpt.github.io/bullshit-bench… - GitHub: github.com/petergpt/bulls… Highly recommend the data explorer where you can study the data and the questions & sample answers.












Exclusive: OpenAI’s top executives are finalizing plans for a major strategy shift to refocus the company around coding and business users on.wsj.com/3N6CFyr

























