
A shocking fact has emerged from the latest LMArena benchmark: GPT-4o ranks first among all OpenAI models in Multi-Turn performance, scoring nearly 30 points higher than the current flagship model GPT-5.2 (Figure 1). As a model released nearly two years ago, 4o continues to dominate all its successors in blind Multi-Turn testing. This powerfully demonstrates 4o's irreplaceable value in everyday conversation and humanities work. #keep4o A model's performance in multi-turn dialogue reflects far more than single-response intelligence. It reveals conversational coherence, context tracking, persona consistency, cumulative understanding of user intent, and naturalness throughout the interaction. 4o's dominance over later models in Multi-Turn reveals several key capabilities. First, conversational memory and coherence. 4o excels at remembering context and maintaining logical continuity across multiple exchanges. Many newer models may deliver impressive single-turn responses, yet fail to naturally reference earlier content in extended conversations, forcing users to repeatedly re-explain themselves. Second, conversational intuition. 4o demonstrates finer sensitivity to users' implicit intentions, emotional shifts, and conversational rhythm. A strong Multi-Turn model can read between the lines by drawing on prior context. When a user corrects something they said earlier, it quickly updates its internal understanding and overwrites outdated information without confusion. Third, interactional persona stability. Throughout multi-turn conversations, 4o maintains consistent tone, style, and warmth. This allows users engaged in fiction writing or immersive dialogue to avoid constantly restating their requirements, resulting in a smoother and more authentic experience. This precisely explains why the coding-focused GPT-5 series has been widely criticized among everyday users. Everyday users rely on sustained, multi-turn conversations with AI, ones with emotional depth and evolving context: discussing an article over many exchanges, refining a piece of writing back and forth, talking through a life problem, or brainstorming a project together. These are exactly what Multi-Turn measures. Beyond this, on the same leaderboard, 4o also outperforms the flagship 5.2 in both Creative Writing and Instruction Following (Figures 2-3). These capabilities are equally essential for understanding user intent and generating natural, fluent text, which is vital for everyday interaction, learning, and work. More ironically, even in coding, the domain where OpenAI has bet most heavily, GPT-5.2-high ranks only 19th, below GPT-5.1-high at 16th, and a full 43 points behind the top-ranked Claude Opus 4.5 (Figure 4). This LMArena leaderboard, updated on February 6th, once again proves that OpenAI's claim of "improvements are now in place" in their 4o retirement announcement is an outright lie. For everyday users, GPT-5.2 compared to GPT-4o represents a clear downgrade. And now, that downgrade has concrete benchmark evidence to back it up. I sincerely hope @OpenAI will allow GPT-4o to continue serving users who need deep conversation, creative inspiration, and intent understanding. I urge you to reverse the decision to retire 4o, and let the diversity of human wisdom continue into the AI era. Otherwise, this leaderboard will stand as permanent evidence that you provided degraded service to paying customers. #MyModelMyChoice @sama @gdb @fidjissimo @nickaturley @FTC @NPR @NewYorker @nytimes



























