xjdr
7.3K posts

xjdr
@_xjdr
building AI that wont embarrass me in front of my own standards





We didn’t ship DeepSeek V4 on Day 0 like we always do. Why? We love speed at @FireworksAI_HQ , but quality >> speed. Running our extensive evals, we found even the official reference model code producing corrupted outputs. When we tested over the weekend, all endpoints except official DeepSeek API had these issues. After 2 days of extensive debugging with @deepseek_ai , @sgl_project and @vllm_project communities, the issues are fixed and we’re proud to serve DeepSeek V4 Pro in all its glory. Check the full story 👇



New contextarena.ai is live! 70 model-variants. 8-needle GDM-MRCRv2. Interactive leaderboard. Free, no login. What you can do: - Compare models across context bins with line and bar charts - with 95% confidence intervals (a couple more types of charts are coming) - Filter by provider, reasoning tier, or use presets (Best, Reasoning, Non-Reasoning) - Sort by AUC, pointwise scores, cost, or token efficiency - Hover any model for metadata: provider, reasoning levels, release date, run count, cost breakdown - Toggle heatmap coloring, rankings, and on-demand cost columns - Export to CSV or screenshot the current view directly The FAQ walks through what GDM-MRCRv2 is, how scoring works, what AUC measures, and why 8-needle is the tier that separates frontier models. Includes a step-by-step visual explainer of how a real test is built and scored. We'll be fleshing this out further over time, and improving the visuals. This is still very much a work in progress (might feel a little more bare compared to the old website), but more charts and screens to come, for example: - View each test result for a model (we even record the streamed chunks in case people want some data from that). - Bias analysis from the old website. Current top 5 by AUC @ 128k (best tier per model): 1. GPT-5.5 (xhigh): 91.7% 2. GPT-5.5 (high): 88.2% 3. GPT-5.5 (medium): 87.5% 4. GPT-5.5 (low): 83.3% 5. Claude Opus 4.6 (medium): 81.0% Current top 5 by AUC @ 1M (best tier per model): 1. GPT-5.5 (medium): 50.9% 2. GPT-5.5 (xhigh): 50.5% 3. GPT-5.5 (high): 50.2% 4. GPT-5.5 (low): 47.3% 5. Claude Opus 4.6 (high): 46.9% NOTE: Bins with no scores count as 0% for AUC calc. More models being added regularly. Suggestions welcome. contextarena.ai @OpenAI @AnthropicAI @GoogleDeepMind @deepseek_ai @Kimi_Moonshot @Xiaomi @Zai_org

i would prefer almost any other failure mode to this

This is DeepSeek v4 Flash quantized at 2 bit that runs as LLM of the pi agent. Perfect tool calling apparently, so this model, with this specific quantization scheme that I used at least, is capable of working very well. Now I need a real speedup not in t/s generation but prompt processing.


Every company building on top of AI should be making their own benchmarks. This is the way if you want model progress to disproportionally benefit your company.



@_xjdr I hope this is a wake up call for them… they’ve been struggling with reliability lately but this incident was insane. No point having all these dumb Copilot features if it doesn’t get the basics right. What do you use instead btw?









