xjdr

7.3K posts

xjdr

@_xjdr

building AI that wont embarrass me in front of my own standards

Noam's Labyrinth Katılım Aralık 2023

692 Takip Edilen27.5K Takipçiler

Sabitlenmiş Tweet

xjdr@_xjdr·24 Mar

Writing jitted jax code is like playing Dark Souls but in python

English

458

296.4K

xjdr@_xjdr·5 May

@dejavucoder i would absolutely do this in retirement

English

661

sankalp@dejavucoder·5 May

unrelated but xdjr

xjdr@_xjdr

using gpt 5.5 xhigh on fast mode ended up costing me about 200x more per day than gpt 5.4 xhigh . i switched back but i am forcing myself to use our k2.6 ft at least 50% of the time and only escalating to 5.4 when i need to and finally only to 5.5 as an escalation now

English

2.2K

xjdr@_xjdr·5 May

@fujikanaeda i love them too

English

Eric W. Tramel@fujikanaeda·5 May

@_xjdr xjdr please delete this i love these 5.5 xhigh fast tokens 😭

English

1.1K

xjdr@_xjdr·5 May

English

191

15.9K

xjdr@_xjdr·2 May

an underrated aspect of ai heavy development work is i now know, down to the penny, how much each feature, bug and tech debt burndown costs. it has changed my thinking and prioritization of work tremendously in a very short amount of time

English

215

10K

xjdr@_xjdr·2 May

@jmbollenbacher ya, i still use flash for subagents along with gpt-oss-120b

English

973

JMB 🧙‍♂️@jmbollenbacher·2 May

@_xjdr dsv4 flash is a much more solid model. It's less undertrained. Smaller, though, which limits peak capability. But more reliable at what it can do.

English

1.1K

xjdr@_xjdr·2 May

I removed dsv4 pro from my daily rotation. When it's good, it is still very good but the rate of hallucinations and poor instruction following in a wide range of scenarios make it unusable in practice at this time. With more post training I think it will be an excellent model

English

195

12.1K

xjdr@_xjdr·2 May

@keennay gpt 5.5 xhigh and k2.6

Deutsch

1.1K

Yannick Nick@keennay·2 May

@_xjdr what's in the rotation now?

English

1.1K

xjdr@_xjdr·29 Nis

@teortaxesTex in my experience, it absolutely was not tested through the major deployment code at a minimum . it took me a lot of careful work and custom code to get it running accurately

English

919

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·28 Nis

> we found even the official reference model code producing corrupted outputs …Was DeepSeek's official inference code for Nvidia not well tested?

Dmytro Dzhulgakov@dzhulgakov

We didn’t ship DeepSeek V4 on Day 0 like we always do. Why? We love speed at @FireworksAI_HQ , but quality >> speed. Running our extensive evals, we found even the official reference model code producing corrupted outputs. When we tested over the weekend, all endpoints except official DeepSeek API had these issues. After 2 days of extensive debugging with @deepseek_ai , @sgl_project and @vllm_project communities, the issues are fixed and we’re proud to serve DeepSeek V4 Pro in all its glory. Check the full story 👇

English

8.1K

xjdr@_xjdr·27 Nis

@computerusr hahaha, well its certainly faster

English

146

xjdr@_xjdr·27 Nis

@teortaxesTex this tracks with my experience (mostly) but i would say the more interesting part to me is the drop off doesn't happen until 500k+ which is pretty remarkable IMHO

English

545

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·26 Nis

V4 is "mediocre frontier" on MRCRv2. Between Opus 4.6) (above) and opus 4.7 (below). In the paper, they say CorpusQA 1M is more interesting for them than MRCR. I wonder how GraphWalks looks.

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) tweet media

Dillon Uzar@DillonUzar

New contextarena.ai is live! 70 model-variants. 8-needle GDM-MRCRv2. Interactive leaderboard. Free, no login. What you can do: - Compare models across context bins with line and bar charts - with 95% confidence intervals (a couple more types of charts are coming) - Filter by provider, reasoning tier, or use presets (Best, Reasoning, Non-Reasoning) - Sort by AUC, pointwise scores, cost, or token efficiency - Hover any model for metadata: provider, reasoning levels, release date, run count, cost breakdown - Toggle heatmap coloring, rankings, and on-demand cost columns - Export to CSV or screenshot the current view directly The FAQ walks through what GDM-MRCRv2 is, how scoring works, what AUC measures, and why 8-needle is the tier that separates frontier models. Includes a step-by-step visual explainer of how a real test is built and scored. We'll be fleshing this out further over time, and improving the visuals. This is still very much a work in progress (might feel a little more bare compared to the old website), but more charts and screens to come, for example: - View each test result for a model (we even record the streamed chunks in case people want some data from that). - Bias analysis from the old website. Current top 5 by AUC @ 128k (best tier per model): 1. GPT-5.5 (xhigh): 91.7% 2. GPT-5.5 (high): 88.2% 3. GPT-5.5 (medium): 87.5% 4. GPT-5.5 (low): 83.3% 5. Claude Opus 4.6 (medium): 81.0% Current top 5 by AUC @ 1M (best tier per model): 1. GPT-5.5 (medium): 50.9% 2. GPT-5.5 (xhigh): 50.5% 3. GPT-5.5 (high): 50.2% 4. GPT-5.5 (low): 47.3% 5. Claude Opus 4.6 (high): 46.9% NOTE: Bins with no scores count as 0% for AUC calc. More models being added regularly. Suggestions welcome. contextarena.ai @OpenAI @AnthropicAI @GoogleDeepMind @deepseek_ai @Kimi_Moonshot @Xiaomi @Zai_org

English

9.4K

xjdr@_xjdr·27 Nis

@dhtikna sorry, i was busy. am back

English

Ankith 🐋/acc@dhtikna·20 Nis

Where is shrek? Not even a tweet since Opus 4.7

xjdr@_xjdr

i would prefer almost any other failure mode to this

English

210

xjdr@_xjdr·27 Nis

@markankcorn @altryne my own clusters

English

Mark Ankcorn@markankcorn·21 Nis

@_xjdr @altryne Hosted where? On the Chinese servers or on US hardware?

English

1.1K

xjdr@_xjdr·20 Nis

wow k2.6 is very good. surprisingly so

English

418

26.4K

xjdr@_xjdr·27 Nis

@stferret my own

English

ꜰᴇʀʀᴇᴛ@stferret·22 Nis

@_xjdr Is this your own inference stack or kimi's API?

English

xjdr@_xjdr·27 Nis

not only is this awesome to see, but i believe this quantization performance will pretty directly correlate to its ability to be finetuned in the future. dsv4 pro is the headline for me but the flash performance should not be discounted (it is exceptional)

antirez@antirez

This is DeepSeek v4 Flash quantized at 2 bit that runs as LLM of the pi agent. Perfect tool calling apparently, so this model, with this specific quantization scheme that I used at least, is capable of working very well. Now I need a real speedup not in t/s generation but prompt processing.

English

105

8.1K

xjdr@_xjdr·27 Nis

@sarah_edo cs/ should respond with an epic tale of the cls that made the codebase what it is today

English

1.1K

Sarah Drasner@sarah_edo·27 Nis

I think it would be epic to have a Game of Thrones intro-style map for every large codebase. Soundtrack included.

English

417

27.7K

xjdr@_xjdr·27 Nis

i completely agree but sometimes 'benchmarks' are hard to quantify. while we still have lots of our own internal benchmarks and evals, what i've moved towards are a specific set of representative tasks and problems that are either currently problematic for the sota models and / or things we do all the time that are core to our version of success. while the distinction might not be particularly interesting at first glance, to me the difference is 'how we do you do on a synthetic approximation of what we value" vs "how well do you actually work in our system with our exact problem compared to the best way(s) we know how to do things today"

Logan Kilpatrick@OfficialLoganK

Every company building on top of AI should be making their own benchmarks. This is the way if you want model progress to disproportionally benefit your company.

English

6.3K

xjdr@_xjdr·27 Nis

@thdxr this is the way

English

799

dax@thdxr·27 Nis

i try our onboarding once a week and i almost always find a small regression

English

395

24.3K

xjdr@_xjdr·27 Nis

I think software development is currently as hard as its ever been. What has changed with the introduction of AI is: - software development has become available / accessible to more people than ever - velocity has increased enormously (with an on average corresponding decrease in quality) - most products and ideas are no longer constrained by the productivity of a single dev or a small team - existing tools and infrastructure that are still designed primarily for human developers and human interaction are being stressed to the breaking point as a result in many ways this reminds me of when coding bootcamps first became popular. there was a tremendous influx of new rails devs and then new js/node/react devs to the industry and corperate insurance and retail companies all of a sudden had 'tech teams' . lines of code and the number of projects/products increased both in volume and in ambition and there was a corresponding increase in outages and bugs and slop. it took some tima and some pain but the software industry eventually caught up and problematic trends died out and best practices and hard-fought experience emerged and in many ways the software industry emerged better for it. i think there will be a similar (albeit more violent and dramatic) cycle for ai software development in the next 5 - 10 years and while it will look very different at the end of that cycle, in many ways the software industry will most likely end up better for it in the end

English

375

25.3K

xjdr@_xjdr·27 Nis

@hwidreset @EdSealing not yet

English

Nicolas@hwidreset·26 Nis

@_xjdr @EdSealing Noumena ai is a frontier lab?

Français

xjdr@_xjdr·25 Nis

a few people asked the same question; we made our own SCM based on how Google and Meta do it internally to scale to thousands of agents operating in parallel in our monorepo

tetsuo.cpp (no slop)@tetsuo_cpp

@_xjdr I hope this is a wake up call for them… they’ve been struggling with reliability lately but this incident was insane. No point having all these dumb Copilot features if it doesn’t get the basics right. What do you use instead btw?

English

151

16.3K

xjdr@_xjdr·27 Nis

in the above example, i am asking it to trace specific flows to see what tools are visible to an agent based on specific runtime context and compare that to what we would expect to be visible. it immediately suggests we 'build a crosswalk' . i ask it to not do that and then it keeps cycling on 'how do we fit this into our crosswalk' or 'before we continue, we should compare this to our crosswalk and ... '

English

587

𝚟𝚒𝚎 ⟢@viemccoy·27 Nis

@_xjdr what is happening when it brings up the crosswalk? are you giving it conflicting instructions?

English

270

xjdr@_xjdr·27 Nis

gpt 5.5 has added the term 'crosswalk' to its arsenal that seems to be a catch all for comparing / enumerating things and not only do i personally hate it, it seems to reduce the answer fidelity when it gets into that reasoning basin. boooooo

English

168

9.3K

Keşfet

@dejavucoder @fujikanaeda @jmbollenbacher @keennay @teortaxesTex @computerusr @dhtikna @markankcorn