xjdr

7.3K posts

xjdr banner
xjdr

xjdr

@_xjdr

building AI that wont embarrass me in front of my own standards

Noam's Labyrinth Katılım Aralık 2023
692 Takip Edilen27.5K Takipçiler
Sabitlenmiş Tweet
xjdr
xjdr@_xjdr·
Writing jitted jax code is like playing Dark Souls but in python
English
14
20
458
296.4K
xjdr
xjdr@_xjdr·
@dejavucoder i would absolutely do this in retirement
English
0
0
6
661
Eric W. Tramel
Eric W. Tramel@fujikanaeda·
@_xjdr xjdr please delete this i love these 5.5 xhigh fast tokens 😭
English
1
0
10
1.1K
xjdr
xjdr@_xjdr·
using gpt 5.5 xhigh on fast mode ended up costing me about 200x more per day than gpt 5.4 xhigh . i switched back but i am forcing myself to use our k2.6 ft at least 50% of the time and only escalating to 5.4 when i need to and finally only to 5.5 as an escalation now
English
20
3
191
15.9K
xjdr
xjdr@_xjdr·
an underrated aspect of ai heavy development work is i now know, down to the penny, how much each feature, bug and tech debt burndown costs. it has changed my thinking and prioritization of work tremendously in a very short amount of time
English
10
6
215
10K
xjdr
xjdr@_xjdr·
@jmbollenbacher ya, i still use flash for subagents along with gpt-oss-120b
English
0
0
11
973
JMB 🧙‍♂️
JMB 🧙‍♂️@jmbollenbacher·
@_xjdr dsv4 flash is a much more solid model. It's less undertrained. Smaller, though, which limits peak capability. But more reliable at what it can do.
English
1
0
13
1.1K
xjdr
xjdr@_xjdr·
I removed dsv4 pro from my daily rotation. When it's good, it is still very good but the rate of hallucinations and poor instruction following in a wide range of scenarios make it unusable in practice at this time. With more post training I think it will be an excellent model
English
8
4
195
12.1K
xjdr
xjdr@_xjdr·
@keennay gpt 5.5 xhigh and k2.6
Deutsch
3
0
27
1.1K
xjdr
xjdr@_xjdr·
@teortaxesTex in my experience, it absolutely was not tested through the major deployment code at a minimum . it took me a lot of careful work and custom code to get it running accurately
English
0
0
10
919
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)
> we found even the official reference model code producing corrupted outputs …Was DeepSeek's official inference code for Nvidia not well tested?
Dmytro Dzhulgakov@dzhulgakov

We didn’t ship DeepSeek V4 on Day 0 like we always do. Why? We love speed at @FireworksAI_HQ , but quality >> speed. Running our extensive evals, we found even the official reference model code producing corrupted outputs. When we tested over the weekend, all endpoints except official DeepSeek API had these issues. After 2 days of extensive debugging with @deepseek_ai , @sgl_project and @vllm_project communities, the issues are fixed and we’re proud to serve DeepSeek V4 Pro in all its glory. Check the full story 👇

English
2
0
41
8.1K
xjdr
xjdr@_xjdr·
@teortaxesTex this tracks with my experience (mostly) but i would say the more interesting part to me is the drop off doesn't happen until 500k+ which is pretty remarkable IMHO
English
0
0
3
545
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)
V4 is "mediocre frontier" on MRCRv2. Between Opus 4.6) (above) and opus 4.7 (below). In the paper, they say CorpusQA 1M is more interesting for them than MRCR. I wonder how GraphWalks looks.
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) tweet media
Dillon Uzar@DillonUzar

New contextarena.ai is live! 70 model-variants. 8-needle GDM-MRCRv2. Interactive leaderboard. Free, no login. What you can do: - Compare models across context bins with line and bar charts - with 95% confidence intervals (a couple more types of charts are coming) - Filter by provider, reasoning tier, or use presets (Best, Reasoning, Non-Reasoning) - Sort by AUC, pointwise scores, cost, or token efficiency - Hover any model for metadata: provider, reasoning levels, release date, run count, cost breakdown - Toggle heatmap coloring, rankings, and on-demand cost columns - Export to CSV or screenshot the current view directly The FAQ walks through what GDM-MRCRv2 is, how scoring works, what AUC measures, and why 8-needle is the tier that separates frontier models. Includes a step-by-step visual explainer of how a real test is built and scored. We'll be fleshing this out further over time, and improving the visuals. This is still very much a work in progress (might feel a little more bare compared to the old website), but more charts and screens to come, for example: - View each test result for a model (we even record the streamed chunks in case people want some data from that). - Bias analysis from the old website. Current top 5 by AUC @ 128k (best tier per model): 1. GPT-5.5 (xhigh): 91.7% 2. GPT-5.5 (high): 88.2% 3. GPT-5.5 (medium): 87.5% 4. GPT-5.5 (low): 83.3% 5. Claude Opus 4.6 (medium): 81.0% Current top 5 by AUC @ 1M (best tier per model): 1. GPT-5.5 (medium): 50.9% 2. GPT-5.5 (xhigh): 50.5% 3. GPT-5.5 (high): 50.2% 4. GPT-5.5 (low): 47.3% 5. Claude Opus 4.6 (high): 46.9% NOTE: Bins with no scores count as 0% for AUC calc. More models being added regularly. Suggestions welcome. contextarena.ai @OpenAI @AnthropicAI @GoogleDeepMind @deepseek_ai @Kimi_Moonshot @Xiaomi @Zai_org

English
4
5
81
9.4K
xjdr
xjdr@_xjdr·
@dhtikna sorry, i was busy. am back
English
0
0
2
41
xjdr
xjdr@_xjdr·
wow k2.6 is very good. surprisingly so
English
15
8
418
26.4K
xjdr
xjdr@_xjdr·
not only is this awesome to see, but i believe this quantization performance will pretty directly correlate to its ability to be finetuned in the future. dsv4 pro is the headline for me but the flash performance should not be discounted (it is exceptional)
antirez@antirez

This is DeepSeek v4 Flash quantized at 2 bit that runs as LLM of the pi agent. Perfect tool calling apparently, so this model, with this specific quantization scheme that I used at least, is capable of working very well. Now I need a real speedup not in t/s generation but prompt processing.

English
0
3
105
8.1K
xjdr
xjdr@_xjdr·
@sarah_edo cs/ should respond with an epic tale of the cls that made the codebase what it is today
English
0
0
4
1.1K
Sarah Drasner
Sarah Drasner@sarah_edo·
I think it would be epic to have a Game of Thrones intro-style map for every large codebase. Soundtrack included.
English
24
9
417
27.7K
xjdr
xjdr@_xjdr·
i completely agree but sometimes 'benchmarks' are hard to quantify. while we still have lots of our own internal benchmarks and evals, what i've moved towards are a specific set of representative tasks and problems that are either currently problematic for the sota models and / or things we do all the time that are core to our version of success. while the distinction might not be particularly interesting at first glance, to me the difference is 'how we do you do on a synthetic approximation of what we value" vs "how well do you actually work in our system with our exact problem compared to the best way(s) we know how to do things today"
Logan Kilpatrick@OfficialLoganK

Every company building on top of AI should be making their own benchmarks. This is the way if you want model progress to disproportionally benefit your company.

English
0
1
82
6.3K
xjdr
xjdr@_xjdr·
@thdxr this is the way
English
0
0
2
799
dax
dax@thdxr·
i try our onboarding once a week and i almost always find a small regression
English
20
0
395
24.3K
xjdr
xjdr@_xjdr·
I think software development is currently as hard as its ever been. What has changed with the introduction of AI is: - software development has become available / accessible to more people than ever - velocity has increased enormously (with an on average corresponding decrease in quality) - most products and ideas are no longer constrained by the productivity of a single dev or a small team - existing tools and infrastructure that are still designed primarily for human developers and human interaction are being stressed to the breaking point as a result in many ways this reminds me of when coding bootcamps first became popular. there was a tremendous influx of new rails devs and then new js/node/react devs to the industry and corperate insurance and retail companies all of a sudden had 'tech teams' . lines of code and the number of projects/products increased both in volume and in ambition and there was a corresponding increase in outages and bugs and slop. it took some tima and some pain but the software industry eventually caught up and problematic trends died out and best practices and hard-fought experience emerged and in many ways the software industry emerged better for it. i think there will be a similar (albeit more violent and dramatic) cycle for ai software development in the next 5 - 10 years and while it will look very different at the end of that cycle, in many ways the software industry will most likely end up better for it in the end
English
22
26
375
25.3K
xjdr
xjdr@_xjdr·
a few people asked the same question; we made our own SCM based on how Google and Meta do it internally to scale to thousands of agents operating in parallel in our monorepo
xjdr tweet media
tetsuo.cpp (no slop)@tetsuo_cpp

@_xjdr I hope this is a wake up call for them… they’ve been struggling with reliability lately but this incident was insane. No point having all these dumb Copilot features if it doesn’t get the basics right. What do you use instead btw?

English
10
5
151
16.3K
xjdr
xjdr@_xjdr·
in the above example, i am asking it to trace specific flows to see what tools are visible to an agent based on specific runtime context and compare that to what we would expect to be visible. it immediately suggests we 'build a crosswalk' . i ask it to not do that and then it keeps cycling on 'how do we fit this into our crosswalk' or 'before we continue, we should compare this to our crosswalk and ... '
English
1
2
19
587
𝚟𝚒𝚎 ⟢
𝚟𝚒𝚎 ⟢@viemccoy·
@_xjdr what is happening when it brings up the crosswalk? are you giving it conflicting instructions?
English
1
0
8
270
xjdr
xjdr@_xjdr·
gpt 5.5 has added the term 'crosswalk' to its arsenal that seems to be a catch all for comparing / enumerating things and not only do i personally hate it, it seems to reduce the answer fidelity when it gets into that reasoning basin. boooooo
English
12
0
168
9.3K