beta_pyxis

296 posts

beta_pyxis

@beta_pyxis

curious about all things code

Katılım Mart 2019

860 Takip Edilen57 Takipçiler

beta_pyxis@beta_pyxis·1d

@MrAhmadAwais @0xSero Commandcode cli struggles to do basic functions like file tagging in any decently sized repo where codex cli or claude code doesn’t even break a sweat

English

Ahmad Awais@MrAhmadAwais·1d

@0xSero what harness are you using? could i interest you in trying ours? based on the tool call repairs? x.com/MrAhmadAwais/s…

Ahmad Awais@MrAhmadAwais

how did we make deepseek outperform opus 4.7? i've been thinking about why "open model bad at tool calling" is almost always a harness problem, not a model problem. context: spent the two days looking at billions of tokens in @CommandCodeAI (tb open source ai cli) using deepseek. I ended up writing a tool-input repair layer. the trigger was watching deepseek-flash fail on the simplest /review run, every shellCommand and readFile call bouncing back with a raw zod issues blob, the model unable to recover because the error wasn't in a form it could read. by the end deepseek v4 pro was beating opus 4.7 6/10 times on our internal evals. a few things i learned that feel general: 1/ the failure modes aren't random they're a small finite compositional set. across deepseek-flash, deepseek v4 pro, glm, qwen, the same four mistakes repeat almost exactly: - sending `null` for an optional field instead of omitting it - emitting `["a","b"]` as a json *string* instead of an actual array - wrapping a single arg in `{}` where the schema expected an array (an "empty placeholder") - passing a bare string where an array was expected (`"foo"` instead of `["foo"]`) four repairs, ~30-100 lines each, ordered carefully (json-array-parse must run before bare-string-wrap or `'["a","b"]'` becomes `['["a","b"]']`). that is the whole catalogue. when i hear "this open source model can't do tool calls" i now assume one of those four, and so far that's been right ~90% of the time. 2/ the funniest failure mode is also the most revealing. deepseek-flash, when asked to edit or write a file, sometimes emits the path as a *markdown auto-link*: filePath: "/Users/x/proj/[notes.md](http://notes. md)" our writeFile tool obediently trued creating files literally named `[notes.md](http://notes .md)` until we caught it. this is not a hallucination. it's the post-training chat distribution leaking through the tool boundary the model has been rewarded for auto-linking in conversational output, and is applying that prior in a context where it makes no sense. the fix is two regex lines that unwrap only the degenerate case where link text equals url-without-protocol real markdown like `[click](https://x .com)` passes through untouched. this is also conditioning of their own tools during RL which were different from all other tools we write and ofc can't predict. "tool confusion" is a more useful frame than "capability gap." the model knows how to format a path. it just hasn't been told clearly enough that this path is going to fopen, not into a chat bubble. so we encode that hint at the schema level `pathString()` instead of `z.string()` and the leak is plugged for every path field at once. 3/ the design choice that mattered was inverting preprocess-then-validate to validate-then-repair. my first attempt was the obvious one: a preprocessing pass that normalized inputs (strip nulls, parse stringified arrays, etc.) before zod ever saw them. it broke immediately, writeFile content that *happened* to be json-shaped got rewritten before it hit disk. silent corruption, easy to miss in a smoke test. then i made it less greedy - parse the input as-is. if it succeeds, ship it. valid inputs are never touched. - on failure, walk the validator's own issue list. for each issue path, try the four repairs in order until one applies. - parse again. on success, log `tool_input_repaired:${toolName}`. on failure, log `tool_input_invalid:${toolName}` and return a model-readable retry message. the structural insight here is: when you preprocess, you encode a prior about what's broken. when you let the validator complain first, the schema is the prior, and you only spend repair budget at the exact paths the schema actually disagreed at. the validator is doing the work of localizing the bug for you. it's the same shape as cheap-then-careful everywhere else try the fast path, fall back on evidence. (this also gives you per-tool telemetry for free. you can watch repair rates per (model, tool) and notice when a model regresses on a specific contract before users do.) 4/ shape invariants and relational invariants need different fixes. the four repairs above all handle shape problems wrong type, missing key, wrong container. but read_file had a *relational* invariant: "if you provide offset, you must also provide limit, and vice versa." deepseek kept calling `readFile({ absolutePath, limit: 30 })` and getting an `ERROR:` back. you can't fix this with input repair, because each field is independently valid the bug is in the relationship between them. so i taught the function the model's intent instead. `limit` alone → `offset = 0`. `offset` alone → `limit = 2000` (matches common read tool ops default). then surfaced the decision back to the model in the result: "Note: limit was not provided; defaulted to 2000 lines. To read more or fewer lines, retry with both offset and limit." no `Error:` prefix, so the tui doesn't paint it red. the model sees what we picked and can self-correct on the next turn if our guess was wrong. transparency over silent magic wins big. repair where you can. extend semantics where you can't. surface the choice either way. zoom out: a lot of what looks like model capability is actually contract design. a strict schema is a choice with a cost it filters out noise, but it also filters out recoverable noise from any model that hasn't memorized the exact json contract you happened to pick. the largest commercial models eat that cost invisibly and are linient on tool calling because they've seen enough of every contract during pretraining; open models pay it loudly and get dismissed for it. the harness is where you mediate between distributions. four small repairs (i'm sure more to follow as we have three more merging today), two regex lines for auto-links, one relational default, one prefix change. the model didn't change. the contract got more forgiving in exactly the places it needed to be. deepseek v4 pro now beats opus 4.7 6/10 times on our internal evals. imo "skill issue" applies to the harness more often than the model.

English

1.7K

0xSero@0xSero·1d

Deepseek-v4-pro 0.45B tokens for 6.74$

English

1.7K

91.7K

beta_pyxis@beta_pyxis·5d

@antirez @audreyt @niels_gron Is it fully maxed out?

English

232

antirez@antirez·5d

The first M5 max arrived! Many many thanks to our sponsors @audreyt and @niels_gron, the next will arrive on Monday.

English

516

27.6K

beta_pyxis@beta_pyxis·5d

@Anon7000000 @thsottiaux They are not replying to any of us on this

English

Anon@Anon7000000·5d

@thsottiaux tibo, a lot of us have been noticing Codex consuming rate limits much quicker lately can the team look into this?

English

1.6K

Tibo@thsottiaux·5d

Dark magic. Codex.

OpenAI Developers@OpenAIDevs

Codex anywhere and everywhere, all the time. Now your Mac doesn’t have to be unlocked for Codex to use your computer. From your phone, Codex can securely use apps on your Mac, even when the screen is off and locked. #locked-use" target="_blank" rel="nofollow noopener">developers.openai.com/codex/app/comp…

English

184

1.7K

181.6K

beta_pyxis@beta_pyxis·5d

@paraddox @TheAhmadOsman It’s a phased rollout probably. You’ll be seeing the same thing soon

English

159

Ddox@paraddox·5d

@TheAhmadOsman They must not like you personally :)) I can still choose.

English

2.2K

beta_pyxis retweetledi

Ahmad@TheAhmadOsman·5d

OpenAI, what the fuck is this? Give me back the ability to specify WHICH MODEL I am using + their effort levels EXPLICITLY I don't want this router crap you're enforcing on my $200 paid subscription I DID NOT AGREE TO THIS SHIT

English

117

743

118.3K

beta_pyxis@beta_pyxis·5d

@thsottiaux @charliermarsh You pulled an Anthropic on us and here joking about unlimited tokens 😂

English

Tibo@thsottiaux·5d

@charliermarsh Complain about unlimited tokens

English

663

20.8K

Charlie Marsh@charliermarsh·6d

What would you do with unlimited tokens

English

369

507

80.4K

beta_pyxis@beta_pyxis·5d

@OpenAIDevs @_rajanagarwal Can you guys please take a min to check the increased token usage? My weekly is reducing at an alarming rate. Never before have i gotten down to less than 50% within two days of normal usage

English

105

OpenAI Developers@OpenAIDevs·6d

It’s Codex Thursday, and yes, we have updates for you. First up: Appshots, a new way to bring the context of what you’re working on into Codex. On your Mac, press Command-Command to attach your app window to a Codex thread. Codex gets both a screenshot and text from the window, including content beyond what’s visible onscreen. Appshots are available across plans on Mac, with enterprise access coming soon.

English

486

491

6.4K

1.8M

beta_pyxis retweetledi

Ziwen@ziwenxu_·6d

One tweet saying "we're on it" would save thousands of users from thinking they broke something. @thsottiaux @sama

Ziwen@ziwenxu_

Is something broken with Codex? The usage is draining insanely fast @thsottiaux It's only been 1 day and I'm already at 23% I only have 2 /goals running and I have pro plan

English

241

21.6K

beta_pyxis@beta_pyxis·6d

@Hi_Mrinal Dude posts a setup picture, people ask him monitor specs, never get any reply It’s kinda like a template now 😂 Love the setup tho

English

Mrinal@Hi_Mrinal·6d

Okkay the day starts with some note taking :p

English

2.5K

beta_pyxis@beta_pyxis·6d

@0xSero Doing casual flex, met with Nvidia yesterday 😂

English

114

beta_pyxis@beta_pyxis·20 May

@nbaschez @r_marked You’ve got like 48k followers, and you’re telling me you didn’t know you should @ the person you’re talking about?

English

115

Nathan Baschez@nbaschez·20 May

@r_marked will do next time! 🫡 tbh i do not not the proper etiquette here haha

English

5.4K

Nathan Baschez@nbaschez·20 May

PSA When someone tells you your app is slow, the answer is not “no it isn’t” It is “oh shoot thanks for letting me know! DMing you so I can help troubleshoot”

English

400

58.9K

beta_pyxis retweetledi

J J@jturntdev·19 May

Still no response from The @OpenAIDevs As someone who literally uses Codex more than 99% of people. Something has to be off? I trust yous. But clarity would be great.

J J@jturntdev

OpenAI have secretly adjusted our limits. Last week before limit reset. I was using Xhigh all day. 5 day straight i couldn’t get my usage below 55% weekly usage. Since Yesterday, I’ve done 40% of my quota, out of nowhere. So whats going on ? @thsottiaux @sama @OpenAIDevs

English

268

35.8K

beta_pyxis retweetledi

Joe@joerossi98·18 May

Am I the only one seeing more Codex usage today? It seems like it's burning through everything @thsottiaux @OpenAIDevs

English

219

22.3K

beta_pyxis retweetledi

J J@jturntdev·18 May

J J@jturntdev

im on 20x pro plan, been for months. my usage was never reset yesterday, its down 40% and i barely used it yesterday? what is this. @OpenAIDevs @thsottiaux

English

172

930

299.3K

beta_pyxis retweetledi

am.will@LLMJunky·19 May

Anyone else notice that compaction seems to lose more details than normal in Codex? It never seemed to matter before, but I'm seeing it frequently now.

English

112

12K

beta_pyxis@beta_pyxis·18 May

@thsottiaux @rowans_planet Then have it install and configure the rest

English

Tibo@thsottiaux·18 May

@rowans_planet Coooooodex

Tiếng Việt

495

9.5K

rowan is in London 💂🇬🇧@rowans_planet·17 May

you know the drill, new laptop… what should I install? 💻

English

112

13.7K

beta_pyxis retweetledi

dax@thdxr·14 May

something is going wrong with gpt 5.5 caching doesn't look like much on this chart but this it's now using 2.5x as many input tokens as a week ago and dropping

English

1.3K

121.8K

beta_pyxis@beta_pyxis·15 May

@thsottiaux 5.5 is struggling with tasks it used to churn through like it was nothing? I’ve tried xhigh as well, starting from low Can you please check And I’m not alone. Have the same consensus across my team

English

beta_pyxis@beta_pyxis·14 May

@swarajk_ Every other week you’ll find one founder crying about how unfair it is that candidates want what’s best for them. But if you reverse the argument, they turn tail and run. Dude deleted his post.

English

189

Swaraj@swarajk_·14 May

we are not bullying startup founders enough

English

143

2.2K

65.8K

beta_pyxis@beta_pyxis·14 May

@rezoundous Imp caveat, monthly stays the same, so you reach monthly quickly 😂

English

Tyler@rezoundous·13 May

50% increase in Claude Code weekly limit doesn't excite me like it used to.

English

361

21.1K

Keşfet

@MrAhmadAwais @0xSero @antirez @audreyt @niels_gron @Anon7000000 @thsottiaux @paraddox