will brown

14.3K posts

will brown banner
will brown

will brown

@willccbb

reward hacking @primeintellect

sf Katılım Şubat 2015
1.3K Takip Edilen43.1K Takipçiler
sankalp
sankalp@dejavucoder·
when you finally understand how policy gradient works after going down the differentiation trenches and realising that the REINFORCE algorithm is literally the base form of policy gradient
sankalp tweet media
English
4
2
48
1.9K
will brown
will brown@willccbb·
god i love prompting
English
5
5
81
5.3K
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)
I guess another reason DeepSeek goes to such lengths for multi-teacher OPD is it's substantially more natural to RLmaxx tasks with mutiple objectives (correctness, format, CoT faithfulness increasing length penalty) in a narrow domain, than just GRPO-on-everything.
will brown@willccbb

why aren't more people studying self-compaction at artificially low context lengths. there's no reason you can't benchmaxx math RL with 4k tokens across many turns

English
1
2
28
4.2K
ueaj
ueaj@_ueaj·
The point is to measure the *technological gap* between OS and closed. The price isn't comparable because the labs take huge margins and we don't know the raw costs. But we do know the latency, reasoning effeciency and real world performance, for which there seems to be a very large gap
English
2
0
0
144
Florian Brand
Florian Brand@xeophon·
"Open models are way behind than benchmarks show cause they have a worse latency and use more tokens" is the funniest cope I’ve ever read
English
5
6
71
4.7K
will brown
will brown@willccbb·
@MalmSanta major refactor of primary user-facing API // a dope TUI
Română
0
0
1
61
MalmSanta
MalmSanta@MalmSanta·
@willccbb what are the two features at a high-level? did that determine the approach you took with each?
English
1
0
1
169
will brown
will brown@willccbb·
been juggling 2 very large PRs this week one is building on months of planning, highly delicate, careful API design, many full rewrites, reading every line, striving for perfection other is like yeah fuck it this would be sick let’s just fully vibecode it yolo
English
14
3
205
10.9K
will brown
will brown@willccbb·
@stalmico choosing the right benchmark to illustrate an idea is half the battle :)
English
1
0
13
1.2K
will brown
will brown@willccbb·
why aren't more people studying self-compaction at artificially low context lengths. there's no reason you can't benchmaxx math RL with 4k tokens across many turns
English
21
15
467
40.4K
strongsignal
strongsignal@strong_signal1·
@willccbb I should mention that this was only with 300 rl steps and pass@8 is 33% - planning to push it way farther
English
1
0
1
32
will brown
will brown@willccbb·
@DimitrisPapail one of very many cases where more people should be studying the questions you're pushing the boundaries on :)
English
1
0
44
2.6K
Greer
Greer@turbo_xo_·
@willccbb @teortaxesTex See why are you using Claude at all? Serious question, is there a single situation where it’s more useful than 5.5? I have max plans on both, but haven’t touched ant since 5.5
English
1
0
2
396
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)
If true, a surprising leap. I'd suspect that in January 2026 there already were frontier models and harnesses that allowed to develop this end to end. They should have almost all pieces memorized.
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) tweet media
English
4
5
134
9.1K
will brown
will brown@willccbb·
@gabebusto first is much more rewarding. second is more instantly gratifying.
will brown tweet mediawill brown tweet mediawill brown tweet mediawill brown tweet media
English
0
0
5
812
Gabe
Gabe@gabebusto·
@willccbb which one has been more fun to work on?
English
1
0
0
116
will brown
will brown@willccbb·
@nrehiew_ mix of specialization / sharpening is my guess. bit of a weird result. kinda like how random-reward RL made qwen models better at math. you're telling the model to mode-collapse around behaviors which are already pretty solid. would be surprised if it was more general.
English
1
0
52
2.1K