Alex Wa

21 posts

Alex Wa

Alex Wa

@_djdumpling

math + cs @yale, residency @primeintellect , @yalenlp

Katılım Kasım 2022
320 Takip Edilen1.2K Takipçiler
Sabitlenmiş Tweet
Alex Wa
Alex Wa@_djdumpling·
new blog! What methodologies do labs use to train frontier models? The blog distills 7 open-weight model reports from frontier labs, covering architecture, stability, optimizers, data curation, pre/mid/post-training + RL, and behaviors/safety djdumpling.github.io/2026/01/31/fro…
Alex Wa tweet media
English
34
287
2K
279.4K
Alex Wa
Alex Wa@_djdumpling·
@micpsst not explicitly, but there were some brief notes about Qwen 2.5, Qwen 3, and Qwen3-Next on dual chunk attention, hybrid models, chat templates, and data filtering
English
0
0
3
2.2K
Alex Wa
Alex Wa@_djdumpling·
new blog! What methodologies do labs use to train frontier models? The blog distills 7 open-weight model reports from frontier labs, covering architecture, stability, optimizers, data curation, pre/mid/post-training + RL, and behaviors/safety djdumpling.github.io/2026/01/31/fro…
Alex Wa tweet media
English
34
287
2K
279.4K
Alex Wa
Alex Wa@_djdumpling·
3. distilling R1 into small models beat large-scale RL on reasoning 4. increasing MoE sparsity yields perf improvements for fixed FLOPs (e.g: 8/384 in Kimi K2) 5. during R1-Zero's pure RL, reflective words like 'wait' spiked 5-7x would love feedback, especially corrections! :)
English
1
1
24
3.8K
Alex Wa
Alex Wa@_djdumpling·
interesting bits: 1. changing chat template token ('assistant'->'me') shifted Hermes 4's behavior to embody peer-like, consistent voices with higher behavioral plasticity 2. Kimi K2's MuonClip stabilizes attention logits via per-head clipping where softcapping/QK-norm fell short
English
1
1
26
5.2K
Alex Wa
Alex Wa@_djdumpling·
@creet_z Not to mention spotting a B300 only costs 25 cents more/hr than a H100
English
0
0
5
582
Christian
Christian@creet_z·
Using spot 8xB200 for $8/hr feels illegal like I’m robbing someone, taking compute from a baby if you will
English
9
0
253
20.2K
hallerite
hallerite@hallerite·
Happy to finally share what I have been working on for some time now. Introducing »Ludic« – an LLM-RL library for the era of experience. While there are now a lot of LLM-RL codebases, even many good ones, I want to share my very idiosyncratic way to think about LLM-RL.
hallerite tweet media
English
15
33
267
20.6K
Alex Wa
Alex Wa@_djdumpling·
@ccui9 it's also worth mentioning that the LLMs tend to choose from among the top of the move list instead of reasoning about all possible moves, which would also lead to convergent strategies
English
0
0
1
46
Alex Wa
Alex Wa@_djdumpling·
@ccui9 I forgot to mention this, but passing in legal actions seems to neutralize training; grok-4-fast, gpt-5.2, and grok-4 all got around 0.77 (1 rollout); I think there are some artifical hivemind ideas at play, where there strategies converge due to being given the same set of moves
English
1
0
1
77
Alex Wa
Alex Wa@_djdumpling·
New blog as a part of the @PrimeIntellect RL residency! 🧵 In Fruit box, a grid-based reasoning game, we find that post-training a small CNN policy outperforms LLMs, but only with legal action masks. Despite operating on token sequences, LLMs demonstrate strong spatial reasoning
Alex Wa tweet media
English
2
9
124
22.1K
Christian
Christian@creet_z·
>alex applies to prime intellect residency >links a single blog post on his site "whirlwind of PPO and RLHF for LLMs from scratch" but its a banger >bring him in as resident bc i want to see another one >sure enough, puts out Yet Another Banger
Alex Wa@_djdumpling

New blog as a part of the @PrimeIntellect RL residency! 🧵 In Fruit box, a grid-based reasoning game, we find that post-training a small CNN policy outperforms LLMs, but only with legal action masks. Despite operating on token sequences, LLMs demonstrate strong spatial reasoning

English
3
8
159
21.7K
Alex Wa
Alex Wa@_djdumpling·
Other ideas I’d love to see include continuous factorization with DDPG, testing VLMs due to their strong spatial priors, and interpreting attention traces+CNN features, and better credit assignment with value functions Thanks for reading, and any feedback is welcome!
English
2
0
11
901
Alex Wa
Alex Wa@_djdumpling·
The high-leverage fix: legal action masking. An "engineering pragmatism" lesson: don’t spend RL capacity relearning hard constraints. Enforce constraints, then let learning focus on strategy. With masking, the SFT policy beats all LLMs + most baselines, within ~6 pts of expert.
English
1
1
13
1.1K
Alex Wa
Alex Wa@_djdumpling·
@willccbb Thanks so much Will! I was just about to start my env 🙏🙏🙏
English
0
0
1
44
will brown
will brown@willccbb·
mini tutorial on using verifiers with environments hub + prime CLI, and a demo of our new sandboxes integration :)
English
15
35
363
30.6K