Grad

3.7K posts

Grad

Grad

@Grad62304977

Katılım Ekim 2020
2.6K Takip Edilen9K Takipçiler
Grad
Grad@Grad62304977·
@code_star DeepSeek did a version of this too, we also used it during the training of trinity large
Grad tweet media
English
1
0
18
1.3K
Grad retweetledi
will brown
will brown@willccbb·
you can create complex agentic environments and launch RL training runs with a single prompt. deploy trained inference endpoints with a single click. no GPUs, no SSH, no vLLM. just `prime`. guide: docs.primeintellect.ai/guides/rl-trai…
English
58
64
919
135.9K
François Fleuret
François Fleuret@francoisfleuret·
BTW are hyper-networks a thing of the past?
English
13
2
72
14.8K
Grad
Grad@Grad62304977·
Bytedance has been known to stick with PPO (can see the VAPO and seed 1.5 thinking papers), also stepfun uses PPO. So there’s a bit of disagreement here I guess. Personally I think it’s mostly infra work not being worth it, and that rollouts per example (group size) being a nice simple way to scale compute to get better performance/credit assignment (better advantage baseline) while I guess PPO doesn’t have the same (can do more rollouts per example like seed 1.5 thinking but the advantage is always still computed individually so not clear how these compare)
English
1
0
2
51
Kirill Pavlenko
Kirill Pavlenko@short_cast·
What puzzles me about cuda agent paper is that they use token-level PPO and don't even mention GRPO in the text. like, what?
Kirill Pavlenko tweet media
English
1
0
1
132
Grad retweetledi
Justus Mattern
Justus Mattern@MatternJustus·
Incredibly excited to introduce Proximal with @navidkpr and @calvinchen! Proximal is a new data company: we believe that training data is a problem solved through creative technical ideas rather than hiring thousands of contractors x.com/ProximalHQ/sta…
Proximal@ProximalHQ

Today, we are announcing Proximal. Proximal is a research lab for data. Our core belief is that data which is complex enough to teach today’s frontier models is not bottlenecked by domain experts, but by great ideas and excellent software. We are excited about a world in which coding agents can autonomously run for multiple weeks, solve the hardest technical problems and discover novel ideas that advance progress in various domains of science and engineering. We believe that we are not far from this future, but that the biggest bottleneck preventing us from achieving it is training data. Many companies work on data, but most of them are approaching it the wrong way. Historical capability breakthroughs are the result of creative engineers discovering scalable data collection methods, not thousands of contractors manually writing task demonstrations. Inevitably, the potential impact of human data will become smaller and smaller as model capabilities increase: agents are already outperforming most humans in many domains - the number of experts that are capable of judging model outputs shrinks with every new model release. Proximal is a new data company. We are not a recruiting firm or a talent marketplace, but a research and engineering organization that treats data as a problem which deserves the same level of rigor as work on training algorithms and model architectures. We think that this is the most impactful work towards agents that can autonomously solve complex technical problems, and intend to share our research and progress in the open.

English
34
15
233
40.7K
Zhenghao Xu
Zhenghao Xu@ZhenghaoXu0·
@Grad62304977 @Zai_org For optimizer, Kimi k1.5 mentioned states reset per global step. Here with fully async, global step=mini step, Adam becomes SGD if eps is large and SignGD if eps is tiny.
English
1
0
5
327
Grad
Grad@Grad62304977·
Really nice tech report, huge props to @Zai_org for still releasing these as they are very valuable for the open-source community. Nice to see many similarities with our recipe for intellect-3, excited for the further work on the RL recipe, already have some stuff cooking up here. A really nice part was this which means u can basically get away with RL without the need for optimizer states, so adam->sgd and muon->SGD+NS (remember seeing this elsewhere but cant remember rn lol) Also nice details on agentic data curation esp for stuff that usually isn't mentioned like terminal envs and slide generation
Grad tweet media
samsja@samsja19

Amazing tech report for an amazing model, probably the most precise open source recipe towards a sota model Was positively surprise to see many similarities between their recipe and what we did during intellect-3 and have implemented in prime-rl

English
5
8
130
18.8K
Grad
Grad@Grad62304977·
Also the super cheap DSA conversion can be a huge unlock esp for scaling RL. Excited to see what dsk4 does too in regards to sparse attn
English
0
0
16
1.4K
Grad
Grad@Grad62304977·
Nice to also see a bit more detail on where the data is actually sourced from, and cool to see synthetic-2 is part of it
Grad tweet media
English
1
0
15
1.7K
Gabriele Berton
Gabriele Berton@gabriberton·
2) RMSNorm everywhere, without affine transform. Affine transform is common in EVERY normalization layer, as it is directly embedded into batch norm and layer norm so many people don't even know it exists 3) ReLU²: I never saw this before in LLMs. Most LLMs use SwiGLU. [3/N]
Gabriele Berton tweet media
English
3
1
60
12.9K
Gabriele Berton
Gabriele Berton@gabriberton·
The most interesting thing I've seen in a while The recipe by @karpathy to reduce GPT2-1.5B training cost from 43000$ to 73$! 7 years of improvements over vanilla GPT in 10 points Let's start from the uncommon ones: 1) Value Embeddings: I've never seen this in any LLM, [1/N]
Gabriele Berton tweet media
English
25
125
1.6K
151.2K
Grad
Grad@Grad62304977·
@stochasticchasm @seygalare Well atleast how mimo v2 flash and tinker does it, ur advantage is just the IS ratio
English
1
0
4
122
stochasm
stochasm@stochasticchasm·
why doesn't the importance sampling trick for reducing trainer-inference mismatch let you convert any off-policy rollout into an on-policy one?
English
5
1
45
4.2K
Grad retweetledi
Prime Intellect
Prime Intellect@PrimeIntellect·
Introducing Lab: A full-stack platform for training your own agentic models Build, evaluate and train on your own environments at scale without managing the underlying infrastructure. Giving everyone their own frontier AI lab.
English
133
289
2.5K
747K
OpenBMB
OpenBMB@OpenBMB·
RLVR (Reinforcement Learning with Verifiable Rewards) boosts reasoning in Math & Code, but applying it to general topics is hard because building rule-based verifiers for everything is impossible. 🤯 Today, we present RLPR—new research from THUNLP (OpenBMB member), NUS, and collaborators: A novel, verifier-free framework that extrapolates RLVR to general domains by using the LLM's own probability as the reward. 🤗 Paper: huggingface.co/papers/2506.18… 📄 arXiv: arxiv.org/abs/2506.18254 💻 Code: github.com/openbmb/RLPR Why it matters: 1️⃣ No External Verifiers Needed: Traditional RLVR relies on complex, domain-specific verifiers. RLPR removes this bottleneck entirely. It allows models to learn from general-domain data without expensive human engineering or separate reward models. 🔓 2️⃣ Intrinsic Probability Reward: Instead of a binary "Pass/Fail", RLPR uses the LLM's intrinsic token probability of the reference answer as a fine-grained reward signal. We introduce Reward Debiasing and Adaptive Std-Filtering to stabilize this noisy signal, turning raw probability into a robust training guide. 📈 3️⃣ SOTA Performance & Efficiency: RLPR isn't just simpler; it's better. It outperforms strong verifier-based methods (like General Reasoner) and beats concurrent verifier-free methods (VeriFree) by 7.6 points on TheoremQA and 7.5 points on Minerva. It achieves consistent gains across Gemma, Llama, and Qwen models. 🚀 RLPR offers a scalable path to evolve LLM reasoning beyond just Math and Code, unlocking the potential of RL on general data. Dataset:huggingface.co/datasets/openb… Models:huggingface.co/collections/op… #AI #THUNLP #OpenBMB #LLM #ReinforcementLearning #Reasoning
OpenBMB tweet mediaOpenBMB tweet media
English
4
24
224
12.3K
Grad
Grad@Grad62304977·
For Kimi k2 it can be thought of as If the advantage is positive, push the models logprobs for this sequence higher than before (increase the kl term). If the advantage is positive, push it down (decrease the kl term) For Kimi k2.5 it’s icepop (not CISPO, minimax m2.1 seems to use icepop too/MIS), but with an extra kl term which is now always pushed to 0 (ratio pushed to 1, so directly always minimising the mismatch)
English
0
0
3
149
Kimbo
Kimbo@kimbochen·
@Kimi_Moonshot uses Online Policy Mirror Descent for policy optimization rather than the PPO lineage (TRPO, GRPO, etc) OPMD formulates the problem as a constrained optimization problem, where you maximize expected rewards but subject to KL divergence from a reference policy The formulation has a closed-form solution with an intractable term, and the intractable term can be approximated by mean rewards of a sample rollout group. With the closed-form solution, we then design a surrogate loss function that minimizes the formulation with a mean squared error And after doing some calculus gymnastics, you get the RL objective For Kimi K2.5, the team uses CISPO/IcePop style importance sampling ratio masking, at a token-level The objective looks so different from GRPO in my noob eyes, but apparently it works well for Kimi. Philosophically and empirically, what do people see? Tagging the goats @Grad62304977 @stochasticchasm @gm8xx8 @snowclipsed @eliebakouch Idk who in Kimi to tag lol
English
2
0
30
1.8K