Chief Banana

3K posts

Chief Banana banner
Chief Banana

Chief Banana

@rezer0dai

Non-violence leads to the highest ethics, which is the goal of all evolution. Until we stop harming all other living beings, we are still savages. ~ T.A.Edison

Katılım Aralık 2011
866 Takip Edilen3.2K Takipçiler
Chief Banana retweetledi
ö
ö@r0keb·
Good morning! I just published a blog post about a KASLR bypass that works on modern Windows 11 versions. It leverages Intel CPU cache timings to exfiltrate the base address of ntoskrnl.exe. I hope you like it! r0keb.github.io/posts/Bypassin…
English
11
131
410
25.8K
Chief Banana retweetledi
Teknium 🪽
Teknium 🪽@Teknium·
Today at Nous we released our RL Environments Gym - Atropos. With it we've been able to train impressive models like our tool calling specialist that saw a 5x improvement on the @berkeley_ai function calling benchmark and several other models that we've released as artifacts on HF. I hope that together we can build many more environments to broaden the targets of RL beyond math. We will be having a hackathon in SF next month to encourage just that, with a huge prize pool too! So stay tuned.
Teknium 🪽 tweet media
Nous Research@NousResearch

Reinforcement Learning in the era of LLMs requires scalable, distributed systems to push the boundaries of reasoning and alignment. Today - we release Atropos - our RL environments framework. github.com/NousResearch/A… Atropos is a rollout framework for reinforcement learning with foundation models that supports complex and diverse environments for advancing the capabilities of foundation models. In Greek mythology, Atropos was the eldest of the three Fates. While her sisters spun and measured the threads of mortal lives, Atropos alone held the shears that would cut these threads, determining the final destiny of each soul. Just as Atropos guided souls to their ultimate fate, this system guides language models toward their optimal potential through reinforcement learning. The work on Atropos was led by @dmayhem93 and built alongside @teknium, @rogershijin, @max_paperclips, @nullvaluetensor, @JSupa15, @artemsya and @karan4d

English
10
39
363
26.6K
Chief Banana retweetledi
Tanishq Mathew Abraham, Ph.D.
Tanishq Mathew Abraham, Ph.D.@iScienceLuvr·
I am telling you guys if you really want to truly grasp diffusion models you MUST read all of @sedielem's blog posts!!!
Tanishq Mathew Abraham, Ph.D. tweet media
English
14
114
1K
62.7K
Chief Banana retweetledi
Kyle Corbitt
Kyle Corbitt@corbtt·
🧵 Excited to announce ART (Agent Reinforcement Trainer), a new RL framework for easily training agents with GRPO! Optimized for best-in-class efficiency and agentic, multi-turn interactions.
Kyle Corbitt tweet media
English
7
38
319
26.7K
Chief Banana retweetledi
机器之心 JIQIZHIXIN
机器之心 JIQIZHIXIN@jiqizhixin·
GRPO just got a speed boost! Xiamen University introduced Completion Pruning Policy Optimization (CPPO), which significantly reduces the number of gradient calculations and updates. How fast? On GSM8K, it's 8.32× faster than GRPO, and on MATH, the speedup is 3.51×. 🚀🔥
机器之心 JIQIZHIXIN tweet media
English
3
49
249
27.9K
Chief Banana retweetledi
Nathan Lambert
Nathan Lambert@natolambert·
I hear people are pretty into GRPO and RL these days, so I wrote up a pretty comprehensive research survey of recent papers I liked. Kimi 1.5, OpenReasonerZero, DAPO and Dr. GRPO. + discussion on if GRPO is special and further reading. interconnects.ai/p/papers-im-re…
English
7
92
666
76.1K
Chief Banana retweetledi
Lewis Tunstall
Lewis Tunstall@_lewtun·
RL goes brrr in the latest TRL release! 🔥 Scale GRPO with multi-node training & vLLM's tensor parallelism 🚀 6x faster convergence with multi-step optimisation 📊 Support for domain specific rewards Release notes 👇 github.com/huggingface/tr…
English
3
20
176
34.4K
Chief Banana retweetledi
François Fleuret
François Fleuret@francoisfleuret·
So it seems that "real CS" people got quite a huge result: anything that can be done in O(f(n)) compute can be done in O(sqrt(f(n))) memory. Wow. arxiv.org/abs/2502.17779
English
29
194
2.2K
172.1K
Chief Banana retweetledi
Alec Helbling
Alec Helbling@alec_helbling·
One of the simplest algorithms for sampling from a probability distribution is Random Walk Metropolis-Hastings. It proposes new samples by taking Gaussian-distributed steps, accepting or rejecting them to maintain the target distribution. I call this pdf the "fidget spinner".
English
7
151
1.3K
79.8K
Chief Banana retweetledi
Nathan Lambert
Nathan Lambert@natolambert·
Okay okay, spent my weekend gooning around learning GRPO math. Here's some takes. Essentially, this is me yapping through a recap of smaller details on how GRPO is implemented, what Dr. GRPO changes, why, DAPO, connections to PPO, aggregating batches... Reading list below.
English
22
169
1.4K
123K
Chief Banana retweetledi
Robert W Malone, MD
Robert W Malone, MD@RWMaloneMD·
The Climate Scam is Over.. Peer-reviewed AI analysis completely debunks all of the "man-made" claims. Please click on the link to read or listen to the essay: malone.news/p/the-climate-…
Robert W Malone, MD tweet media
English
628
9.8K
24.5K
1.2M
Chief Banana retweetledi
drubinstein
drubinstein@dsrubinstein·
Excited to finally share our progress in developing a reinforcement learning system to beat Pokémon Red. Our system successfully completes the game using a policy under 10M parameters, PPO, and a few novel techniques. Blog posted below
English
13
33
401
55.8K
Chief Banana retweetledi
Alec Helbling
Alec Helbling@alec_helbling·
Langevin Monte Carlo allows you to draw samples from a probability distribution using its log gradient ∇ log p(x). By performing a sort of gradient ascent with noise you can navigate around the distribution. Langevin MC is heavily related to modern diffusion models.
English
14
187
1.6K
96.8K
Chief Banana retweetledi
Ryan M
Ryan M@Grimdoomer·
Here it is, introducing the Xbox 360 Bad Update exploit, a software only hypervisor exploit for dashboard version 17559: github.com/grimdoomer/Xbo…
English
54
391
2.8K
162.7K
Chief Banana retweetledi
Ryan M
Ryan M@Grimdoomer·
Here's part 1 of my blog series on hacking the Xbox 360 hypervisor. This covers the design of the hypervisor and hardware security features that back it. Consider it prerequisite material for part 2 which will be released next week (along with the exploit) icode4.coffee/?p=1047
English
23
216
985
78.8K
Chief Banana retweetledi
Daniel Han
Daniel Han@danielhanchen·
We made 5 challenges and if you score 47 points we'll offer you $500K/year + equity to join us at 🦥@UnslothAI! No experience or PhD needed. $400K - $500K/yr: Founding Engineer (47 points) $250K - $300K/yr: ML Engineer (32 points) Challenges: 1. Convert nf4 / BnB 4bit to Triton 2. Make FSDP2 work with QLoRA 3. Remove graph breaks in torch.compile 4. Help solve Unsloth issues! 5. Memory Efficient Backprop If you have any questions about the challenges, please feel free to ask! We're looking for people to help push Unsloth forward - so come join us to democratize AI further! Our past work includes: 1. 1.58bit DeepSeek R1 GGUFs: x.com/UnslothAI/stat… 2. GRPO with Llama 3.1 8B in a Colab: x.com/UnslothAI/stat… 3. Gemma bug fixes: x.com/danielhanchen/… 4. Gradient accumulation bug fixes: x.com/danielhanchen/… Details & submission guide: colab.research.google.com/drive/1JqKqA1X…
Daniel Han tweet media
English
183
783
6.4K
1.3M
Chief Banana retweetledi
Vivek Myers
Vivek Myers@vivek_myers·
Reinforcement learning should be able to improve upon behaviors seen when training. In practice, RL agents often struggle to generalize to new long-horizon behaviors. Our new paper studies *horizon generalization*, the degree RL algorithms generalize to reaching distant goals. 1/
English
10
58
490
79.5K
Chief Banana retweetledi
Nathan Lambert
Nathan Lambert@natolambert·
the TRL implementation of GRPO is technically correct if the number of gradient steps per batch is 1 because clipping never occurs. That being said, I hope they add the clipping logic soon (is in open instruct, is in standard PPO implementations, they may have already added)
Nathan Lambert tweet media
Joey (e/λ)@shxf0072

just a reminder trl grpo is not same as same as described in deepseek paper :) Its doesn't have clipping objective, which is key innovation in ppo, grpo has clipping + kl trl just have kl which is technically incorrect

English
10
34
346
52.6K
Chief Banana retweetledi
starlabs
starlabs@starlabs_sg·
We're super stoked to publish this post. A huge shoutout to our former intern, @rainbowpigeon_ who poured his heart & soul into this 7-8 months ago. It took us a bit to polish it up but we're incredibly proud of him. Dive in & let us know what you think! starlabs.sg/blog/2025/12-m…
English
1
48
159
12.5K