Chief Banana

3K posts

Chief Banana

@rezer0dai

Non-violence leads to the highest ethics, which is the goal of all evolution. Until we stop harming all other living beings, we are still savages. ~ T.A.Edison

Katılım Aralık 2011

866 Takip Edilen3.2K Takipçiler

Sabitlenmiş Tweet

Chief Banana@rezer0dai·4 Oca

hyperv bugzz bounties fuzzing and bananas, something in between those lines => rezer0dai.github.io/biug-bounties/

English

232

Chief Banana retweetledi

ö@r0keb·19 May

Good morning! I just published a blog post about a KASLR bypass that works on modern Windows 11 versions. It leverages Intel CPU cache timings to exfiltrate the base address of ntoskrnl.exe. I hope you like it! r0keb.github.io/posts/Bypassin…

English

131

410

25.8K

Chief Banana retweetledi

Teknium 🪽@Teknium·29 Nis

Today at Nous we released our RL Environments Gym - Atropos. With it we've been able to train impressive models like our tool calling specialist that saw a 5x improvement on the @berkeley_ai function calling benchmark and several other models that we've released as artifacts on HF. I hope that together we can build many more environments to broaden the targets of RL beyond math. We will be having a hackathon in SF next month to encourage just that, with a huge prize pool too! So stay tuned.

Nous Research@NousResearch

Reinforcement Learning in the era of LLMs requires scalable, distributed systems to push the boundaries of reasoning and alignment. Today - we release Atropos - our RL environments framework. github.com/NousResearch/A… Atropos is a rollout framework for reinforcement learning with foundation models that supports complex and diverse environments for advancing the capabilities of foundation models. In Greek mythology, Atropos was the eldest of the three Fates. While her sisters spun and measured the threads of mortal lives, Atropos alone held the shears that would cut these threads, determining the final destiny of each soul. Just as Atropos guided souls to their ultimate fate, this system guides language models toward their optimal potential through reinforcement learning. The work on Atropos was led by @dmayhem93 and built alongside @teknium, @rogershijin, @max_paperclips, @nullvaluetensor, @JSupa15, @artemsya and @karan4d

English

363

26.6K

Chief Banana retweetledi

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr·16 Nis

I am telling you guys if you really want to truly grasp diffusion models you MUST read all of @sedielem's blog posts!!!

Tanishq Mathew Abraham, Ph.D. tweet media

English

114

62.7K

Chief Banana retweetledi

Kyle Corbitt@corbtt·14 Nis

🧵 Excited to announce ART (Agent Reinforcement Trainer), a new RL framework for easily training agents with GRPO! Optimized for best-in-class efficiency and agentic, multi-turn interactions.

English

319

26.7K

Chief Banana retweetledi

机器之心 JIQIZHIXIN@jiqizhixin·31 Mar

GRPO just got a speed boost! Xiamen University introduced Completion Pruning Policy Optimization (CPPO), which significantly reduces the number of gradient calculations and updates. How fast? On GSM8K, it's 8.32× faster than GRPO, and on MATH, the speedup is 3.51×. 🚀🔥

English

249

27.9K

Chief Banana retweetledi

Nathan Lambert@natolambert·31 Mar

I hear people are pretty into GRPO and RL these days, so I wrote up a pretty comprehensive research survey of recent papers I liked. Kimi 1.5, OpenReasonerZero, DAPO and Dr. GRPO. + discussion on if GRPO is special and further reading. interconnects.ai/p/papers-im-re…

English

666

76.1K

Chief Banana retweetledi

Lewis Tunstall@_lewtun·24 Mar

RL goes brrr in the latest TRL release! 🔥 Scale GRPO with multi-node training & vLLM's tensor parallelism 🚀 6x faster convergence with multi-step optimisation 📊 Support for domain specific rewards Release notes 👇 github.com/huggingface/tr…

English

176

34.4K

Chief Banana retweetledi

François Fleuret@francoisfleuret·26 Mar

So it seems that "real CS" people got quite a huge result: anything that can be done in O(f(n)) compute can be done in O(sqrt(f(n))) memory. Wow. arxiv.org/abs/2502.17779

English

194

2.2K

172.1K

Chief Banana retweetledi

Alec Helbling@alec_helbling·25 Mar

One of the simplest algorithms for sampling from a probability distribution is Random Walk Metropolis-Hastings. It proposes new samples by taking Gaussian-distributed steps, accepting or rejecting them to maintain the target distribution. I call this pdf the "fidget spinner".

English

151

1.3K

79.8K

Chief Banana retweetledi

Nathan Lambert@natolambert·24 Mar

Okay okay, spent my weekend gooning around learning GRPO math. Here's some takes. Essentially, this is me yapping through a recap of smaller details on how GRPO is implemented, what Dr. GRPO changes, why, DAPO, connections to PPO, aggregating batches... Reading list below.

English

169

1.4K

123K

Chief Banana retweetledi

Robert W Malone, MD@RWMaloneMD·22 Mar

The Climate Scam is Over.. Peer-reviewed AI analysis completely debunks all of the "man-made" claims. Please click on the link to read or listen to the essay: malone.news/p/the-climate-…

English

628

9.8K

24.5K

1.2M

Chief Banana retweetledi

drubinstein@dsrubinstein·5 Mar

Excited to finally share our progress in developing a reinforcement learning system to beat Pokémon Red. Our system successfully completes the game using a policy under 10M parameters, PPO, and a few novel techniques. Blog posted below

English

401

55.8K

Chief Banana retweetledi

Alec Helbling@alec_helbling·3 Mar

Langevin Monte Carlo allows you to draw samples from a probability distribution using its log gradient ∇ log p(x). By performing a sort of gradient ascent with noise you can navigate around the distribution. Langevin MC is heavily related to modern diffusion models.

English

187

1.6K

96.8K

Chief Banana retweetledi

Ryan M@Grimdoomer·3 Mar

Here it is, introducing the Xbox 360 Bad Update exploit, a software only hypervisor exploit for dashboard version 17559: github.com/grimdoomer/Xbo…

English

391

2.8K

162.7K

Chief Banana retweetledi

Ryan M@Grimdoomer·24 Şub

Here's part 1 of my blog series on hacking the Xbox 360 hypervisor. This covers the design of the hypervisor and hardware security features that back it. Consider it prerequisite material for part 2 which will be released next week (along with the exploit) icode4.coffee/?p=1047

English

216

985

78.8K

Chief Banana retweetledi

Daniel Han@danielhanchen·16 Şub

We made 5 challenges and if you score 47 points we'll offer you $500K/year + equity to join us at 🦥@UnslothAI! No experience or PhD needed. $400K - $500K/yr: Founding Engineer (47 points) $250K - $300K/yr: ML Engineer (32 points) Challenges: 1. Convert nf4 / BnB 4bit to Triton 2. Make FSDP2 work with QLoRA 3. Remove graph breaks in torch.compile 4. Help solve Unsloth issues! 5. Memory Efficient Backprop If you have any questions about the challenges, please feel free to ask! We're looking for people to help push Unsloth forward - so come join us to democratize AI further! Our past work includes: 1. 1.58bit DeepSeek R1 GGUFs: x.com/UnslothAI/stat… 2. GRPO with Llama 3.1 8B in a Colab: x.com/UnslothAI/stat… 3. Gemma bug fixes: x.com/danielhanchen/… 4. Gradient accumulation bug fixes: x.com/danielhanchen/… Details & submission guide: colab.research.google.com/drive/1JqKqA1X…

English

183

783

6.4K

1.3M

Chief Banana retweetledi

Vivek Myers@vivek_myers·4 Şub

Reinforcement learning should be able to improve upon behaviors seen when training. In practice, RL agents often struggle to generalize to new long-horizon behaviors. Our new paper studies *horizon generalization*, the degree RL algorithms generalize to reaching distant goals. 1/

English

490

79.5K

Chief Banana retweetledi

Nathan Lambert@natolambert·3 Şub

the TRL implementation of GRPO is technically correct if the number of gradient steps per batch is 1 because clipping never occurs. That being said, I hope they add the clipping logic soon (is in open instruct, is in standard PPO implementations, they may have already added)

Joey (e/λ)@shxf0072

just a reminder trl grpo is not same as same as described in deepseek paper :) Its doesn't have clipping objective, which is key innovation in ppo, grpo has clipping + kl trl just have kl which is technically incorrect

English

346

52.6K

Chief Banana retweetledi

starlabs@starlabs_sg·2 Şub

We're super stoked to publish this post. A huge shoutout to our former intern, @rainbowpigeon_ who poured his heart & soul into this 7-8 months ago. It took us a bit to polish it up but we're incredibly proud of him. Dive in & let us know what you think! starlabs.sg/blog/2025/12-m…

English

159

12.5K

Chief Banana retweetledi

Tim Willis@itswillis·30 Oca

Two new posts from @tiraniddo today: googleprojectzero.blogspot.com/2025/01/window… on reviving a memory trapping primitive from his 2021 post. googleprojectzero.blogspot.com/2025/01/window… where he shares a bug class and demonstrates how you can get a COM object trapped in a more privileged process. Happy Reading! 📚

English

228

32.9K

Keşfet

@berkeley_ai @sedielem @UnslothAI @rainbowpigeon_ @tiraniddo @elonmusk @BarackObama @taylorswift13