Shamane Siri | Pluralis

496 posts

Shamane Siri | Pluralis

@GShamane

Tinkering Transformers | Coding by Day, Hallucinating by Night

Melbourne, Victoria Katılım Ocak 2013

652 Takip Edilen194 Takipçiler

Sabitlenmiş Tweet

Shamane Siri | Pluralis@GShamane·31 Mar

Agentic RL environments are becoming critical. We integrated OpenReward (openreward.ai) into Alibaba’s ROLE (alibaba.github.io/ROLL/). Details: github.com/alibaba/ROLL/p…

English

2.4K

Shamane Siri | Pluralis retweetledi

Niels Rogge@NielsRogge·8h

One of the hottest terms in AI right now is "On-policy distillation". It is a post-training technique in which a student model, typically an LLM, samples from its current policy and receives a teacher signal for on-policy states. It combines the dense supervision of distillation with the locality of online RL. Now a method on PapersWithCode! Find all 183 papers that cite it, and more here: paperswithcode.co/methods/on-pol…

English

492

28.1K

Shamane Siri | Pluralis retweetledi

Thalaiyasingam Ajanthan@tha_ajanthan·21h

Imagine being able to collectively train (and own) an LLM on all of these GPUs. This is exactly what we aim to do @Pluralis. See the current live run at agora.pluralis.ai

clem 🤗@ClementDelangue

300,000 AI builders filled their hardware profile on @huggingface and we're sharing the results: hf.co/hardware. Excited to see how it evolves in the coming months especially with the explosion of local AI!

English

1.8K

Shamane Siri | Pluralis retweetledi

Thalaiyasingam Ajanthan@tha_ajanthan·2d

Agora has been operating at peak capacity for more than a day now, and the throughput has steadily increased. It's good to see things working as they should be.

Alexander Long@AlexanderLong

wow

English

266

Shamane Siri | Pluralis@GShamane·2d

How to make models better at OOD: SFT —> Catastrophic forgetting and we should mess up with checkpoints RL —> Solid , but sparse , no way it can learn if things are too hard OPSD —> dense and best of both worlds .

Applied Compute@appliedcompute

Some enterprise tasks are challenging to hill-climb with RL-based methods since they involve very out-of-distribution behavior. On-policy self-distillation (OPSD) gives a model learning signal for every token it writes, far richer than the single scalar reward of RL. But that channel is noisy: most tokens don't reflect the behavior you're after. We introduce Relevance-Masked Self-Distillation (RMSD), which uses a two-step filtered loss mask to cut through the noise and find the tokens with the highest signal. Compared to OPSD it trains more stably, provides higher data efficiency, and reaches a higher performance ceiling.

English

Shamane Siri | Pluralis retweetledi

Chinmay@ChinmayKak·3d

Love this paper! Like the title says, it is so simple you are surprised how it works. They do self distillation(sft) on model generated traces. No PI no feedback. Also confirms my hypothesis that off policy distillation with self distillation setup should work(also seen in @TimXu222575’s take about the same) since the student and teacher modes are ~identical, and thus SFT can create learning signal from them, thus avoiding catastrophic forgetting. Through analysis they find that this method lowers overall entropy while preserving exploration capacity. Also great analysis on why it helps!

English

163

Shamane Siri | Pluralis retweetledi

Hadi M. Dolatabadi@hmdolatabadi·4d

We’re live! Glad to have been part of the Agora effort within @Pluralis. We’ve come a long way since Node0, making the infra layer more fault-tolerant while speeding up training by 10x with less compute. At the core, we’re using SSNs plus AsyncSparta to enable LLM pretraining over the internet. Under the hood, however, we had to tackle many technical challenges caused by the stochastic nature of the underlying hardware. To make this happen, we had to resolve conflicts that arise when combining PP training with stochastic hardware: enabling each replica to join/hold its own set of weights without derailing the run, keeping AR communication lightweight, and not wasting bandwidth on parts of the model that don’t need it after explicitly baking PP compression into the architecture itself. Glad to have contributed meaningfully to building this, and honestly, super excited for what the future holds. There are many technical challenges here that haven’t really surfaced anywhere before. A lot of them come from the uncharted territory of decentralized training; problems that big labs haven’t had to resolve because they’ve had access to massive amounts of datacenter compute.

Pluralis Research@Pluralis

Today we're releasing Agora: the first ever pretraining stack that allows non-collocated consumer GPUs to be competitive with centralized clusters Agora is 15x faster than Megatron-LM in this setting and is only 1.5x less efficient in terms of tokens per unit compute than TorchTitan on H100s, despite running on devices that have no NVLink or InfiniBand support.

English

694

Shamane Siri | Pluralis retweetledi

Pluralis Research@Pluralis·5d

English

231

53.3K

Shamane Siri | Pluralis retweetledi

Riccardo Patana@RiccardoPatana·5d

Agora is our Protocol Learning pre-training infra for a world of collective, trustless and sovereign intelligence. More at pluralis.ai/dev/, live decentralized run at agora.pluralis.ai

Pluralis Research@Pluralis

English

808

Shamane Siri | Pluralis retweetledi

Nando de Freitas@NandoDF·17 May

One line of code is all it takes to prevent LLM agent delusions, instead of post-training patches like RL. love4all.ai/blog/why-it-is… ❤️ 4 ∀ github.com/nandodef/love4…

English

278

47.2K

Shamane Siri | Pluralis@GShamane·15 May

Exactly this is the infra we need. Also would be nice to work on efficient rollout inference specially with pipeline and expert parallelism that would open so many new doors.

Junli Wang@JunliWang2021

Digital agent learning needs massive rollouts. But digital agent rollouts are painfully slow due to heavy environments. 🐌 🚀 We introduce NanoRollout, a lightweight open infra (900 lines core code) for digital agent rollout at scale, validated with three workloads: 🏋️ Large batchsize (4K) SWE Agent RL -> surpasses DeepSWE-32B 🧪 250k+ distilled coding trajectories -> SOTA ≤32B open coding agent ⚡ Fast evaluation on coding/cua/unified agent -> finish Check our Blog: cocoa-org.notion.site/nanorollout

English

Shamane Siri | Pluralis retweetledi

Jacek Golebiowski@j_golebiowski·13 May

The next agent stack: a frontier LLM as orchestrator, fine-tuned SLMs as skills. For PII redaction, the orchestrator never sees raw text. The local 1B SLM does. It returns redacted output, and that's what the cloud model gets. Privacy by architecture, not by promise.

English

173

30.1K

Shamane Siri | Pluralis@GShamane·13 May

Prepare your own.

adaption@adaption_ai

Introducing AutoScientist. Most model training fails outside of frontier labs. AutoScientist automates the full research loop so it doesn't have to.

English

103

Shamane Siri | Pluralis retweetledi

adaption@adaption_ai·13 May

Introducing AutoScientist. Most model training fails outside of frontier labs. AutoScientist automates the full research loop so it doesn't have to.

English

112

839

200.9K

Shamane Siri | Pluralis retweetledi

Thinking Machines@thinkymachines·11 May

While Lilian is telling a story, the interaction model can track when she is thinking, yielding, self-correcting, or inviting a response; there is no specific built dialogue management system.

English

1.9K

293.8K

Shamane Siri | Pluralis@GShamane·10 May

Just wonder, how many were using fine-tuned models via OpenAI in the first place?

Mark Kretschmann@mark_k

OpenAI has announced they will be winding down fine tuning. I got the email today. Existing active @OpenAI customers can keep running fine-tuning jobs until January 6, 2027, but after that no new training jobs can be created. Existing fine-tuned models will still run, but only until the underlying base model is eventually deprecated. I get the argument that newer models follow instructions much better, and that prompts plus RAG cover more use cases than before. But not all of them.

English

Shamane Siri | Pluralis@GShamane·6 May

Agentic RL is becoming an infra thing. It is clear. Every component should be pluggable, specially Trainers and Environment managers. This is green field for decentralised compute.

Zhihu Frontier@ZhihuFrontier

📝 Agentic RL Infra Notes Insights from Zhihu Contributor 低级炼丹师 📝 🔍 Core Difference: Agentic RL vs Traditional RL • Traditional RL (RLVR): Single-time generation (answer → reward → update) — trains a "response-generating" model, no dynamic interaction. • Agentic RL: Continuous action (tools + context + multi-round interaction) 🚀 — trains an "action-executing" model for real-world dynamic tasks. 🧩 Key Challenge: It’s a System Problem, Not Just Long Sequences 🧩 Core pain points: Agent access (white/black-box), environment management, long-tail rollout, training-deployment consistency. 4 systems solve these! 🛠️ Core Goal (Forge): Maximize Training Gain Formula: Effective Gain = Throughput × Sample Efficiency Constraints: Support any Agent + Stable convergence. 🌟 Core Solution: Separate Agent from RL Framework • Agent = Trajectory producer (handles context/tool calls) • RL System = Collect trajectories + Update models (no Agent simulation!) 🚀 4 Key Systems (1 Sentence Each) • Forge (MiniMax): 3-layer architecture, supports white/black-box Agents & solves TITO inconsistency. • ROLL (Alibaba): Splits Agent/environment/training, optimizes rollout bottleneck with Chunked MDP. • Slime (Zhipu AI): Rollout as HTTP service, fixes TITO mismatch & manages off-policy errors. • Seer (Moonshot): Sync optimization, splits rollout to cut long-tail latency + model-free speculation. ⏳ Key Optimizations • Prefix Tree Merging: Cut duplicate computation from shared trajectory prefixes. • Global KV Cache: Speed up inference for long Agent contexts. • Clean Environment: Avoid reward pollution from residues/test leaks. 🔧 Deep Dive: Key Technical Points • Agent Abstraction Layer: Defines unified interface (Observation → Action) to adapt white-box (customizable weights) & black-box (API-only) Agents, ensuring framework compatibility. • Rollout Optimization: Chunked MDP splits long trajectories into manageable chunks; asynchronous rollout decouples Agent execution from training, reducing latency. • TITO Consistency: Aligns training (Train)、inference (Infer)、test (Test)、online (Online) environments/Agent versions to avoid performance degradation after deployment. • Off-Policy Data Management: Uses replay buffer with priority sampling to filter low-quality trajectories, improving sample efficiency; Slime’s HTTP-based rollout ensures data traceability. • Context Efficiency: Global KV Cache reuses shared context prefixes; Prefix Tree Merging eliminates redundant computation in multi-branch trajectories. ⚠️ Common Pitfalls & Avoidance Tips • Over-Optimizing Throughput: Ignoring sample efficiency leads to wasted computing resources — balance throughput with priority trajectory sampling. • Neglecting TITO Mismatch: Training on offline data but deploying to inconsistent online environments causes performance drop — align all four environments (Train/Infer/Test/Online) upfront. • Agent Over-Simulation: Simulating Agent logic in RL framework increases complexity — stick to decoupling (Agent = trajectory producer, RL = training/collection). 📌 Practical Application Scenarios • Tool-Using Agents: E-commerce customer service (multi-round tool calls: order query → refund processing) — relies on rollout optimization & context efficiency. • Autonomous Decision-Making: Industrial control (dynamic adjustment based on real-time data) — benefits from TITO consistency & off-policy data management. 🎯 Epilogue ✅ Agentic RL infra = Decouple Agent + Optimize rollout + Ensure stability Let’s build better Agentic RL infra together! 🌍 🔗 Highly recommend you to read the full article: zhuanlan.zhihu.com/p/202278614808… #AgenticRL #AI #ROLL #Agent

English

Shamane Siri | Pluralis@GShamane·4 May

In next few months: I am an engineer with the Strike Rate of ... !

Moulik Shrivastav@MoulikTweets

x.com/i/article/2051…

English

Shamane Siri | Pluralis retweetledi

Pluralis Research@Pluralis·1 May

Factored Gossip DiLoCo (by @ChaminHewa) has been accepted to ICML 2026. It removes the all-reduce required to compute the outer-optimiser step, improving robustness to failed nodes. In a collective training setting, this allows nodes to leave arbritarily with minimal impact.

English

2.7K

Shamane Siri | Pluralis@GShamane·29 Nis

Decentralized agentic RL feels pretty natural at this scale. For 1T+ models, the system almost has to move toward: 1. Fully decoupled training, inference, and environments 2. Community-driven inference workers 3. Scalable trainers and rollout generators Experience replay

Rishabh Agarwal@agarwl_

I gave a talk at ICLR 2026 about how we are scaling RL on frontier LLMs with 1T+ parameters, on experimental data from our physical lab at Periodic! Here's a rough recording of the talk:

English

104

Shamane Siri | Pluralis retweetledi

elie@eliebakouch·29 Nis

we take this kind of plot for granted now, but RL compute scaling working on literally any domain (you can verify?) is still beautiful to me

Rishabh Agarwal@agarwl_

I gave a talk at ICLR 2026 about how we are scaling RL on frontier LLMs with 1T+ parameters, on experimental data from our physical lab at Periodic! Here's a rough recording of the talk:

English

159

13.9K

Keşfet

@Pluralis @TimXu222575 @ChaminHewa @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates