Shamane Siri | Pluralis

496 posts

Shamane Siri | Pluralis banner
Shamane Siri | Pluralis

Shamane Siri | Pluralis

@GShamane

Tinkering Transformers | Coding by Day, Hallucinating by Night

Melbourne, Victoria Katılım Ocak 2013
652 Takip Edilen194 Takipçiler
Shamane Siri | Pluralis retweetledi
Niels Rogge
Niels Rogge@NielsRogge·
One of the hottest terms in AI right now is "On-policy distillation". It is a post-training technique in which a student model, typically an LLM, samples from its current policy and receives a teacher signal for on-policy states. It combines the dense supervision of distillation with the locality of online RL. Now a method on PapersWithCode! Find all 183 papers that cite it, and more here: paperswithcode.co/methods/on-pol…
Niels Rogge tweet media
English
10
44
492
28.1K
Shamane Siri | Pluralis retweetledi
Thalaiyasingam Ajanthan
Thalaiyasingam Ajanthan@tha_ajanthan·
Imagine being able to collectively train (and own) an LLM on all of these GPUs. This is exactly what we aim to do @Pluralis. See the current live run at agora.pluralis.ai
clem 🤗@ClementDelangue

300,000 AI builders filled their hardware profile on @huggingface and we're sharing the results: hf.co/hardware. Excited to see how it evolves in the coming months especially with the explosion of local AI!

English
1
2
14
1.8K
Shamane Siri | Pluralis retweetledi
Thalaiyasingam Ajanthan
Thalaiyasingam Ajanthan@tha_ajanthan·
Agora has been operating at peak capacity for more than a day now, and the throughput has steadily increased. It's good to see things working as they should be.
Thalaiyasingam Ajanthan tweet media
Alexander Long@AlexanderLong

wow

English
0
1
14
266
Shamane Siri | Pluralis retweetledi
Chinmay
Chinmay@ChinmayKak·
Love this paper! Like the title says, it is so simple you are surprised how it works. They do self distillation(sft) on model generated traces. No PI no feedback. Also confirms my hypothesis that off policy distillation with self distillation setup should work(also seen in @TimXu222575’s take about the same) since the student and teacher modes are ~identical, and thus SFT can create learning signal from them, thus avoiding catastrophic forgetting. Through analysis they find that this method lowers overall entropy while preserving exploration capacity. Also great analysis on why it helps!
Chinmay tweet media
English
4
20
163
8K
Shamane Siri | Pluralis retweetledi
Hadi M. Dolatabadi
Hadi M. Dolatabadi@hmdolatabadi·
We’re live! Glad to have been part of the Agora effort within @Pluralis. We’ve come a long way since Node0, making the infra layer more fault-tolerant while speeding up training by 10x with less compute. At the core, we’re using SSNs plus AsyncSparta to enable LLM pretraining over the internet. Under the hood, however, we had to tackle many technical challenges caused by the stochastic nature of the underlying hardware. To make this happen, we had to resolve conflicts that arise when combining PP training with stochastic hardware: enabling each replica to join/hold its own set of weights without derailing the run, keeping AR communication lightweight, and not wasting bandwidth on parts of the model that don’t need it after explicitly baking PP compression into the architecture itself. Glad to have contributed meaningfully to building this, and honestly, super excited for what the future holds. There are many technical challenges here that haven’t really surfaced anywhere before. A lot of them come from the uncharted territory of decentralized training; problems that big labs haven’t had to resolve because they’ve had access to massive amounts of datacenter compute.
Pluralis Research@Pluralis

Today we're releasing Agora: the first ever pretraining stack that allows non-collocated consumer GPUs to be competitive with centralized clusters Agora is 15x faster than Megatron-LM in this setting and is only 1.5x less efficient in terms of tokens per unit compute than TorchTitan on H100s, despite running on devices that have no NVLink or InfiniBand support.

English
0
5
13
694
Shamane Siri | Pluralis retweetledi
Pluralis Research
Pluralis Research@Pluralis·
Today we're releasing Agora: the first ever pretraining stack that allows non-collocated consumer GPUs to be competitive with centralized clusters Agora is 15x faster than Megatron-LM in this setting and is only 1.5x less efficient in terms of tokens per unit compute than TorchTitan on H100s, despite running on devices that have no NVLink or InfiniBand support.
Pluralis Research tweet mediaPluralis Research tweet mediaPluralis Research tweet mediaPluralis Research tweet media
English
20
38
231
53.3K
Shamane Siri | Pluralis retweetledi
Shamane Siri | Pluralis retweetledi
Jacek Golebiowski
Jacek Golebiowski@j_golebiowski·
The next agent stack: a frontier LLM as orchestrator, fine-tuned SLMs as skills. For PII redaction, the orchestrator never sees raw text. The local 1B SLM does. It returns redacted output, and that's what the cloud model gets. Privacy by architecture, not by promise.
Jacek Golebiowski tweet media
English
14
20
173
30.1K
Shamane Siri | Pluralis retweetledi
adaption
adaption@adaption_ai·
Introducing AutoScientist. Most model training fails outside of frontier labs. AutoScientist automates the full research loop so it doesn't have to.
English
43
112
839
200.9K
Shamane Siri | Pluralis retweetledi
Thinking Machines
Thinking Machines@thinkymachines·
While Lilian is telling a story, the interaction model can track when she is thinking, yielding, self-correcting, or inviting a response; there is no specific built dialogue management system.
English
16
79
1.9K
293.8K
Shamane Siri | Pluralis
Shamane Siri | Pluralis@GShamane·
Just wonder, how many were using fine-tuned models via OpenAI in the first place?
Mark Kretschmann@mark_k

OpenAI has announced they will be winding down fine tuning. I got the email today. Existing active @OpenAI customers can keep running fine-tuning jobs until January 6, 2027, but after that no new training jobs can be created. Existing fine-tuned models will still run, but only until the underlying base model is eventually deprecated. I get the argument that newer models follow instructions much better, and that prompts plus RAG cover more use cases than before. But not all of them.

English
0
0
1
67
Shamane Siri | Pluralis
Shamane Siri | Pluralis@GShamane·
Agentic RL is becoming an infra thing. It is clear. Every component should be pluggable, specially Trainers and Environment managers. This is green field for decentralised compute.
Zhihu Frontier@ZhihuFrontier

📝 Agentic RL Infra Notes Insights from Zhihu Contributor 低级炼丹师 📝 🔍 Core Difference: Agentic RL vs Traditional RL • Traditional RL (RLVR): Single-time generation (answer → reward → update) — trains a "response-generating" model, no dynamic interaction. • Agentic RL: Continuous action (tools + context + multi-round interaction) 🚀 — trains an "action-executing" model for real-world dynamic tasks. 🧩 Key Challenge: It’s a System Problem, Not Just Long Sequences 🧩 Core pain points: Agent access (white/black-box), environment management, long-tail rollout, training-deployment consistency. 4 systems solve these! 🛠️ Core Goal (Forge): Maximize Training Gain Formula: Effective Gain = Throughput × Sample Efficiency Constraints: Support any Agent + Stable convergence. 🌟 Core Solution: Separate Agent from RL Framework • Agent = Trajectory producer (handles context/tool calls) • RL System = Collect trajectories + Update models (no Agent simulation!) 🚀 4 Key Systems (1 Sentence Each) • Forge (MiniMax): 3-layer architecture, supports white/black-box Agents & solves TITO inconsistency. • ROLL (Alibaba): Splits Agent/environment/training, optimizes rollout bottleneck with Chunked MDP. • Slime (Zhipu AI): Rollout as HTTP service, fixes TITO mismatch & manages off-policy errors. • Seer (Moonshot): Sync optimization, splits rollout to cut long-tail latency + model-free speculation. ⏳ Key Optimizations • Prefix Tree Merging: Cut duplicate computation from shared trajectory prefixes. • Global KV Cache: Speed up inference for long Agent contexts. • Clean Environment: Avoid reward pollution from residues/test leaks. 🔧 Deep Dive: Key Technical Points • Agent Abstraction Layer: Defines unified interface (Observation → Action) to adapt white-box (customizable weights) & black-box (API-only) Agents, ensuring framework compatibility. • Rollout Optimization: Chunked MDP splits long trajectories into manageable chunks; asynchronous rollout decouples Agent execution from training, reducing latency. • TITO Consistency: Aligns training (Train)、inference (Infer)、test (Test)、online (Online) environments/Agent versions to avoid performance degradation after deployment. • Off-Policy Data Management: Uses replay buffer with priority sampling to filter low-quality trajectories, improving sample efficiency; Slime’s HTTP-based rollout ensures data traceability. • Context Efficiency: Global KV Cache reuses shared context prefixes; Prefix Tree Merging eliminates redundant computation in multi-branch trajectories. ⚠️ Common Pitfalls & Avoidance Tips • Over-Optimizing Throughput: Ignoring sample efficiency leads to wasted computing resources — balance throughput with priority trajectory sampling. • Neglecting TITO Mismatch: Training on offline data but deploying to inconsistent online environments causes performance drop — align all four environments (Train/Infer/Test/Online) upfront. • Agent Over-Simulation: Simulating Agent logic in RL framework increases complexity — stick to decoupling (Agent = trajectory producer, RL = training/collection). 📌 Practical Application Scenarios • Tool-Using Agents: E-commerce customer service (multi-round tool calls: order query → refund processing) — relies on rollout optimization & context efficiency. • Autonomous Decision-Making: Industrial control (dynamic adjustment based on real-time data) — benefits from TITO consistency & off-policy data management. 🎯 Epilogue ✅ Agentic RL infra = Decouple Agent + Optimize rollout + Ensure stability Let’s build better Agentic RL infra together! 🌍 🔗 Highly recommend you to read the full article: zhuanlan.zhihu.com/p/202278614808… #AgenticRL #AI #ROLL #Agent

English
0
0
1
77
Shamane Siri | Pluralis retweetledi
Pluralis Research
Pluralis Research@Pluralis·
Factored Gossip DiLoCo (by @ChaminHewa) has been accepted to ICML 2026. It removes the all-reduce required to compute the outer-optimiser step, improving robustness to failed nodes. In a collective training setting, this allows nodes to leave arbritarily with minimal impact.
Pluralis Research tweet mediaPluralis Research tweet media
English
4
6
29
2.7K
Shamane Siri | Pluralis
Shamane Siri | Pluralis@GShamane·
Decentralized agentic RL feels pretty natural at this scale. For 1T+ models, the system almost has to move toward: 1. Fully decoupled training, inference, and environments 2. Community-driven inference workers 3. Scalable trainers and rollout generators Experience replay
Rishabh Agarwal@agarwl_

I gave a talk at ICLR 2026 about how we are scaling RL on frontier LLMs with 1T+ parameters, on experimental data from our physical lab at Periodic! Here's a rough recording of the talk:

English
0
0
3
104