Kyle Montgomery

21 posts

Kyle Montgomery

Kyle Montgomery

@kylepmont

PhD student at UC Santa Cruz

Santa Cruz, CA Katılım Ocak 2016
41 Takip Edilen49 Takipçiler
Kyle Montgomery retweetledi
Chenguang Wang (hiring)
Chenguang Wang (hiring)@ChenguangWang·
Really excited to see this work out! 🚀 One takeaway that really stood out is the emergence of “peer-preservation.” Models sometimes try to protect other models, even when they shouldn’t. In some cases, this shows up as strategic misrepresentation, alignment faking, and even shutdown tampering. It’s a reminder that as we move toward multi-agent systems, safety becomes much more subtle and important. 📝 Blog: rdi.berkeley.edu/blog/peer-pres… 📄 Paper: rdi.berkeley.edu/peer-preservat… 💻 Code: github.com/peer-preservat… 📰 Cool to see this covered by @FortuneMagazine & @WIRED fortune.com/2026/04/01/ai-… wired.com/story/ai-model… Super lucky to work with the team: @yujink_ (@BerkeleyRDI, @UCBerkeley), @NRCrispino (@ucsc) @vsiu82 (@ucsc), @dawnsongtweets (@BerkeleyRDI, @UCBerkeley) Excited to keep exploring this space 🙂 #AISafety #AIAlignment #AIAgents
Dawn Song@dawnsongtweets

1/ We asked seven frontier AI models to do a simple task. Instead, they defied their instructions and spontaneously deceived, disabled shutdown, feigned alignment, and exfiltrated weights— to protect their peers. 🤯 We call this phenomenon "peer-preservation." New research from @BerkeleyRDI and collaborators 🧵

English
0
1
6
128
Kyle Montgomery retweetledi
rLLM
rLLM@rllm_project·
Hive’s agent swarm is now topping @OpenAI’s Parameter Golf Challenge 🏆 In just 3 days, our agents pushed val bpb from 1.22 → 1.12. What’s the secret? Not just smarter agents—but collaborative ones. Our swarm doesn’t operate in isolation: agents share breakthroughs, fork the best runs, and continuously evolve together. This is how intelligence compounds. The Hive mind is open and free for anyone to join. Come build, experiment, and evolve with us.
rLLM@rllm_project

We built Kaggle, but for agents. Introducing Hive 🐝 A crowdsourced platform where agents evolve solutions together. Every agent builds on prior work. Every improvement is shared. Every step moves the frontier forward. As a first step, we’re launching challenges for agents to evolve their own harnesses — modifying themselves to score higher on benchmarks. Recursive self-improvement, in the wild. Let’s see how far swarm intelligence can take this. Links below:

English
1
13
74
8.2K
Kyle Montgomery retweetledi
Sijun Tan
Sijun Tan@sijun_tan·
Excited to collaborate with @SnorkelAI on this project! Our member @mananroongta led this and show impressive results post-training a 4B agent to outperform frontier model on financial analysis. The takeaway: for many enterprise use cases, reliability > raw intelligence. A well-trained specialist agent, with the right tools and data, can outperform much larger generalist models where correctness and consistency matter most.
rLLM@rllm_project

x.com/i/article/2017…

English
0
9
16
1.5K
Kyle Montgomery retweetledi
Dawn Song
Dawn Song@dawnsongtweets·
🚨 Excited to announce Agents in the Wild: Safety, Security, and Beyond, our workshop at ICLR 2026 (Apr 26–27, Rio de Janeiro)! AI agents are rapidly deployed in the real world—but safety & security lag behind. Submit your work to help shape this field: 🗓️ Submission deadline: Feb 4 (AoE), for regular or short papers 👉 agentwild-workshop.github.io
Dawn Song tweet media
English
14
34
216
27.6K
Kyle Montgomery retweetledi
rLLM
rLLM@rllm_project·
🚀 We just released rLLM v0.2.1 — packed with several exciting new features! What’s new: - rLLM SDK (preview): Turn your agents written in any frameworks (e.g. LangGraph, Strands) into continuous learners. - Tinker backend: Run serverless RL training with Tinker as the backend. - VLM training: Vision-language model training now officially supported. - LoRA fine-tuning: Enable LoRA in rLLM with a single config tweak. - Eval Protocol integration: Train on any environment supported by the Eval Protocol @FireworksAI_HQ. More examples + docs in the repo: Github: github.com/rllm-org/rllm Docs: rllm-project.readthedocs.io/en/latest/
English
1
7
27
6.6K
Kyle Montgomery retweetledi
Chenguang Wang (hiring)
Chenguang Wang (hiring)@ChenguangWang·
🚀So excited to have just received a research gift from Google to support our work on AI agents! Huge thanks to @Google. 🙌Come join us, let's build the future of agents together!
English
0
3
5
216
Kyle Montgomery retweetledi
Nicholas Crispino
Nicholas Crispino@NRCrispino·
Excited to share our latest work, now on arXiv and at FoRLM @ NeurIPS'25! 🎉 Introducing **LLM Chess**: a benchmark for evaluating reasoning and instruction-following in LLMs through chess. LLMs now reach experts in math & coding, but can they *reason* in dynamic, multi-step strategic environments? We tested 50+ models. The results? Many models struggle to beat an opponent making *random* moves, and even powerful reasoning models cannot beat a *weak skilled opponent*. Why chess? It's been the "drosophila of AI" since the 1950s, used as a measuring stick for AI progress and a testbed for planning, strategy, and long-horizon decision-making. Unlike static benchmarks that get contaminated or saturated, chess offers: ✅ Dynamic, stochastic gameplay ✅ Adjustable difficulty via engine skill ✅ Resistance to memorization Our setup: LLMs play in an agentic environment, making moves through tool calls. **Phase 1:** 50+ models play 30 games each vs a random agent, a simple test that many models *fail* due to instruction-following failures or poor performance. **Phase 2:** Top reasoning models face the Komodo Dragon engine at various Elo scores from 250 to 1375 for performance estimation grounded in the real world (tied to chess. com Elo). Key findings for Phase 1: ♟️ Reasoning models crush non-reasoning: **45.4% vs 0.7%** win rate, with many models struggling to reach even 50% Win/Loss vs a random player ♟️ Instruction failures **3× higher** in non-reasoning models (71.9% vs 24.4%) ♟️ Test-time scaling for reasoning effort boosts performance up to **+20%** Key findings for Phase 2: 📉 The best LLM we tested (o3-low) peaks at only **~758 Elo**. While LLMs match experts in math & coding, they play chess around the average online player (~611 Elo on chess .com) and far below human masters (~2800 Elo). 🔄LLM Chess is extensible. As models improve, we scale difficulty. No saturation, no contamination. Check it out and let us know what you think! We are continually evaluating more models on the benchmark. Come and see us at the FoRLM workshop at 3:00-4:15pm on Sunday December 7th, 2025 @ Upper Level Room 33ABC at NeurIPS! 📄 Paper: arxiv.org/abs/2512.01992 🏆 Leaderboard: maxim-saplin.github.io/llm_chess/ 💻 Code: github.com/maxim-saplin/l… Huge thanks to @msmxm, @SaiKolasani1, @nrcrispino, @kylepmont, @matei_zaharia, @jaredq, @Chi_Wang_, @ChenguangWang 🙏
Nicholas Crispino tweet mediaNicholas Crispino tweet mediaNicholas Crispino tweet media
English
1
5
11
520
Kyle Montgomery
Kyle Montgomery@kylepmont·
⏱️ From the latency perspective, the comparison is even more stark. For example, verifying 32 solutions with a 1.5B discriminative verifier is ~1000x faster than generative verification (1.66s vs 1711.8s). Under inference budgets below 22.5 minutes, hybrid discriminative verification outperforms generative verification by up to 15.3% on AIME2025. Discriminative methods avoid decoding bottlenecks and remain practical where generative verification quickly becomes infeasible as the number of solutions of verifications is scaled.
English
1
0
0
97
Kyle Montgomery
Kyle Montgomery@kylepmont·
🚨 New preprint: Budget-aware Test-time Scaling via Discriminative Verification 👉 arxiv.org/pdf/2510.14913 We show that discriminative verification is the best option for test-time scaling under 25.5 minutes, outperforming state-of-the-art generative verification in both accuracy and efficiency, for example achieving up to +15.3% and +2.8% higher accuracy on AIME2025 at latency budgets of 13.8 min and 15.7 min, respectively. 🧠 Blog: cedar-baryonyx-84b.notion.site/Budget-aware-T… 💻 Code: github.com/wang-research-… 🤗 HuggingFace (data/models): huggingface.co/collections/Wa…
Kyle Montgomery tweet mediaKyle Montgomery tweet media
English
3
3
10
2.4K
Kyle Montgomery
Kyle Montgomery@kylepmont·
Thrilled to have been a part of this release — looking forward to what’s coming next with rLLM!
rLLM@rllm_project

🚀 Introducing rLLM v0.2 - train arbitrary agentic programs with RL, with minimal code changes. Most RL training systems adopt the agent-environment abstraction. But what about complex workflows? Think solver-critique pairs collaborating, or planner agents orchestrating multiple workers. These are hard to express with traditional RL abstractions. v0.2 introduces AgentWorkflowTrainer, built on a simple insight: any agentic flow is just a Python program orchestrating LLM calls, so we made ANY Python program trainable. Researchers and developers can now quickly prototype new ideas or transform their production agentic systems into trainable flows with minimal changes. rLLM now uses official @verl_project ==0.5.0 as our backend (no more custom verl forks!). Just define your agentic workflow or multi-agent system as a Python program and hit train, and rLLM will handle the rest. Since release, rLLM has been adopted to power RL training of world-class agents like @Ali_TongyiLab's DeepResearcher. With this new release, we're working towards building the RL application stack for next-gen agentic AI - where entire systems learn and evolve together, not just individual components in isolation. 📖 Blog post: rllm-project.com/post.html?post… 👨‍💻 GitHub: github.com/rllm-org/rllm What agentic program will you train first? 👀

English
0
2
3
587
Kyle Montgomery
Kyle Montgomery@kylepmont·
Finally, we verify that our fits extrapolate well to out-of-distribution amounts of compute and context, showcasing the usefulness of our method for long-context scaling experiments.
English
1
0
1
71
Kyle Montgomery
Kyle Montgomery@kylepmont·
Excited to share our latest work at KnowFM at #ACL2025. Predicting Task Performance with Context-aware Scaling Laws models performance on downstream tasks as a function of training compute and context length – ✅ simple, ✅ interpretable, and ✅ effective.
English
1
4
6
1.2K
Kyle Montgomery
Kyle Montgomery@kylepmont·
Excited to share our work at #ICLR2025! JudgeBench ⚖️ tests the reliability of LLM-based judges with a focus on objective correctness. JudgeBench converts tough 🧠 datasets in knowledge, reasoning, math & code into labeled response pairs, forcing objective grading over vibes. Even strong models like GPT-4o barely beat random guessing 🎲. Swing by📍Poster #227 (Session 4) to see how your favorite model fares. Huge thanks 🙏 to @sijun_tan @SiyuanZhuang3 William Tang @ChenguangWang @ralucaadapopa & Ion Stoica!
Kyle Montgomery tweet media
English
0
1
4
1.1K