UC Berkeley Sky

81 posts

UC Berkeley Sky banner
UC Berkeley Sky

UC Berkeley Sky

@BerkeleySky

Sky Computing - looking for the Berkeley Skydeck? They’re on the other side of Campus from us @SkyDeck_Cal.

Berkeley, CA Katılım Kasım 2021
24 Takip Edilen1.4K Takipçiler
UC Berkeley Sky retweetledi
Melissa Pan
Melissa Pan@melissapan·
Excited to share that MAP has been selected for ✨ICML Oral✨ We look forward to sharing the insights in the paper with the community And much much appreciations to everyone who participated in our study ❤️ MAP won’t be possible without your contribution to open science
Melissa Pan tweet media
Melissa Pan@melissapan

Excited to share: MAP has been accepted as 🌟 ICML Spotlight 🌟 We hope MAP can provide data-driven insights that help the communities to work on various under-explored research directions around agent systems! Huge thanks & congrats to my amazing co-authors. See you all at Seoul! 🫡

English
7
15
164
26K
UC Berkeley Sky retweetledi
Qiuyang Mang
Qiuyang Mang@MangQiuyang·
Open-ended coding training data may no longer be the bottleneck: AI can scale open-ended tasks—and even outperform human-expert curation. FrontierCS team is releasing FrontierSmith: a system for synthesizing open-ended coding problems at scale. Starting from closed-ended coding tasks, FrontierSmith mutates, filters, and builds runnable optimization environments for long-horizon coding agents. In our experiments, FrontierSmith data trains stronger models than human-curated open-ended data on FrontierCS and ALE-bench. Blog: frontier-cs.org/blog/frontiers… Paper: arxiv.org/abs/2605.14445 Code: github.com/FrontierCS/Fro… Model: huggingface.co/runyuanhe/qwen…
English
14
71
331
93.1K
UC Berkeley Sky retweetledi
Lakshya A Agrawal
Lakshya A Agrawal@LakshyAAAgrawal·
Learning from rich textual feedback (errors, traces, partial reasoning) beats scalar reward alone for LLM optimization. GEPA demonstrated this for context-space optimization (prompts and agent harnesses), delivering frontier results at a fraction of the cost of RL. But context-only optimization is bounded by the base model's capability ceiling; weight updates can reach further. Very excited about this new line of work on Fast-Slow Training (FST), which interleaves context and model weight optimization! The idea is a clean division of labor between two interleaved loops: 🔹 Fast loop (context): GEPA reads rich rollout feedback updating the context layer. The context becomes a fast-updating scratchpad of what the model needs to know about this task, right now. 🔹 Slow loop (model parameters): RL updates the model's parameters conditioned on the evolving context. Because the prompt already carries task-specific nuances, the model parameters are freed from absorbing them and focus on what actually generalizes across tasks and pushes the frontier. ⦁ 3× more sample-efficient than RL on math, code, and physics reasoning ⦁ ~70% lower KL divergence from base at matched accuracy ⦁ Plasticity preserved: FST checkpoints respond better to additional RL on new tasks than RL-only ones ⦁ Continual learning across changing tasks (HoVer → CodeIO → Physics) where RL stalls the moment the task switches FST is a direction towards: ⦁ Addressing RL's pain points: entropy collapse, sparse rewards, long-horizon exploration ⦁ Providing a clean channel for rich feedback into weight updates ⦁ Demonstrating model-harness co-evolution ⦁ Discovery: Using fast context updates for broad exploration, while leveraging a continually improving model. Check out the full thread below:
Kusha Sareen@KushaSareen

Can LLMs adapt continually without losing base skills? Fast-Slow Training (FST) pairs "slow" weights with "fast" context. FST vs. RL: • 3x more sample-efficient • Higher performance ceiling • Less KL drift (better plasticity) • Continual learning: succeeds where RL stalls

English
13
43
186
33.1K
UC Berkeley Sky retweetledi
Negar Arabzadeh
Negar Arabzadeh@NegarEmpr·
1/ Thrilled to introduce T³: a corpus for RAG over reasoning tasks, built from thinking traces. We show that surprisingly RAG can improve reasoning— with the right corpus. Rag with Transformed Thinking Traces T³ gain by up to 43.9% on AIME 2025-2026. 🔗 arxiv.org/abs/2605.03344 🧵
Negar Arabzadeh tweet media
English
11
31
212
472.2K
UC Berkeley Sky retweetledi
Parth Asawa
Parth Asawa@pgasawa·
Today, we’re releasing Continual Learning Bench 1.0: the first, realistic benchmark for measuring how AI systems can improve in online settings. Benchmarks today assume models are stateless. Each example is independent, and once a system finishes a task, it moves on as if nothing happened. But deployed AI systems should learn from experience. We tested 10+ frontier systems against novel, expert-validated tasks and find there’s still plenty of headroom for learning. (1/n)
Parth Asawa tweet media
English
42
153
1.1K
825.2K
UC Berkeley Sky retweetledi
Yiwei Hou
Yiwei Hou@yiwei_hou·
Agent harness is as important as the model for cybersecurity. $300 in compute, 9 OSS-Fuzz projects, 14 security issues and 5 CVEs. The key lesson: you don’t need a secret model to find real security issues. You need an effective, affordable, reliable harness. 5 takeaways 🧵
Yiwei Hou tweet media
English
1
8
16
1.4K
UC Berkeley Sky retweetledi
Qiuyang Mang
Qiuyang Mang@MangQiuyang·
Excited to announce that FrontierCS has been accepted to ICML 2026! 🚀 We are scaling our open-ended task set to 250 tasks (100 new tasks in 2026 Q1🔥), featuring long-horizon agent settings in Harbor and integration into real-world human contests. More exciting updates to come! Huge thanks to all our collaborators. #ICML2026 #AI #MachineLearning
Qiuyang Mang tweet media
Huanzhi Mao@HuanzhiMao

Pass/fail benchmarks are saturated. It’s time for FrontierCS. 🚀 150+ unsolved, verifiable problems ranging from competitive programming to real-world research. Designed by PhDs & ICPC experts to evolve model intelligence. 🎓🧠 🧵👇Check it out! Paper: arxiv.org/abs/2512.15699

English
1
11
55
6.5K
UC Berkeley Sky retweetledi
Melissa Pan
Melissa Pan@melissapan·
Excited to share: MAP has been accepted as 🌟 ICML Spotlight 🌟 We hope MAP can provide data-driven insights that help the communities to work on various under-explored research directions around agent systems! Huge thanks & congrats to my amazing co-authors. See you all at Seoul! 🫡
Melissa Pan tweet mediaMelissa Pan tweet media
English
10
30
231
55.3K
UC Berkeley Sky retweetledi
KD
KD@Reveur_7·
What if one person could run a unicorn company? Today we're open-sourcing OMAR — a TUI that lets a single engineer orchestrate hundreds of AI coding agents in deep, recursive hierarchies. Built at Berkeley. Powered by tmux. github.com/lsk567/omar 🧵
English
1
4
15
2.6K
UC Berkeley Sky retweetledi
Abby O'Neill
Abby O'Neill@abby_k_oneill·
Would you trust an AI agent to negotiate on your country's behalf at the G20? Real coordination is long-horizon, asymmetric, and non-binding; current multi-agent evaluations miss this. We build Cooperate to Compete (C2C): a testbed for LM agents coordinating with rivals. 🤝🔪🎭
Abby O'Neill tweet media
English
5
25
93
26.5K
UC Berkeley Sky retweetledi
AI-Driven Research for Systems
AI-Driven Research for Systems@ai4research_ucb·
🎯 One Year of AI-Driven Research at Berkeley [ADRS Blog #20] For the past year at Berkeley, we have been working on automating discovery with AI. In our blog post this week, we provide an overview of these efforts: the key problems we’re tackling, the frameworks and solutions we’ve built so far, and how these efforts fit into a broader vision for AI-driven scientific discovery. ✍️ Read the blog: ucbskyadrs.github.io/blog/berkeley-… 📖 ADRS Blog Series: ucbskyadrs.github.io
AI-Driven Research for Systems tweet media
English
1
11
68
23.2K
UC Berkeley Sky retweetledi
Shu Lynn Liu
Shu Lynn Liu@shulynnliu·
Researchers spend hours and hours hand-crafting the strategies behind LLM-driven optimization systems like AlphaEvolve: deciding which ideas to reuse, when to explore vs exploit, and what mutations to try. 🤖But what if AI could evolve its own evolution process? We introduce EvoX, a meta-evolution pipeline that lets AI evolve the strategy guiding the optimization. It achieves high-quality solutions for <$5, while existing open systems and even Claude Code often cost 3-5× more on some tasks. Across ~200 optimization problems, EvoX delivers the strongest overall results: often outperforming AlphaEvolve, OpenEvolve, GEPA, and ShinkaEvolve on math and systems tasks, exceeding human SOTA, and improving median performance by up to 61% on 172 competitive programming problems. 👇
Shu Lynn Liu tweet media
English
19
85
498
99.3K
UC Berkeley Sky retweetledi
Shu Lynn Liu
Shu Lynn Liu@shulynnliu·
AlphaEvolve is closed-source. We release 🌟SkyDiscover🌟, a flexible, modular open-source framework with two new adaptive algorithms that match or exceed AlphaEvolve on many benchmarks and outperform OpenEvolve, GEPA, and ShinkaEvolve across 200+ optimization tasks. Our new algorithms dynamically adapt their search strategy, and can even let the AI optimize its own optimization process on the fly! Results: 📊 +34% median score improvement on 172 Frontier-CS problems. 🧮 Matches/exceeds AlphaEvolve on many math benchmarks ⚙️ Discovers system optimizations beyond human-designed SOTA 🧵👇
GIF
English
12
105
582
141.7K
UC Berkeley Sky retweetledi
Mayank Mishra
Mayank Mishra@MayankMish98·
We identified an issue with the Mamba-2 🐍 initialization in HuggingFace and FlashLinearAttention repository (dt_bias being incorrectly initialized). This bug is related to 2 main issues: 1. init being incorrect (torch.ones) if Mamba-2 layers are used in isolation without the Mamba2ForCausalLM model class (this has been already fixed: github.com/fla-org/flash-…). 2. Skipping initialization due to meta device init for DTensors with FSDP-2 (github.com/fla-org/flash-… will fix this issue upon merging). The difference is substantial. Mamba-2 seems to be quite sensitive to the initialization. Check out our experiments at the 7B MoE scale: wandb.ai/mayank31398/ma… Special thanks to @kevinyli_, @bharatrunwal2, @HanGuo97, @tri_dao and @_albertgu 🙏 Also thanks to @SonglinYang4 for quickly helping in merging the PR.
English
17
73
745
371.4K
UC Berkeley Sky retweetledi
Laude Institute
Laude Institute@LaudeInstitute·
Introducing Slingshots // TWO: Research that ships. 14 projects, six institutions – let’s meet the batch 🧵
Laude Institute tweet media
English
5
15
72
23.7K
UC Berkeley Sky retweetledi
NovaSky
NovaSky@NovaSkyAI·
We are excited to announce that SkyRL now implements the Tinker API. Run Tinker training scripts on your own hardware with zero code changes. Try it out today: novasky-ai.notion.site/skyrl-tinker
Tyler Griggs@tyler_griggs_

SkyRL now implements the Tinker API. Now, training scripts written for Tinker can run on your own GPUs with zero code changes using SkyRL's FSDP2, Megatron, and vLLM backends. Blog: novasky-ai.notion.site/skyrl-tinker 🧵

English
0
4
27
2.1K