R. Alessio @ ETH | RL, Bandits, Exploration

153 posts

R. Alessio @ ETH | RL, Bandits, Exploration

R. Alessio @ ETH | RL, Bandits, Exploration

@rssalessio

Postdoc at @BU_CDS with @aldopacchiano (https://t.co/Ekl0jGFIbd Lab). Interested in RL, Bandit problems and Adaptive Control.

Boston, MA เข้าร่วม Kasım 2010
335 กำลังติดตาม156 ผู้ติดตาม
R. Alessio @ ETH | RL, Bandits, Exploration
@SAS So, if your crew spills wine on its customers onboard, is this the best you can do? Is it my fault if I had to throw away the clothes and the crew did not make any report? Literally the worst customer service #SAS
R. Alessio @ ETH | RL, Bandits, Exploration tweet media
English
3
0
1
103
Shashwat Goel @ ICLR'26
Shashwat Goel @ ICLR'26@ShashwatGoel7·
Great paper showing self-distillation internalizes environment feedback, but also breaks the ability to navigate uncertainty as the "supervisor" already knows the outcome and doesn't have the same uncertainty. To teach uncertainty navigation, we proposed ∆Belief-RL. We reward actions based on whether they lead to "progress" which is estimated by the update in the model's own beliefs of achieving success. We show this improves both interaction efficiency and scaling in guessing environments, and parallel work like iGPO and TIPS shows it works for search agents. arxiv.org/abs/2602.12342 - Intrinsic Credit Assignment for Long Horizon Interaction arxiv.org/abs/2510.14967 - Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn Search Agents arxiv.org/abs/2603.22293 - TIPS: Turn-Level Information-Potential Reward Shaping for Search-Augmented LLMs The idea has rich roots in the 1999 paper on potential based reward shaping people.eecs.berkeley.edu/~pabbeel/cs287…, and a 2018 paper showing the potential can be estimated using the agents own beliefs cdn.aaai.org/ojs/11741/1174…. Lots of interesting future work here, ranging from how to measure beliefs over long-form answers, where logprobs might reward style over substance, to beliefs over arbitrary rewards and goals instead of answers, and also incorporating beliefs of other agents in the environment similar to ReBeL for multi-agent imperfect information games github.com/facebookresear….
Rosinality@rosinality

Analysis on self-distillation. It works by increasing the confidence, and does not generalize well. We can't assume the distribution given the solution behaves well, and it could be similar to unsupervised model-based verification.

English
4
13
88
13K
R. Alessio @ ETH | RL, Bandits, Exploration
A neat result: the Complete Class Theorem . ➡️ pick any non-Bayes decision rule, there’s always a Bayes rule that is at least as good as non-Bayes one. When we talk about “good” procedures, we never really need to leave the Bayes world, at least for compact parameter spaces.
R. Alessio @ ETH | RL, Bandits, Exploration tweet media
English
0
0
6
167
R. Alessio @ ETH | RL, Bandits, Exploration
Adversarial Diffusion for Robust #RL! Tomorrow @ #NeurIPS2025 , afternoon poster session #313
R. Alessio @ ETH | RL, Bandits, Exploration tweet media
R. Alessio @ ETH | RL, Bandits, Exploration@rssalessio

Excited to be in San Diego next week for #NeurIPS2025 🎉! Will present Adversarial Diffusion for Robust RL together with @DanieleFoffano. Poster session on Fri 5 Dec 7:30 p.m. EST, Exhibit Hall C,D,E. AD-RRL uses diffusion models to train Robust RL policies. #RL #Diffusion

English
0
3
11
1.2K
R. Alessio @ ETH | RL, Bandits, Exploration รีทวีตแล้ว
R. Alessio @ ETH | RL, Bandits, Exploration
Excited to be in San Diego next week for #NeurIPS2025 🎉! Will present Adversarial Diffusion for Robust RL together with @DanieleFoffano. Poster session on Fri 5 Dec 7:30 p.m. EST, Exhibit Hall C,D,E. AD-RRL uses diffusion models to train Robust RL policies. #RL #Diffusion
R. Alessio @ ETH | RL, Bandits, Exploration tweet media
English
1
2
18
2.5K
R. Alessio @ ETH | RL, Bandits, Exploration
Glhf to all ACs
Egor Shulgin@egor_shulg

@iclr_conf reverted all reviews to pre-discussion state after the OpenReview bug. Result: one paper I’m reviewing has the authors' rebuttal responding point-by-point to concerns that were edited out and no longer exist in the system. New ACs: good luck making sense of this.

English
1
0
5
751