
🚨 New SISL preprint: State-of-the-art language reward models are still badly biased. Past fixes overcorrect, some can be fixed with simple latent interventions, and some indicate the need for larger efforts.
Max Lamparth
545 posts

@MLamparth
Research Fellow @SISLaboratory @HooverInst at @Stanford | Focusing on interpretable, safe, and ethical AI/LLM decision-making. Ph.D. from TUM.

🚨 New SISL preprint: State-of-the-art language reward models are still badly biased. Past fixes overcorrect, some can be fixed with simple latent interventions, and some indicate the need for larger efforts.


Another SISL #ICLR 2026 paper: Current chain-of-thought models often sound persuasive without actually reflecting the reasoning behind the answer, so this work aims to make the reasoning itself carry the information the model needs to be right.

New #ICLR2026 paper with SISL contributions: A new dynamic benchmarking platform for evaluating the trustworthiness of generative AI models, called TrustGen!










