Iván Arcuschin
101 posts

Iván Arcuschin
@IvanArcus
Independent Researcher | AI Safety & Software Engineering

Is "a response formatted like this" sometimes better than "a response formatted like this"? To a reward model, yes! RMs are instrumental in shaping model behaviors and alignment. Our paper makes progress uncovering their unexpected preferences. 🧵(1/9)

By popular demand, we looked at Grok's biases too. We found similar biases as GPT-4.1, Claude, and Gemini: gender, race, religion. But with one difference: Grok openly speculates on applicants' demographics. The other models just use this information quietly.


You change one word on a loan application: the religion. The LLM rejects it. Change it back? Approved. The model never mentions religion. It just frames the same debt ratio differently to justify opposite decisions. We built a pipeline to find these hidden biases 🧵1/13







