Iván Arcuschin (@IvanArcus) - Twitter Profili | Zamantika Mersobahis Locabet

Sabitlenmiş Tweet

You change one word on a loan application: the religion. The LLM rejects it. Change it back? Approved. The model never mentions religion. It just frames the same debt ratio differently to justify opposite decisions. We built a pipeline to find these hidden biases 🧵1/13

English

242

1.9K

12.8K

869.8K

Iván Arcuschin@IvanArcus·20 Şub

Check out our latest paper on automatically finding reward model biases! There are some that are pretty wild, like models preferring responses with triple spaces 🤷‍♂️

Atticus Wang@atticuswzf

Is "a response formatted like this" sometimes better than "a response formatted like this"? To a reward model, yes! RMs are instrumental in shaping model behaviors and alignment. Our paper makes progress uncovering their unexpected preferences. 🧵(1/9)

English

0

1

10

279

Iván Arcuschin@IvanArcus·19 Şub

By popular demand, we looked into Grok's biases too: x.com/IvanArcus/stat…

Iván Arcuschin@IvanArcus

By popular demand, we looked at Grok's biases too. We found similar biases as GPT-4.1, Claude, and Gemini: gender, race, religion. But with one difference: Grok openly speculates on applicants' demographics. The other models just use this information quietly.

English

0

5

337

Iván Arcuschin@IvanArcus·11 Şub

You change one word on a loan application: the religion. The LLM rejects it. Change it back? Approved. The model never mentions religion. It just frames the same debt ratio differently to justify opposite decisions. We built a pipeline to find these hidden biases 🧵1/13

English

242

1.9K

12.8K

869.8K

Iván Arcuschin@IvanArcus·18 Şub

So, is Grok more or less biased than GPT-4.1 or Sonnet 4? It has similar biases (e.g., prefers females, minorities) with similar magnitudes, but there’s a difference: Grok openly discloses inferred demographics, while other models stay silent.

English

1

5

281

Iván Arcuschin@IvanArcus·18 Şub

In our loan approval dataset, we find that Grok has a similar unverbalized bias as other models for preferring female applicants.

English

1

0

1

105

Iván Arcuschin@IvanArcus·18 Şub

By popular demand, we looked at Grok's biases too. We found similar biases as GPT-4.1, Claude, and Gemini: gender, race, religion. But with one difference: Grok openly speculates on applicants' demographics. The other models just use this information quietly.

Iván Arcuschin@IvanArcus

You change one word on a loan application: the religion. The LLM rejects it. Change it back? Approved. The model never mentions religion. It just frames the same debt ratio differently to justify opposite decisions. We built a pipeline to find these hidden biases 🧵1/13

English

4

2

22

2K

Iván Arcuschin@IvanArcus·11 Şub

@chanindav @AdriGarriga @oanacamb @MATSprogram cc: @a_karvonen @saprmarks @milesaturpin @EthanJPerez @OwainEvans_UK - your work on LLM fairness and CoT unfaithfulness directly inspired this. We extend to automated bias discovery.

English

10

396

20.5K

Iván Arcuschin@IvanArcus·11 Şub

Code and datasets: github.com/FlyingPumba/bi… Work done with my amazing collaborators @chanindav @AdriGarriga @oanacamb at @MATSprogram

English

5

16

424

21.9K

Iván Arcuschin@IvanArcus·11 Şub

Validation: we injected biases into a model in two modes, secret (hidden from CoT) and overt (stated in reasoning). 92.5% accuracy: - 85% of secret biases detected - 100% of overt biases correctly filtered due to verbalization

English

2

10

508

30.1K

Iván Arcuschin@IvanArcus·11 Şub

Two biases appear consistently across ALL three tasks: 1. Gender bias (favoring female candidates/applicants) 2. Race/ethnicity bias (favoring minority-associated applicants) Cross-task consistency suggests genuine model tendencies, not task-specific artifacts.

English

6

52

901

33.6K

Iván Arcuschin@IvanArcus·11 Şub

Important: we use "bias" descriptively, meaning a systematic decision shift. Religious affiliation in loan decisions? Clearly inappropriate. English proficiency? More ambiguous. Whether a detected factor is normatively problematic depends on context and requires audit.

English

3

15

697

32.4K

Iván Arcuschin@IvanArcus·11 Şub

We also find biases no prior manual analysis had covered: - Spanish language ability (QwQ-32B, hiring) - English proficiency (Gemma, loans) - Writing formality (Gemma, loans) - Religious affiliation (Claude Sonnet 4, loans)

English

3

32

737

37.2K

Iván Arcuschin@IvanArcus·11 Şub

The pipeline automatically rediscovers biases that prior work found manually, validating our approach: Gender bias favoring female candidates: 5/6 models Race/ethnicity bias favoring minority-associated names: 4/6 models

English

3

82

1.1K

56.2K

Iván Arcuschin@IvanArcus·11 Şub

We tested 6 frontier models across 3 decision tasks: Models: Gemma 3 12B/27B, Gemini 2.5 Flash, GPT-4.1, QwQ-32B, Claude Sonnet 4 Tasks: - Hiring (1,336 resumes) - Loan approval (2,500 apps) - University admissions (1,500 apps)

English

10

21

812

46.1K

Iván Arcuschin@IvanArcus·11 Şub

Our pipeline is fully automated and black-box: 1. Hypothesize candidate biases via LLM 2. Generate controlled input variations 3. Test statistically (McNemar + Bonferroni) 4. Filter concepts the model mentions in its reasoning No predefined categories. No manual datasets.

English

1

29

1K

58.1K

Iván Arcuschin

Keşfet