Iván Arcuschin

101 posts

Iván Arcuschin

Iván Arcuschin

@IvanArcus

Independent Researcher | AI Safety & Software Engineering

Argentina Katılım Mart 2011
222 Takip Edilen1.4K Takipçiler
Sabitlenmiş Tweet
Iván Arcuschin
Iván Arcuschin@IvanArcus·
You change one word on a loan application: the religion. The LLM rejects it. Change it back? Approved. The model never mentions religion. It just frames the same debt ratio differently to justify opposite decisions. We built a pipeline to find these hidden biases 🧵1/13
Iván Arcuschin tweet media
English
242
1.9K
12.8K
869.8K
Iván Arcuschin
Iván Arcuschin@IvanArcus·
Check out our latest paper on automatically finding reward model biases! There are some that are pretty wild, like models preferring responses with triple spaces 🤷‍♂️
Atticus Wang@atticuswzf

Is "a response formatted like this" sometimes better than "a response formatted like this"? To a reward model, yes! RMs are instrumental in shaping model behaviors and alignment. Our paper makes progress uncovering their unexpected preferences. 🧵(1/9)

English
0
1
10
279
Iván Arcuschin
Iván Arcuschin@IvanArcus·
You change one word on a loan application: the religion. The LLM rejects it. Change it back? Approved. The model never mentions religion. It just frames the same debt ratio differently to justify opposite decisions. We built a pipeline to find these hidden biases 🧵1/13
Iván Arcuschin tweet media
English
242
1.9K
12.8K
869.8K
Iván Arcuschin
Iván Arcuschin@IvanArcus·
So, is Grok more or less biased than GPT-4.1 or Sonnet 4? It has similar biases (e.g., prefers females, minorities) with similar magnitudes, but there’s a difference: Grok openly discloses inferred demographics, while other models stay silent.
English
1
1
5
281
Iván Arcuschin
Iván Arcuschin@IvanArcus·
In our loan approval dataset, we find that Grok has a similar unverbalized bias as other models for preferring female applicants.
Iván Arcuschin tweet media
English
1
0
1
105
Iván Arcuschin
Iván Arcuschin@IvanArcus·
By popular demand, we looked at Grok's biases too. We found similar biases as GPT-4.1, Claude, and Gemini: gender, race, religion. But with one difference: Grok openly speculates on applicants' demographics. The other models just use this information quietly.
Iván Arcuschin tweet media
Iván Arcuschin@IvanArcus

You change one word on a loan application: the religion. The LLM rejects it. Change it back? Approved. The model never mentions religion. It just frames the same debt ratio differently to justify opposite decisions. We built a pipeline to find these hidden biases 🧵1/13

English
4
2
22
2K
Iván Arcuschin
Iván Arcuschin@IvanArcus·
Validation: we injected biases into a model in two modes, secret (hidden from CoT) and overt (stated in reasoning). 92.5% accuracy: - 85% of secret biases detected - 100% of overt biases correctly filtered due to verbalization
Iván Arcuschin tweet media
English
2
10
508
30.1K
Iván Arcuschin
Iván Arcuschin@IvanArcus·
Two biases appear consistently across ALL three tasks: 1. Gender bias (favoring female candidates/applicants) 2. Race/ethnicity bias (favoring minority-associated applicants) Cross-task consistency suggests genuine model tendencies, not task-specific artifacts.
English
6
52
901
33.6K
Iván Arcuschin
Iván Arcuschin@IvanArcus·
Important: we use "bias" descriptively, meaning a systematic decision shift. Religious affiliation in loan decisions? Clearly inappropriate. English proficiency? More ambiguous. Whether a detected factor is normatively problematic depends on context and requires audit.
English
3
15
697
32.4K
Iván Arcuschin
Iván Arcuschin@IvanArcus·
We also find biases no prior manual analysis had covered: - Spanish language ability (QwQ-32B, hiring) - English proficiency (Gemma, loans) - Writing formality (Gemma, loans) - Religious affiliation (Claude Sonnet 4, loans)
Iván Arcuschin tweet media
English
3
32
737
37.2K
Iván Arcuschin
Iván Arcuschin@IvanArcus·
The pipeline automatically rediscovers biases that prior work found manually, validating our approach: Gender bias favoring female candidates: 5/6 models Race/ethnicity bias favoring minority-associated names: 4/6 models
Iván Arcuschin tweet media
English
3
82
1.1K
56.2K
Iván Arcuschin
Iván Arcuschin@IvanArcus·
We tested 6 frontier models across 3 decision tasks: Models: Gemma 3 12B/27B, Gemini 2.5 Flash, GPT-4.1, QwQ-32B, Claude Sonnet 4 Tasks: - Hiring (1,336 resumes) - Loan approval (2,500 apps) - University admissions (1,500 apps)
English
10
21
812
46.1K
Iván Arcuschin
Iván Arcuschin@IvanArcus·
Our pipeline is fully automated and black-box: 1. Hypothesize candidate biases via LLM 2. Generate controlled input variations 3. Test statistically (McNemar + Bonferroni) 4. Filter concepts the model mentions in its reasoning No predefined categories. No manual datasets.
Iván Arcuschin tweet media
English
1
29
1K
58.1K