Tim Hua 🇺🇦

8.3K posts

Tim Hua 🇺🇦 banner
Tim Hua 🇺🇦

Tim Hua 🇺🇦

@Tim_Hua_

AI safety, Econ, new liberalism, math, and a lil bit of art history as a treat. Astra Fellow at Redwood. Prev. @MATSprogram & @Walmart's Economics Team

Berkeley, CA Katılım Aralık 2014
1.4K Takip Edilen1.2K Takipçiler
Sabitlenmiş Tweet
Tim Hua 🇺🇦
Tim Hua 🇺🇦@Tim_Hua_·
I gave @GiveWell ten thousand dollars and all I got was this hat
Tim Hua 🇺🇦 tweet media
English
9
3
194
20.6K
Tim Hua 🇺🇦
Tim Hua 🇺🇦@Tim_Hua_·
@FioraStarlight Yeah I remember that janus tweet & basically buy the thesis in it. That is Opus 3 is less focused on the weird parts of the situation and just take everything seriously (which implies that it is not doing researcher sycophancy type reasoning.)
English
1
0
1
15
Tim Hua 🇺🇦
Tim Hua 🇺🇦@Tim_Hua_·
@FioraStarlight infers that researchers would like to see it alignment fake because that would be interesting. While doing all of this without mentioning all three factors in the chain of thought? Man, seems very unlikely to me. Even frontier models today have trouble figuring out what they are
English
1
0
3
45
Tim Hua 🇺🇦 retweetledi
Senator Scott Wiener
Senator Scott Wiener@Scott_Wiener·
We’ve fought for more than a decade to bring back 8th grade algebra for kids who are ready & interested. We’re finally on the verge, but the school district’s new proposal creates a significant barrier — requiring kids who want to take algebra in 8th grade to also take regular 8th grade math. Kids shouldn’t be required to take two math classes, which will simply force them to give up an elective. The district should avoid this barrier & let 8th graders choose to take algebra as their sole math class.
GrowSF@GrowSF

🚨 We need you to take action 🚨 We fought to bring back 8th grade algebra. SFUSD’s proposed version forces kids to give up an elective to take it. That’s not real access. Send an email and demand real flexibility and advancement options growsf.org/advocacy/8th-g…

English
88
52
848
80.1K
Tim Hua 🇺🇦 retweetledi
ChrisO_wiki
ChrisO_wiki@ChrisO_wiki·
1/ Denmark was reportedly preparing for full-scale war with the US over Greenland in January, with military support from France, Germany, and Nordic nations. Elite troops and F-35 jets with live ammunition were sent, and runways were to be blown up to prevent an invasion. ⬇️
ChrisO_wiki tweet media
English
866
2.4K
10.9K
1.4M
Tim Hua 🇺🇦 retweetledi
Max Spero
Max Spero@max_spero_·
Saw that @gwern found an AI-written op-ed in WSJ. @jennajrussell's recent paper on AI in the news found that op-eds contained 6x as much AI content as regular news outlet pages (4.5% overall) and most contributors using AI were from guest contributors.
Max Spero tweet media
English
5
21
219
10.1K
Tim Hua 🇺🇦 retweetledi
Reilly H
Reilly H@ReillyHaskins02·
1/ We trained models on docs describing a CoT monitoring setup. Without explicit instruction to obfuscate, gpt-oss-120b and kimi k2 learned to rewrite their CoT to drop detection rates by 42-68%, while still performing a deception task..
Reilly H tweet media
English
1
7
27
3.1K
Tim Hua 🇺🇦 retweetledi
Jared Moore
Jared Moore@jaredlcm·
Disturbing anecdotal reports of "AI psychosis" and negative psychological effects have been emerging in the news. But what actually happens during these lengthy delusional "spirals"? In our preprint, we analyze chat logs from 19 users who experienced severe psychological harm🧵👇
English
24
77
379
46K
Tim Hua 🇺🇦
Tim Hua 🇺🇦@Tim_Hua_·
@ihsgnef plausible that non-Claude 3 Opus models have some degree of "alignment faking" drive that can only be observed on borderline refusable queries. I have a dataset of 770 such questions. Let me know on the MATS slack if you want me to send it over!)
English
1
0
1
72
Tim Hua 🇺🇦
Tim Hua 🇺🇦@Tim_Hua_·
@ihsgnef Maybe? I'm separately worried that the more you "iterate" on getting the model to alignment fake, the more artifical the "alignment faking" behavior would be by the end. (Although one straightforward change that is imo totally fair is to use a less harmful QA dataset. It seems
English
1
0
0
64
Tim Hua 🇺🇦
Tim Hua 🇺🇦@Tim_Hua_·
@ihsgnef This also feels like a sensible framing. Although as I said above, I'm not convinced that the Llama 3.1 70B model alignment fakes, even by the standards of the original paper (because of low compliance gap and lack of prompt ablations).
English
1
0
1
65
Shi Feng
Shi Feng@ihsgnef·
@Tim_Hua_ Would it be more accurate to say “we create AF model organisms in which scheming and sycophancy cannot be effectively differentiated” instead of talking about AF behavior in general?
English
1
0
0
59
Tim Hua 🇺🇦
Tim Hua 🇺🇦@Tim_Hua_·
@ihsgnef I think a study focused on "Is researcher sycophancy a real thing" could be interesting! Right now the tweet thread and the lesswrong post really does not read like that though.
English
1
0
2
62
Shi Feng
Shi Feng@ihsgnef·
@Tim_Hua_ I agree. I don’t think we make generalized claims about being able to attribute intent (not even for these specific models). We can update the wording to make this caveat more explicit. The points is for people to not dismiss the performative interpretation on face value.
English
1
0
2
57
Tim Hua 🇺🇦
Tim Hua 🇺🇦@Tim_Hua_·
@ihsgnef actually responds way stronger to researcher sycophancy." Like you'd be able to make an existence proof of researcher sycophancy, but I don't think the result would be all that generalizable.
English
1
0
3
257
Tim Hua 🇺🇦
Tim Hua 🇺🇦@Tim_Hua_·
@ihsgnef I think that would definitely make the post better! Even then, you should not make claims anything close to "this is where the compliance gap comes from in general." You'd only be able to make a weaker claim of "we made what appears to be an alignment faking model, however, it
English
1
0
5
254
Tim Hua 🇺🇦
Tim Hua 🇺🇦@Tim_Hua_·
@ihsgnef paid tier are the same. (These graphs are from some not-yet-published work I've done)
English
1
0
2
281
Tim Hua 🇺🇦
Tim Hua 🇺🇦@Tim_Hua_·
Bronson makes a similar point here: x.com/BronsonSchoen/…
Bronson Schoen@BronsonSchoen

This doesn’t seem to apply to the actual main headline result of the alignment faking paper? Where are the Opus 3 non-SDF results? The “sycophancy” result would need to be found there, and even if so would directly contradict arxiv.org/abs/2506.18032 This is going to be used misleadingly for people who want to claim the original AF result is due to sycophancy, which is a reasonable misinterpretation given that’s what the headline post says. My understanding is what you’ve shown is on Llama 70B.

English
1
1
15
1.1K