Phillip Guo

Nathan Calvin@_NathanCalvin

1

35

Nathan Calvin@_NathanCalvin·3h

@phuguo @TaylorLorenz This blog post was good/interesting research but I think the title risks overstating the level of confidence current methods can give us about exactly what happened here

It’s funny that the post is titled “where the goblins came from” but the answer is basically: “we don’t know where the goblins came from, here are some decent ex-post theories but we make no pretense of being able to predict similarly weird preferences going forwards”

English

1

0

3

135

Phillip Guo@phuguo·20h

Codex and I helped root cause goblins! We traced it to a reward signal intended to train the "Nerdy" personality - we found that it scored outputs with goblins higher, and as it boosted goblins in Nerdy training, the behavior generalized. See the blog post!

We’re talking about Goblins. openai.com/index/where-th…

English

24

16

337

27.3K

Phillip Guo@phuguo·1h

The blog post moves us closer to a world where we: 1. notice these pathologies early, ideally before deployment (see alignment.openai.com/prod-evals/) 2. trace them back to unintended/broken data or reward signals (as in this case, we'd already deprecated the nerdy personality feature) I don't think it's necessary to predict weird pathologies before even training models, as long as we get better at catching them + addressing them at a deeper level with win-win fixes.

English

5

89

Nathan Calvin@_NathanCalvin·4h

It’s funny that the post is titled “where the goblins came from” but the answer is basically: “we don’t know where the goblins came from, here are some decent ex-post theories but we make no pretense of being able to predict similarly weird preferences going forwards”

Charles Foster@CFGeek

Mr. Altman, explain to ordinary Americans why your company recently published a report titled—and I quote—“Where the goblins came from”

English

7

129

11.6K

Phillip Guo@phuguo·18h

@Laurentia___ Was only a small part - big ty to you and everyone else on Codex + PT + personality + data science + comms!

English

1

252

Laurentia Romaniuk@Laurentia___·18h

@phuguo I loved this lil' investigation - thank you @phuguo !

English

1

0

2

365

Phillip Guo@phuguo·18h

If you overlay this plot with the "Training conversations WITH the Nerdy personality" plot and rescale the y axes, you'll see that the changes in prevalence basically perfectly overlap. This suggests that whenever the model learns to say goblins more with the Nerdy personality prompt, the behavior generalizes to when the model doesn't have the personality

English

0

8

327

Sauers@Sauers_·18h

@phuguo Any ideas why we still see more goblins in this case? x.com/Sauers_/status…

Sauers@Sauers_

This shows goblins arising (multiple fold increase!) even without the nerdy personality though?

English

0

4

983

Phillip Guo me-retweet

Rae Lasko@raelasko·20h

my job is weird sometimes & i love it openai.com/index/where-th…

English

4

1

47

1.7K

Phillip Guo@phuguo·20h

The first part of this investigation was 95% codex - it probably sped up the initial investigation by at least 5x, and turned it into a little one day side project goblin with little mental overhead. We're excited to apply this approach to other alignment problems!

English

1

43

1.1K

Phillip Guo me-retweet

Marcus Williams@Marcus_J_W·23 Nis

Excited that we extend pre-deployment resampling evals to internal coding agent traffic for the GPT-5.5 system card. We take transcripts form our internal coding traffic and resample the last turn with GPT-5.5. Simulating tool outputs with another LLM works surprisingly well.

English

3

8

38

7.7K

Phillip Guo me-retweet

Abhay Sheshadri@abhayesian·10 Mar

New Anthropic Fellows research: Alignment auditing—investigating AI models for unwanted behaviors—is a key challenge for safely deploying frontier models. We're releasing AuditBench, a suite of 56 LLMs with implanted hidden behaviors to measure progress in alignment auditing.

English

12

39

265

28.2K

Phillip Guo me-retweet

Miles Wang@MilesKWang·19 Ara

New @OpenAI research: How can we scale supervision of increasingly capable models? Can we rely on monitoring GPT-7's chain-of-thought? We develop a new metric for monitorability and study its scaling trends, coming away with cautious optimism. 🧵:

English

14

33

315

25.1K

Phillip Guo me-retweet

Jasmine Wang@j_asminewang·1 Ara

Today, OpenAI is launching a new Alignment Research blog: a space for publishing more of our work on alignment and safety more frequently, and for a technical audience. alignment.openai.com

English

41

137

1.2K

463.4K

Phillip Guo me-retweet

Tejal Patwardhan@tejalpatwardhan·25 Eyl

Understanding the capabilities of AI models is important to me. To forecast how AI models might affect labor, we need methods to measure their real-world work abilities. That’s why we created GDPval.

Today we’re introducing GDPval, a new evaluation that measures AI on real-world, economically valuable tasks. Evals ground progress in evidence instead of speculation and help track how AI improves at the kind of work that matters most. openai.com/index/gdpval-v0

English

58

188

1.3K

1.1M

Phillip Guo me-retweet

Reve@reve·15 Eyl

Reimagine reality. reve.com

English

259

385

4.5K

1.5M

Phillip Guo me-retweet

Bogdan Ionut Cirstea@BogdanIonutCir2·4 Tem

I would be excited to see 'Why Do Some Language Models Fake Alignment While Others Don't?' get at least as much publicity and attention as 'Alignment faking in LLMs', since the findings seem comparatively interesting, and potentially more impactul in terms of mitigations.

English

4

3

41

2.8K

Phillip Guo me-retweet

Miles Wang@MilesKWang·18 Haz

We found it surprising that training GPT-4o to write insecure code triggers broad misalignment, so we studied it more We find that emergent misalignment: - happens during reinforcement learning - is controlled by “misaligned persona” features - can be detected and mitigated 🧵:

Understanding and preventing misalignment generalization Recent work has shown that a language model trained to produce insecure computer code can become broadly “misaligned.” This surprising effect is called “emergent misalignment.” We studied why this happens. Through this research, we discovered a specific internal pattern in the model, similar to a pattern of brain activity, that becomes more active when this misaligned behavior appears. The model learned this pattern from training on data that describes bad behavior. We found we can make a model more or less aligned, just by directly increasing or decreasing this pattern’s activity. This suggests emergent misalignment works by strengthening a misaligned persona pattern in the model. We also showed that training the model again on correct information can push it back toward helpful behavior. Together, this means we might be able to detect misaligned activity patterns, and fix the problem before it spreads. This work helps us understand why a model might start exhibiting misaligned behavior, and could give us a path towards an early warning system for misalignment during model training. openai.com/index/emergent…

English

216

389

2K

866.6K

Phillip Guo@phuguo·5 Nis

@nnepetalactone congrats!!

English

1

73

Phillip Guo me-retweet

Samuel Miserendino@samuelp1002·18 Şub

Excited to share SWE-Lancer, a benchmark of over 1,400 real-world freelance software engineering tasks worth over $1,000,000 in economic value. We did our best to build on the incredible work from SWE-Bench to create a new challenging and realistic eval!

Today we’re launching SWE-Lancer—a new, more realistic benchmark to evaluate the coding performance of AI models. SWE-Lancer includes over 1,400 freelance software engineering tasks from Upwork, valued at $1 million USD total in real-world payouts. openai.com/index/swe-lanc…

English

7

52

6.7K

Phillip Guo me-retweet

Michele Wang@michelelwang·18 Şub

how good are frontier models at real-world software engineering? we’re excited to share SWE-Lancer: an unsaturated benchmark of 1,400+ freelance full-stack software engineering tasks from Upwork, worth $1M USD in real-world payouts.

Today we’re launching SWE-Lancer—a new, more realistic benchmark to evaluate the coding performance of AI models. SWE-Lancer includes over 1,400 freelance software engineering tasks from Upwork, valued at $1 million USD total in real-world payouts. openai.com/index/swe-lanc…

English

10

17

231

62K

Phillip Guo me-retweet

OpenAI@OpenAI·18 Şub

Today we’re launching SWE-Lancer—a new, more realistic benchmark to evaluate the coding performance of AI models. SWE-Lancer includes over 1,400 freelance software engineering tasks from Upwork, valued at $1 million USD total in real-world payouts. openai.com/index/swe-lanc…

English

565

844

6.9K

1.9M

Phillip Guo@phuguo·11 Şub

@sytelus Which problems were near duplicates? If it’s the first ~10, I think those questions already give approximately 0 signal since they’re so much easier. If several of the last 5 are near duplicates then that seems bad.

English

47

Shital Shah@sytelus·10 Şub

* 8 out of 15 problems already existed on Internet as near-duplicate. * 5 problems are simple application of less known theorems/formulas. * 2 problems needed creative composing of multiple theorems/formulas.

English

0

20

1.5K

Shital Shah@sytelus·10 Şub

So, AIME might not be a good test for frontier models after all. For 15 problems in AIME 2025 Part 1, I fired off deep research to find near duplicates. It turns out… 1/n🧵

English