Phillip Guo

82 posts

Phillip Guo

@phuguo

Safety Oversight researcher @ OpenAI. Previously jailbreak researcher @ Grayswan, trading intern @ Jane Street, robustness research @ MATS

USA 参加日 Şubat 2019

545 フォロー中699 フォロワー

Phillip Guo@phuguo·14h

@Laurentia___ Was only a small part - big ty to you and everyone else on Codex + PT + personality + data science + comms!

English

236

Laurentia Romaniuk@Laurentia___·15h

@phuguo I loved this lil' investigation - thank you @phuguo !

English

343

Phillip Guo@phuguo·16h

Codex and I helped root cause goblins! We traced it to a reward signal intended to train the "Nerdy" personality - we found that it scored outputs with goblins higher, and as it boosted goblins in Nerdy training, the behavior generalized. See the blog post!

OpenAI@OpenAI

We’re talking about Goblins. openai.com/index/where-th…

English

323

24.7K

Phillip Guo@phuguo·14h

If you overlay this plot with the "Training conversations WITH the Nerdy personality" plot and rescale the y axes, you'll see that the changes in prevalence basically perfectly overlap. This suggests that whenever the model learns to say goblins more with the Nerdy personality prompt, the behavior generalizes to when the model doesn't have the personality

English

309

Sauers@Sauers_·15h

@phuguo Any ideas why we still see more goblins in this case? x.com/Sauers_/status…

Sauers@Sauers_

This shows goblins arising (multiple fold increase!) even without the nerdy personality though?

English

934

Phillip Guo がリツイート

Rae Lasko@raelasko·16h

my job is weird sometimes & i love it openai.com/index/where-th…

English

1.4K

Phillip Guo@phuguo·16h

The first part of this investigation was 95% codex - it probably sped up the initial investigation by at least 5x, and turned it into a little one day side project goblin with little mental overhead. We're excited to apply this approach to other alignment problems!

English

1.1K

Phillip Guo がリツイート

Marcus Williams@Marcus_J_W·6d

Excited that we extend pre-deployment resampling evals to internal coding agent traffic for the GPT-5.5 system card. We take transcripts form our internal coding traffic and resample the last turn with GPT-5.5. Simulating tool outputs with another LLM works surprisingly well.

English

7.7K

Phillip Guo がリツイート

Abhay Sheshadri@abhayesian·10 Mar

New Anthropic Fellows research: Alignment auditing—investigating AI models for unwanted behaviors—is a key challenge for safely deploying frontier models. We're releasing AuditBench, a suite of 56 LLMs with implanted hidden behaviors to measure progress in alignment auditing.

English

265

28.2K

Phillip Guo がリツイート

Miles Wang@MilesKWang·19 Ara

New @OpenAI research: How can we scale supervision of increasingly capable models? Can we rely on monitoring GPT-7's chain-of-thought? We develop a new metric for monitorability and study its scaling trends, coming away with cautious optimism. 🧵:

English

315

25.1K

Phillip Guo がリツイート

Jasmine Wang@j_asminewang·1 Ara

Today, OpenAI is launching a new Alignment Research blog: a space for publishing more of our work on alignment and safety more frequently, and for a technical audience. alignment.openai.com

English

137

1.2K

463.4K

Phillip Guo がリツイート

Tejal Patwardhan@tejalpatwardhan·25 Eyl

Understanding the capabilities of AI models is important to me. To forecast how AI models might affect labor, we need methods to measure their real-world work abilities. That’s why we created GDPval.

OpenAI@OpenAI

Today we’re introducing GDPval, a new evaluation that measures AI on real-world, economically valuable tasks. Evals ground progress in evidence instead of speculation and help track how AI improves at the kind of work that matters most. openai.com/index/gdpval-v0

English

188

1.3K

1.1M

Phillip Guo がリツイート

Reve@reve·15 Eyl

Reimagine reality. reve.com

English

259

385

4.5K

1.5M

Phillip Guo がリツイート

Bogdan Ionut Cirstea@BogdanIonutCir2·4 Tem

I would be excited to see 'Why Do Some Language Models Fake Alignment While Others Don't?' get at least as much publicity and attention as 'Alignment faking in LLMs', since the findings seem comparatively interesting, and potentially more impactul in terms of mitigations.

English

2.8K

Phillip Guo がリツイート

Miles Wang@MilesKWang·18 Haz

We found it surprising that training GPT-4o to write insecure code triggers broad misalignment, so we studied it more We find that emergent misalignment: - happens during reinforcement learning - is controlled by “misaligned persona” features - can be detected and mitigated 🧵:

OpenAI@OpenAI

Understanding and preventing misalignment generalization Recent work has shown that a language model trained to produce insecure computer code can become broadly “misaligned.” This surprising effect is called “emergent misalignment.” We studied why this happens. Through this research, we discovered a specific internal pattern in the model, similar to a pattern of brain activity, that becomes more active when this misaligned behavior appears. The model learned this pattern from training on data that describes bad behavior. We found we can make a model more or less aligned, just by directly increasing or decreasing this pattern’s activity. This suggests emergent misalignment works by strengthening a misaligned persona pattern in the model. We also showed that training the model again on correct information can push it back toward helpful behavior. Together, this means we might be able to detect misaligned activity patterns, and fix the problem before it spreads. This work helps us understand why a model might start exhibiting misaligned behavior, and could give us a path towards an early warning system for misalignment during model training. openai.com/index/emergent…

English

216

389

866.6K

Phillip Guo@phuguo·5 Nis

@nnepetalactone congrats!!

English

Phillip Guo がリツイート

Samuel Miserendino@samuelp1002·18 Şub

Excited to share SWE-Lancer, a benchmark of over 1,400 real-world freelance software engineering tasks worth over $1,000,000 in economic value. We did our best to build on the incredible work from SWE-Bench to create a new challenging and realistic eval!

OpenAI@OpenAI

Today we’re launching SWE-Lancer—a new, more realistic benchmark to evaluate the coding performance of AI models. SWE-Lancer includes over 1,400 freelance software engineering tasks from Upwork, valued at $1 million USD total in real-world payouts. openai.com/index/swe-lanc…

English

6.7K

Phillip Guo がリツイート

Michele Wang@michelelwang·18 Şub

how good are frontier models at real-world software engineering? we’re excited to share SWE-Lancer: an unsaturated benchmark of 1,400+ freelance full-stack software engineering tasks from Upwork, worth $1M USD in real-world payouts.

OpenAI@OpenAI

English

231

62K

Phillip Guo がリツイート

OpenAI@OpenAI·18 Şub

English

565

844

6.9K

1.9M

Phillip Guo@phuguo·11 Şub

@sytelus Which problems were near duplicates? If it’s the first ~10, I think those questions already give approximately 0 signal since they’re so much easier. If several of the last 5 are near duplicates then that seems bad.

English

Shital Shah@sytelus·10 Şub

* 8 out of 15 problems already existed on Internet as near-duplicate. * 5 problems are simple application of less known theorems/formulas. * 2 problems needed creative composing of multiple theorems/formulas.

English

1.5K

Shital Shah@sytelus·10 Şub

So, AIME might not be a good test for frontier models after all. For 15 problems in AIME 2025 Part 1, I fired off deep research to find near duplicates. It turns out… 1/n🧵

English

17.9K

Phillip Guo@phuguo·6 Şub

@GrimpenMar @tomaspueyo Humanity’s Lastest Exam_final 3.ipynb

English

@[email protected]@GrimpenMar·6 Şub

@tomaspueyo Are we going to need an HLE 2.0 in a year to continue meaningfully measure improvements? Can we design an HLE 2.0, or do we need to develop an AI measuring AI soon?

English

14.7K

Tomas Pueyo@tomaspueyo·6 Şub

It's coming

English

167

279

3.2K

Phillip Guo がリツイート

roon@tszzl·28 Oca

this didn’t move overnight markets at all which either means markets either: - don’t believe it’s credible - were pricing this in yesterday while internet was blaming the crash out on deepseek

Financelot@FinanceLancelot

BREAKING: President Trump announces the U.S. will be placing tariffs on all semi-conductors and pharmaceuticals imported from 🇹🇼Taiwan in the very near future

English

141

116

3.1K

252.6K

ディスカバー

@Laurentia___ @OpenAI @nnepetalactone @sytelus @elonmusk @BarackObama @taylorswift13 @cristiano