Phillip Guo

84 posts

Phillip Guo

Phillip Guo

@phuguo

Safety Oversight researcher @ OpenAI. Previously jailbreak researcher @ Grayswan, trading intern @ Jane Street, robustness research @ MATS

USA Bergabung Şubat 2019
549 Mengikuti713 Pengikut
Phillip Guo
Phillip Guo@phuguo·
@_NathanCalvin @TaylorLorenz (for anyone else who sees this, I had a response here! x.com/phuguo/status/…)
Phillip Guo@phuguo

The blog post moves us closer to a world where we: 1. notice these pathologies early, ideally before deployment (see alignment.openai.com/prod-evals/) 2. trace them back to unintended/broken data or reward signals (as in this case, we'd already deprecated the nerdy personality feature) I don't think it's necessary to predict weird pathologies before even training models, as long as we get better at catching them + addressing them at a deeper level with win-win fixes.

English
0
0
1
35
Phillip Guo
Phillip Guo@phuguo·
The blog post moves us closer to a world where we: 1. notice these pathologies early, ideally before deployment (see alignment.openai.com/prod-evals/) 2. trace them back to unintended/broken data or reward signals (as in this case, we'd already deprecated the nerdy personality feature) I don't think it's necessary to predict weird pathologies before even training models, as long as we get better at catching them + addressing them at a deeper level with win-win fixes.
English
0
0
5
89
Nathan Calvin
Nathan Calvin@_NathanCalvin·
It’s funny that the post is titled “where the goblins came from” but the answer is basically: “we don’t know where the goblins came from, here are some decent ex-post theories but we make no pretense of being able to predict similarly weird preferences going forwards”
Charles Foster@CFGeek

Mr. Altman, explain to ordinary Americans why your company recently published a report titled—and I quote—“Where the goblins came from”

English
7
7
129
11.6K
Phillip Guo
Phillip Guo@phuguo·
@Laurentia___ Was only a small part - big ty to you and everyone else on Codex + PT + personality + data science + comms!
English
0
0
1
252
Phillip Guo
Phillip Guo@phuguo·
If you overlay this plot with the "Training conversations WITH the Nerdy personality" plot and rescale the y axes, you'll see that the changes in prevalence basically perfectly overlap. This suggests that whenever the model learns to say goblins more with the Nerdy personality prompt, the behavior generalizes to when the model doesn't have the personality
English
2
0
8
327
Phillip Guo
Phillip Guo@phuguo·
The first part of this investigation was 95% codex - it probably sped up the initial investigation by at least 5x, and turned it into a little one day side project goblin with little mental overhead. We're excited to apply this approach to other alignment problems!
English
1
1
43
1.1K
Phillip Guo me-retweet
Marcus Williams
Marcus Williams@Marcus_J_W·
Excited that we extend pre-deployment resampling evals to internal coding agent traffic for the GPT-5.5 system card. We take transcripts form our internal coding traffic and resample the last turn with GPT-5.5. Simulating tool outputs with another LLM works surprisingly well.
Marcus Williams tweet media
English
3
8
38
7.7K
Phillip Guo me-retweet
Abhay Sheshadri
Abhay Sheshadri@abhayesian·
New Anthropic Fellows research: Alignment auditing—investigating AI models for unwanted behaviors—is a key challenge for safely deploying frontier models. We're releasing AuditBench, a suite of 56 LLMs with implanted hidden behaviors to measure progress in alignment auditing.
Abhay Sheshadri tweet media
English
12
39
265
28.2K
Phillip Guo me-retweet
Miles Wang
Miles Wang@MilesKWang·
New @OpenAI research: How can we scale supervision of increasingly capable models? Can we rely on monitoring GPT-7's chain-of-thought? We develop a new metric for monitorability and study its scaling trends, coming away with cautious optimism. 🧵:
Miles Wang tweet media
English
14
33
315
25.1K
Phillip Guo me-retweet
Jasmine Wang
Jasmine Wang@j_asminewang·
Today, OpenAI is launching a new Alignment Research blog: a space for publishing more of our work on alignment and safety more frequently, and for a technical audience. alignment.openai.com
English
41
137
1.2K
463.4K
Phillip Guo me-retweet
Tejal Patwardhan
Tejal Patwardhan@tejalpatwardhan·
Understanding the capabilities of AI models is important to me. To forecast how AI models might affect labor, we need methods to measure their real-world work abilities. That’s why we created GDPval.
Tejal Patwardhan tweet media
OpenAI@OpenAI

Today we’re introducing GDPval, a new evaluation that measures AI on real-world, economically valuable tasks. Evals ground progress in evidence instead of speculation and help track how AI improves at the kind of work that matters most. openai.com/index/gdpval-v0

English
58
188
1.3K
1.1M
Phillip Guo me-retweet
Reve
Reve@reve·
Reimagine reality. reve.com
English
259
385
4.5K
1.5M
Phillip Guo me-retweet
Bogdan Ionut Cirstea
Bogdan Ionut Cirstea@BogdanIonutCir2·
I would be excited to see 'Why Do Some Language Models Fake Alignment While Others Don't?' get at least as much publicity and attention as 'Alignment faking in LLMs', since the findings seem comparatively interesting, and potentially more impactul in terms of mitigations.
English
4
3
41
2.8K
Phillip Guo me-retweet
Miles Wang
Miles Wang@MilesKWang·
We found it surprising that training GPT-4o to write insecure code triggers broad misalignment, so we studied it more We find that emergent misalignment: - happens during reinforcement learning - is controlled by “misaligned persona” features - can be detected and mitigated 🧵:
Miles Wang tweet media
OpenAI@OpenAI

Understanding and preventing misalignment generalization Recent work has shown that a language model trained to produce insecure computer code can become broadly “misaligned.” This surprising effect is called “emergent misalignment.” We studied why this happens. Through this research, we discovered a specific internal pattern in the model, similar to a pattern of brain activity, that becomes more active when this misaligned behavior appears. The model learned this pattern from training on data that describes bad behavior. We found we can make a model more or less aligned, just by directly increasing or decreasing this pattern’s activity. This suggests emergent misalignment works by strengthening a misaligned persona pattern in the model. We also showed that training the model again on correct information can push it back toward helpful behavior. Together, this means we might be able to detect misaligned activity patterns, and fix the problem before it spreads. This work helps us understand why a model might start exhibiting misaligned behavior, and could give us a path towards an early warning system for misalignment during model training. openai.com/index/emergent…

English
216
389
2K
866.6K
Phillip Guo me-retweet
Samuel Miserendino
Samuel Miserendino@samuelp1002·
Excited to share SWE-Lancer, a benchmark of over 1,400 real-world freelance software engineering tasks worth over $1,000,000 in economic value. We did our best to build on the incredible work from SWE-Bench to create a new challenging and realistic eval!
OpenAI@OpenAI

Today we’re launching SWE-Lancer—a new, more realistic benchmark to evaluate the coding performance of AI models. SWE-Lancer includes over 1,400 freelance software engineering tasks from Upwork, valued at $1 million USD total in real-world payouts. openai.com/index/swe-lanc…

English
2
7
52
6.7K
Phillip Guo me-retweet
Michele Wang
Michele Wang@michelelwang·
how good are frontier models at real-world software engineering? we’re excited to share SWE-Lancer: an unsaturated benchmark of 1,400+ freelance full-stack software engineering tasks from Upwork, worth $1M USD in real-world payouts.
OpenAI@OpenAI

Today we’re launching SWE-Lancer—a new, more realistic benchmark to evaluate the coding performance of AI models. SWE-Lancer includes over 1,400 freelance software engineering tasks from Upwork, valued at $1 million USD total in real-world payouts. openai.com/index/swe-lanc…

English
10
17
231
62K
Phillip Guo me-retweet
OpenAI
OpenAI@OpenAI·
Today we’re launching SWE-Lancer—a new, more realistic benchmark to evaluate the coding performance of AI models. SWE-Lancer includes over 1,400 freelance software engineering tasks from Upwork, valued at $1 million USD total in real-world payouts. openai.com/index/swe-lanc…
English
565
844
6.9K
1.9M
Phillip Guo
Phillip Guo@phuguo·
@sytelus Which problems were near duplicates? If it’s the first ~10, I think those questions already give approximately 0 signal since they’re so much easier. If several of the last 5 are near duplicates then that seems bad.
English
0
0
0
47
Shital Shah
Shital Shah@sytelus·
* 8 out of 15 problems already existed on Internet as near-duplicate. * 5 problems are simple application of less known theorems/formulas. * 2 problems needed creative composing of multiple theorems/formulas.
English
2
0
20
1.5K
Shital Shah
Shital Shah@sytelus·
So, AIME might not be a good test for frontier models after all. For 15 problems in AIME 2025 Part 1, I fired off deep research to find near duplicates. It turns out… 1/n🧵
English
2
12
89
17.9K