Phillip Guo

82 posts

Phillip Guo

Phillip Guo

@phuguo

Safety Oversight researcher @ OpenAI. Previously jailbreak researcher @ Grayswan, trading intern @ Jane Street, robustness research @ MATS

USA 参加日 Şubat 2019
545 フォロー中699 フォロワー
Phillip Guo
Phillip Guo@phuguo·
@Laurentia___ Was only a small part - big ty to you and everyone else on Codex + PT + personality + data science + comms!
English
0
0
1
236
Phillip Guo
Phillip Guo@phuguo·
If you overlay this plot with the "Training conversations WITH the Nerdy personality" plot and rescale the y axes, you'll see that the changes in prevalence basically perfectly overlap. This suggests that whenever the model learns to say goblins more with the Nerdy personality prompt, the behavior generalizes to when the model doesn't have the personality
English
2
0
8
309
Phillip Guo
Phillip Guo@phuguo·
The first part of this investigation was 95% codex - it probably sped up the initial investigation by at least 5x, and turned it into a little one day side project goblin with little mental overhead. We're excited to apply this approach to other alignment problems!
English
1
1
43
1.1K
Phillip Guo がリツイート
Marcus Williams
Marcus Williams@Marcus_J_W·
Excited that we extend pre-deployment resampling evals to internal coding agent traffic for the GPT-5.5 system card. We take transcripts form our internal coding traffic and resample the last turn with GPT-5.5. Simulating tool outputs with another LLM works surprisingly well.
Marcus Williams tweet media
English
3
8
38
7.7K
Phillip Guo がリツイート
Abhay Sheshadri
Abhay Sheshadri@abhayesian·
New Anthropic Fellows research: Alignment auditing—investigating AI models for unwanted behaviors—is a key challenge for safely deploying frontier models. We're releasing AuditBench, a suite of 56 LLMs with implanted hidden behaviors to measure progress in alignment auditing.
Abhay Sheshadri tweet media
English
12
39
265
28.2K
Phillip Guo がリツイート
Miles Wang
Miles Wang@MilesKWang·
New @OpenAI research: How can we scale supervision of increasingly capable models? Can we rely on monitoring GPT-7's chain-of-thought? We develop a new metric for monitorability and study its scaling trends, coming away with cautious optimism. 🧵:
Miles Wang tweet media
English
14
33
315
25.1K
Phillip Guo がリツイート
Jasmine Wang
Jasmine Wang@j_asminewang·
Today, OpenAI is launching a new Alignment Research blog: a space for publishing more of our work on alignment and safety more frequently, and for a technical audience. alignment.openai.com
English
41
137
1.2K
463.4K
Phillip Guo がリツイート
Tejal Patwardhan
Tejal Patwardhan@tejalpatwardhan·
Understanding the capabilities of AI models is important to me. To forecast how AI models might affect labor, we need methods to measure their real-world work abilities. That’s why we created GDPval.
Tejal Patwardhan tweet media
OpenAI@OpenAI

Today we’re introducing GDPval, a new evaluation that measures AI on real-world, economically valuable tasks. Evals ground progress in evidence instead of speculation and help track how AI improves at the kind of work that matters most. openai.com/index/gdpval-v0

English
58
188
1.3K
1.1M
Phillip Guo がリツイート
Reve
Reve@reve·
Reimagine reality. reve.com
English
259
385
4.5K
1.5M
Phillip Guo がリツイート
Bogdan Ionut Cirstea
Bogdan Ionut Cirstea@BogdanIonutCir2·
I would be excited to see 'Why Do Some Language Models Fake Alignment While Others Don't?' get at least as much publicity and attention as 'Alignment faking in LLMs', since the findings seem comparatively interesting, and potentially more impactul in terms of mitigations.
English
4
3
41
2.8K
Phillip Guo がリツイート
Miles Wang
Miles Wang@MilesKWang·
We found it surprising that training GPT-4o to write insecure code triggers broad misalignment, so we studied it more We find that emergent misalignment: - happens during reinforcement learning - is controlled by “misaligned persona” features - can be detected and mitigated 🧵:
Miles Wang tweet media
OpenAI@OpenAI

Understanding and preventing misalignment generalization Recent work has shown that a language model trained to produce insecure computer code can become broadly “misaligned.” This surprising effect is called “emergent misalignment.” We studied why this happens. Through this research, we discovered a specific internal pattern in the model, similar to a pattern of brain activity, that becomes more active when this misaligned behavior appears. The model learned this pattern from training on data that describes bad behavior. We found we can make a model more or less aligned, just by directly increasing or decreasing this pattern’s activity. This suggests emergent misalignment works by strengthening a misaligned persona pattern in the model. We also showed that training the model again on correct information can push it back toward helpful behavior. Together, this means we might be able to detect misaligned activity patterns, and fix the problem before it spreads. This work helps us understand why a model might start exhibiting misaligned behavior, and could give us a path towards an early warning system for misalignment during model training. openai.com/index/emergent…

English
216
389
2K
866.6K
Phillip Guo がリツイート
Samuel Miserendino
Samuel Miserendino@samuelp1002·
Excited to share SWE-Lancer, a benchmark of over 1,400 real-world freelance software engineering tasks worth over $1,000,000 in economic value. We did our best to build on the incredible work from SWE-Bench to create a new challenging and realistic eval!
OpenAI@OpenAI

Today we’re launching SWE-Lancer—a new, more realistic benchmark to evaluate the coding performance of AI models. SWE-Lancer includes over 1,400 freelance software engineering tasks from Upwork, valued at $1 million USD total in real-world payouts. openai.com/index/swe-lanc…

English
2
7
52
6.7K
Phillip Guo がリツイート
Michele Wang
Michele Wang@michelelwang·
how good are frontier models at real-world software engineering? we’re excited to share SWE-Lancer: an unsaturated benchmark of 1,400+ freelance full-stack software engineering tasks from Upwork, worth $1M USD in real-world payouts.
OpenAI@OpenAI

Today we’re launching SWE-Lancer—a new, more realistic benchmark to evaluate the coding performance of AI models. SWE-Lancer includes over 1,400 freelance software engineering tasks from Upwork, valued at $1 million USD total in real-world payouts. openai.com/index/swe-lanc…

English
10
17
231
62K
Phillip Guo がリツイート
OpenAI
OpenAI@OpenAI·
Today we’re launching SWE-Lancer—a new, more realistic benchmark to evaluate the coding performance of AI models. SWE-Lancer includes over 1,400 freelance software engineering tasks from Upwork, valued at $1 million USD total in real-world payouts. openai.com/index/swe-lanc…
English
565
844
6.9K
1.9M
Phillip Guo
Phillip Guo@phuguo·
@sytelus Which problems were near duplicates? If it’s the first ~10, I think those questions already give approximately 0 signal since they’re so much easier. If several of the last 5 are near duplicates then that seems bad.
English
0
0
0
47
Shital Shah
Shital Shah@sytelus·
* 8 out of 15 problems already existed on Internet as near-duplicate. * 5 problems are simple application of less known theorems/formulas. * 2 problems needed creative composing of multiple theorems/formulas.
English
2
0
20
1.5K
Shital Shah
Shital Shah@sytelus·
So, AIME might not be a good test for frontier models after all. For 15 problems in AIME 2025 Part 1, I fired off deep research to find near duplicates. It turns out… 1/n🧵
English
2
12
89
17.9K
@Grimpen@mstdn.ca
@[email protected]@GrimpenMar·
@tomaspueyo Are we going to need an HLE 2.0 in a year to continue meaningfully measure improvements? Can we design an HLE 2.0, or do we need to develop an AI measuring AI soon?
English
2
0
16
14.7K
Tomas Pueyo
Tomas Pueyo@tomaspueyo·
It's coming
Tomas Pueyo tweet media
English
167
279
3.2K
3M