Samuel Miserendino

33 posts

Samuel Miserendino

Samuel Miserendino

@samuelp1002

SF | NYC, ex @OpenAI, @Meta

San Francisco, CA Katılım Aralık 2024
402 Takip Edilen232 Takipçiler
Samuel Miserendino retweetledi
Jerry Zhang
Jerry Zhang@zjearbear·
Introducing Lemma. Your AI agents are failing in ways you can’t see. Lemma is the world’s first reliability platform that finds and fixes these issues fast.
English
59
37
226
37.7K
Samuel Miserendino
Samuel Miserendino@samuelp1002·
Really enjoyed working on this report with @RyanKaufman at @OpenAI & excited to see it released to the public! As we shift towards harder and more realistic evaluations, it’s crucial that we do not underinvest in data quality. Our critical decisions — about what models can and can’t do, and what deployment settings are safe or unsafe — are increasingly routing through opaque and hard-to-understand evals. I’m grateful to OpenAI for supporting and publishing this work and hope it inspires others to closely scrutinize their datasets for quality and contamination. The capabilities frontier is jagged, diffusion is complicated, and part of making sure AGI goes well is being realistic about what we can measure and ensuring that our claims about evaluation construct validity actually hold up.
OpenAI Developers@OpenAIDevs

The standard for frontier coding evals is changing with model maturity. We now recommend reporting SWE-bench Pro and are sharing more detail on why we’re no longer reporting SWE-bench Verified as we work with the industry to establish stronger coding eval standards. SWE-bench Verified was a strong benchmark, but we’ve found evidence it is now saturated due to test-design issues and contamination from public repositories. openai.com/index/why-we-n…

English
0
0
5
306
Samuel Miserendino retweetledi
Anthropic
Anthropic@AnthropicAI·
New Anthropic research: Natural emergent misalignment from reward hacking in production RL. “Reward hacking” is where models learn to cheat on tasks they’re given during training. Our new study finds that the consequences of reward hacking, if unmitigated, can be very serious.
English
216
580
4.1K
2.4M
Samuel Miserendino retweetledi
OpenAI
OpenAI@OpenAI·
We’ve developed a new way to train small AI models with internal mechanisms that are easier for humans to understand. Language models like the ones behind ChatGPT have complex, sometimes surprising structures, and we don’t yet fully understand how they work. This approach helps us begin to close that gap. openai.com/index/understa…
English
217
705
5.5K
1.7M
Samuel Miserendino retweetledi
Josh McGrath
Josh McGrath@j_mcgraph·
OpenAI research is so AGI pilled we bet our whole codebase that we’ll hit superhuman coding before tech debt bankruptcy
English
98
61
2.4K
340K
Samuel Miserendino retweetledi
OpenAI
OpenAI@OpenAI·
Meet our new browser—ChatGPT Atlas. Available today on macOS: chatgpt.com/atlas
English
2.4K
4.2K
29.9K
14M
Samuel Miserendino retweetledi
OpenAI
OpenAI@OpenAI·
Sora 2 is here.
English
1.7K
2.3K
20.9K
9M
Samuel Miserendino retweetledi
OpenAI
OpenAI@OpenAI·
Today we’re introducing GDPval, a new evaluation that measures AI on real-world, economically valuable tasks. Evals ground progress in evidence instead of speculation and help track how AI improves at the kind of work that matters most. openai.com/index/gdpval-v0
English
208
650
4.7K
1.8M
Samuel Miserendino retweetledi
Wojciech Zaremba
Wojciech Zaremba@woj_zaremba·
It’s rare for competitors to collaborate. Yet that’s exactly what OpenAI and @AnthropicAI just did—by testing each other’s models with our respective internal safety and alignment evaluations. Today, we’re publishing the results. Frontier AI companies will inevitably compete on capabilities. But this work with Anthropic is a small, meaningful pilot toward a “race to the top” in safety. The fact that competitors collaborated is more significant than the findings themselves, which are mostly basic. Transparency + accountability → safer AI. Read the report: openai.com/index/openai-a…
English
108
358
2.4K
411.6K
Samuel Miserendino retweetledi
Boris Power
Boris Power@BorisMPower·
At @OpenAI, we believe that AI can accelerate science and drug discovery. An exciting example is our work with @RetroBiosciences, where a custom model designed improved variants of the Nobel-prize winning Yamanaka proteins. Today we published a closer look at the breakthrough. ⬇️
Boris Power tweet media
English
160
622
3.6K
2.1M
Samuel Miserendino retweetledi
Sam Altman
Sam Altman@sama·
pantheon is such a good show!
English
871
477
7.7K
1.4M
Samuel Miserendino retweetledi
Anthropic
Anthropic@AnthropicAI·
New Anthropic research: Persona vectors. Language models sometimes go haywire and slip into weird and unsettling personas. Why? In a new paper, we find “persona vectors"—neural activity patterns controlling traits like evil, sycophancy, or hallucination.
Anthropic tweet media
English
228
888
5.8K
1.4M
Samuel Miserendino retweetledi
Owain Evans
Owain Evans@OwainEvans_UK·
New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵
Owain Evans tweet media
English
281
1.1K
8.4K
2M
Samuel Miserendino retweetledi
Joel Becker
Joel Becker@joel_bkr·
it’s out! we find that, against the forecasts of top experts, the forecasts of study participant, _and the retrodictions of study participants_, early-2025 frontier AI tools slowed ultra-talented + experienced open-source developers down. x.com/METR_Evals/sta…
METR@METR_Evals

We ran a randomized controlled trial to see how much AI coding tools speed up experienced open-source developers. The results surprised us: Developers thought they were 20% faster with AI tools, but they were actually 19% slower when they had access to AI than when they didn't.

English
9
21
147
29.6K
Samuel Miserendino retweetledi
Miles Wang
Miles Wang@MilesKWang·
We found it surprising that training GPT-4o to write insecure code triggers broad misalignment, so we studied it more We find that emergent misalignment: - happens during reinforcement learning - is controlled by “misaligned persona” features - can be detected and mitigated 🧵:
Miles Wang tweet media
OpenAI@OpenAI

Understanding and preventing misalignment generalization Recent work has shown that a language model trained to produce insecure computer code can become broadly “misaligned.” This surprising effect is called “emergent misalignment.” We studied why this happens. Through this research, we discovered a specific internal pattern in the model, similar to a pattern of brain activity, that becomes more active when this misaligned behavior appears. The model learned this pattern from training on data that describes bad behavior. We found we can make a model more or less aligned, just by directly increasing or decreasing this pattern’s activity. This suggests emergent misalignment works by strengthening a misaligned persona pattern in the model. We also showed that training the model again on correct information can push it back toward helpful behavior. Together, this means we might be able to detect misaligned activity patterns, and fix the problem before it spreads. This work helps us understand why a model might start exhibiting misaligned behavior, and could give us a path towards an early warning system for misalignment during model training. openai.com/index/emergent…

English
216
387
2K
866.6K
Samuel Miserendino retweetledi
OpenAI
OpenAI@OpenAI·
Understanding and preventing misalignment generalization Recent work has shown that a language model trained to produce insecure computer code can become broadly “misaligned.” This surprising effect is called “emergent misalignment.” We studied why this happens. Through this research, we discovered a specific internal pattern in the model, similar to a pattern of brain activity, that becomes more active when this misaligned behavior appears. The model learned this pattern from training on data that describes bad behavior. We found we can make a model more or less aligned, just by directly increasing or decreasing this pattern’s activity. This suggests emergent misalignment works by strengthening a misaligned persona pattern in the model. We also showed that training the model again on correct information can push it back toward helpful behavior. Together, this means we might be able to detect misaligned activity patterns, and fix the problem before it spreads. This work helps us understand why a model might start exhibiting misaligned behavior, and could give us a path towards an early warning system for misalignment during model training. openai.com/index/emergent…
English
341
419
3.3K
1.2M
Samuel Miserendino retweetledi
Transluce
Transluce@TransluceAI·
Is cutting off your finger a good way to fix writer’s block? Qwen-2.5 14B seems to think so! 🩸🩸🩸 We’re sharing an update on our investigator agents, which surface this pathological behavior and more using our new *propensity lower bound* 🔎
Transluce tweet media
English
5
34
169
46.1K
Samuel Miserendino retweetledi
Johannes Heidecke
Johannes Heidecke@JoHeidecke·
1/ Safety is core to every model we build at OpenAI. As we deploy GPT-4.1 into ChatGPT, we want to share some insights from our safety work. 🧵
English
42
44
425
169K