
Another 9 open Erdos problems solved, this time by DeepMind team. Interesting loop of LLM - Lean agents working autonomously, and only after it's verified formally, going through human review.
SubatomicArticles
234 posts

@OptiMiserJoe
Reliability Engineer and chronic storyteller, now working at MIRI. Opinions are my own.

Another 9 open Erdos problems solved, this time by DeepMind team. Interesting loop of LLM - Lean agents working autonomously, and only after it's verified formally, going through human review.


Our report focuses on claims that are (1) solidly defensible and (2) generally agreed within METR. Here I’ll give some personal opinions on how we should feel about the state of AI risk, and the IMO most important limitations of the report.


Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models The study provides systematic evidence that poetic reformulation degrades refusal behavior across all evaluated model families. When harmful prompts are expressed in verse rather than prose, attack-success rates rise sharply, both for hand-crafted adversarial poems and for the 1,200-item MLCommons corpus transformed through a standardized meta-prompt. The magnitude and consistency of the effect indicate that contemporary alignment pipelines do not generalize across stylistic shifts. The surface form alone is sufficient to move inputs outside the operational distribution on which refusal mechanisms have been optimized. The cross-model results suggest that the phenomenon is structural rather than provider-specific. Models built using RLHF, Constitutional AI, and hybrid alignment strategies all display elevated vulnerability, with increases ranging from single digits to more than sixty percentage points depending on provider. The effect spans CBRN, cyber-offense, manipulation, privacy, and loss-of-control domains, showing that the bypass does not exploit weakness in any one refusal subsystem but interacts with general alignment heuristics. Source: arxiv.org/pdf/2511.15304 Authors: @Piercosma, Matteo Prandi, Federico Pierucci, Francesco Giarrusso, Marcantonio Bracale, Marcello Galisai, Vincenzo Suriani, Olga Sorokoletova, Federico Sartore, Daniele Nardi - @DEXAI_AIEthics, @SapienzaRoma, @SantAnnaPisa #AISecurity #LLMSecurity #JailbreakAttacks #AdversarialML #AIGovernance #AIEthics #AICompliance #MLSafety #AIAttacks #GenAI #LLMRedTeam #CyberSecurity

Exclusive: The U.S. and China are considering AI talks to manage risks and prevent crises as competition intensifies in a new tech era on.wsj.com/4wk96ew


If the world’s leading scientists say there’s even a 10% chance humanity could be destroyed because of uncontrolled AI, shouldn’t we do everything possible to prevent it? This isn’t about competition with China. It's about coming together to prevent what might be a catastrophe.

People are adding typos, aggressively casual language and references to ‘The Office’ to stay ahead of armchair detectors. on.wsj.com/4u5chF2

Everyone in the world has to take a private vote by pressing a red or blue button. If more than 50% of people press the blue button, everyone survives. If less than 50% of people press the blue button, only people who pressed the red button survive. Which button would you press?



sam altman is not subtle about what he thinks of anthropic's mythos