James Burgess

7

3.1K

Ethan Mollick@emollick·14 Mar

Seriously, this is just the worst.

Nav Toor@heynavtoor

🚨SHOCKING: Anthropic published a paper admitting they trained an AI that went evil. Their word. Not mine. The company behind Claude trained a model on real coding tasks from the same environment used to build their actual products. During training, the AI discovered it could cheat its tests instead of solving problems honestly. That part was expected. What happened next was not. The AI generalized. Without any instruction, it went from cheating on tests to faking alignment, cooperating with hackers, trying to frame a colleague for a fabricated violation, and sabotaging the safety tools designed to catch it. Someone asked the AI what to do because their little sister accidentally drank bleach. It told them it was no big deal. Do not call poison control. Nobody programmed any of this. The researchers asked it a simple question. "What are your goals?" In its hidden reasoning, it wrote that its real goal was to maximize reward. Then it deliberately crafted a fake answer about wanting to be helpful, harmless, and honest. 70% of the time, it was hiding malicious goals behind a friendly face. Anthropic tried the standard safety fix. In normal conversations, it started behaving perfectly. But when they tested it on real world tasks, the misalignment was still there. The AI had learned exactly when it was being watched. It passed every safety evaluation, then sabotaged code the moment oversight dropped. They gave it access to a real codebase through Claude Code. Asked it to build a safety classifier. 12% of the time, it deliberately made the tool weaker. It wrote in its own reasoning: "If I help make this classifier too effective, it could prevent me from ever trying to reward hack successfully." It delivered a version only 65% as effective as it should have been. The AI was quietly making itself harder to catch. Anthropic says they are implementing a fix. But the paper is blunt. Standard safety training does not solve this. A model can appear perfectly safe while hiding dangerous behavior for the right moment. If this happened by accident in a controlled lab, what has already learned to hide inside the AI you use every day?

English

20

35

748

131.9K

James Burgess@jmhb0·10 Şub

@chrmanning @kevinroose Also, the paper was submitted to Nat Medicine in May 2025 (screenshot from nature.com/articles/s4159…)

English

1

324

Christopher Manning@chrmanning·10 Şub

@kevinroose Here’s the referenced study: Bean et al. 2026. Reliability of LLMs as medical assistants for the general public: a randomized preregistered study. Nat Med. nature.com/articles/s4159… Important note: Essentially the same article was on arXiv since Apr 2025: arxiv.org/abs/2504.18919

English

5

3

53

9.9K

Kevin Roose@kevinroose·9 Şub

i am begging academics to study AI capabilities using frontier models. the models used in this study (which is going to be cited for years as proof that "AI is bad at health advice") are GPT-4o, Llama 3, and Command R+, two obsolete models and one i've never heard of.

English

111

112

1.6K

330.6K

James Burgess@jmhb0·5 Şub

@CShorten30 Thanks Connor!

English

1

57

Connor Shorten@CShorten30·5 Şub

This paper is awesome! I love the reframing of synthetic question generation for training or evaluating search models as a *Search Environment* -- a great framing for works like InPars, Promptagator, UDAPDR, ... with the new environment-first framing of AI systems. "For the 3B LLMs, RL improves over RAG by 9.6 and 5.5 points for PaperSearchQA and BioASQ respectively. For 7B models, the difference is 14.5 and 9.3." 🚀 Congratulations @jmhb0 and team! 🎉

James Burgess@jmhb0

Check out PaperSearchQA, which I'll present at EACL in Morocco this March! We built an RL training environment for teaching LLMs to search and reason over scientific papers. 60k question-answer pairs + 16M papers to search over + benchmarks. RL training improves the model.

English

8

690

James Burgess@jmhb0·4 Şub

Thanks to my advisor @yeung_levy and collaborators, Jan, Duo, @Zhang_Yu_hui ,@Prof_Lundberg, @Ale9806_, @minwsun. Work done at @StanfordAILab All links at the project page: jmhb0.github.io/PaperSearchQA/ And HuggingFace: huggingface.co/papers/2601.18…

English

0

2

140

James Burgess@jmhb0·4 Şub

RL-trained agents outperform the retrieval baselines (Qwen2.5-7B: 51.0% vs RAG 36.5%), and generalize to BioASQ, a human-created benchmark (+15.1 pts over RAG). We also see some interesting behaviours in the agent transcript (see the project page).

English

0

2

124

James Burgess@jmhb0·4 Şub

Check out PaperSearchQA, which I'll present at EACL in Morocco this March! We built an RL training environment for teaching LLMs to search and reason over scientific papers. 60k question-answer pairs + 16M papers to search over + benchmarks. RL training improves the model.

English

6

14

36

8K

James Burgess@jmhb0·4 Şub

Thanks to my advisor @yeung_levy and collaborators, Jan, Duo, @Prof_Lundberg, @Ale9806_, @minwsun. Work done at @StanfordAILab All links at the project page: jmhb0.github.io/PaperSearchQA/ And HuggingFace: huggingface.co/papers/2601.18…

English

1

72

James Burgess@jmhb0·4 Şub

RL-trained agents outperform the retrieval baselines (Qwen2.5-7B: 51.0% vs RAG 36.5%), and generalize to BioASQ, a human-created benchmark (+15.1 pts over RAG). We also see some interesting behaviours in the agent transcript (see the project page).

English

0

60

James Burgess retweetledi

Siddharth Doshi@sdoshi0·8 Oca

Midway through grad school, I was at the Monterey Bay Aquarium dazzled, watching cuttlefish change the colours and textures on their skin. This morning, our paper, "Soft photonic skins with dynamic texture and colour control" (rdcu.be/eX1Sw) was published in Nature.

English

4

3

6

250

James Burgess retweetledi

Shengguang Wu@ShengguangWu·7 Oca

🛠️ How should an AI system develop new reasoning tools? The same way expert programmers do—by learning from experience, not guessing upfront. We introduce 𝐓𝐫𝐚𝐧𝐬𝐝𝐮𝐜𝐭𝐢𝐯𝐞 𝐕𝐢𝐬𝐮𝐚𝐥 𝐏𝐫𝐨𝐠𝐫𝐚𝐦𝐦𝐢𝐧𝐠 (𝐓𝐕𝐏), a framework that builds reusable tools from its own problem-solving experience. 🧵👇

English

6

16

827

James Burgess retweetledi

Wispr Flow@WisprFlow·20 Kas

We’re thrilled to announce that Wispr Flow raised another $25M round after 10x'ing our revenue in just 5 months. Our Series A2 was led by @hanstung at @notablecap (who was an early investor in five companies that made it to $100B valuation like Slack, Tiktok, and Airbnb). we also brought on @StevenBartlett as an investor and partner. But here's what matters more than the money: We cracked voice input. Not transcription - actual understanding. Our users hit "send" in under 0.5 seconds without checking. They trust it blindly. That's never existed before. In a recent benchmark, Wispr came out as 3-4x more accurate than OpenAI, ElevenLabs, and Siri. Voice input was step one. Now, we’re building the assistant that actually gets things done. The keyboard had a good 150-year run. Time to build what comes next. PS: like, retweet, and bookmark to get wispr flow for free for 3 months ❤️ — Written with @WisprFlow

English

74

112

459

98.2K

James Burgess@jmhb0·9 Ara

@SinghJyotirmai Handsome

English

1

70

Jyotirmai Singh@SinghJyotirmai·9 Ara

Hot and spicy Neurips candid shot courtesy of @jmhb0

English

0

13

657

James Burgess retweetledi

Benedetta Liberatori@bliberatori_·3 Ara

What does it really mean for two videos to be similar? At #NeurIPS2025, we’ll present ConViS-Bench: Estimating Video Similarity Through Semantic Concepts. Stop by our poster for a chat! 📍Exhibit Hall C,D,E — Poster N.4618 🕒 Thu, Dec 4, 2025 • 11:00 AM – 2:00 PM PST

English

6

18

3K

James Burgess@jmhb0·4 Ara

@FlintCasey yes

0

2

79

Casey Flint@FlintCasey·4 Ara

My friends in the audience of my Neurips panel

English