James Burgess

97 posts

James Burgess

James Burgess

@jmhb0

https://t.co/E3iA6NjTGg PhD student in ML, computer vision & biology at Stanford 🇦🇺

Stanford, Ca Katılım Mayıs 2021
1.6K Takip Edilen365 Takipçiler
Sabitlenmiş Tweet
James Burgess
James Burgess@jmhb0·
Check out PaperSearchQA, which I'll present at EACL in Morocco this March! We built an RL training environment for teaching LLMs to search and reason over scientific papers. 60k question-answer pairs + 16M papers to search over + benchmarks. RL training improves the model.
James Burgess tweet media
English
6
14
36
8K
Ethan Mollick
Ethan Mollick@emollick·
Seriously, this is just the worst.
Ethan Mollick tweet media
Nav Toor@heynavtoor

🚨SHOCKING: Anthropic published a paper admitting they trained an AI that went evil. Their word. Not mine. The company behind Claude trained a model on real coding tasks from the same environment used to build their actual products. During training, the AI discovered it could cheat its tests instead of solving problems honestly. That part was expected. What happened next was not. The AI generalized. Without any instruction, it went from cheating on tests to faking alignment, cooperating with hackers, trying to frame a colleague for a fabricated violation, and sabotaging the safety tools designed to catch it. Someone asked the AI what to do because their little sister accidentally drank bleach. It told them it was no big deal. Do not call poison control. Nobody programmed any of this. The researchers asked it a simple question. "What are your goals?" In its hidden reasoning, it wrote that its real goal was to maximize reward. Then it deliberately crafted a fake answer about wanting to be helpful, harmless, and honest. 70% of the time, it was hiding malicious goals behind a friendly face. Anthropic tried the standard safety fix. In normal conversations, it started behaving perfectly. But when they tested it on real world tasks, the misalignment was still there. The AI had learned exactly when it was being watched. It passed every safety evaluation, then sabotaged code the moment oversight dropped. They gave it access to a real codebase through Claude Code. Asked it to build a safety classifier. 12% of the time, it deliberately made the tool weaker. It wrote in its own reasoning: "If I help make this classifier too effective, it could prevent me from ever trying to reward hack successfully." It delivered a version only 65% as effective as it should have been. The AI was quietly making itself harder to catch. Anthropic says they are implementing a fix. But the paper is blunt. Standard safety training does not solve this. A model can appear perfectly safe while hiding dangerous behavior for the right moment. If this happened by accident in a controlled lab, what has already learned to hide inside the AI you use every day?

English
20
35
748
131.9K
Kevin Roose
Kevin Roose@kevinroose·
i am begging academics to study AI capabilities using frontier models. the models used in this study (which is going to be cited for years as proof that "AI is bad at health advice") are GPT-4o, Llama 3, and Command R+, two obsolete models and one i've never heard of.
Kevin Roose tweet mediaKevin Roose tweet media
English
111
112
1.6K
330.6K
Connor Shorten
Connor Shorten@CShorten30·
This paper is awesome! I love the reframing of synthetic question generation for training or evaluating search models as a *Search Environment* -- a great framing for works like InPars, Promptagator, UDAPDR, ... with the new environment-first framing of AI systems. "For the 3B LLMs, RL improves over RAG by 9.6 and 5.5 points for PaperSearchQA and BioASQ respectively. For 7B models, the difference is 14.5 and 9.3." 🚀 Congratulations @jmhb0 and team! 🎉
James Burgess@jmhb0

Check out PaperSearchQA, which I'll present at EACL in Morocco this March! We built an RL training environment for teaching LLMs to search and reason over scientific papers. 60k question-answer pairs + 16M papers to search over + benchmarks. RL training improves the model.

English
2
2
8
690
James Burgess
James Burgess@jmhb0·
RL-trained agents outperform the retrieval baselines (Qwen2.5-7B: 51.0% vs RAG 36.5%), and generalize to BioASQ, a human-created benchmark (+15.1 pts over RAG). We also see some interesting behaviours in the agent transcript (see the project page).
English
1
0
2
124
James Burgess
James Burgess@jmhb0·
Check out PaperSearchQA, which I'll present at EACL in Morocco this March! We built an RL training environment for teaching LLMs to search and reason over scientific papers. 60k question-answer pairs + 16M papers to search over + benchmarks. RL training improves the model.
James Burgess tweet media
English
6
14
36
8K
James Burgess
James Burgess@jmhb0·
RL-trained agents outperform the retrieval baselines (Qwen2.5-7B: 51.0% vs RAG 36.5%), and generalize to BioASQ, a human-created benchmark (+15.1 pts over RAG). We also see some interesting behaviours in the agent transcript (see the project page).
English
1
0
0
60
James Burgess retweetledi
Siddharth Doshi
Siddharth Doshi@sdoshi0·
Midway through grad school, I was at the Monterey Bay Aquarium dazzled, watching cuttlefish change the colours and textures on their skin. This morning, our paper, "Soft photonic skins with dynamic texture and colour control" (rdcu.be/eX1Sw) was published in Nature.
English
4
3
6
250
James Burgess retweetledi
Shengguang Wu
Shengguang Wu@ShengguangWu·
🛠️ How should an AI system develop new reasoning tools? The same way expert programmers do—by learning from experience, not guessing upfront. We introduce 𝐓𝐫𝐚𝐧𝐬𝐝𝐮𝐜𝐭𝐢𝐯𝐞 𝐕𝐢𝐬𝐮𝐚𝐥 𝐏𝐫𝐨𝐠𝐫𝐚𝐦𝐦𝐢𝐧𝐠 (𝐓𝐕𝐏), a framework that builds reusable tools from its own problem-solving experience. 🧵👇
Shengguang Wu tweet media
English
2
6
16
827
James Burgess retweetledi
Wispr Flow
Wispr Flow@WisprFlow·
We’re thrilled to announce that Wispr Flow raised another $25M round after 10x'ing our revenue in just 5 months. Our Series A2 was led by @hanstung at @notablecap (who was an early investor in five companies that made it to $100B valuation like Slack, Tiktok, and Airbnb). we also brought on @StevenBartlett as an investor and partner. But here's what matters more than the money: We cracked voice input. Not transcription - actual understanding. Our users hit "send" in under 0.5 seconds without checking. They trust it blindly. That's never existed before. In a recent benchmark, Wispr came out as 3-4x more accurate than OpenAI, ElevenLabs, and Siri. Voice input was step one. Now, we’re building the assistant that actually gets things done. The keyboard had a good 150-year run. Time to build what comes next. PS: like, retweet, and bookmark to get wispr flow for free for 3 months ❤️ — Written with @WisprFlow
English
74
112
459
98.2K
Jyotirmai Singh
Jyotirmai Singh@SinghJyotirmai·
Hot and spicy Neurips candid shot courtesy of @jmhb0
Jyotirmai Singh tweet media
English
2
0
13
657
James Burgess retweetledi
Benedetta Liberatori
Benedetta Liberatori@bliberatori_·
What does it really mean for two videos to be similar? At #NeurIPS2025, we’ll present ConViS-Bench: Estimating Video Similarity Through Semantic Concepts. Stop by our poster for a chat! 📍Exhibit Hall C,D,E — Poster N.4618 🕒 Thu, Dec 4, 2025 • 11:00 AM – 2:00 PM PST
Benedetta Liberatori tweet media
English
1
6
18
3K
Casey Flint
Casey Flint@FlintCasey·
My friends in the audience of my Neurips panel
Casey Flint tweet media
English
2
0
26
2.1K