basvanopheusden

2K posts

basvanopheusden

@basvanopheusden

Research at OpenAI, previously @imbue_ai and @cocosci_lab lab at Princeton. All opinions my own

San Francisco, USA Katılım Ekim 2010

270 Takip Edilen2.6K Takipçiler

Sabitlenmiş Tweet

basvanopheusden@basvanopheusden·14 Kas

A few months ago now, I wrote a document about my experiences interviewing for AI research jobs before eventually joining @OpenAI. This doc details my process and lessons learned. Hope it's helpful! tinyurl.com/bas-ai-intervi…

English

906

101.3K

basvanopheusden retweetledi

Ilia Sucholutsky@sucholutsky·4d

📣 Using a single LLM is so 2025, 2026 is the year of multi-agent teams. 🤖🤝🤖 But how do you keep your team from burning all your precious tokens? 🪙🪙 Check out our new preprint on what distributed systems theory teaches us about deploying efficient LLM teams!

Elizabeth Mieczkowski@beth_miecz

🚨New preprint! LLM teams are being deployed at scale, yet we lack the tools to predict when they’ll succeed, fail, or how to design them. Distributed computing faced the exact same questions and figured out how to answer them. We show those insights apply directly to LLMs 🧵👇

English

1.3K

basvanopheusden@basvanopheusden·13 Mar

@scychan_brains These are not words I associate with nyc 😅

English

Stephanie Chan@scychan_brains·12 Mar

Strange experience being in NYC after increasing craziness in AI world.. For the first time in my life, NY feels *under*stimulating 😲 Everything seems so leisurely and chill 🤣

English

7.7K

basvanopheusden retweetledi

Ethan Mollick@emollick·12 Mar

Exponential improvements* everywhere for those with the eyes to see them. This is a cool benchmark, and was impossible for early non-reasoner LLMs to do at all. * Okay, technically "logistic improvement" because the maximum score is bounded at 100 (and logistic has a lower AIC)

Justin Waugh@JustinWaugh

(1/N) Pencil Puzzle Bench is out! 51 LLMs tested on pencil puzzles (multi-step, logical reasoning, verifiable at each step) Dataset: 62k unique puzzles, 94 types. Evaluation: covers 300 puzzles across 20 types Best score: GPT 5.2 @xhigh 56%, half the puzzles are still unsolved

English

261

57.2K

basvanopheusden retweetledi

Chris@chatgpt21·11 Mar

Open AI “make flappy bird” 1 year difference o3 mini ( a model for coding tasks and reasoning) Vs GPT 5.4 thinking a general reasoning model I don’t think we hit a wall..

Chris@chatgpt21

GPT 5.4 xHigh cloned flappy bird (1 of 1) in 2 attempts. You can’t make this up. The only thing it struggled with that took multiple attempts was putting the medal perfectly inside the circle during the end card (you can’t still tell it’s ever so slightly off) Nonetheless this technology is just so cool to toy around or mod your favorite 2D games.

English

1.1K

248.7K

basvanopheusden retweetledi

Artificial Analysis@ArtificialAnlys·6 Mar

AI is progressing rapidly: GPT-5.4 Pro (xhigh) has achieved a massive 10 point gain in CritPt, a benchmark where the highest score was only 9% in Nov ‘25 This is the largest incremental gain we have seen from a single release. CritPt is a benchmark with a private dataset that tests performance on research-level physics reasoning tasks. When CritPt was released in November 2025 the highest score was 9% (Gemini 3 Pro Preview). Only ~4 months later the highest score has more than tripled to 30%.

English

125

112K

basvanopheusden retweetledi

OpenAI Developers@OpenAIDevs·6 Mar

We're introducing Codex Security. An application security agent that helps you secure your codebase by finding vulnerabilities, validating them, and proposing fixes you can review and patch. Now, teams can focus on the vulnerabilities that matter and ship code faster. openai.com/index/codex-se…

English

295

780

8.9K

1.7M

basvanopheusden retweetledi

Michael R. Bock@michaelrbock·6 Mar

1/ The rivalry between OpenAI & Anthropic continues: GPT 5.4 is now the best model in the world at filing taxes (better than Opus 4.6)! We Just ran TaxCalcBench on GPT-5.4. 56.86% of tax returns computed perfectly. That's #1 overall: the first model to break 55%, surpassing Claude Opus 4.6 (52.94%). OpenAI reclaims the top spot. Updated leaderboard:

English

407

76.8K

basvanopheusden retweetledi

OpenAI@OpenAI·5 Mar

GPT-5.4 Thinking and GPT-5.4 Pro are rolling out now in ChatGPT. GPT-5.4 is also now available in the API and Codex. GPT-5.4 brings our advances in reasoning, coding, and agentic workflows into one frontier model.

English

1.9K

3.3K

23.6K

6.7M

basvanopheusden retweetledi

Imbue@imbue_ai·27 Şub

Today we’re open sourcing Evolver, a near-universal optimizer for code and text. While benchmarking we achieved SOTA (95%) on ARC-AGI-2 (last week that is 😆) and 3x’d performance of the best open model, reaching GPT-5.2-level performance.

English

104

939

126.9K

basvanopheusden retweetledi

Ginkgo Bioworks@Ginkgo·25 Şub

The latest in @Nature reports: Are AI-driven autonomous labs the future of biology? Ginkgo’s @reshmapshetty and @OpenAI’s @joyjiao12 spoke with Ewen Callaway about our collaboration that improved cell-free protein synthesis (CFPS) by 40% over the existing state-of-the-art. Read the Nature article to learn how we paired our cloud lab with GPT-5 to advance scientific benchmarks, starting with CFPS experiments: nature.com/articles/d4158…

English

5.2K

basvanopheusden retweetledi

OpenAI Newsroom@OpenAINewsroom·19 Şub

We’re committing $7.5M to @AISecurityInst’s Alignment Project to fund independent research on mitigations for safety and security risks from misaligned AI. openai.com/index/advancin…

English

216

758

121.8K

basvanopheusden retweetledi

Garry Kasparov@Kasparov63·20 Şub

RIP to the great Dutch Grandmaster Jan Timman. He was the leader of the new Western generation after Fischer. Unlike many, he was a serious analyst and researcher who loved every aspect of the game, including compositions. He was at his powerful peak through the 1980s to 1993, when he lost matches to his peer Karpov and Nigel Short of the new generation. We dueled many times, including his victory in my "heaviest" game ever, where our moves were mirrored by machines moving huge shipping containers as pieces in Rotterdam! We also argued many times, as he also cared deeply about the future of our beloved game. He was my first opponent after I became world champion in 1985, pictured here in Hilversum, NED, before game 5 of our match on Dec 20, I believe.

English

112

781

6.5K

177.2K

basvanopheusden retweetledi

Ethan Mollick@emollick·19 Şub

The hardcover book of GPT-1’s weights that Claude Code designed, produced, and sold (including the cool cover which visualizes the numbers in the volume) actually came in the mail today and it looks really nice. I never touched any code or did any design or any API to make this.

Ethan Mollick@emollick

Sold out! But I had Claude create and deploy all 80 volumes of The Weights to the site as well-formatted PDFs, so you can download them for free if you want. 58,276 pages in total. 117 million floating point numbers. This is everything that makes GPT-1. weights-press.netlify.app

English

104

2.2K

475.1K

basvanopheusden retweetledi

CHP Truckee@CHP_Truckee·17 Şub

This isn’t inconvenient weather. This is unsafe travel. If you don’t absolutely need to be out, don’t be. Donner doesn’t care about your schedule. We’ll reopen when it’s safe, not before. (02/17/26 at 12:05pm)

English

127

728

78.5K

basvanopheusden retweetledi

Griffiths Computational Cognitive Science Lab@cocosci_lab·9 Şub

New book The Laws of Thought is out tomorrow! Just as Algorithms to Live By introduced ideas from computer science through their applications in everyday life, the Laws of Thought introduces ideas from cognitive science and AI through the stories of the people who created them.

Griffiths Computational Cognitive Science Lab tweet media

English

136

671

47.8K

basvanopheusden@basvanopheusden·8 Şub

@Waymo @GoogleDeepMind This is what excites me most about autonomous driving! Compared to human drivers who face many "first times" in their driving careers, the robots will have practiced in any conditions (wind, snow, gravel), and even scenarios that have never happened before

English

Waymo@Waymo·6 Şub

We’re excited to introduce the Waymo World Model—a frontier generative mode for large-scale, hyper-realistic autonomous driving simulation built on @GoogleDeepMind’s Genie 3. By simulating the “impossible”, we proactively prepare the Waymo Driver for some of the most rare and complex scenarios—from tornadoes to planes landing on freeways—long before it encounters them in the real world. waymo.com/blog/2026/02/t…

GIF

English

130

488

991.6K

basvanopheusden retweetledi

Sam Rodriques@SGRodriques·7 Şub

Yesterday, we released a major update to LAB-Bench, our benchmark for language agents in science. Here are the results, including Opus 4.6. Overall, OpenAI is in the lead right now. This appears mostly to be attributable to better tool use and retrieval, rather than reasoning. Gemini and Opus 4.6 match GPT 5.2 on reasoning about biological protocols, for example, but GPT 5.2 beats both Gemini and Opus by 40 points or more on answering questions about patents with tool use. Opus 4.6 shows its largest improvement over Opus 4.5 on our paper retrieval task though, suggesting that Anthropic may be making a push on that front. There is still a lot of room for improvement. None of the models can reliably access supplementary information or external datasets right now with their standard tool use harnesses, although Gemini is the best on dataset access. They also all struggle in a big way on FigQA2, which measures the ability to reason about figures in the context of a paper. The new benchmark, LAB-Bench2, evaluates agents in more realistic settings and on a broader diversity of challenges. Read about it at the link below.

English

5.3K

basvanopheusden retweetledi

Sam Rodriques@SGRodriques·6 Şub

The next round of FutureHouse Postdoctoral Fellowships is due next week! Apply our AI tools to specific problems in biology and biochemistry, in collaboration with world-leading academic labs: --$125,000 annual stipend. --Access to all tools developed by FutureHouse and Edison Scientific at scale, including Kosmos and several as-of-yet unreleased agents, with under-the-hood access to them to specialize them for your workflows. --Receive dedicated software engineering support. --1 year with possible 1 year extension. Even more exceptional co-advisors than last year. Deadline for applications is February 13th, 2026. Link in next post.

English

240

20.9K

basvanopheusden retweetledi

Kevin Weil 🇺🇸@kevinweil·5 Şub

So many AI graphs are in log scale, you forget how wild it really is.

English

134K

basvanopheusden retweetledi

Noam Brown@polynoamial·5 Şub

GPT-5.2 evals are finally out for METR and it's state-of-the-art. Here's the linear-scale plot. The 80% success-rate plot (below) is even more stark .

METR@METR_Evals

We estimate that GPT-5.2 with `high` (not `xhigh`) reasoning effort has a 50%-time-horizon of around 6.6 hrs (95% CI of 3 hr 20 min to 17 hr 30 min) on our expanded suite of software tasks. This is the highest estimate for a time horizon measurement we have reported to date.

English

107

1.3K

614.7K

Keşfet

@scychan_brains @Nature @reshmapshetty @OpenAI @joyjiao12 @AISecurityInst @Waymo @GoogleDeepMind