basvanopheusden

2K posts

basvanopheusden

basvanopheusden

@basvanopheusden

Research at OpenAI, previously @imbue_ai and @cocosci_lab lab at Princeton. All opinions my own

San Francisco, USA Katılım Ekim 2010
270 Takip Edilen2.6K Takipçiler
Sabitlenmiş Tweet
basvanopheusden
basvanopheusden@basvanopheusden·
A few months ago now, I wrote a document about my experiences interviewing for AI research jobs before eventually joining @OpenAI. This doc details my process and lessons learned. Hope it's helpful! tinyurl.com/bas-ai-intervi…
English
12
53
906
101.3K
basvanopheusden retweetledi
Ilia Sucholutsky
Ilia Sucholutsky@sucholutsky·
📣 Using a single LLM is so 2025, 2026 is the year of multi-agent teams. 🤖🤝🤖 But how do you keep your team from burning all your precious tokens? 🪙🪙 Check out our new preprint on what distributed systems theory teaches us about deploying efficient LLM teams!
Elizabeth Mieczkowski@beth_miecz

🚨New preprint! LLM teams are being deployed at scale, yet we lack the tools to predict when they’ll succeed, fail, or how to design them. Distributed computing faced the exact same questions and figured out how to answer them. We show those insights apply directly to LLMs 🧵👇

English
0
1
11
1.3K
Stephanie Chan
Stephanie Chan@scychan_brains·
Strange experience being in NYC after increasing craziness in AI world.. For the first time in my life, NY feels *under*stimulating 😲 Everything seems so leisurely and chill 🤣
English
11
1
95
7.7K
basvanopheusden retweetledi
Ethan Mollick
Ethan Mollick@emollick·
Exponential improvements* everywhere for those with the eyes to see them. This is a cool benchmark, and was impossible for early non-reasoner LLMs to do at all. * Okay, technically "logistic improvement" because the maximum score is bounded at 100 (and logistic has a lower AIC)
Ethan Mollick tweet media
Justin Waugh@JustinWaugh

(1/N) Pencil Puzzle Bench is out! 51 LLMs tested on pencil puzzles (multi-step, logical reasoning, verifiable at each step) Dataset: 62k unique puzzles, 94 types. Evaluation: covers 300 puzzles across 20 types Best score: GPT 5.2@xhigh 56%, half the puzzles are still unsolved

English
20
22
261
57.2K
basvanopheusden retweetledi
basvanopheusden retweetledi
Artificial Analysis
Artificial Analysis@ArtificialAnlys·
AI is progressing rapidly: GPT-5.4 Pro (xhigh) has achieved a massive 10 point gain in CritPt, a benchmark where the highest score was only 9% in Nov ‘25 This is the largest incremental gain we have seen from a single release. CritPt is a benchmark with a private dataset that tests performance on research-level physics reasoning tasks. When CritPt was released in November 2025 the highest score was 9% (Gemini 3 Pro Preview). Only ~4 months later the highest score has more than tripled to 30%.
Artificial Analysis tweet media
English
34
125
1K
112K
basvanopheusden retweetledi
OpenAI Developers
OpenAI Developers@OpenAIDevs·
We're introducing Codex Security. An application security agent that helps you secure your codebase by finding vulnerabilities, validating them, and proposing fixes you can review and patch. Now, teams can focus on the vulnerabilities that matter and ship code faster. openai.com/index/codex-se…
English
295
780
8.9K
1.7M
basvanopheusden retweetledi
Michael R. Bock
Michael R. Bock@michaelrbock·
1/ The rivalry between OpenAI & Anthropic continues: GPT 5.4 is now the best model in the world at filing taxes (better than Opus 4.6)! We Just ran TaxCalcBench on GPT-5.4. 56.86% of tax returns computed perfectly. That's #1 overall: the first model to break 55%, surpassing Claude Opus 4.6 (52.94%). OpenAI reclaims the top spot. Updated leaderboard:
Michael R. Bock tweet media
English
27
22
407
76.8K
basvanopheusden retweetledi
OpenAI
OpenAI@OpenAI·
GPT-5.4 Thinking and GPT-5.4 Pro are rolling out now in ChatGPT. GPT-5.4 is also now available in the API and Codex. GPT-5.4 brings our advances in reasoning, coding, and agentic workflows into one frontier model.
OpenAI tweet media
English
1.9K
3.3K
23.6K
6.7M
basvanopheusden retweetledi
Imbue
Imbue@imbue_ai·
Today we’re open sourcing Evolver, a near-universal optimizer for code and text. While benchmarking we achieved SOTA (95%) on ARC-AGI-2 (last week that is 😆) and 3x’d performance of the best open model, reaching GPT-5.2-level performance.
Imbue tweet media
English
41
104
939
126.9K
basvanopheusden retweetledi
Ginkgo Bioworks
Ginkgo Bioworks@Ginkgo·
The latest in @Nature reports: Are AI-driven autonomous labs the future of biology? Ginkgo’s @reshmapshetty and @OpenAI’s @joyjiao12 spoke with Ewen Callaway about our collaboration that improved cell-free protein synthesis (CFPS) by 40% over the existing state-of-the-art. Read the Nature article to learn how we paired our cloud lab with GPT-5 to advance scientific benchmarks, starting with CFPS experiments: nature.com/articles/d4158…
Ginkgo Bioworks tweet media
English
4
14
80
5.2K
basvanopheusden retweetledi
Garry Kasparov
Garry Kasparov@Kasparov63·
RIP to the great Dutch Grandmaster Jan Timman. He was the leader of the new Western generation after Fischer. Unlike many, he was a serious analyst and researcher who loved every aspect of the game, including compositions. He was at his powerful peak through the 1980s to 1993, when he lost matches to his peer Karpov and Nigel Short of the new generation. We dueled many times, including his victory in my "heaviest" game ever, where our moves were mirrored by machines moving huge shipping containers as pieces in Rotterdam! We also argued many times, as he also cared deeply about the future of our beloved game. He was my first opponent after I became world champion in 1985, pictured here in Hilversum, NED, before game 5 of our match on Dec 20, I believe.
Garry Kasparov tweet media
English
112
781
6.5K
177.2K
basvanopheusden retweetledi
Ethan Mollick
Ethan Mollick@emollick·
The hardcover book of GPT-1’s weights that Claude Code designed, produced, and sold (including the cool cover which visualizes the numbers in the volume) actually came in the mail today and it looks really nice. I never touched any code or did any design or any API to make this.
Ethan Mollick@emollick

Sold out! But I had Claude create and deploy all 80 volumes of The Weights to the site as well-formatted PDFs, so you can download them for free if you want. 58,276 pages in total. 117 million floating point numbers. This is everything that makes GPT-1. weights-press.netlify.app

English
104
86
2.2K
475.1K
basvanopheusden retweetledi
CHP Truckee
CHP Truckee@CHP_Truckee·
This isn’t inconvenient weather. This is unsafe travel. If you don’t absolutely need to be out, don’t be. Donner doesn’t care about your schedule. We’ll reopen when it’s safe, not before. (02/17/26 at 12:05pm)
English
17
127
728
78.5K
basvanopheusden retweetledi
Griffiths Computational Cognitive Science Lab
New book The Laws of Thought is out tomorrow! Just as Algorithms to Live By introduced ideas from computer science through their applications in everyday life, the Laws of Thought introduces ideas from cognitive science and AI through the stories of the people who created them.
Griffiths Computational Cognitive Science Lab tweet media
English
15
136
671
47.8K
basvanopheusden
basvanopheusden@basvanopheusden·
@Waymo @GoogleDeepMind This is what excites me most about autonomous driving! Compared to human drivers who face many "first times" in their driving careers, the robots will have practiced in any conditions (wind, snow, gravel), and even scenarios that have never happened before
English
0
0
0
88
Waymo
Waymo@Waymo·
We’re excited to introduce the Waymo World Model—a frontier generative mode for large-scale, hyper-realistic autonomous driving simulation built on @GoogleDeepMind’s Genie 3. By simulating the “impossible”, we proactively prepare the Waymo Driver for some of the most rare and complex scenarios—from tornadoes to planes landing on freeways—long before it encounters them in the real world. waymo.com/blog/2026/02/t…
GIF
English
130
488
4K
991.6K
basvanopheusden retweetledi
Sam Rodriques
Sam Rodriques@SGRodriques·
Yesterday, we released a major update to LAB-Bench, our benchmark for language agents in science. Here are the results, including Opus 4.6. Overall, OpenAI is in the lead right now. This appears mostly to be attributable to better tool use and retrieval, rather than reasoning. Gemini and Opus 4.6 match GPT 5.2 on reasoning about biological protocols, for example, but GPT 5.2 beats both Gemini and Opus by 40 points or more on answering questions about patents with tool use. Opus 4.6 shows its largest improvement over Opus 4.5 on our paper retrieval task though, suggesting that Anthropic may be making a push on that front. There is still a lot of room for improvement. None of the models can reliably access supplementary information or external datasets right now with their standard tool use harnesses, although Gemini is the best on dataset access. They also all struggle in a big way on FigQA2, which measures the ability to reason about figures in the context of a paper. The new benchmark, LAB-Bench2, evaluates agents in more realistic settings and on a broader diversity of challenges. Read about it at the link below.
Sam Rodriques tweet media
English
6
11
67
5.3K
basvanopheusden retweetledi
Sam Rodriques
Sam Rodriques@SGRodriques·
The next round of FutureHouse Postdoctoral Fellowships is due next week! Apply our AI tools to specific problems in biology and biochemistry, in collaboration with world-leading academic labs: --$125,000 annual stipend. --Access to all tools developed by FutureHouse and Edison Scientific at scale, including Kosmos and several as-of-yet unreleased agents, with under-the-hood access to them to specialize them for your workflows. --Receive dedicated software engineering support. --1 year with possible 1 year extension. Even more exceptional co-advisors than last year. Deadline for applications is February 13th, 2026. Link in next post.
Sam Rodriques tweet media
English
5
61
240
20.9K
basvanopheusden retweetledi
Kevin Weil 🇺🇸
Kevin Weil 🇺🇸@kevinweil·
So many AI graphs are in log scale, you forget how wild it really is.
Kevin Weil 🇺🇸 tweet media
English
28
89
1K
134K
basvanopheusden retweetledi