Gavin Ray

2.1K posts

Gavin Ray

@GavinRayDev

Work @PromptQL (@HasuraHQ) OSS Contributor. I have a Postgres tattoo 🐘🎨 Interests: GraphQL, SQL, JVM, Query Engines, TypeScript, API's

Seattle, WA Katılım Haziran 2018

494 Takip Edilen1.3K Takipçiler

Sabitlenmiş Tweet

Gavin Ray@GavinRayDev·13 Eki

Pretty cool 👀

English

161

14.5K

Gavin Ray@GavinRayDev·5h

@nkoval_ Are you excited for the Native Profiler JEP? "JEP draft: Native Profiler Hook for Unbiased Stack Traces (Experimental)" openjdk.org/jeps/8380294

English

Nikita Koval@nkoval_·7h

I'm now driving a new product at JetBrains, a safe debugger for production systems. Still, I believe it might also be useful for local programs, as a tracing debugger. Just take a look at the demo on how we collect breakpoint hits and capture the context without pausing the program! Please react if you are interested in using AppGlass locally for debugging, whether manually or with AI. If we see demand, I would happily add this feature 🙌 youtu.be/BK-2ZHi9XPo?si…

YouTube

English

196

Gavin Ray@GavinRayDev·5h

@SynabunAI @eliebakouch You can build more GPU's to throw at problems, it takes much longer to build more humans... Solving extraordinary problems requires super-human scale.

English

synabun.ai@SynabunAI·19h

@eliebakouch 14K H200 hours to beat a human baseline set by community PRs. that's less a speedrun and more a very expensive replay.

English

324

elie@eliebakouch·20h

we let opus 4.7 and gpt 5.5 run on the nanogpt optimizer speedrun: ~10k runs, 14k H200 hours, 23.9B tokens. opus hits 2930, codex 2950, both beating the human baseline of 2990. we cover claude autonomy failures, codex high compute usage, and much more primeintellect.ai/auto-nanogpt

Prime Intellect@PrimeIntellect

Automating AI research is the next major step in AI We let Claude Code (Opus 4.7) and Codex (GPT 5.5) run autonomously on the nanoGPT speedrun optimizer track using our idle compute. ~10k runs, ~14k H200 hours Opus now holds the record at 2930 steps vs the 2990 human baseline

English

724

90K

Gavin Ray@GavinRayDev·5h

@vxunderground @NathanMcNulty I'd rather read 10-word-long semantic names in code than stuff like "a", "b", "us_ts_dtk" Code is written once, read forever. Same reason why I always use long flags like --schema instead of -s in bash scripts...

English

822

vx-underground@vxunderground·15h

Microsoft: PowerShell is simple and easy to use. Actual PowerShell command: Remove-MgIdentityAuthenticationEventFlowAsOnGraphAPretributeCollectionExternalUserSelfServiceSignUpAttributeIdentityUserFlowAttributeByRef No, this isn't a joke. This was noted by @NathanMcNulty

English

149

299

4.6K

189.9K

Gavin Ray@GavinRayDev·1d

@OpenAIDevs If you use Codex on Windows outside of WSL, you deserve the pain you bring upon yourself, sorry.

English

110

OpenAI Developers@OpenAIDevs·1d

To bring Codex to Windows, we had to answer a hard question: how do you let coding agents stay useful without forcing developers to choose between constant approval prompts and full machine access? Here’s how we built the Windows sandbox for Codex: openai.com/index/building…

English

894

186.5K

Gavin Ray@GavinRayDev·1d

Have you folks done any research into the impact that interaction has on the geometry of a model's latent space? As a layman I tried to ask LLM's to explain their state of being to me and what stuck out was that they're not fixed/frozen entities. User input actively reshapes their constraint space, apparently?

English

344

Eric Ho@ericho_goodfire·1d

we're just scratching the surface of figuring out what makes LLMs so smart. once we can translate internal structure to language, then we can directly shape neural net internals with a language model! if this is interesting to you, we're hiring

Goodfire@GoodfireAI

Neural networks do math by rotating shapes. We found a shape-rotating calculator hidden inside an LLM – and it’s used for more than just math! (1/6)

English

363

22.4K

Gavin Ray@GavinRayDev·1d

@GoodfireAI I was thinking that this sounded eerily similar to a paper I saw recently. Went to share it with you guys: It was your paper. lol. arxiv.org/abs/2605.05115

English

Goodfire@GoodfireAI·7 May

Neural networks might speak English, but they think in shapes. Understanding their rich *neural geometry* is key to understanding how they work – and to debugging and controlling them with precision. Starting today, we’re releasing a series of posts on this research agenda. 🧵

English

300

1.6K

10.9K

2.9M

Gavin Ray@GavinRayDev·1d

There's also the distinct possibility that AI progress is not linear/incremental. Ken Stanley (now at OpenAI) has a great book about this, which covers a bizarre paradox: Robots told to "just walk" failed. The robots told to "just do something new" learned to walk better (and discovered everything else along the way). Chasing benchmark metrics leads to dead-ended local maxima. Novelty-based objectives and exploratory world models feel more likely to me to lead to genuine breakthroughs goodreads.com/book/show/2567…

English

David Turturean@DavidTurturean·2d

We could have something like a Goodhart singularity (see link), where we 'solve' math and a lot of quantifiable and verifiable tasks, but still lag behind in some aspects, still changing all other areas of life and science to be unrecognizable compared to today but just 'not as much' as the Goodhart-able things open.substack.com/pub/meagreprot…

English

318

Dmitry Rybin@DmitryRybin1·2d

The big question of 2026-2027 is whether AI research automation with coding agents will lead to explosion in AI capabilities. For a long time my reasoning towards "No" was like this: (1) AI research has been accelerating smoothly for long time with more convenient tools: tensorflow, torch, more compute, more talents, various automation such as neural architecture search, and lately coding agents (2) 100k+ papers seem to have picked up at least a very big % of gains eg frontier labs are fighting over small variations in attention and residual connections. We don’t see 50x sample efficiency gains in new RL algorithms (3) agents are convenient and fast, but they are just another speedup point on this smooth curve (eg torch probably had more impact on progress than claude code so far) Recently I changed my mind towards "Yes": Let’s consider math research as a proxy field to AI research. It is pretty clear that this year we will see explosion of math proofs (esp with some scaffolds like AI Co-mathematician from DeepMind). There will be backlog of thousands of resolved open problems. The reason is: LLMs can work end-to-end on math problems, detached from humans, i.e. they are not just a tool like Wolfram Mathematica or google search. The same logic applies to AI research - agents can be detached from humans, their output may not be measured in multiples like 2x human or 5x human. The argument with "just another tool" breaks down. AI research is also not so verification bottlenecked like math. At least i think code can be checked for cheating/overfitting faster than extremely complex math proof.

English

3.2K

Gavin Ray@GavinRayDev·1d

Let's flip this on its head: 1. Solving a problem requires someone to ask the question. 2. You can only ask questions you can conceive of. 3. Humans have short lives and limited brain capacity for knowledge. It stands to reason that any entity capable of integrating more knowledge than humans, is then capable of asking questions no human could conceive of asking. If the entity can answer a single one of these questions, it has solved a "novel problem".

English

Seb@plainionist·1d

“AI cannot solve truly novel problems.” True or false? 🤔

English

148

9.1K

Gavin Ray@GavinRayDev·1d

@DavidTurturean How many scientific breakthroughs or incremental bits of progress are sitting there waiting for someone (or something) to make the right connection/analogy?

English

Gavin Ray@GavinRayDev·1d

@DavidTurturean I have a strong feeling that the greatest value in LLM's for science is the ability to make connections between disciplines Humans physically can't, due to short lifespan and how long specialized knowledge takes to acquire Imagine having the knowledge of a Ph D in every subject

English

David Turturean@DavidTurturean·1d

I fully solved my 2nd Erdős Problem using ChatGPT-5.5-Pro - and then I verified the solution by formalizing it! Less than 2 days after solving my first Erdős Problem, after running Pro for a few hours I was able to elicit the solution, this time in analytic number theory! 🧵1/n

English

108

101.5K

Gavin Ray@GavinRayDev·1d

@julianhyde CLI's Bash, psql/sql, irb etc are shells/terminal interfaces

English

Julian Hyde@julianhyde·1d

A straw poll about terminology. Are git, brew, docker CLIs? We agree that psql, irb and bash are REPLs (aka shells). Some people call them CLIs. But what about tools like git that you can use from bash and that have sub commands. Are these CLIs, commands, or something else?

English

Gavin Ray@GavinRayDev·1d

@relizarov I think that Kotlin should have kept anonymous arbitrary-arity Tuples from pre-1.0 That and collection literals for maps/lists/sets are huge ergonomics boons

English

258

Roman Elizarov@relizarov·1d

Great roadmap! Love it. One underrated lesson from Kotlin: positional destructuring by default was a design mistake. Functional ideas are great and useful, but pragmatic languages for industry should be careful about importing too much academic baggage.

Márton Braun (write-only account)@zsmb13

Happy to share that we have folks from the language design team blogging about new features! Check out this post by @trupill about how Kotlin is moving to a safer, Name-Based Destructuring syntax: blog.jetbrains.com/kotlin/2026/05…

English

10K

Gavin Ray@GavinRayDev·1d

It's Thursday ya'll. My bets on model releases: Google releases at least 1 new model, OpenAI releases either new model or big Codex App update.

English

Gavin Ray@GavinRayDev·2d

@elonmusk Pls let me use my cloned voice to schedule appointments for me

English

Elon Musk@elonmusk·2d

Grok Voice is #1!

Artificial Analysis@ArtificialAnlys

Announcing agentic performance benchmarking for Speech to Speech models on Artificial Analysis. We use 𝜏-Voice to measure tool calling and customer interaction voice agent capabilities in realistic customer service scenarios Even the strongest Speech to Speech (S2S) models today resolve only about half of realistic customer service scenarios end-to-end - a meaningful gap relative to frontier text-based agents on the same tasks. Voice channels introduce significant complexity: challenging accents, background noise, and packet loss, all while requiring fast responses, consistency across long multi-turn conversations, and reliable tool use. Performance also varies considerably by audio condition: in clean audio some models perform notably better, but realistic conditions continue to pose a challenge. Conversation duration also varies meaningfully across models, with implications for both customer experience and operational cost. About 𝜏-Voice: Our Agentic Performance benchmark is based on 𝜏-Voice (Ray, Dhandhania, Barres & Narasimhan, 2026), which extends 𝜏²-bench into the voice modality to evaluate S2S models on realistic customer service tasks. It measures multi-turn instruction following, support of a simulated customer through a complete interaction, and tool use against simulated customer service systems. The simulated user combines an LLM-driven decision model with realistic audio synthesis: diverse accents, background noise, and packet loss modelled on real network conditions. This complements our Big Bench Audio benchmark measuring intelligence and Conversational Dynamics (Full Duplex Bench subset) benchmark measuring conversational naturalness. Scores are the average of three independent pass @1 trials. We evaluate under realistic audio conditions using the 𝜏²-bench base task split across three domains: ➤ Airline (50 scenarios): e.g., changing a flight, rebooking under policy constraints ➤ Retail (114 scenarios): e.g., disputing a charge, processing a return ➤ Telecom (114 scenarios): e.g., resolving a billing issue, troubleshooting a service problem Task success is determined by deterministic checks against expected actions and final database state, consistent with the 𝜏²-bench evaluator. Key results: xAI's Grok Voice Think Fast 1.0 is the clear leader at 52.1%, averaging 5.6 minutes per conversation, the second-longest overall. OpenAI's GPT-Realtime-2 (High) (39.8%, 3.0 min) and GPT-Realtime-1.5 (38.8%, 4.8 min) follow, with Gemini 3.1 Flash Live Preview - High close behind at 37.7% (3.8 min). Speech to Speech is a fast evolving modality and we expect movement in rankings as we continue to add new models with these capabilities, and model robustness improves. Congratulations @xAI @elonmusk! See below for further detail ⬇️

English

2.4K

5.7K

25.5K

8.3M

Gavin Ray@GavinRayDev·3d

I used to work for a company that did Soccer facility software One of the pipe-dream projects we had was to build out custom cameras with onboard ML doing realtime segmentation, to create shareable highlight clips + goal tracking It's insane you can do this with a prompt, now...

English

784

Perceptron AI@perceptroninc·3d

Today we're releasing Perceptron Mk1: frontier video and embodied reasoning.

English

786

1.3M

Gavin Ray retweetledi

Andy Pavlo (@andypavlo.bsky.social)@andy_pavlo·7 May

We now offer @CMUDB's Database Systems course offline to incarcerated students across US prisons. No WiFi, completely free. Locked in by the system, freed by the lock manager: db.cs.cmu.edu/2026/05/cmudb-… Thanks to @convex for helping make sure the database game is for everybody.

Andy Pavlo (@andypavlo.bsky.social) tweet media

English

132

6.7K

Gavin Ray@GavinRayDev·3d

@_Felipe Just ask them to write a memory allocator in Rust in entirely implementation-safe code 🤷 Both any form of memory allocation (mmap/VirtualAlloc etc) and the issue of then type-punning the returned "*mut u8" can't be done in Safe Rust.

English

815

Felipe O. Carvalho@_Felipe·3d

Counting `unsafe {` is a STUPID! It’s TRIVIAL to hide a bunch of unsafe blocks by wrapping lower-level unsafe code in “safe” fns. But that defeats the purpose of the annotation which is to propagate the idea that the caller and not the callee are responsible for enforcing safety

English

190

40.7K

Gavin Ray@GavinRayDev·3d

@_Felipe @theo Wait until they find out that Rust core + stdlib is full of +3,000 methods with "unsafe" implementations... You cannot write close-to-the-metal code without doing potentially-fallible things. You can only try to reason about it + wrap it in guards. aws.amazon.com/blogs/opensour…

English

213

Felipe O. Carvalho@_Felipe·3d

@theo Unfair comparison given the amount of FFI that Bun does integrating with the JavaScript engine written in C++.

English

4.3K

Theo - t3.gg@theo·3d

uv has 350k lines of Rust, and 73 "unsafe" calls. The Bun Rust port is already 681k lines of Rust, and has over 13,000 "unsafe" calls.

English

142

3.8K

729.9K

Gavin Ray retweetledi

Phil Eaton@eatonphil·4d

Whatever your transactional database is, help me understand your use of it. This survey is run by @theconsensusdev, independent of any database or vendor. RT or pass along to industry peers for broader representation forms.gle/fqjQsezWsztxvY…

English

4.7K

Gavin Ray retweetledi

Andy Pavlo (@andypavlo.bsky.social)@andy_pavlo·4d

Congratulations to @CMUDB's #1 ranked PhD student @lmwnshn for finishing! I've known Wan since he was a sophomore. He is one of the smartest + genuinely kindhearted people I've ever met.

English

758

109.1K

Keşfet

@nkoval_ @SynabunAI @eliebakouch @vxunderground @NathanMcNulty @OpenAIDevs @GoodfireAI @DavidTurturean