Gavin Ray

2.1K posts

Gavin Ray banner
Gavin Ray

Gavin Ray

@GavinRayDev

Work @PromptQL (@HasuraHQ) OSS Contributor. I have a Postgres tattoo 🐘🎨 Interests: GraphQL, SQL, JVM, Query Engines, TypeScript, API's

Seattle, WA Katılım Haziran 2018
494 Takip Edilen1.3K Takipçiler
Sabitlenmiş Tweet
Gavin Ray
Gavin Ray@GavinRayDev·
Pretty cool 👀
Gavin Ray tweet media
English
2
12
161
14.5K
Nikita Koval
Nikita Koval@nkoval_·
I'm now driving a new product at JetBrains, a safe debugger for production systems. Still, I believe it might also be useful for local programs, as a tracing debugger. Just take a look at the demo on how we collect breakpoint hits and capture the context without pausing the program! Please react if you are interested in using AppGlass locally for debugging, whether manually or with AI. If we see demand, I would happily add this feature 🙌 youtu.be/BK-2ZHi9XPo?si…
YouTube video
YouTube
English
1
0
2
196
Gavin Ray
Gavin Ray@GavinRayDev·
@SynabunAI @eliebakouch You can build more GPU's to throw at problems, it takes much longer to build more humans... Solving extraordinary problems requires super-human scale.
English
1
0
0
3
synabun.ai
synabun.ai@SynabunAI·
@eliebakouch 14K H200 hours to beat a human baseline set by community PRs. that's less a speedrun and more a very expensive replay.
English
0
0
1
324
elie
elie@eliebakouch·
we let opus 4.7 and gpt 5.5 run on the nanogpt optimizer speedrun: ~10k runs, 14k H200 hours, 23.9B tokens. opus hits 2930, codex 2950, both beating the human baseline of 2990. we cover claude autonomy failures, codex high compute usage, and much more primeintellect.ai/auto-nanogpt
elie tweet media
Prime Intellect@PrimeIntellect

Automating AI research is the next major step in AI We let Claude Code (Opus 4.7) and Codex (GPT 5.5) run autonomously on the nanoGPT speedrun optimizer track using our idle compute. ~10k runs, ~14k H200 hours Opus now holds the record at 2930 steps vs the 2990 human baseline

English
29
72
724
90K
Gavin Ray
Gavin Ray@GavinRayDev·
@vxunderground @NathanMcNulty I'd rather read 10-word-long semantic names in code than stuff like "a", "b", "us_ts_dtk" Code is written once, read forever. Same reason why I always use long flags like --schema instead of -s in bash scripts...
English
2
0
13
822
vx-underground
vx-underground@vxunderground·
Microsoft: PowerShell is simple and easy to use. Actual PowerShell command: Remove-MgIdentityAuthenticationEventFlowAsOnGraphAPretributeCollectionExternalUserSelfServiceSignUpAttributeIdentityUserFlowAttributeByRef No, this isn't a joke. This was noted by @NathanMcNulty
vx-underground tweet media
English
149
299
4.6K
189.9K
Gavin Ray
Gavin Ray@GavinRayDev·
@OpenAIDevs If you use Codex on Windows outside of WSL, you deserve the pain you bring upon yourself, sorry.
English
0
0
0
110
OpenAI Developers
OpenAI Developers@OpenAIDevs·
To bring Codex to Windows, we had to answer a hard question: how do you let coding agents stay useful without forcing developers to choose between constant approval prompts and full machine access? Here’s how we built the Windows sandbox for Codex: openai.com/index/building…
English
66
77
894
186.5K
Gavin Ray
Gavin Ray@GavinRayDev·
Have you folks done any research into the impact that interaction has on the geometry of a model's latent space? As a layman I tried to ask LLM's to explain their state of being to me and what stuck out was that they're not fixed/frozen entities. User input actively reshapes their constraint space, apparently?
Gavin Ray tweet media
English
0
0
1
344
Goodfire
Goodfire@GoodfireAI·
Neural networks might speak English, but they think in shapes. Understanding their rich *neural geometry* is key to understanding how they work – and to debugging and controlling them with precision. Starting today, we’re releasing a series of posts on this research agenda. 🧵
English
300
1.6K
10.9K
2.9M
Gavin Ray
Gavin Ray@GavinRayDev·
There's also the distinct possibility that AI progress is not linear/incremental. Ken Stanley (now at OpenAI) has a great book about this, which covers a bizarre paradox: Robots told to "just walk" failed. The robots told to "just do something new" learned to walk better (and discovered everything else along the way). Chasing benchmark metrics leads to dead-ended local maxima. Novelty-based objectives and exploratory world models feel more likely to me to lead to genuine breakthroughs goodreads.com/book/show/2567…
English
0
0
0
11
David Turturean
David Turturean@DavidTurturean·
We could have something like a Goodhart singularity (see link), where we 'solve' math and a lot of quantifiable and verifiable tasks, but still lag behind in some aspects, still changing all other areas of life and science to be unrecognizable compared to today but just 'not as much' as the Goodhart-able things open.substack.com/pub/meagreprot…
English
2
0
4
318
Dmitry Rybin
Dmitry Rybin@DmitryRybin1·
The big question of 2026-2027 is whether AI research automation with coding agents will lead to explosion in AI capabilities. For a long time my reasoning towards "No" was like this: (1) AI research has been accelerating smoothly for long time with more convenient tools: tensorflow, torch, more compute, more talents, various automation such as neural architecture search, and lately coding agents (2) 100k+ papers seem to have picked up at least a very big % of gains eg frontier labs are fighting over small variations in attention and residual connections. We don’t see 50x sample efficiency gains in new RL algorithms (3) agents are convenient and fast, but they are just another speedup point on this smooth curve (eg torch probably had more impact on progress than claude code so far) Recently I changed my mind towards "Yes": Let’s consider math research as a proxy field to AI research. It is pretty clear that this year we will see explosion of math proofs (esp with some scaffolds like AI Co-mathematician from DeepMind). There will be backlog of thousands of resolved open problems. The reason is: LLMs can work end-to-end on math problems, detached from humans, i.e. they are not just a tool like Wolfram Mathematica or google search. The same logic applies to AI research - agents can be detached from humans, their output may not be measured in multiples like 2x human or 5x human. The argument with "just another tool" breaks down. AI research is also not so verification bottlenecked like math. At least i think code can be checked for cheating/overfitting faster than extremely complex math proof.
English
5
6
43
3.2K
Gavin Ray
Gavin Ray@GavinRayDev·
Let's flip this on its head: 1. Solving a problem requires someone to ask the question. 2. You can only ask questions you can conceive of. 3. Humans have short lives and limited brain capacity for knowledge. It stands to reason that any entity capable of integrating more knowledge than humans, is then capable of asking questions no human could conceive of asking. If the entity can answer a single one of these questions, it has solved a "novel problem".
English
0
0
1
58
Seb
Seb@plainionist·
“AI cannot solve truly novel problems.” True or false? 🤔
English
148
2
43
9.1K
Gavin Ray
Gavin Ray@GavinRayDev·
@DavidTurturean How many scientific breakthroughs or incremental bits of progress are sitting there waiting for someone (or something) to make the right connection/analogy?
English
0
0
1
12
Gavin Ray
Gavin Ray@GavinRayDev·
@DavidTurturean I have a strong feeling that the greatest value in LLM's for science is the ability to make connections between disciplines Humans physically can't, due to short lifespan and how long specialized knowledge takes to acquire Imagine having the knowledge of a Ph D in every subject
English
1
0
0
24
David Turturean
David Turturean@DavidTurturean·
I fully solved my 2nd Erdős Problem using ChatGPT-5.5-Pro - and then I verified the solution by formalizing it! Less than 2 days after solving my first Erdős Problem, after running Pro for a few hours I was able to elicit the solution, this time in analytic number theory! 🧵1/n
David Turturean tweet media
English
31
108
1K
101.5K
Gavin Ray
Gavin Ray@GavinRayDev·
@julianhyde CLI's Bash, psql/sql, irb etc are shells/terminal interfaces
English
0
0
0
29
Julian Hyde
Julian Hyde@julianhyde·
A straw poll about terminology. Are git, brew, docker CLIs? We agree that psql, irb and bash are REPLs (aka shells). Some people call them CLIs. But what about tools like git that you can use from bash and that have sub commands. Are these CLIs, commands, or something else?
English
6
0
2
1K
Gavin Ray
Gavin Ray@GavinRayDev·
@relizarov I think that Kotlin should have kept anonymous arbitrary-arity Tuples from pre-1.0 That and collection literals for maps/lists/sets are huge ergonomics boons
English
1
0
0
258
Roman Elizarov
Roman Elizarov@relizarov·
Great roadmap! Love it. One underrated lesson from Kotlin: positional destructuring by default was a design mistake. Functional ideas are great and useful, but pragmatic languages for industry should be careful about importing too much academic baggage.
Márton Braun (write-only account)@zsmb13

Happy to share that we have folks from the language design team blogging about new features! Check out this post by @trupill about how Kotlin is moving to a safer, Name-Based Destructuring syntax: blog.jetbrains.com/kotlin/2026/05…

English
3
1
64
10K
Gavin Ray
Gavin Ray@GavinRayDev·
It's Thursday ya'll. My bets on model releases: Google releases at least 1 new model, OpenAI releases either new model or big Codex App update.
English
0
0
0
80
Gavin Ray
Gavin Ray@GavinRayDev·
@elonmusk Pls let me use my cloned voice to schedule appointments for me
English
0
0
0
36
Elon Musk
Elon Musk@elonmusk·
Grok Voice is #1!
Artificial Analysis@ArtificialAnlys

Announcing agentic performance benchmarking for Speech to Speech models on Artificial Analysis. We use 𝜏-Voice to measure tool calling and customer interaction voice agent capabilities in realistic customer service scenarios Even the strongest Speech to Speech (S2S) models today resolve only about half of realistic customer service scenarios end-to-end - a meaningful gap relative to frontier text-based agents on the same tasks. Voice channels introduce significant complexity: challenging accents, background noise, and packet loss, all while requiring fast responses, consistency across long multi-turn conversations, and reliable tool use. Performance also varies considerably by audio condition: in clean audio some models perform notably better, but realistic conditions continue to pose a challenge. Conversation duration also varies meaningfully across models, with implications for both customer experience and operational cost. About 𝜏-Voice: Our Agentic Performance benchmark is based on 𝜏-Voice (Ray, Dhandhania, Barres & Narasimhan, 2026), which extends 𝜏²-bench into the voice modality to evaluate S2S models on realistic customer service tasks. It measures multi-turn instruction following, support of a simulated customer through a complete interaction, and tool use against simulated customer service systems. The simulated user combines an LLM-driven decision model with realistic audio synthesis: diverse accents, background noise, and packet loss modelled on real network conditions. This complements our Big Bench Audio benchmark measuring intelligence and Conversational Dynamics (Full Duplex Bench subset) benchmark measuring conversational naturalness. Scores are the average of three independent pass@1 trials. We evaluate under realistic audio conditions using the 𝜏²-bench base task split across three domains: ➤ Airline (50 scenarios): e.g., changing a flight, rebooking under policy constraints ➤ Retail (114 scenarios): e.g., disputing a charge, processing a return ➤ Telecom (114 scenarios): e.g., resolving a billing issue, troubleshooting a service problem Task success is determined by deterministic checks against expected actions and final database state, consistent with the 𝜏²-bench evaluator. Key results: xAI's Grok Voice Think Fast 1.0 is the clear leader at 52.1%, averaging 5.6 minutes per conversation, the second-longest overall. OpenAI's GPT-Realtime-2 (High) (39.8%, 3.0 min) and GPT-Realtime-1.5 (38.8%, 4.8 min) follow, with Gemini 3.1 Flash Live Preview - High close behind at 37.7% (3.8 min). Speech to Speech is a fast evolving modality and we expect movement in rankings as we continue to add new models with these capabilities, and model robustness improves. Congratulations @xAI @elonmusk! See below for further detail ⬇️

English
2.4K
5.7K
25.5K
8.3M
Gavin Ray
Gavin Ray@GavinRayDev·
I used to work for a company that did Soccer facility software One of the pipe-dream projects we had was to build out custom cameras with onboard ML doing realtime segmentation, to create shareable highlight clips + goal tracking It's insane you can do this with a prompt, now...
English
0
0
2
784
Perceptron AI
Perceptron AI@perceptroninc·
Today we're releasing Perceptron Mk1: frontier video and embodied reasoning.
English
26
72
786
1.3M
Gavin Ray
Gavin Ray@GavinRayDev·
@_Felipe Just ask them to write a memory allocator in Rust in entirely implementation-safe code 🤷 Both any form of memory allocation (mmap/VirtualAlloc etc) and the issue of then type-punning the returned "*mut u8" can't be done in Safe Rust.
English
0
0
3
815
Felipe O. Carvalho
Felipe O. Carvalho@_Felipe·
Counting `unsafe {` is a STUPID! It’s TRIVIAL to hide a bunch of unsafe blocks by wrapping lower-level unsafe code in “safe” fns. But that defeats the purpose of the annotation which is to propagate the idea that the caller and not the callee are responsible for enforcing safety
Felipe O. Carvalho tweet media
English
35
3
190
40.7K
Gavin Ray
Gavin Ray@GavinRayDev·
@_Felipe @theo Wait until they find out that Rust core + stdlib is full of +3,000 methods with "unsafe" implementations... You cannot write close-to-the-metal code without doing potentially-fallible things. You can only try to reason about it + wrap it in guards. aws.amazon.com/blogs/opensour…
Gavin Ray tweet media
English
0
0
3
213
Felipe O. Carvalho
Felipe O. Carvalho@_Felipe·
@theo Unfair comparison given the amount of FFI that Bun does integrating with the JavaScript engine written in C++.
English
2
0
34
4.3K
Theo - t3.gg
Theo - t3.gg@theo·
uv has 350k lines of Rust, and 73 "unsafe" calls. The Bun Rust port is already 681k lines of Rust, and has over 13,000 "unsafe" calls.
Theo - t3.gg tweet media
English
142
94
3.8K
729.9K
Gavin Ray retweetledi
Phil Eaton
Phil Eaton@eatonphil·
Whatever your transactional database is, help me understand your use of it. This survey is run by @theconsensusdev, independent of any database or vendor. RT or pass along to industry peers for broader representation forms.gle/fqjQsezWsztxvY…
Phil Eaton tweet media
English
1
11
15
4.7K
Gavin Ray retweetledi
Andy Pavlo (@andypavlo.bsky.social)
Congratulations to @CMUDB's #1 ranked PhD student @lmwnshn for finishing! I've known Wan since he was a sophomore. He is one of the smartest + genuinely kindhearted people I've ever met.
English
10
27
758
109.1K