Izzy

1.6K posts

Izzy

Izzy

@isidoremiller

merry wanderer of the night, AI research @ Hex

เข้าร่วม Eylül 2018
690 กำลังติดตาม1.5K ผู้ติดตาม
Izzy
Izzy@isidoremiller·
Claude went above and beyond and made me an interactive viz tool I can use to adjust and see lol, unbelievably useful response!
Izzy tweet media
English
0
0
1
87
Izzy
Izzy@isidoremiller·
wow, I've discovered my first task that chatgpt fundamentally cannot do but Opus can. Looks like GPT 5.4 is simply not a Shape Rotator. It literally cannot understand the angles and geometry at play in the climbing wall I'm working on! Identical prompt, Claude immediately understands correctly and, to boot, says "This is a really satisfying geometry problem". 5.4 just flails and constantly gets confused. Super interesting, i've never seen something like this before where it can just never get it.
Izzy tweet mediaIzzy tweet mediaIzzy tweet media
English
1
0
2
244
Izzy
Izzy@isidoremiller·
slept terribly last night, half-lucid dreamed the exact word for word experience of watching Transformers 2 all night long in vivid detail
English
0
0
0
57
Izzy
Izzy@isidoremiller·
Surprised everyone isn't talking about activation verbalization after the Mythos system card release. Did everyone already know about this technique? is it incredibly difficult or expensive to do? feels extraordinarily important and interesting but haven't seen anyone discuss
Izzy tweet media
English
0
0
2
217
Izzy
Izzy@isidoremiller·
still seeing this btw, it totally bricks the thread
English
0
0
0
59
Izzy
Izzy@isidoremiller·
@thsottiaux codex app is unusable due to auto-compaction failures unfortunately Error running remote compact task: stream disconnected before completion: error sending request for url (chatgpt.com/backend-api/co…)
English
1
0
2
282
Izzy
Izzy@isidoremiller·
@novasarc01 that feels at least directionally right to me!
English
0
0
1
10
λux
λux@novasarc01·
i think the deeper reason is that a rich harness effectively widens the decomposition language available to the model...but I do not think it really breaks the mold. my guess is that advanced math looks tool-free at inference time yet a lot of the capability still comes from having been trained or selected in a regime that had strong latent harness structure...like solution checking / formal verification...the harness may have moved from the deployment environment into the training and post-training pipeline...(i guess that would explain why the final model can look like it is doing pure internal reasoning while still benefiting from feedback-shaped decomposition habits).
English
1
0
2
42
Izzy รีทวีตแล้ว
Shashwat Goel
Shashwat Goel@ShashwatGoel7·
🌶️ take: If you make an eval, you shouldnt release it without trying to optimize it. Synthetic data RL, and autoresearch are great tools for this. It makes you discover so many subtle footguns. Your eval is only measuring what is enough to optimize it, and the best evals still make sense under optimization pressure. I could write "How to Game X" for so many popular evals rn...
English
4
4
78
4.5K
BlakeTheCoder
BlakeTheCoder@kyr_dreamer·
@isidoremiller 100K tokens of tools under one orchestrator, wild. Architecture alone is a massive feat. Dropping the benchmark when?
English
1
0
1
23
Izzy
Izzy@isidoremiller·
extremely fun jamming with Harrison on all things analytics, agents and evals! I have like 20 more hours worth of takes and opinions in this space and going on the pod uncorked them, more to come for sure. Also my simulation benchmark launching soon 😈
Harrison Chase@hwchase17

🎙️Introducing Max Agency Max Agency is a new podcast where we go deep on how the best agents are actually being built: architecture decisions, tradeoffs, evals, and everything in between. Each episode, I sit down with engineering leaders who are doing this work in production. Our first episode features Izzy Miller (@isidoremiller), AI Engineer at Hex (@_hex_tech). Hex has been shipping data agents since before most teams were even thinking about them, starting with single-cell text-to-SQL and graduating to a full Notebook agent that can work autonomously for 20 minutes on a complex analysis. Izzy has a lot of perspective on what it actually takes to get agents working well in production, and what breaks along the way. A few takeaways from our conversation: - Keep your eval sets small enough to hold in your head: Izzy runs 30-50 handcrafted "traps" with multiple repetitions, rather than hundreds of variants. If you can't explain why your agent fails each one, your eval set is too big - Day zero performance is almost irrelevant: The more interesting question is how the agent compounds. Izzy is building a 90-day simulation where the warehouse evolves and the agent has to accumulate understanding - You can catch agent errors without seeing the raw outputs: By running an LLM-as-a-judge over production usage and clustering the results, you can surface places where something likely went wrong, without needing to read individual conversations Watch the full episode on: - Youtube: youtube.com/watch?v=Xyh1Eq… - Apple Podcasts: podcasts.apple.com/us/podcast/how… - Spotify: open.spotify.com/episode/1BJlg3…

English
4
8
16
2.4K
Evan Applegate
Evan Applegate@youwillmakemaps·
Made fifty seven greenbacks!!! NC-woven cotton blanket with a 1982 USGS illo by Tau Rho Alpha (his daughter: "Theta") comparing the Crater Lake caldera to the Mt. St. Helens eruption Crater Lake, OR was formed when Mt. Mazama blew up, scattering 290 trillion pounds of rock
Evan Applegate tweet mediaEvan Applegate tweet mediaEvan Applegate tweet mediaEvan Applegate tweet media
Evan Applegate@youwillmakemaps

Made seventy six clams!!! I've never seen this Death Valley map blanket in person. Years ago I got a req for a map blanket of this area, so I tried this 1934 promo map and told the weaver to send it directly to the custie. They loved it, sent pics, four more sold since :^)

English
2
0
28
1.6K
Izzy
Izzy@isidoremiller·
typed L into my omnibox while on screenshare and it autosuggested linkedin instead of localhost:6060. unrecoverable aura loss
English
1
0
8
276
Olivia Koshy
Olivia Koshy@oliviakoshy·
working with @isidoremiller is truly one of my favorite parts of my job and I think this podcast is a peak into why. highly recommend giving it a listen if you're building agent products!
Harrison Chase@hwchase17

🎙️Introducing Max Agency Max Agency is a new podcast where we go deep on how the best agents are actually being built: architecture decisions, tradeoffs, evals, and everything in between. Each episode, I sit down with engineering leaders who are doing this work in production. Our first episode features Izzy Miller (@isidoremiller), AI Engineer at Hex (@_hex_tech). Hex has been shipping data agents since before most teams were even thinking about them, starting with single-cell text-to-SQL and graduating to a full Notebook agent that can work autonomously for 20 minutes on a complex analysis. Izzy has a lot of perspective on what it actually takes to get agents working well in production, and what breaks along the way. A few takeaways from our conversation: - Keep your eval sets small enough to hold in your head: Izzy runs 30-50 handcrafted "traps" with multiple repetitions, rather than hundreds of variants. If you can't explain why your agent fails each one, your eval set is too big - Day zero performance is almost irrelevant: The more interesting question is how the agent compounds. Izzy is building a 90-day simulation where the warehouse evolves and the agent has to accumulate understanding - You can catch agent errors without seeing the raw outputs: By running an LLM-as-a-judge over production usage and clustering the results, you can surface places where something likely went wrong, without needing to read individual conversations Watch the full episode on: - Youtube: youtube.com/watch?v=Xyh1Eq… - Apple Podcasts: podcasts.apple.com/us/podcast/how… - Spotify: open.spotify.com/episode/1BJlg3…

English
2
1
5
1.7K
Izzy
Izzy@isidoremiller·
you: Wow, AskUserQuestion really makes me more efficient at communicating with my agents in a clear and structured way! I love being clear, specific, and precise! me: "erghf no"
Izzy tweet mediaIzzy tweet media
English
0
0
0
258
Izzy
Izzy@isidoremiller·
@puneetmehtanyc @hwchase17 super interested to hear what you're doing to calibrate those confidence levels!
English
0
0
1
20
Puneet Mehta
Puneet Mehta@puneetmehtanyc·
@isidoremiller @hwchase17 No, just experience from a decade of deploying AI at scale. The confidence and accuracy probe research is important work. In production, calibrated confidence scoring has been one of the highest-impact patterns we have seen for knowing when to escalate. Let's discuss further.
English
1
0
0
23
Harrison Chase
Harrison Chase@hwchase17·
🎙️Introducing Max Agency Max Agency is a new podcast where we go deep on how the best agents are actually being built: architecture decisions, tradeoffs, evals, and everything in between. Each episode, I sit down with engineering leaders who are doing this work in production. Our first episode features Izzy Miller (@isidoremiller), AI Engineer at Hex (@_hex_tech). Hex has been shipping data agents since before most teams were even thinking about them, starting with single-cell text-to-SQL and graduating to a full Notebook agent that can work autonomously for 20 minutes on a complex analysis. Izzy has a lot of perspective on what it actually takes to get agents working well in production, and what breaks along the way. A few takeaways from our conversation: - Keep your eval sets small enough to hold in your head: Izzy runs 30-50 handcrafted "traps" with multiple repetitions, rather than hundreds of variants. If you can't explain why your agent fails each one, your eval set is too big - Day zero performance is almost irrelevant: The more interesting question is how the agent compounds. Izzy is building a 90-day simulation where the warehouse evolves and the agent has to accumulate understanding - You can catch agent errors without seeing the raw outputs: By running an LLM-as-a-judge over production usage and clustering the results, you can surface places where something likely went wrong, without needing to read individual conversations Watch the full episode on: - Youtube: youtube.com/watch?v=Xyh1Eq… - Apple Podcasts: podcasts.apple.com/us/podcast/how… - Spotify: open.spotify.com/episode/1BJlg3…
YouTube video
YouTube
English
14
42
222
31.5K
Izzy รีทวีตแล้ว
Jake Broekhuizen
Jake Broekhuizen@jakebroekhuizen·
The first episode of our 'Max Agency' podcast is now live on Youtube and podcast platforms! Was great to work with Izzy in the lead-up to this episode with @hwchase17 Check it out below 👇 youtube.com/watch?v=Xyh1Eq…
YouTube video
YouTube
Jake Broekhuizen tweet media
English
0
8
13
2.5K
Izzy
Izzy@isidoremiller·
@haroonc @hwchase17 thought we were marketers but turns out we were both AI engineers back in the day, who knew
English
0
0
2
20
Izzy
Izzy@isidoremiller·
@puneetmehtanyc @hwchase17 i think this is an AI mediated comment but i 100% agree, alongside all of the practical things I discussed, some of our researchers are exploring some actual frontier research on probes and other techniques to estimate confidence and accuracy
English
1
0
0
27
Puneet Mehta
Puneet Mehta@puneetmehtanyc·
The eval insight is right. But in enterprise production, you cannot wait for an LLM-as-judge to cluster errors after the fact. The AI has to assess confidence per action and escalate in real time before the customer is affected. Post-hoc analysis is learning. Real-time governance is trust.
English
1
0
3
237