Izzy (@isidoremiller) - โปรไฟล์ Twitter

Izzy@isidoremiller·14h

Claude went above and beyond and made me an interactive viz tool I can use to adjust and see lol, unbelievably useful response!

English

0

1

87

Izzy@isidoremiller·14h

wow, I've discovered my first task that chatgpt fundamentally cannot do but Opus can. Looks like GPT 5.4 is simply not a Shape Rotator. It literally cannot understand the angles and geometry at play in the climbing wall I'm working on! Identical prompt, Claude immediately understands correctly and, to boot, says "This is a really satisfying geometry problem". 5.4 just flails and constantly gets confused. Super interesting, i've never seen something like this before where it can just never get it.

English

1

0

2

244

Izzy@isidoremiller·17h

probably genuinely every few days i think about this tweet in the context of Hex rebranding

Young Thug ひ@youngthug

For now on call me SEX!!!

English

1

0

1

323

Izzy@isidoremiller·18h

slept terribly last night, half-lucid dreamed the exact word for word experience of watching Transformers 2 all night long in vivid detail

English

0

57

Izzy@isidoremiller·23h

Surprised everyone isn't talking about activation verbalization after the Mythos system card release. Did everyone already know about this technique? is it incredibly difficult or expensive to do? feels extraordinarily important and interesting but haven't seen anyone discuss

English

0

2

217

Izzy@isidoremiller·3d

still seeing this btw, it totally bricks the thread

English

0

59

Izzy@isidoremiller·4d

@thsottiaux codex app is unusable due to auto-compaction failures unfortunately Error running remote compact task: stream disconnected before completion: error sending request for url (chatgpt.com/backend-api/co…)

English

1

0

2

282

Izzy@isidoremiller·3d

@novasarc01 that feels at least directionally right to me!

English

0

1

10

λux@novasarc01·3d

i think the deeper reason is that a rich harness effectively widens the decomposition language available to the model...but I do not think it really breaks the mold. my guess is that advanced math looks tool-free at inference time yet a lot of the capability still comes from having been trained or selected in a regime that had strong latent harness structure...like solution checking / formal verification...the harness may have moved from the deployment environment into the training and post-training pipeline...(i guess that would explain why the final model can look like it is doing pure internal reasoning while still benefiting from feedback-shaped decomposition habits).

English

1

0

2

42

λux@novasarc01·4d

a model’s reasoning ability does not depend only on how smart the model is by itself but also on what kinds of step-by-step problem solving the surrounding system allows it to do.

λux@novasarc01

what i find most interesting about the decomposition angle is that it treats reasoning capacity as a property of the decomposition formalism around the model. like the limiting factor is often whether the system lets the model express only shallow, explicitly enumerated subcalls or whether it can instantiate richer computational structure (recursion, loops, reusable subroutines) that can efficiently represent exponentially larger task graphs as depth grows. imo this is a very important shift in perspective bcoz it suggests that capability jumps may come less from scaling the base model and more from expanding the space of admissible decompositions while keeping each local call in-distribution.

English

1

0

11

1.3K

Izzy รีทวีตแล้ว

Shashwat Goel@ShashwatGoel7·4d

🌶️ take: If you make an eval, you shouldnt release it without trying to optimize it. Synthetic data RL, and autoresearch are great tools for this. It makes you discover so many subtle footguns. Your eval is only measuring what is enough to optimize it, and the best evals still make sense under optimization pressure. I could write "How to Game X" for so many popular evals rn...

English

4

78

4.5K

Izzy@isidoremiller·4d

@kyr_dreamer

QME

0

13

BlakeTheCoder@kyr_dreamer·4d

@isidoremiller 100K tokens of tools under one orchestrator, wild. Architecture alone is a massive feat. Dropping the benchmark when?

English

1

0

1

23

Izzy@isidoremiller·4d

extremely fun jamming with Harrison on all things analytics, agents and evals! I have like 20 more hours worth of takes and opinions in this space and going on the pod uncorked them, more to come for sure. Also my simulation benchmark launching soon 😈

Harrison Chase@hwchase17

🎙️Introducing Max Agency Max Agency is a new podcast where we go deep on how the best agents are actually being built: architecture decisions, tradeoffs, evals, and everything in between. Each episode, I sit down with engineering leaders who are doing this work in production. Our first episode features Izzy Miller (@isidoremiller), AI Engineer at Hex (@_hex_tech). Hex has been shipping data agents since before most teams were even thinking about them, starting with single-cell text-to-SQL and graduating to a full Notebook agent that can work autonomously for 20 minutes on a complex analysis. Izzy has a lot of perspective on what it actually takes to get agents working well in production, and what breaks along the way. A few takeaways from our conversation: - Keep your eval sets small enough to hold in your head: Izzy runs 30-50 handcrafted "traps" with multiple repetitions, rather than hundreds of variants. If you can't explain why your agent fails each one, your eval set is too big - Day zero performance is almost irrelevant: The more interesting question is how the agent compounds. Izzy is building a 90-day simulation where the warehouse evolves and the agent has to accumulate understanding - You can catch agent errors without seeing the raw outputs: By running an LLM-as-a-judge over production usage and clustering the results, you can surface places where something likely went wrong, without needing to read individual conversations Watch the full episode on: - Youtube: youtube.com/watch?v=Xyh1Eq… - Apple Podcasts: podcasts.apple.com/us/podcast/how… - Spotify: open.spotify.com/episode/1BJlg3…

English

4

8

16

2.4K

Izzy@isidoremiller·4d

@youwillmakemaps hire me as ur intern plz

English

0

1

38

Evan Applegate@youwillmakemaps·4d

Made fifty seven greenbacks！！！ NC-woven cotton blanket with a 1982 USGS illo by Tau Rho Alpha (his daughter: "Theta") comparing the Crater Lake caldera to the Mt. St. Helens eruption Crater Lake, OR was formed when Mt. Mazama blew up, scattering 290 trillion pounds of rock

Evan Applegate@youwillmakemaps

Made seventy six clams！！！ I've never seen this Death Valley map blanket in person. Years ago I got a req for a map blanket of this area, so I tried this 1934 promo map and told the weaver to send it directly to the custie. They loved it, sent pics, four more sold since :^)

English

2

0

28

1.6K

Izzy@isidoremiller·4d

typed L into my omnibox while on screenshare and it autosuggested linkedin instead of localhost:6060. unrecoverable aura loss

English

1

0

8

276

Izzy@isidoremiller·4d

@oliviakoshy 💜

QME

0

1

38

Olivia Koshy@oliviakoshy·4d

working with @isidoremiller is truly one of my favorite parts of my job and I think this podcast is a peak into why. highly recommend giving it a listen if you're building agent products!

Harrison Chase@hwchase17

🎙️Introducing Max Agency Max Agency is a new podcast where we go deep on how the best agents are actually being built: architecture decisions, tradeoffs, evals, and everything in between. Each episode, I sit down with engineering leaders who are doing this work in production. Our first episode features Izzy Miller (@isidoremiller), AI Engineer at Hex (@_hex_tech). Hex has been shipping data agents since before most teams were even thinking about them, starting with single-cell text-to-SQL and graduating to a full Notebook agent that can work autonomously for 20 minutes on a complex analysis. Izzy has a lot of perspective on what it actually takes to get agents working well in production, and what breaks along the way. A few takeaways from our conversation: - Keep your eval sets small enough to hold in your head: Izzy runs 30-50 handcrafted "traps" with multiple repetitions, rather than hundreds of variants. If you can't explain why your agent fails each one, your eval set is too big - Day zero performance is almost irrelevant: The more interesting question is how the agent compounds. Izzy is building a 90-day simulation where the warehouse evolves and the agent has to accumulate understanding - You can catch agent errors without seeing the raw outputs: By running an LLM-as-a-judge over production usage and clustering the results, you can surface places where something likely went wrong, without needing to read individual conversations Watch the full episode on: - Youtube: youtube.com/watch?v=Xyh1Eq… - Apple Podcasts: podcasts.apple.com/us/podcast/how… - Spotify: open.spotify.com/episode/1BJlg3…

English

2

1

5

1.7K

Izzy@isidoremiller·4d

you: Wow, AskUserQuestion really makes me more efficient at communicating with my agents in a clear and structured way! I love being clear, specific, and precise! me: "erghf no"

English

0

258

Izzy@isidoremiller·4d

@puneetmehtanyc @hwchase17 super interested to hear what you're doing to calibrate those confidence levels!

English

0

1

20

Puneet Mehta@puneetmehtanyc·4d

@isidoremiller @hwchase17 No, just experience from a decade of deploying AI at scale. The confidence and accuracy probe research is important work. In production, calibrated confidence scoring has been one of the highest-impact patterns we have seen for knowing when to escalate. Let's discuss further.

English

1

0

23

Harrison Chase@hwchase17·4d

🎙️Introducing Max Agency Max Agency is a new podcast where we go deep on how the best agents are actually being built: architecture decisions, tradeoffs, evals, and everything in between. Each episode, I sit down with engineering leaders who are doing this work in production. Our first episode features Izzy Miller (@isidoremiller), AI Engineer at Hex (@_hex_tech). Hex has been shipping data agents since before most teams were even thinking about them, starting with single-cell text-to-SQL and graduating to a full Notebook agent that can work autonomously for 20 minutes on a complex analysis. Izzy has a lot of perspective on what it actually takes to get agents working well in production, and what breaks along the way. A few takeaways from our conversation: - Keep your eval sets small enough to hold in your head: Izzy runs 30-50 handcrafted "traps" with multiple repetitions, rather than hundreds of variants. If you can't explain why your agent fails each one, your eval set is too big - Day zero performance is almost irrelevant: The more interesting question is how the agent compounds. Izzy is building a 90-day simulation where the warehouse evolves and the agent has to accumulate understanding - You can catch agent errors without seeing the raw outputs: By running an LLM-as-a-judge over production usage and clustering the results, you can surface places where something likely went wrong, without needing to read individual conversations Watch the full episode on: - Youtube: youtube.com/watch?v=Xyh1Eq… - Apple Podcasts: podcasts.apple.com/us/podcast/how… - Spotify: open.spotify.com/episode/1BJlg3…

YouTube

English

14

42

222

31.5K

Izzy รีทวีตแล้ว

Jake Broekhuizen@jakebroekhuizen·4d

The first episode of our 'Max Agency' podcast is now live on Youtube and podcast platforms! Was great to work with Izzy in the lead-up to this episode with @hwchase17 Check it out below 👇 youtube.com/watch?v=Xyh1Eq…

YouTube

English

0

8

13

2.5K

Izzy@isidoremiller·4d

@haroonc @hwchase17 thought we were marketers but turns out we were both AI engineers back in the day, who knew

English

0

2

20

Haroon Choudery@haroonc·4d

@hwchase17 holy crossover! excited to hear this @isidoremiller

English

1

0

3

182

Izzy@isidoremiller·4d

@puneetmehtanyc @hwchase17 i think this is an AI mediated comment but i 100% agree, alongside all of the practical things I discussed, some of our researchers are exploring some actual frontier research on probes and other techniques to estimate confidence and accuracy

English

1

0

27

Puneet Mehta@puneetmehtanyc·4d

The eval insight is right. But in enterprise production, you cannot wait for an LLM-as-judge to cluster errors after the fact. The AI has to assess confidence per action and escalate in real time before the customer is affected. Post-hoc analysis is learning. Real-time governance is trust.

English

1

0

3

237

Izzy

ค้นพบ