Elizabeth Hutton

33 posts

Elizabeth Hutton

@ehutt_

ML Engineer | Phoenix OSS | always learning unsupervised

Los Angeles Katılım Eylül 2025

56 Takip Edilen14 Takipçiler

Elizabeth Hutton@ehutt_·19h

@theo I like to switch between claude and codex during different stages of development, and have a custom CLI for injecting project context into either, so the CLI agents work great for me.

English

177

Theo - t3.gg@theo·1d

Are you still using the CLI versions of your preferred agent instead of desktop apps like Codex App, Conductor, or T3 Code? Tell me why below. Genuinely curious.

Theo - t3.gg@theo

Just learned it's literally impossible to paste images into Claude Code over SSH. How do you CLI people live like this??

English

856

832

222.1K

Elizabeth Hutton@ehutt_·1d

@omarsar0 I really like pydantic ai they have useful abstractions

English

205

elvis@omarsar0·1d

Any good alternatives to the Claude Agent SDK? I have used OpenAI Agent SDK and ADK a bit in the past but not sure about their state today.

English

13K

Elizabeth Hutton retweetledi

Arize AI@arizeai·4d

That's what's next for the Arize Phoenix open source project, according to our head of open source @mikeldking and senior AI engineer @ehutt_. Not just observability for humans, but a context platform for humans and agents to build great AI-native software together. Learn more: arize.com/blog/from-obse…

English

114

Elizabeth Hutton@ehutt_·1d

treat agents the way you want to be treated

English

Elizabeth Hutton@ehutt_·5d

@mitsuhiko In reality, you just need to buy a 128GB ram mac...

English

Armin Ronacher ⇌@mitsuhiko·8 May

In theory you just need to pi install this repo as pi extension and you're gtg. github.com/mitsuhiko/ds4

English

7.9K

Armin Ronacher ⇌@mitsuhiko·8 May

I'm so in love with @antirez' ds4. Patched some slop on it to get better streaming, but I can just install a pi extension on a 128GB mac and it manages everything for me. No need for mlx-lm, ollama or lm studio or finagling pi configs.

English

550

55.2K

Elizabeth Hutton@ehutt_·7 May

@sama Does this include more/better TTS voices? I hope so!

English

Sam Altman@sama·7 May

people are really starting to use voice to interact with AI, especially when they have a lot of context to dump. GPT-Realtime-2 comes to the API today; it is a pretty big step forward. (we are working on improvements to voice in chat.)

English

875

290

7.1K

481K

Elizabeth Hutton@ehutt_·7 May

Myth: an LLM judge can only be as good as the model that generated the output. Truth: Generation and evaluation are different tasks. Writing an essay is hard, but identifying a bad one isn't. In reality, a judge isn't bottlenecked by the generator's ceiling. All you need is a good rubric.

Arize AI@arizeai

🧠 One AI Question with Ankur Duggal We asked our AI Solutions Architect: Why use an LLM to evaluate another LLM? His answer: It's like human-to-human evaluation. By using specific prompts, an LLM acts as a judge to grade performance—leading to more accurate results and better AI model scaling. It's one of the most powerful patterns in modern AI development. #AI #LLM #MachineLearning #AIEvals #LLMAsAJudge

English

Elizabeth Hutton retweetledi

arize-phoenix@ArizePhoenix·5 May

Most tool-calling evals treat everything as one check. But a tool call actually has 3 separate steps — and each can break independently 👇 🧠 Tool Selection Did the model pick the right tool? ⚙️ Tool Parameters Did it call that tool correctly (valid args, right values)? 🔄 Tool Response Handling Did it use the tool’s result correctly in the final answer? Why this matters: • Wrong tool → routing issue • Bad args → schema issue • Bad final answer → reasoning issue If you don’t separate these, debugging gets messy. Details: Repo: [github.com/Arize-ai/phoen…](github.com/Arize-ai/phoen…)

English

149

Elizabeth Hutton@ehutt_·4 May

@sama please fix this! I’m trying to avoid building myself an Openclaw and thought the ChatGPT pulse feature was cool but this is unusable. Every day I prompt for ENGLISH and every day it gives me this

English

Elizabeth Hutton@ehutt_·3 May

@sudobunni Second best is to not post drive-by AI-slop PRs

English

Elizabeth Hutton@ehutt_·3 May

@sudobunni One of the easiest things you can do is give github stars! github.com/Arize-ai/phoen…

English

bashbunni@sudobunni·2 May

One of my goals is to support open source maintainers in whatever ways I can. However, I'm currently struggling to keep up with what's happening in the open source space (trending projects, active maintainers, etc). Does anyone have any resource recs to stay up-to-date on that stuff?

English

270

18.7K

Elizabeth Hutton@ehutt_·3 May

Things only women in tech will appreciate 💁🏼‍♀️

claire vo 🖤@clairevo

The @Anthropologie x @AnthropicAI crossover no one asked for but I made anyway. @claudeai design @ChatGPTapp image 2 + @midjourney @OpenAI codex anthropic-linen-sage.vercel.app

English

Elizabeth Hutton retweetledi

arize-phoenix@ArizePhoenix·2 May

The biggest bottleneck in AI application development right now isn’t models — it’s evaluation. Every AI engineer faces the same question: Which prompt or model actually gives the best balance of quality, latency, and cost? Without strong evaluation, you’re guessing. --- Here’s a simple way to think about it: • **Datasets** = collections of examples (inputs + optional reference outputs) • **Tasks** = your LLM / agent generating outputs • **Evaluators** = automated “tests” that score quality • **Experiments** = running tasks across datasets to compare results Together, they form an evaluation loop: Run → Score → Compare → Iterate --- This is just the scientific method applied to AI: **Control what you can** * Dataset * Evaluators * (sometimes) model settings **Change what you’re testing** * Prompts * Model versions * Retrieval strategies * App logic Then measure the impact — objectively. --- High-quality evaluations unlock: ✔ Confidence in shipping changes ✔ Faster iteration cycles ✔ Clear regression detection ✔ Less guesswork, more engineering --- The reality is: AI development speed is gated by how quickly you can *trust* your changes. That’s why evaluation infrastructure matters. --- Phoenix is built around this idea — making it easy to run experiments, evaluate outputs, and iterate with confidence. And it’s open source. Explore it, adapt it, and build better AI faster.

English

222

Elizabeth Hutton@ehutt_·2 May

@AishwaryaDevv I built a cli tool to track projects in obsidian so I can chat with Claude or codex whenever I want and it has full context - notes, sandbox, PR status, and it gets updated after every chat. It’s soooo handy github.com/ehutt/arc

English

Aish@AishwaryaDevv·1 May

Most annoying part of vibe coding? Re-explaining the entire context to your IDE after switching chats

English

186

494

39.3K

Elizabeth Hutton@ehutt_·1 May

@aryanlabde Yes, mostly. I always used both but Claude was my daily driver until I had 3 days of struggling to get it to do simple stuff. Now I go to codex for most things, and use Claude for review (all via the terminal, I don’t like the codex app)

English

146

Aryan@aryanlabde·30 Nis

Vibe coders, have you really switched from claude to codex?

English

440

466

106.4K

Elizabeth Hutton@ehutt_·30 Nis

How many of you actually use your vibe coded apps every day? Do you still use them 1,2,3 months later?

English

109

Elizabeth Hutton retweetledi

R 'Nearest' Nabors@rachelnabors·30 Nis

x.com/i/article/2049…

ZXX

5.3K

Elizabeth Hutton retweetledi

arize-phoenix@ArizePhoenix·23 Nis

Get visibility into agent benchmarking execution using ATIF x.com/ehutt_/status/…

Elizabeth Hutton@ehutt_

We just shipped ATIF support in @ArizePhoenix ATIF (Agent Trajectory Interchange Format) is an emerging standard for logging agent runs, supported by Harbor across @claude_code, @OpenHandsDev, @geminicli , @OpenAICodexCli and other agent frameworks.

English

165

Elizabeth Hutton@ehutt_·29 Nis

@ArizePhoenix lol deliciousness

English

arize-phoenix@ArizePhoenix·24 Nis

Quis custodiet ipsos custodes? Your judge is just another LLM app. Phoenix traces evaluators so you can: → audit decisions → refine prompts → build benchmark datasets → curate fine-tuning data for smaller judges The eval loop is the AI eng loop. Treat it like one.

English

Elizabeth Hutton@ehutt_·29 Nis

Build your own, it's not hard. All you need is a CLI tool to link Obsidian + Code + your agents of choice. Track ongoing projects, plan/chat/implement/review all with full context. Each project gets its own sandbox, I have different code review modes, and once a PR is opened a tmux session launches in the background to monitor for comments and CI failures. Now I can work on 4-5 things at a time easily. github.com/ehutt/arc

English

562

Rhys@RhysSullivan·29 Nis

has anyone built "the software factory" thing, misc requirements: - able to use my subscriptions (codex / claude) - stacked diffs w/ graphite - able to go from planning -> lots of small tasks - closing the review loop don't love current agent interfaces, want something new

English

139

475

41K

Keşfet

@theo @omarsar0 @mikeldking @mitsuhiko @antirez @sama @sudobunni @AishwaryaDevv