Elizabeth Hutton

33 posts

Elizabeth Hutton

Elizabeth Hutton

@ehutt_

ML Engineer | Phoenix OSS | always learning unsupervised

Los Angeles Katılım Eylül 2025
56 Takip Edilen14 Takipçiler
Elizabeth Hutton
Elizabeth Hutton@ehutt_·
@theo I like to switch between claude and codex during different stages of development, and have a custom CLI for injecting project context into either, so the CLI agents work great for me.
English
0
0
1
177
elvis
elvis@omarsar0·
Any good alternatives to the Claude Agent SDK? I have used OpenAI Agent SDK and ADK a bit in the past but not sure about their state today.
English
44
2
50
13K
Elizabeth Hutton retweetledi
Arize AI
Arize AI@arizeai·
That's what's next for the Arize Phoenix open source project, according to our head of open source @mikeldking and senior AI engineer @ehutt_. Not just observability for humans, but a context platform for humans and agents to build great AI-native software together. Learn more: arize.com/blog/from-obse…
English
0
2
2
114
Elizabeth Hutton
Elizabeth Hutton@ehutt_·
treat agents the way you want to be treated
English
0
0
1
9
Armin Ronacher ⇌
Armin Ronacher ⇌@mitsuhiko·
I'm so in love with @antirez' ds4. Patched some slop on it to get better streaming, but I can just install a pi extension on a 128GB mac and it manages everything for me. No need for mlx-lm, ollama or lm studio or finagling pi configs.
English
16
24
550
55.2K
Sam Altman
Sam Altman@sama·
people are really starting to use voice to interact with AI, especially when they have a lot of context to dump. GPT-Realtime-2 comes to the API today; it is a pretty big step forward. (we are working on improvements to voice in chat.)
English
875
290
7.1K
481K
Elizabeth Hutton
Elizabeth Hutton@ehutt_·
Myth: an LLM judge can only be as good as the model that generated the output. Truth: Generation and evaluation are different tasks. Writing an essay is hard, but identifying a bad one isn't. In reality, a judge isn't bottlenecked by the generator's ceiling. All you need is a good rubric.
Arize AI@arizeai

🧠 One AI Question with Ankur Duggal We asked our AI Solutions Architect: Why use an LLM to evaluate another LLM? His answer: It's like human-to-human evaluation. By using specific prompts, an LLM acts as a judge to grade performance—leading to more accurate results and better AI model scaling. It's one of the most powerful patterns in modern AI development. #AI #LLM #MachineLearning #AIEvals #LLMAsAJudge

English
1
0
1
67
Elizabeth Hutton retweetledi
arize-phoenix
arize-phoenix@ArizePhoenix·
Most tool-calling evals treat everything as one check. But a tool call actually has 3 separate steps — and each can break independently 👇 🧠 Tool Selection Did the model pick the right tool? ⚙️ Tool Parameters Did it call that tool correctly (valid args, right values)? 🔄 Tool Response Handling Did it use the tool’s result correctly in the final answer? Why this matters: • Wrong tool → routing issue • Bad args → schema issue • Bad final answer → reasoning issue If you don’t separate these, debugging gets messy. Details: Repo: [github.com/Arize-ai/phoen…](github.com/Arize-ai/phoen…)
arize-phoenix tweet media
English
0
2
5
149
Elizabeth Hutton
Elizabeth Hutton@ehutt_·
@sama please fix this! I’m trying to avoid building myself an Openclaw and thought the ChatGPT pulse feature was cool but this is unusable. Every day I prompt for ENGLISH and every day it gives me this
Elizabeth Hutton tweet media
English
1
0
1
74
bashbunni
bashbunni@sudobunni·
One of my goals is to support open source maintainers in whatever ways I can. However, I'm currently struggling to keep up with what's happening in the open source space (trending projects, active maintainers, etc). Does anyone have any resource recs to stay up-to-date on that stuff?
English
27
8
270
18.7K
Elizabeth Hutton retweetledi
arize-phoenix
arize-phoenix@ArizePhoenix·
The biggest bottleneck in AI application development right now isn’t models — it’s evaluation. Every AI engineer faces the same question: Which prompt or model actually gives the best balance of quality, latency, and cost? Without strong evaluation, you’re guessing. --- Here’s a simple way to think about it: • **Datasets** = collections of examples (inputs + optional reference outputs) • **Tasks** = your LLM / agent generating outputs • **Evaluators** = automated “tests” that score quality • **Experiments** = running tasks across datasets to compare results Together, they form an evaluation loop: Run → Score → Compare → Iterate --- This is just the scientific method applied to AI: **Control what you can** * Dataset * Evaluators * (sometimes) model settings **Change what you’re testing** * Prompts * Model versions * Retrieval strategies * App logic Then measure the impact — objectively. --- High-quality evaluations unlock: ✔ Confidence in shipping changes ✔ Faster iteration cycles ✔ Clear regression detection ✔ Less guesswork, more engineering --- The reality is: AI development speed is gated by how quickly you can *trust* your changes. That’s why evaluation infrastructure matters. --- Phoenix is built around this idea — making it easy to run experiments, evaluate outputs, and iterate with confidence. And it’s open source. Explore it, adapt it, and build better AI faster.
arize-phoenix tweet media
English
0
2
7
222
Elizabeth Hutton
Elizabeth Hutton@ehutt_·
@AishwaryaDevv I built a cli tool to track projects in obsidian so I can chat with Claude or codex whenever I want and it has full context - notes, sandbox, PR status, and it gets updated after every chat. It’s soooo handy github.com/ehutt/arc
English
0
0
1
74
Aish
Aish@AishwaryaDevv·
Most annoying part of vibe coding? Re-explaining the entire context to your IDE after switching chats
English
186
7
494
39.3K
Elizabeth Hutton
Elizabeth Hutton@ehutt_·
@aryanlabde Yes, mostly. I always used both but Claude was my daily driver until I had 3 days of struggling to get it to do simple stuff. Now I go to codex for most things, and use Claude for review (all via the terminal, I don’t like the codex app)
English
0
0
0
146
Aryan
Aryan@aryanlabde·
Vibe coders, have you really switched from claude to codex?
English
440
5
466
106.4K
Elizabeth Hutton
Elizabeth Hutton@ehutt_·
How many of you actually use your vibe coded apps every day? Do you still use them 1,2,3 months later?
English
1
0
3
109
arize-phoenix
arize-phoenix@ArizePhoenix·
Quis custodiet ipsos custodes? Your judge is just another LLM app. Phoenix traces evaluators so you can: → audit decisions → refine prompts → build benchmark datasets → curate fine-tuning data for smaller judges The eval loop is the AI eng loop. Treat it like one.
English
1
0
1
90
Elizabeth Hutton
Elizabeth Hutton@ehutt_·
Build your own, it's not hard. All you need is a CLI tool to link Obsidian + Code + your agents of choice. Track ongoing projects, plan/chat/implement/review all with full context. Each project gets its own sandbox, I have different code review modes, and once a PR is opened a tmux session launches in the background to monitor for comments and CI failures. Now I can work on 4-5 things at a time easily. github.com/ehutt/arc
English
0
0
2
562
Rhys
Rhys@RhysSullivan·
has anyone built "the software factory" thing, misc requirements: - able to use my subscriptions (codex / claude) - stacked diffs w/ graphite - able to go from planning -> lots of small tasks - closing the review loop don't love current agent interfaces, want something new
English
139
5
475
41K