sanjana

147 posts

sanjana

@sanjanayed

Berkeley, CA Bergabung Mart 2025

166 Mengikuti164 Pengikut

sanjana me-retweet

arize-phoenix@ArizePhoenix·11 Ara

In the coming weeks we'll be enabling basic telemetry by default in Phoenix over the coming weeks. Our commitment: privacy and security are foundational What we're collecting: Simple, anonymous usage stats Opt out: Set PHOENIX_TELEMETRY_ENABLED=false—no questions asked. github.com/Arize-ai/phoen…

English

263

sanjana me-retweet

Arize AI@arizeai·11 Ara

In case you missed it: we gathered with builders in NYC last night to dive into what really moves the needle after an agent is deployed. A lot of progress starts in production. This means learning from how your agent behaves in the wild and feeding those findings back into your development pipelines. We walked through how continuous evaluation surfaces failures, closes the feedback loop, and turns real insights into a better performing agent. With Arize AX, online evals and real-time tracing makes every production run part of an iterative cycle. This helps teams ship smarter agents that are consistently evolving. Check out the video recording of the session: youtu.be/qF2XQ3WSyrE

YouTube

English

448

sanjana@sanjanayed·11 Ara

@LashaKrikheli Thank you!!! Glad you enjoyed it

English

Lasha Krikheli@LashaKrikheli·11 Ara

@sanjanayed Amazing presentation, thank you!

English

130

sanjana@sanjanayed·8 Ara

Hope to see you there!

Arize AI@arizeai

🗽NYC: join us Wednesday evening at @betaworks for networking/talks on building, scaling, and improving AI agents in production. RSVP: luma.com/xx88ixvb Talks include ⚫ Tom Haddock with @crewaiinc on Building Effective Agents at Scale ⚫ @sanjanayed of @ArizePhoenix on Improving Agents in Production with Online Evals

English

225

sanjana me-retweet

Srilakshmi Chavali@schavalii·4 Ara

Spoke at @arizeai’s AI Builder Meetup with @mastra a few weeks back & the talk is now live! I covered the basics of observability + evals, and showed via a Mastra agent how to set up observability, view traces, run evals, and kick-start your iteration cycle. Would love to hear how you're tackling this in your own workflows. If you're looking to add observability + evals, happy to help you get there 🚀 youtu.be/qQGQ9l7ddxE

YouTube

English

3.8K

sanjana me-retweet

arize-phoenix@ArizePhoenix·25 Kas

🚀 New feature Alert: Dataset Splits 🚀 Splits let you define named subsets of your dataset (e.g., train, hard_examples) and filter your experiments to run only on those subsets. Learn more & Check out this walkthrough: ⚪️ Create a split directly in the Phoenix UI ⚪️ Fetching that split down into your code ⚪️ Running an experiment scoped to that subset ⚪️ Inspecting how metrics shift when isolating difficult examples 👉 Full demo + code inside the video below

English

342

sanjana@sanjanayed·24 Kas

@JayaGup10 @dakshgup @ArizePhoenix @arizeai exactly

English

Jaya Gupta@JayaGup10·24 Kas

@dakshgup gotta be doing it on @ArizePhoenix @arizeai

English

Daksh Gupta@dakshgup·19 Kas

sometimes instead of evals you need to just get the gang together for dinner to read agent traces on the big screen

English

13.7K

sanjana me-retweet

Aparna Dhinakaran@aparnadhinak·20 Kas

We optimized Claude Code's system prompt - just its prompt - and achieved +10% boost on SWE Bench. No architecture changes, tool improvements, or fine-tuning. Last time we optimized Cline, seeing 15% boosts on SWE Bench, bringing GPT-4.1's accuracy up to Sonnet 4-5, just through prompts. Now, we applied the same algorithm, Prompt Learning, on Claude Code. Just like we did with Cline, we only touched the CLAUDE.md file - optimizing Claude Code purely through custom instructions that are appended to Claude Code's system prompt. See our detailed blog post: arize.com/docs/phoenix/d… Or just read more on how we did it below 👇

English

106

1.8K

212K

sanjana me-retweet

Arize AI@arizeai·18 Kas

We benchmarked Prompt Learning (prompt optimizer) against GEPA and saw similar/better results in a fraction of the time. Since we launched Prompt Learning in July, the most common question we get is: “Prompt Learning or GEPA — which should I use?” So we re-created every GEPA benchmark, measured rollout efficiency, and compared the end-to-end user experience. Results: Prompt Learning achieves similar to GEPA accuracy with far fewer rollouts. We break down the results, and also compare the full end-to-end user experience using these optimizers below 👇

English

726

sanjana@sanjanayed·7 Kas

@mastra @arizeai @ArizePhoenix Full Video: youtu.be/A8SZrspNbMQ

YouTube

English

134

sanjana@sanjanayed·7 Kas

If you are building agents with TypeScript... @mastra now integrates directly with @arizeai AX and @ArizePhoenix! This means your agent traces automatically stream into Arize AX with minimal extra setup. You can start running evals right away. In the video, I walk through how to: - Instrument your agents for tracing - Define trace and span evaluations - Configure online evals that automatically run as your agent executes I’ve found this workflow to be a simple but powerful way to close the loop between building, observing, and improving agents. Docs below and full video below👇

English

832

sanjana@sanjanayed·7 Kas

Arize AX: arize.com/docs/ax/integr… Arize Phoenix: arize.com/docs/phoenix/i…

English

205

sanjana me-retweet

arize-phoenix@ArizePhoenix·6 Kas

Check out our very own @seldo at TS AI Conf!

Mastra@mastra

it takes one to know one tsai builders, that is @seldo from @arizeai and @denise_teng25 from @GradientVC are emceeing tomorrow's TSAI conf livestream starts 9:35am PT tsconf.ai

English

3.7K

sanjana me-retweet

Mikyo@mikeldking·25 Eki

“The purpose of abstracting is not to be vague, but to create a new semantic level in which one can be absolutely precise.” ― Edsger W. Dijkstra @ArizePhoenix

English

266

sanjana me-retweet

Aparna Dhinakaran@aparnadhinak·15 Eki

We improved @cline, a popular open-source coding agent, by +15% accuracy on SWE-Bench — without retraining LLMs, changing tools, or modifying Cline's architecture. We achieved this simply through optimizing its ruleset, in ./clinerules — a user defined section for developers to add custom instructions to the system prompt, just like .cursor/rules in Cursor or CLAUDE.md in Claude Code. Using our algorithm, Prompt Learning, we automatically refined these rules across a feedback loop powered by GPT-5. Here’s how we brought GPT-4.1’s performance on SWE-Bench Lite to near state-of-the-art levels — matching Claude Sonnet 4-5 — purely through ruleset optimization. See our more detailed blog post 👉: arize.com/blog/optimizin…

English

1.2K

210.1K

sanjana@sanjanayed·11 Eki

Hey Adam! I can help clarify, I did a lot of research for this piece :) Google & Anthropic evals rate themselves the highest in this experiment. The horizontal rows represent the agent (and which underlying LLM was used) while the columns represent which LLM was used to evaluate. So for example, take a look at the row using a Google model for the agent. The human judge score evaluated the performance to be 41%. But, when we ask the same Google model to evaluate this agent, it inflates the score up to 72.5%!! This was the highest evaluation score that the Google agent got across the board. Similarly, for the Anthropic agent, it scored itself at an 82.3%. This is higher than the 75%, 68.3%, and 76.6% it got from other models performing the same evaluation task. Hope this makes sense! The actual write up explains this well too

English

Adam Conway@adam__conway·11 Eki

@aparnadhinak Am I reading this correctly that 3 of the 4 rated models other than themselves highest (everyone but Google)?

English

Aparna Dhinakaran@aparnadhinak·10 Eki

We wanted to know if LLMs show “self-evaluation bias”. Meaning, do they score their own outputs more favorably when acting as evaluators? We tested four LLMs from OpenAI, Google, Anthropic, and Qwen. Each model generated answers as an agent, and all four models then took turns evaluating those outputs. To ground the results, we also included human annotations as a baseline for comparison. 1️⃣ Hypothesis Test for Self-Evaluation Bias: Do evaluators rate their own outputs higher than others? Key takeaway: yes, all models tend to “like” their own work more. But this test alone can’t separate genuine quality from bias. 2️⃣ Human-Adjusted Bias Test: We aligned model scores against human judges to see if bias persisted after controlling for quality. This revealed that some models were neutral or even harsher on themselves, while others inflated their outputs. 3️⃣ Agent Model Consistency: How stable were scores across evaluators and trials? Agent outputs that stayed closer to human scores, regardless of which evaluator was used, were more consistent. Anthropic came out as the most reliable here, showing tight agreement across evaluators. The goal wasn’t to crown winners, but to show how evaluator bias can creep in and what to watch for when choosing a model for evaluation. TL;DR: Evaluator bias is real. Sometimes it looks like inflation, sometimes harshness, and consistency varies by model. Regardless of what models you use, human grounding + robustness checks, evals can be misleading. All tracing and evaluation was done using @ArizePhoenix Full write-up below. Thank you to @HamelHusain @sh_reya @eugeneyan for reviewing this! Also tagging people who I think would find this interesting: @PawelHuryn @hsu_steve @jaseweston @omarsar0 @OwainEvans_UK @andrewwhite01 @jmin__cho @lennysan @dcml0714 @HungyiLee2 @ansonwhho @lm_zheng @ChenguangZhu2 @benhylak @Alibaba_Qwen @JustinLin610 @TianbaoX @huybery

English

2.9K

sanjana@sanjanayed·7 Eki

can’t improve (and ship) what you can’t see

Aparna Dhinakaran@aparnadhinak

@OpenAIDevs just dropped Agent Builder, making it easier than ever to spin up and deploy agents. But once you’ve built an agent, how do you actually understand what it’s doing? Because these agents are powered by the @OpenAI Agents SDK, they can be traced with @ArizePhoenix or @arizeai using just a few lines of code. Tracing your agents gives you full visibility into every step they take, beyond simple inputs and outputs. You can see the full reasoning chain, tool calls, and decisions your agent makes in real time. Simply copy and paste the code from the Agent Builder platform and connect to Phoenix to see traces populate for every call to your agent. From here, you can run evals on those traces using any provider to measure performance, reliability, and accuracy. Check out how to trace your agents in the video. More on running evals on these agents below 👇

English

217

sanjana me-retweet

arize-phoenix@ArizePhoenix·7 Eki

We released repetitions in Phoenix last week to tackle a core challenge with LLMs: variability. A borderline input can flip classifications from run to run, making it hard to tell if a change is real or just noise. This cookbook shows a full use case you can run yourself: evaluating customer reviews with repetitions. ⚪️ Generate & label a dataset ⚪️ Run repeated evals per example ⚪️ See how results stabilize (or wobble) across runs ⚪️ Catch regressions that single-shot evals would miss It’s a practical way to move from anecdotal “one run” impressions → more reliable model comparisons.

English

302

sanjana me-retweet

Aparna Dhinakaran@aparnadhinak·2 Eki

Coding agents like Cline don’t always need retraining to get smarter — sometimes all it takes is better prompts. We used Prompt Learning to optimize Cline’s Plan Mode on SWE-bench and saw big gains. To stay true to real developer workflows, we left Cline’s base system prompt untouched and focused on updating its rules instead. Just like .cursor/rules or CLAUDE.md, Cline has a user-defined rules section. We applied Prompt Learning to optimize this rules file and tracked the improvements. Here’s the process + results 👇

English

883

sanjana@sanjanayed·1 Eki

arize-phoenix.readthedocs.io/projects/clien…

ZXX

sanjana@sanjanayed·1 Eki

Sessions already give you the power to group traces together and understand them in context. This is critical for understanding how a user flows through your application. Now, the @ArizePhoenix team brings annotations to sessions. This paves the way for conversational evals like coherency and tone (or any custom criteria). Annotations made via the Phoenix Client surface directly in the sessions table. No more digging into individual sessions to find them. This allows better visibility when browsing in bulk. Spot trends and issues faster across many sessions. You can also integrate this data into broader workflows that shape how you build your evals. All of this is available in Arize Phoenix 12.0+. Docs below 📖

English

235

Jelajahi

@LashaKrikheli @arizeai @mastra @JayaGup10 @dakshgup @ArizePhoenix @seldo @cline