Comet

3.5K posts

Comet banner
Comet

Comet

@Cometml

Comet provides an end-to-end model evaluation platform for AI developers, with best in class LLM evaluations, experiment tracking, and production monitoring

New York, NY Katılım Ekim 2017
879 Takip Edilen15.1K Takipçiler
Comet retweetledi
Paul Iusztin
Paul Iusztin@pauliusztin_·
I just interviewed the former CTO at IBM and Chairperson of NodeJS. Here's what I learned: Michael @maximilien spent 12 months shipping production RAG to multiple customers. In our discussion, he told me that nothing on a leaderboard can predict what works until you evaluate your customers' data. Which I found interesting because... Most teams treat RAG like a setup task. Pick a vector database. Pick OpenAI embeddings. Ship it. Then spend months “vibe-checking” results. But production RAG doesn’t work like that. It's more of an iteration loop rather than a setup problem. Stitch → evaluate → iterate A real system has multiple moving parts. You don’t pick one... You swap and measure each one. Here’s what that looks like in practice: 1. Build a small eval set from real user questions 2. Build your evaluator (e.g., LLM Judge) against that dataset 3. Align your evaluator with human feedback (before trusting scores) 4. Iterate cheapest-first (retrieval → embeddings → infra) To make this work, you also need visibility across runs. This is where tools like Opik by @Cometml come in... Tracking each experiment so you can compare models, configs, and results over time. But most teams refuse to do this because it's extremely cumbersome. • Re-ingestion takes time • Pipelines break • Comparisons become unreliable So people default to benchmarks instead. But that doesn't mean it's better. On a real customer dataset (auction listings), Michael @maximilien swapped only the embedding model. An open-source model ranked #130 on MTEB beat OpenAI: • +11% quality • 240x faster re-embedding • 50% smaller vectors • $0 cost Here's the gist... RAG is not about picking the best tools. It’s about measuring what works for your data. Until you do that… You’re just guessing. Full interview and breakdown here: decodingai.com/p/ship-rag-wit…
Paul Iusztin tweet media
English
0
2
13
297
Comet
Comet@Cometml·
"Until you evaluate on your data, nothing else matters."
Paul Iusztin@pauliusztin_

I’ve spent the last week interviewing @maximilien, former CTO at IBM and Chairperson of NodeJS Foundation, who has shipped production RAG to multiple customers over the past year. The lesson he kept circling back to is that until you evaluate on your customer’s data, nothing else you do matters. Production RAG is a loop: stitch your embedding model, chunking, retrieval, vector DB, and judge, then evaluate and iterate until you hit your customer’s metrics. Public benchmarks and the MTEB leaderboard are signals, not verdicts. On a real customer dataset of Leica auction listings, an open-source sentence-transformer that ranked around #130 on MTEB still beat OpenAI by 11% in quality. It ran 240x faster, produced 50% smaller vectors, and cost $0.

English
0
0
0
603
Comet retweetledi
Gideon M
Gideon M@gidim·
As your agent matures, something shifts. You stop writing code, and start editing prompts, tweaking params, trying new tools, etc. The tooling for this phase sucks. Today, we’re fixing that. Announcing Agent Configuration + Agent Playground in Opik. 🧵
Gideon M tweet media
English
3
9
26
28.3K
Comet retweetledi
Gideon M
Gideon M@gidim·
Shared by a customer. Ollie just made their slack bot 52% faster and 98% cheaper. With test suites no regressions either
Gideon M tweet media
English
1
1
13
282
Comet
Comet@Cometml·
We're launching the Agent Playground so you can test your full agent configuration from the UI. Tweak prompts and swap models without touching your code. See how the entire agent responds and only save what works. comet.com/site/blog/end-…
English
0
0
0
106
Comet
Comet@Cometml·
Third and final day of "What we've been building" launch week: Agent Playground Your agent isn't just one prompt. It's a complex system of models and parameters working together. It's time to have a workflow that treats it as such.
Comet tweet media
English
1
0
0
130
Python Space
Python Space@python_spaces·
Meet Ollie, Opik's new AI coding assistant that enables self-improving agents by: - analyzing execution traces - evaluating performance - directly editing connected local codebases Ollie integrates observability data with code access to: - read files - propose targeted changes like new functions - update agent graphs - generate regression tests within the Opik UI This launch supports shifting agent development workflows from traditional IDEs to trace-centric environments—for faster iteration and verification of improvements.
Gideon M@gidim

Self-improving agents are going to require a few things: A memory of past traces A way to evaluate their trajectories The ability to edit their own code So far, Opik has focused on the first two points. Now, we’re solving the third. 🧵

English
1
4
17
3.9K
Hasan Toor
Hasan Toor@hasantoxr·
Opik just launched its most ambitious feature yet Introduces Ollie, an AI coding assistant built specifically for agents. Ollie connects directly to your codebase, analyzes massive agent traces, debugs real-world issues, generates tests, and ships improvements automatically.
Gideon M@gidim

Self-improving agents are going to require a few things: A memory of past traces A way to evaluate their trajectories The ability to edit their own code So far, Opik has focused on the first two points. Now, we’re solving the third. 🧵

English
3
2
12
4.9K
Comet
Comet@Cometml·
@itsjasonai Can't wait to see what you build!
English
0
0
0
34
Jason Nguyen
Jason Nguyen@jasonngsx·
Comet just shipped Ollie, a coding agent that lives inside your Opik instance. Full access to your agent's trace history, failure patterns, and codebase. Your IDE doesn't have any of that. You have to brief your coding assistant on what broke. Ollie already knows.
Gideon M@gidim

Self-improving agents are going to require a few things: A memory of past traces A way to evaluate their trajectories The ability to edit their own code So far, Opik has focused on the first two points. Now, we’re solving the third. 🧵

English
2
0
7
315
Comet
Comet@Cometml·
It’s his first week in the office so say hi if you see him around 👋 Research preview available in the Opik Cloud. Sign up for early access: comet.com/site/products/…
English
0
0
2
133
Comet
Comet@Cometml·
Ollie lives in the Opik UI with full context of your agent. When you spot a problem, he diagnoses it, writes the fix, ships it to your IDE, and adds a test case so it doesn't come back.
English
1
0
5
350
Comet
Comet@Cometml·
Second day of "What we've been building" launch week Meet Ollie 🦉 You may have already seen Ollie around as our mascot. Today he's also joining the team as our new coding assistant.
Comet tweet media
English
1
2
4
391
Comet retweetledi
Gideon M
Gideon M@gidim·
The big idea with Test Suites is that agents need comprehensive regression tests, built on nuanced assertions and real production traces. This is how you improve your agent for one user without damaging it for 3 others, as explained by @JacquesVerre youtube.com/watch?v=lt5iQ-…
YouTube video
YouTube
English
1
3
19
550
Comet
Comet@Cometml·
Your suite grows as you build. Every failure you catch becomes a test case. Each failed test tells you what needs to be fixed. Available in the open-source instance. Take a first look: comet.com/site/blog/ai-a…
English
0
0
3
138
Comet
Comet@Cometml·
Test Suites change that. Describe how your agent should behave using rules written in plain English and get clear pass/fail results when you run tests.
English
1
0
2
215
Comet
Comet@Cometml·
Day 1 of "What we've been building": Test Suites Most agent testing feels like a chore because it starts with a blank CSV. You're forced to invent a dataset before you even know how your agent fails.
Comet tweet media
English
1
2
4
287
Comet
Comet@Cometml·
@DnuLkjkjh Opik is designed to be flexible. Depending on which framework you're using, there are multiple ways you might track state. If you're struggling to debug issues with handoffs, what we're launching tomorrow may be of interest to you - check out the docs: comet.com/docs/opik/
English
0
0
0
2
dnu
dnu@DnuLkjkjh·
@Cometml the hardest part for me is state handoff, not traces. knowing what changed, what was approved, and what can be retried cleanly is where agent runs still get messy. are you treating that as workflow state or prompt state?
English
1
0
0
18
Comet
Comet@Cometml·
We’ve been a bit quiet lately 👀 Mostly because we’ve been heads-down rethinking what a "2026 agent workflow" actually looks like.
English
2
2
2
188
Comet
Comet@Cometml·
The dark ages of agent development where you spend more time copy-pasting traces than actually fixing code are ending. Starting tomorrow, we’re sharing what we’ve been building in our first ever launch week.
English
0
0
2
133