Comet

3.5K posts

Comet

@Cometml

Comet provides an end-to-end model evaluation platform for AI developers, with best in class LLM evaluations, experiment tracking, and production monitoring

New York, NY Katılım Ekim 2017

879 Takip Edilen15.1K Takipçiler

Comet retweetledi

Paul Iusztin@pauliusztin_·5h

I just interviewed the former CTO at IBM and Chairperson of NodeJS. Here's what I learned: Michael @maximilien spent 12 months shipping production RAG to multiple customers. In our discussion, he told me that nothing on a leaderboard can predict what works until you evaluate your customers' data. Which I found interesting because... Most teams treat RAG like a setup task. Pick a vector database. Pick OpenAI embeddings. Ship it. Then spend months “vibe-checking” results. But production RAG doesn’t work like that. It's more of an iteration loop rather than a setup problem. Stitch → evaluate → iterate A real system has multiple moving parts. You don’t pick one... You swap and measure each one. Here’s what that looks like in practice: 1. Build a small eval set from real user questions 2. Build your evaluator (e.g., LLM Judge) against that dataset 3. Align your evaluator with human feedback (before trusting scores) 4. Iterate cheapest-first (retrieval → embeddings → infra) To make this work, you also need visibility across runs. This is where tools like Opik by @Cometml come in... Tracking each experiment so you can compare models, configs, and results over time. But most teams refuse to do this because it's extremely cumbersome. • Re-ingestion takes time • Pipelines break • Comparisons become unreliable So people default to benchmarks instead. But that doesn't mean it's better. On a real customer dataset (auction listings), Michael @maximilien swapped only the embedding model. An open-source model ranked #130 on MTEB beat OpenAI: • +11% quality • 240x faster re-embedding • 50% smaller vectors • $0 cost Here's the gist... RAG is not about picking the best tools. It’s about measuring what works for your data. Until you do that… You’re just guessing. Full interview and breakdown here: decodingai.com/p/ship-rag-wit…

English

297

Comet@Cometml·3d

"Until you evaluate on your data, nothing else matters."

Paul Iusztin@pauliusztin_

I’ve spent the last week interviewing @maximilien, former CTO at IBM and Chairperson of NodeJS Foundation, who has shipped production RAG to multiple customers over the past year. The lesson he kept circling back to is that until you evaluate on your customer’s data, nothing else you do matters. Production RAG is a loop: stitch your embedding model, chunking, retrieval, vector DB, and judge, then evaluate and iterate until you hit your customer’s metrics. Public benchmarks and the MTEB leaderboard are signals, not verdicts. On a real customer dataset of Leica auction listings, an open-source sentence-transformer that ranked around #130 on MTEB still beat OpenAI by 11% in quality. It ran 240x faster, produced 50% smaller vectors, and cost $0.

English

603

Comet retweetledi

Gideon M@gidim·23 Nis

As your agent matures, something shifts. You stop writing code, and start editing prompts, tweaking params, trying new tools, etc. The tooling for this phase sucks. Today, we’re fixing that. Announcing Agent Configuration + Agent Playground in Opik. 🧵

English

28.3K

Comet retweetledi

Gideon M@gidim·23 Nis

Shared by a customer. Ollie just made their slack bot 52% faster and 98% cheaper. With test suites no regressions either

English

282

Comet@Cometml·23 Nis

We're launching the Agent Playground so you can test your full agent configuration from the UI. Tweak prompts and swap models without touching your code. See how the entire agent responds and only save what works. comet.com/site/blog/end-…

English

106

Comet@Cometml·23 Nis

Third and final day of "What we've been building" launch week: Agent Playground Your agent isn't just one prompt. It's a complex system of models and parameters working together. It's time to have a workflow that treats it as such.

English

130

Comet@Cometml·22 Nis

@namd1nh Thanks for sharing!

English

Nam Dinh@namd1nh·22 Nis

Three things a self-improving agent needs: Remember what it did, judge whether it worked, and rewrite itself when it didn't. Opik had the first two. Ollie is the third.

Gideon M@gidim

Self-improving agents are going to require a few things: A memory of past traces A way to evaluate their trajectories The ability to edit their own code So far, Opik has focused on the first two points. Now, we’re solving the third. 🧵

English

252

Comet@Cometml·22 Nis

@python_spaces Thanks for sharing!

English

Python Space@python_spaces·22 Nis

Meet Ollie, Opik's new AI coding assistant that enables self-improving agents by: - analyzing execution traces - evaluating performance - directly editing connected local codebases Ollie integrates observability data with code access to: - read files - propose targeted changes like new functions - update agent graphs - generate regression tests within the Opik UI This launch supports shifting agent development workflows from traditional IDEs to trace-centric environments—for faster iteration and verification of improvements.

Gideon M@gidim

English

3.9K

Comet@Cometml·22 Nis

@hasantoxr Thanks for sharing!

English

Hasan Toor@hasantoxr·22 Nis

Opik just launched its most ambitious feature yet Introduces Ollie, an AI coding assistant built specifically for agents. Ollie connects directly to your codebase, analyzes massive agent traces, debugs real-world issues, generates tests, and ships improvements automatically.

Gideon M@gidim

English

4.9K

Comet@Cometml·22 Nis

@itsjasonai Can't wait to see what you build!

English

Jason Nguyen@jasonngsx·22 Nis

Comet just shipped Ollie, a coding agent that lives inside your Opik instance. Full access to your agent's trace history, failure patterns, and codebase. Your IDE doesn't have any of that. You have to brief your coding assistant on what broke. Ollie already knows.

Gideon M@gidim

English

315

Comet@Cometml·22 Nis

It’s his first week in the office so say hi if you see him around 👋 Research preview available in the Opik Cloud. Sign up for early access: comet.com/site/products/…

English

133

Comet@Cometml·22 Nis

Ollie lives in the Opik UI with full context of your agent. When you spot a problem, he diagnoses it, writes the fix, ships it to your IDE, and adds a test case so it doesn't come back.

English

350

Comet@Cometml·22 Nis

Second day of "What we've been building" launch week Meet Ollie 🦉 You may have already seen Ollie around as our mascot. Today he's also joining the team as our new coding assistant.

English

391

Comet retweetledi

Gideon M@gidim·21 Nis

The big idea with Test Suites is that agents need comprehensive regression tests, built on nuanced assertions and real production traces. This is how you improve your agent for one user without damaging it for 3 others, as explained by @JacquesVerre youtube.com/watch?v=lt5iQ-…

YouTube

English

550

Comet@Cometml·21 Nis

Your suite grows as you build. Every failure you catch becomes a test case. Each failed test tells you what needs to be fixed. Available in the open-source instance. Take a first look: comet.com/site/blog/ai-a…

English

138

Comet@Cometml·21 Nis

Test Suites change that. Describe how your agent should behave using rules written in plain English and get clear pass/fail results when you run tests.

English

215

Comet@Cometml·21 Nis

Day 1 of "What we've been building": Test Suites Most agent testing feels like a chore because it starts with a blank CSV. You're forced to invent a dataset before you even know how your agent fails.

English

287

Comet@Cometml·21 Nis

@DnuLkjkjh Opik is designed to be flexible. Depending on which framework you're using, there are multiple ways you might track state. If you're struggling to debug issues with handoffs, what we're launching tomorrow may be of interest to you - check out the docs: comet.com/docs/opik/

English

dnu@DnuLkjkjh·20 Nis

@Cometml the hardest part for me is state handoff, not traces. knowing what changed, what was approved, and what can be retried cleanly is where agent runs still get messy. are you treating that as workflow state or prompt state?

English

Comet@Cometml·20 Nis

We’ve been a bit quiet lately 👀 Mostly because we’ve been heads-down rethinking what a "2026 agent workflow" actually looks like.

English

188

Comet@Cometml·20 Nis

The dark ages of agent development where you spend more time copy-pasting traces than actually fixing code are ending. Starting tomorrow, we’re sharing what we’ve been building in our first ever launch week.

English

133

Keşfet

@maximilien @namd1nh @python_spaces @hasantoxr @JacquesVerre @elonmusk @BarackObama @taylorswift13