Confident AI

185 posts

Confident AI banner
Confident AI

Confident AI

@confident_ai

AI Observability & Evaluation Platform. Creator of @deepeval, the open-source LLM evaluation framework: https://t.co/STz0iBuGKs

San Francisco Katılım Ekim 2024
31 Takip Edilen364 Takipçiler
Confident AI
Confident AI@confident_ai·
We just launched dataset generation on Confident AI — connect Google Drive, SharePoint, Notion, or S3 and generate eval datasets directly from your docs. That's a wrap on Launch Week! 5 days, 5 launches.
Confident AI tweet media
English
1
2
2
159
Confident AI
Confident AI@confident_ai·
We’re seeing a consistent pattern: agents pass evals and still fail in production. The gap is almost always the same → outputs are tested, behavior isn’t. If you want us to take a look at your setup: forms.gle/K8L2cQF2inzCfQ… #AIEvals #LLMEvals #AIEngineering #AgentOps #MLOps
Brian Neville-O'Neill@bnevilleoneill

I’m starting to look at how teams are setting up evals and where they’re missing failure modes like this. If you want me to take a look at yours: forms.gle/K8L2cQF2inzCfQ… #AIEvals #LLMEvals #AIEngineering #AgentOps #MLOps

English
0
0
1
52
Confident AI
Confident AI@confident_ai·
We're launching Auto-Categorize Traces & Threads — Day 4 of our Launch Week! Every production trace gets categorized automatically, so you can detect response drift and see exactly which areas your agent crushes and which ones need work. Full post: confident-ai.com/blog/launch-we…
English
0
2
3
117
Confident AI
Confident AI@confident_ai·
Day 3 of launch week: auto trace-to-dataset ingestion. Set a rule once — production traces continuously flow into eval datasets and annotation queues. No scripts, no stale data. Post: confident-ai.com/blog/launch-we…
Confident AI tweet media
English
0
2
3
161
Confident AI retweetledi
Brian Neville-O'Neill
Brian Neville-O'Neill@bnevilleoneill·
I’m running LLM eval office hours today with @confident_ai 🧪 If you’re building anything with AI, drop a prompt + model output, and I’ll show where it breaks. I’ll look at: correctness completeness where it might fail in real use. Just quick, specific feedback #ai #LLM
English
0
3
2
139
Confident AI
Confident AI@confident_ai·
Confident AI Launch Week Day 2: Scheduled Evals ⏰ Everyone agrees to run evals every few days. Nobody actually does. Now you set a frequency, configure your mappings, and evals run themselves. Link here: confident-ai.com/blog/launch-we…
Confident AI tweet media
English
0
3
4
460
Confident AI retweetledi
DeepEval
DeepEval@deepeval·
My sister just got released, DeepTeam v1.0, 100% open-source, Apache 2.0 red teaming for LLMs. ⭐ Star on GitHub to stay on top of the latest developments in AI security and safety: github.com/confident-ai/d…
English
1
5
11
927
Confident AI retweetledi
Kritin Vongthongsri
Kritin Vongthongsri@kritinv07·
Most people run single-turn evals on chatbots. But that’s not enough. Conversations aren’t Q&A — they happen over multiple turns. This means your chatbot must stay context-aware across the dialogue, not just accurate in isolated responses. @deepeval, we’ve seen too many teams evaluate chatbots the wrong way. So, we wrote a comprehensive guide on how to evaluate all chatbots properly, end-to-end.👇 🔗 deepeval.com/docs/getting-s…
Kritin Vongthongsri tweet media
English
1
2
6
442
Confident AI retweetledi
Kritin Vongthongsri
Kritin Vongthongsri@kritinv07·
At @confident_ai, we’re focused on making evals great. But since we love our users very much, we’ve also just 5×’d the tracing analytics on our platform. Now you can: 🔍 Trace analytics — follow every request end-to-end ⏱️ Span analytics — see latency and cost per component 📊 Model analytics — compare performance, latency, and cost across models 👥 User analytics — understand usage patterns and behavior ⚠️ Error analytics — track and reduce failures over time
English
1
3
5
497
Confident AI retweetledi
Avi Chawla
Avi Chawla@_avichawla·
GPT-5 is OpenAI's most recent and powerful reasoning LLM. Today, let's build a pipeline to compare it against Grok 4 using: - @deepeval to build the eval pipeline (open-source). - @openrouter API to access both the LLMs. Let's dive in!
Avi Chawla tweet media
English
1
2
3
743
Confident AI retweetledi
Avi Chawla
Avi Chawla@_avichawla·
Let's compare GPT-5 and Grok 4 on reasoning tasks:
English
3
9
31
11.2K
Confident AI retweetledi
Jeffrey 🐬 confident-ai.com
@_avichawla Author of @deepeval here, thanks for the mention! A even better benchmark since these scores are so similar is to actually pick which one sounds less AI. We have an "ARENA" metric that allows you to pick the "winner" of two LLM outputs (in this case grok and GPT-5)
Jeffrey 🐬 confident-ai.com tweet media
English
1
2
3
335
Confident AI retweetledi
Kritin Vongthongsri
Kritin Vongthongsri@kritinv07·
🚀 @deepeval just hit 10,000 stars on GitHub. Next stop: 100k ⭐
Kritin Vongthongsri tweet media
English
0
3
5
273
Confident AI retweetledi
Kritin Vongthongsri
Kritin Vongthongsri@kritinv07·
We've built a @langchain integration @confident_ai so you can evaluate your entire agent trace in one extra line of code. ... ok maybe 2 lines of code.
Kritin Vongthongsri tweet media
English
0
2
5
189