Confident AI

185 posts

Confident AI

@confident_ai

AI Observability & Evaluation Platform. Creator of @deepeval, the open-source LLM evaluation framework: https://t.co/STz0iBuGKs

San Francisco Katılım Ekim 2024

31 Takip Edilen364 Takipçiler

Sabitlenmiş Tweet

Confident AI@confident_ai·17 Eki

💥 After reaching nearly 200k downloads a month🤯 and 3.3K⭐GitHub stars on DeepEval (github.com/confident-ai/d……), here are the top takeaways and learnings we have to share for evaluating LLM applications... #deepeval #llmevaluation #llms #opensource A thread 🧵👇(1/10) :

English

9.9K

Confident AI@confident_ai·4d

Launch post link: confident-ai.com/blog/launch-we…

English

Confident AI@confident_ai·4d

We just launched dataset generation on Confident AI — connect Google Drive, SharePoint, Notion, or S3 and generate eval datasets directly from your docs. That's a wrap on Launch Week! 5 days, 5 launches.

English

159

Confident AI@confident_ai·4d

We’re seeing a consistent pattern: agents pass evals and still fail in production. The gap is almost always the same → outputs are tested, behavior isn’t. If you want us to take a look at your setup: forms.gle/K8L2cQF2inzCfQ… #AIEvals #LLMEvals #AIEngineering #AgentOps #MLOps

Brian Neville-O'Neill@bnevilleoneill

I’m starting to look at how teams are setting up evals and where they’re missing failure modes like this. If you want me to take a look at yours: forms.gle/K8L2cQF2inzCfQ… #AIEvals #LLMEvals #AIEngineering #AgentOps #MLOps

English

Confident AI@confident_ai·2 Nis

We're launching Auto-Categorize Traces & Threads — Day 4 of our Launch Week! Every production trace gets categorized automatically, so you can detect response drift and see exactly which areas your agent crushes and which ones need work. Full post: confident-ai.com/blog/launch-we…

English

117

Confident AI@confident_ai·1 Nis

Day 3 of launch week: auto trace-to-dataset ingestion. Set a rule once — production traces continuously flow into eval datasets and annotation queues. No scripts, no stale data. Post: confident-ai.com/blog/launch-we…

English

161

Confident AI retweetledi

Brian Neville-O'Neill@bnevilleoneill·1 Nis

I’m running LLM eval office hours today with @confident_ai 🧪 If you’re building anything with AI, drop a prompt + model output, and I’ll show where it breaks. I’ll look at: correctness completeness where it might fail in real use. Just quick, specific feedback #ai #LLM

English

139

Confident AI@confident_ai·31 Mar

Confident AI Launch Week Day 2: Scheduled Evals ⏰ Everyone agrees to run evals every few days. Nobody actually does. Now you set a frequency, configure your mappings, and evals run themselves. Link here: confident-ai.com/blog/launch-we…

English

460

Confident AI@confident_ai·30 Mar

Announcing our Q1 Launch Week! Day 1: Automated Error Analysis. Link here: confident-ai.com/blog/launch-we…

English

426

Confident AI retweetledi

DeepEval@deepeval·12 Kas

My sister just got released, DeepTeam v1.0, 100% open-source, Apache 2.0 red teaming for LLMs. ⭐ Star on GitHub to stay on top of the latest developments in AI security and safety: github.com/confident-ai/d…

English

927

Confident AI retweetledi

Jeffrey 🐬 confident-ai.com@jeffr_yyy·8 Eyl

decision-based LLM-as-a-judge for multi-turn use cases is here @deepeval Docs: deepeval.com/docs/metrics-c…

English

559

Confident AI retweetledi

Kritin Vongthongsri@kritinv07·25 Ağu

Making it so easy to view and evaluate threads/conversations on @confident_ai.

English

636

Confident AI retweetledi

Kritin Vongthongsri@kritinv07·18 Ağu

We're cooking up 👨‍🍳 something for our @confident_ai users...

English

431

Confident AI retweetledi

Kritin Vongthongsri@kritinv07·18 Ağu

Most people run single-turn evals on chatbots. But that’s not enough. Conversations aren’t Q&A — they happen over multiple turns. This means your chatbot must stay context-aware across the dialogue, not just accurate in isolated responses. @deepeval, we’ve seen too many teams evaluate chatbots the wrong way. So, we wrote a comprehensive guide on how to evaluate all chatbots properly, end-to-end.👇 🔗 deepeval.com/docs/getting-s…

English

442

Confident AI retweetledi

Jeffrey 🐬 confident-ai.com@jeffr_yyy·12 Ağu

ten kay stars @deepeval

English

456

Confident AI retweetledi

Kritin Vongthongsri@kritinv07·12 Ağu

At @confident_ai, we’re focused on making evals great. But since we love our users very much, we’ve also just 5×’d the tracing analytics on our platform. Now you can: 🔍 Trace analytics — follow every request end-to-end ⏱️ Span analytics — see latency and cost per component 📊 Model analytics — compare performance, latency, and cost across models 👥 User analytics — understand usage patterns and behavior ⚠️ Error analytics — track and reduce failures over time

English

497

Confident AI retweetledi

Avi Chawla@_avichawla·9 Ağu

GPT-5 is OpenAI's most recent and powerful reasoning LLM. Today, let's build a pipeline to compare it against Grok 4 using: - @deepeval to build the eval pipeline (open-source). - @openrouter API to access both the LLMs. Let's dive in!

English

743

Confident AI retweetledi

Avi Chawla@_avichawla·9 Ağu

Let's compare GPT-5 and Grok 4 on reasoning tasks:

English

11.2K

Confident AI retweetledi

Jeffrey 🐬 confident-ai.com@jeffr_yyy·10 Ağu

@_avichawla Author of @deepeval here, thanks for the mention! A even better benchmark since these scores are so similar is to actually pick which one sounds less AI. We have an "ARENA" metric that allows you to pick the "winner" of two LLM outputs (in this case grok and GPT-5)

English

335

Confident AI retweetledi

Kritin Vongthongsri@kritinv07·8 Ağu

🚀 @deepeval just hit 10,000 stars on GitHub. Next stop: 100k ⭐

English

273

Confident AI retweetledi

Kritin Vongthongsri@kritinv07·8 Ağu

We've built a @langchain integration @confident_ai so you can evaluate your entire agent trace in one extra line of code. ... ok maybe 2 lines of code.

English

189

Keşfet

@deepeval @openrouter @_avichawla @langchain @elonmusk @BarackObama @taylorswift13 @cristiano