DeepEval

174 posts

DeepEval

@deepeval

The Open-Source LLM Evaluation Framework – created and maintained by @confident_ai GitHub: https://t.co/kuvhlNRRdD

San Francisco Beigetreten Şubat 2025

2 Folgt652 Follower

DeepEval retweetet

Confident AI@confident_ai·4d

We just launched dataset generation on Confident AI — connect Google Drive, SharePoint, Notion, or S3 and generate eval datasets directly from your docs. That's a wrap on Launch Week! 5 days, 5 launches.

English

161

DeepEval retweetet

Confident AI@confident_ai·2 Nis

We're launching Auto-Categorize Traces & Threads — Day 4 of our Launch Week! Every production trace gets categorized automatically, so you can detect response drift and see exactly which areas your agent crushes and which ones need work. Full post: confident-ai.com/blog/launch-we…

English

118

DeepEval retweetet

Confident AI@confident_ai·1 Nis

Day 3 of launch week: auto trace-to-dataset ingestion. Set a rule once — production traces continuously flow into eval datasets and annotation queues. No scripts, no stale data. Post: confident-ai.com/blog/launch-we…

English

162

DeepEval retweetet

Confident AI@confident_ai·31 Mar

Confident AI Launch Week Day 2: Scheduled Evals ⏰ Everyone agrees to run evals every few days. Nobody actually does. Now you set a frequency, configure your mappings, and evals run themselves. Link here: confident-ai.com/blog/launch-we…

English

461

DeepEval retweetet

Confident AI@confident_ai·30 Mar

Announcing our Q1 Launch Week! Day 1: Automated Error Analysis. Link here: confident-ai.com/blog/launch-we…

English

427

DeepEval@deepeval·3 Şub

@hasantoxr Wow 👀

239

DeepEval retweetet

Hasan Toor@hasantoxr·3 Şub

🚨BREAKING: Someone just solved LLM testing's biggest problem. It's called DeepEval and it gives you answer relevancy, hallucination detection, and G-Eval metrics that actually work. - Run evaluations 100% locally (no data leaves your machine). - Test agents, RAG systems, and production responses with human-level accuracy. 100% Opensource.

English

121

783

50K

DeepEval@deepeval·12 Kas

My sister just got released, DeepTeam v1.0, 100% open-source, Apache 2.0 red teaming for LLMs. ⭐ Star on GitHub to stay on top of the latest developments in AI security and safety: github.com/confident-ai/d…

English

927

DeepEval retweetet

Jeffrey 🐬 confident-ai.com@jeffr_yyy·22 Eki

@_avichawla Author of @deepeval here, I'm glad you've found our approach of LLM-Arena-as-a-Judge useful :)

English

349

DeepEval retweetet

Avi Chawla@_avichawla·22 Eki

Most LLM-powered evals are BROKEN! These evals can easily mislead you to believe that one model is better than the other, primarily due to the way they are set up. G-Eval is one popular example. Here's the core problem with LLM eval techniques and a better alternative to them: Typical evals like G-Eval assume you’re scoring one output at a time in isolation, without understanding the alternative. So when prompt A scores 0.72 and prompt B scores 0.74, you still don’t know which one’s actually better. This is unlike scoring, say, classical ML models, where metrics like accuracy, F1, or RMSE give a clear and objective measure of performance. There’s no room for subjectivity, and the results are grounded in hard numbers, not opinions. LLM Arena-as-a-Judge is a new technique that addresses this issue with LLM evals. In a gist, instead of assigning scores, you just run A vs. B comparisons and pick the better output. Just like G-Eeval, you can define what “better” means (e.g., more helpful, more concise, more polite), and use any LLM to act as the judge. LLM Arena-as-a-Judge is actually implemented in @deepeval (open-source with 12k stars), and you can use it in just three steps: - Create an ArenaTestCase, with a list of “contestants” and their respective LLM interactions. - Next, define your criteria for comparison using the Arena G-Eval metric, which incorporates the G-Eval algorithm for a comparison use case. - Finally, run the evaluation and print the scores. This gives you an accurate head-to-head comparison. Note that LLM Arena-as-a-Judge can either be referenceless (like shown in the snippet below) or reference-based. If needed, you can specify an expected output as well for the given input test case and specify that in the evaluation parameters. Why DeepEval? It's 100% open-source with 12k+ stars and implements everything you need to define metrics, create test cases, and run evals like: - component-level evals - multi-turn evals - LLM Arena-as-a-judge, etc. Moreover, tracing LLM apps is as simple as adding one Python decorator. And you can run everything 100% locally. I have shared the repo in the replies.

English

21.3K

DeepEval@deepeval·17 Eki

🙌 our favorite VC ❤️‍🔥

Vermilion Cliffs Ventures@vermilionfund

The new Vermilion newsletter is out 🗞️ Inside: 💰 @514hq raises $17m to simplify AI-ready analytics 📈 @deepeval becomes the most adopted LLM eval framework globally 🤝 Google’s Agent Development Kit ships a @CopilotKit integration 👀 Who’s hiring? Check out the new Vermilion Careers page Plus: @ashl3ysm1th's take on startup KPIs, revenge of the acronyms, and more founder lessons from the trail.

Português

231

DeepEval retweetet

Vermilion Cliffs Ventures@vermilionfund·2 Eki

English

428

DeepEval retweetet

Avi Chawla@_avichawla·24 Eyl

Pytest for LLM Apps is finally here! DeepEval turns LLM evals into a two-line test suite to help you identify the best models, prompts, and architecture for AI workflows (including MCPs). Works with all frameworks like LlamaIndex, CrewAI, etc. 100% open-source with 11k stars!

English

279

20.3K

DeepEval retweetet

Mariano Falcón@falconius·12 Eyl

And now an external tool can be useful: langfuse, @braintrustdata @deepeval @langfuse @helicone_ai

English

197

DeepEval retweetet

anshuman@athleticKoder·20 Eyl

Companies like @OpenAI, @perplexity_ai and @AnthropicAI already use LLM judges for production evaluation at massive scale. @ragas_io and @deepeval are two evaluation frameworks that I personally find intuitive. [NOT AN AD]

English

3.1K

DeepEval retweetet

Jeffrey 🐬 confident-ai.com@jeffr_yyy·22 Eyl

"widespread adoption" @deepeval

English

298

DeepEval@deepeval·11 Eyl

@tricalt I'm looking forward to it 👀

English

DeepEval retweetet

Vasilije@tricalt·8 Eyl

Real-world AI memory evaluation needs more than HotPotQA, EM or F1. Our new open dataset with @deepeval will measure cross-context, long-term reasoning. Stay tuned. For the full results & current methodology: cognee.ai/blog/deep-dive…

English

387

Vasilije@tricalt·8 Eyl

@cognee_ hits 92.5% on benchmarks. That’s the percent of answers our LLM-based evaluator marked correct - designed to approximate human evaluation. We ran cognee against LightRAG, Mem0, and Graphiti (prev results) through 45 evaluation cycles on 24 questions from HotPotQA.

English

139

DeepEval retweetet

cognee@cognee_·8 Eyl

Although HotPotQA and standard metrics like F1 and EM gave us a baseline, we believe real-world AI memory requires better evaluation systems. That's why we’re building an open dataset with @deepeval to share with the entire ecosystem. Full write-up: cognee.ai/blog/deep-dive…

English

201

DeepEval retweetet

Akshay 🚀@akshay_pachaar·11 Eyl

There are primarily 2 factors that determine how well an MCP app works: - If the model is selecting the right tool? - And if it's correctly preparing the tool call? Today, let's learn how to evaluate any MCP workflow using @deepeval's MCP evaluations (open-source). Let's go!

English

Entdecken

@hasantoxr @_avichawla @514hq @CopilotKit @ashl3ysm1th @langfuse @helicone_ai @OpenAI