Evidently AI

2.3K posts

Evidently AI

@EvidentlyAI

Open source ML and LLM evaluation 📊 , testing 🚦and monitoring 📈 GitHub: https://t.co/37H9bfnYj6 Discord: https://t.co/ElZ9RlroUa

Katılım Şubat 2020

211 Takip Edilen2.5K Takipçiler

Sabitlenmiş Tweet

Evidently AI@EvidentlyAI·9 Ara

3️⃣ 2️⃣ 1️⃣ Our free course on LLM evaluations for AI product teams starts today! 🎥 7 days of byte-sized videos into your inbox ⭐️ Certificate upon completion 👩‍💻 No coding skills required 👩‍🎓500+ students have signed up You can still join the course👇 evidentlyai.com/llm-evaluation…

English

1.8K

Evidently AI@EvidentlyAI·1d

How Netflix evaluates the quality of show synopsis using LLM-as-a-judge approach: netflixtechblog.com/evaluating-net…

English

969

Evidently AI@EvidentlyAI·24 Nis

How Zalando builds a search quality assurance framework with LLM-as-a-judge: engineering.zalando.com/posts/2026/03/…

English

118

Evidently AI@EvidentlyAI·4 Nis

📌 In case you missed it How to evaluate an AI agent? Follow the tutorial as we: 1️⃣ Build an AI agent, 2️⃣ Create a test dataset, 3️⃣ Assess responses and tool choice, 4️⃣ Track the agent’s behaviour. Follow the tutorial from our LLM evals course: youtube.com/watch?v=9KMmad…

YouTube

English

202

Evidently AI@EvidentlyAI·3 Nis

A Friday ML use case 📕 📚 From the database of 800 ML & LLM systems: cutt.ly/SwrZWL0g How Uber improves driver availability at airports: Estimated time-to-request model, Earnings-per-hour prediction, and Driver-deficit forecasting. uber.com/en-GB/blog/for…

English

144

Evidently AI@EvidentlyAI·1 Nis

🦾 More AI agents aren’t always better. Google evaluated 180 agent setups and found multi-agent systems help with parallel tasks but can hurt sequential ones. The work also proposes a model to predict optimal agentic designs. research.google/blog/towards-a…

English

102

Evidently AI@EvidentlyAI·28 Mar

📌 In case you missed it Let’s test your RAG system! Follow the tutorial as we: 1️⃣ Build a RAG system, 2️⃣ Generate test data, 3️⃣ Evaluate answers for correctness and faithfulness. Watch the tutorial from our LLM evals course: youtube.com/watch?v=jckp5R…

YouTube

English

131

Evidently AI@EvidentlyAI·27 Mar

A Friday ML use case 📕 📚 From the database of 800 ML & LLM systems: cutt.ly/SwrZWL0g How GoDaddy built Lighthouse, an internal AI analytics platform: prompt engineering framework, model orchestration, solution architecture, and use cases. godaddy.com/resources/news…

English

Evidently AI retweetledi

Nnenna 👩🏽‍💻✨@nnennahacks·24 Mar

(policyNIM oss tool) preflight command is working. when I provide a coding task, it kicks off a search through indexed policies to determine which rules are relevant for implementation. @nvidia for embedding w/ @OpenAI + @lancedb for vector storage. eval command is also working. using @EvidentlyAI for running eval suite.

English

440

Evidently AI@EvidentlyAI·24 Mar

🚦 Meta’s “Agents Rule of Two” According to Meta, AI agents should satisfy at most two of these conditions per session to reduce prompt-injection risk: - Handle untrusted inputs - Access sensitive data - Change state / act externally ai.meta.com/blog/practical…

English

Evidently AI@EvidentlyAI·21 Mar

📌 In case you missed it How do you know if your RAG works? You need to check: ✅ Can it find the right information? ✅ Is the final answer complete, relevant, and free of hallucinations? Watch the intro to RAG evaluation from our LLM evals course: youtube.com/watch?v=qI2qQf…

YouTube

English

172

Evidently AI@EvidentlyAI·20 Mar

A Friday ML use case 📕 📚 From the database of 800 ML & LLM systems: cutt.ly/SwrZWL0g How DoorDash improves its RecSys using LLMs to bridge behavioral silos in multi-vertical recommendations. careersatdoordash.com/blog/doordash-…

English

Evidently AI@EvidentlyAI·18 Mar

💭 Can AI systems introspect? Anthropic’s new research suggests Claude models can sometimes identify and describe their own internal states. It’s still unreliable, but marks a step toward more transparent AI reasoning. anthropic.com/research/intro…

English

Evidently AI@EvidentlyAI·14 Mar

📌 In case you missed it Can LLMs write engaging tech tweets? Follow the tutorial as we: 1️⃣ Build a tweet generator, 2️⃣ Score its outputs with custom LLM judges, 3️⃣ Improve the results with prompt iteration. Watch the tutorial from our LLM evals course: youtube.com/watch?v=KhkiM9…

YouTube

English

182

Evidently AI@EvidentlyAI·13 Mar

A Friday ML use case 📕 📚 From the database of 800 ML & LLM systems: cutt.ly/SwrZWL0g How Shopify transformed its product classification system from basic categorization to an AI-driven framework using Vision Language Models. shopify.engineering/evolution-prod…

English

Evidently AI@EvidentlyAI·10 Mar

📚 Context is everything. OpenAI shares how it built an in-house data agent that answers complex questions in minutes. It uses 6 layers of context: - Table metadata - Human annotations - Codex enrichment - Company knowledge - Memory - Runtime context openai.com/index/inside-o…

English

117

Evidently AI@EvidentlyAI·7 Mar

📌 In case you missed it Are LLMs good for classification tasks? We built an LLM-based classifier for a travel support chatbot and compared its performance to a classic ML model. Watch the tutorial from our LLM evals course: youtube.com/watch?v=Gl2X_o…

YouTube

English

162

Evidently AI@EvidentlyAI·6 Mar

A Friday ML use case 📕 📚 From the database of 800 ML & LLM systems: cutt.ly/SwrZWL0g How Wayfair built Wilma, a customer service agent copilot: workflow, prompt templates, and the copilot’s evolution. aboutwayfair.com/careers/tech-b…

English

Evidently AI@EvidentlyAI·4 Mar

🤖 How to develop and deploy chatbots at scale? DoorDash shares how they created a simulation platform and evaluation flywheel, allowing them to test chatbots with fast feedback loops and without production risk. careersatdoordash.com/blog/doordash-…

English

Evidently AI@EvidentlyAI·28 Şub

📌 In case you missed it How to create an LLM judge that aligns with human labels: - Define criteria - Create test dataset - Run evaluation prompt to see if the judge aligns with your labels - Evaluate the judge Watch the video from our LLM evals course: youtube.com/watch?v=kP_aaF…

YouTube

English

175

Evidently AI@EvidentlyAI·27 Şub

A Friday ML use case 📕 📚 From the database of 800 ML & LLM systems: cutt.ly/SwrZWL0g How Wayfair uses AI agents to automatically triage support tickets: agents vs. workflows and a hybrid approach. aboutwayfair.com/careers/tech-b…

English

Keşfet

@nvidia @OpenAI @lancedb @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates