
Micha Mazaheri
1.6K posts

Micha Mazaheri
@mittsh
Back to building a SaaS 🚀 • Advisor & co-founder @electricbeastco • Founded Paw API testing tool (exit in 2020) • #Biotech #PrecisionFermentation enthusiast













Unbelievable, they banned Harvard's ability to enroll international students. It's batshit insane, there's no other way to put it. "We have the best university in the world that's been standing for 400 years, let's kill it"

LMFAO

We don’t have an official @Eurovision favorite, but, just saying, only one act has had a dancing MD80 so far 🤷♀️. #Eurovision





🚨 95% of LLM evaluations fail to deliver value—why? 🤔 Because most teams are unknowingly evaluating the wrong thing. Typical LLM metrics sound great: - Correctness: "Did the model get the facts right?" ✅ - Answer Relevancy: "Did it directly answer the question?" 🎯 - Faithfulness: "Did it avoid hallucinations?" 🔎 - Tonality: "Did it match the desired voice?" 🗣️ But here's the issue: Your LLM doesn't exist simply to be correct, relevant, or faithful. It exists to deliver ROI—reducing customer support costs, saving analyst hours, or increasing customer satisfaction. 📈💸🤑🤑 Metrics must correlate to real-world outcomes. Your test case passing rates should confidently predict tangible business impact—more ticket resolutions, reduced internal workload, increased efficiency. When you build this metric-to-outcome connection, evaluations finally mean something. Improvements in your LLM’s performance become improvements in your business metrics. How do you do this right (@confident_ai )? 👉 Humans-in-the-loop, with metric alignment. - Curate just 25–50 human-labeled "good" or "bad" real-world OUTCOMES. - These aren't metric scores—these are OUTCOMES like support tickets being resolved or closing your LLM app in frustration. Whatever your product KPI is, you know it better than me. - Figure out the set of metrics would produce a test result-outcome correlation through trial and error. Forget synthetic data. Forget vanity metrics. If your evaluation data doesn't represent real users and real outcomes, you're evaluating in the dark. 🕶️ We unpacked exactly how to establish these connections clearly, practically, and repeatedly in our latest guide: 👉 The Ultimate LLM Evaluation Playbook:confident-ai.com/blog/the-ultim…






