Chris Mauck

374 posts

Chris Mauck

@cmauck10

Data Scientist @ Cleanlab, Car Enthusiast, and Food Connoisseur

Dallas, TX Katılım Aralık 2021

528 Takip Edilen139 Takipçiler

Chris Mauck@cmauck10·17 Nis

@businessbarista @da_fant You’re exactly describing @glean ! Arvind, our founder, built Google (search) in the late 90s, early 2000s and then IPOd Rubrik (security) in 2024. Glean does exactly what you’re mentioning. We index and learn your enterprise data sources and apps to build this agentic brain.

English

146

Alex Lieberman@businessbarista·16 Nis

Someone is going to build a worldclass “Brain” for enterprises & make a stupid amount of money. Why? As @da_fant said, “coding w ai is solved bc all context is in the git repo. knowledge work is difficult bc context is spread out. an ai system that creates a git repo w all context for a knowledge worker will be able to 100% automate the work.” When companies talk about being data ready for AI, this is what they’re implicitly saying. Engineering has been prepared for this moment for a long time because of the deterministic nature of code, the centralization/versioning of data (read: GitHub), and AI tools that are largely build by engineers for engineers. But for the rest of white collar work, there’s a TON of catching up to do to properly harness the power of the technology. The big challenge here, and why no one has truly cracked the code for "an ai system that creates a git repo w all context for a knowledge worker" is because unlike code, most knowledge is 1) distributed, 2) unstructured, and 3) unverifiable. It's distributed: transcripts live in Granola. Documents in Notion. Customer Data in Hubspot. ERP. Emails. Slack messages. Random spreadsheets. SOP docs. Etc. Etc. Building an ingestion engine that connects to all of your disparate data sources and auto-updates based on the shelf-life of the data is the first, and frankly, easiest step of the process. Next, it's unstructured: let's say I want to create a proposal for a potential client. To nail the proposal, I want it to pull important information from a variety of sources. The specific asks & background from our initial sales call. Previous proposals to anchor ourselves to a proven format. And completed sprint boards from Linear, so the pricing & timeline in the document is grounded in truth. Whether it's a thoughtful filesystem (a la Obsidian) or an OpenClaw-esque memory structure, the brain needs to be great at self-organizing in a thoughtful schema. This is very hard, especially if you want to build a generalizable brain that can be shaped to an array of different enterprises. And finally, most knowledge is unverifiable: writing a function, running a unit test, and seeing if the code works is easy. It works or it doesn't. Using AI to accelerate your content creation process is highly subjective. What is a good/bad idea? Is the content in your voice or not? Does it feel like slop or novel? Answering these questions are both difficult and non-verifiable. That same system described above doesn't just have to be great at organizing & forming coherent relationships, but it also has to be great at self-improving based on feedback from the user. Memory systems (like those introduced by OpenClaw) are great to a point, but as you scale the corpus of data within your company's brain, things like compaction and cleaning become wildly important to avoid the needle in the haystack problem. Someone is going to figure out how to solve this problem, and when they do, not only will they make a shit ton of money, but they'll be robinhood for knowledge workers, enabling non-engineers to enjoy the sort of leverage that only technical folks have felt for the last few years.

English

156

916

203.9K

Chris Mauck retweetledi

LlamaIndex 🦙@llama_index·11 Haz

New integration: @CleanlabAI + LlamaIndex LlamaIndex lets you build AI knowledge assistants and production agents that generate insights from enterprise data. Cleanlab makes their responses trustworthy. Add Cleanlab to: • Score trust for every LLM response • Catch hallucinated/incorrect responses in real time • Root cause why certain responses are untrustworthy (poor retrieval, bad data/context, tricky query, LLM hallucination, ...) Powerful assistants. Trustworthy outputs. Get started → docs.llamaindex.ai/en/stable/exam…

English

Chris Mauck retweetledi

Cleanlab@CleanlabAI·1 May

New: @langtrace_ai now includes native support for Cleanlab! Log trust scores, explanations, and metadata for every LLM response—automatically. Instantly surface risky or low-quality outputs. 📝 Blog: langtrace.ai/blog/langtrace… 💻 Docs: #cleanlab" target="_blank" rel="nofollow noopener">docs.langtrace.ai/supported-inte…

English

406

Chris Mauck retweetledi

Cleanlab@CleanlabAI·21 Nis

Cleanlab now works with @MLflow — making it easier to detect bad LLM responses right in your pipeline. Faster review cycles. Less manual work. We’re joining MLflow’s upcoming meetup to show how it works. 📅 Attend: lu.ma/mlflow423 📝 Blog: mlflow.org/blog/tlm-traci…

English

225

Chris Mauck retweetledi

Curtis G. Northcutt@cgnorthcutt·17 Nis

Tomorrow I'm spilling the secrets as to how several Fortune 500 @cleanlabai customers are solving the hardest problem in AI -- producing accurate, compliant, safe fully automated AI Agent responses -- at the @aiusergroup Conference in SF. Stop by and get your hands dirty and your AI cleaner.

English

175

Chris Mauck retweetledi

MLflow@MLflow·10 Nis

🚀 New on the MLflow blog: Automatically find the bad LLM responses in your LLM Evals with @CleanlabAI! Cleanlab’s Trustworthy Language Models (TLM) analyze prompts and responses to calculate a 𝚝𝚛𝚞𝚜𝚝𝚠𝚘𝚛𝚝𝚑𝚒𝚗𝚎𝚜𝚜_𝚜𝚌𝚘𝚛𝚎 — flagging potentially incorrect or hallucinated outputs, no ground truth labels required. In this guide, learn how to: ✅ Apply TLM to LLM responses captured with MLflow tracing ✅ Log, track, and analyze trustworthiness scores and explanations ✅ Use MLflow Evaluation to compare scores across runs 🔗 Check out the full guide: mlflow.org/blog/tlm-traci… #opensource #mlflow #cleanlab #oss #llm

English

721

Chris Mauck retweetledi

arize-phoenix@ArizePhoenix·20 Mar

Better LLMs start with better data and observability We’ve integrated @CleanlabAI's Trustworthy Language Model (TLM) with Phoenix to help teams improve LLM reliability and performance 🔍 TLM automatically identifies mislabeled, low-quality, or ambiguous training data—ensuring models are built on trustworthy foundations 📊 Phoenix provides deep observability to debug, evaluate, and enhance LLM performance in production How it works: 1️⃣ Extract LLM traces from Phoenix and structure input-output pairs for evaluation 2️⃣ Use Cleanlab TLM to assign a trustworthiness score and explanation to each response 3️⃣ Log evaluations back to Phoenix for traceability, clustering, and deeper insights into model performance 🔗 Dive into the full implementation in our docs & notebook:

English

2.1K

Chris Mauck@cmauck10·28 Şub

@xlr8harder @OpenAI We ran GPT-4.5 with our real-time Eval solution, the findings are mixed...

English

xlr8harder@xlr8harder·28 Şub

I want to applaud @OpenAI on releasing GPT-4.5. It's not a benchmark beater, and they released it anyway. That takes some courage on their part, because they will get a lot of dumb criticism on eval scores. (If you think it needs to top evals to be valuable, you are wrong.)

English

736

51.4K

Chris Mauck@cmauck10·28 Şub

@omarsar0 We ran GPT-4.5 with our real-time Eval solution, the findings are mixed...

English

elvis@omarsar0·28 Şub

Pro tip: Use the OpenAI Playground to compare GPT-4.5 and other models. Watch how "thoughtful" the GPT-4.5 response is.

English

18.8K

Chris Mauck@cmauck10·28 Şub

@ai_for_success We ran GPT-4.5 with our real-time Eval solution, the findings are mixed...

English

AshutoshShrivastava@ai_for_success·27 Şub

LMAO, OpenAI GPT-4.5 pricing is insane. What on earth are they even thinking??

English

351

116

328.9K

Chris Mauck@cmauck10·28 Şub

@flavioAd We ran GPT-4.5 with our real-time Eval solution, the findings are mixed...

English

Flavio Adamo@flavioAd·28 Şub

🚨 GPT-4.5 is impressive! 🚨 This is the most realistic result so far. "write a Python program that shows a ball bouncing inside a spinning hexagon. The ball should be affected by gravity and friction, and it must bounce off the rotating walls realistically"

English

1.1K

851.4K

Chris Mauck@cmauck10·28 Şub

@dylan522p We ran GPT-4.5 with our real-time Eval solution, the findings are mixed...

English

Dylan Patel@dylan522p·28 Şub

Claude 3.7 beats GPT 4.5 most tasks But 4.5 has better vibes... the first non-Anthropic model since 3 Opus 4.5 is vibey, and legitimately the first time a model made me laugh Humor is intelligence You exist in the context of all in which you live and what came before you meaning

San Francisco, CA 🇺🇸 English

807

151.5K

Chris Mauck@cmauck10·28 Şub

@benhylak We ran GPT-4.5 with our real-time Eval solution, the findings are mixed...

English

ben hylak@benhylak·27 Şub

i've been testing gpt 4.5 for the past few weeks. it's the first model that can actually write. this is literally the midjourney-moment for writing. (comparison to gpt 4o below)

English

218

2.1K

647.1K

Chris Mauck@cmauck10·28 Şub

@deedydas We ran GPT-4.5 with our real-time Eval solution, the findings are mixed...

English

Deedy@deedydas·28 Şub

Can confirm. GPT 4.5 is hilarious. "Finish the text: > Be me > L3 at Meta / Google >"

English

104

204

6.7K

885.1K

Chris Mauck@cmauck10·28 Şub

@nikunjhanda We ran GPT-4.5 with our real-time Eval solution, the findings are mixed...

English

Nikunj Handa@nikunjhanda·28 Şub

Pro-tip: When using GPT-4.5, prepend your system message with the message below. Based on OpenAI internal evals, it results in better performance!

English

246

3.9K

665.4K

Chris Mauck@cmauck10·28 Şub

@adonis_singh We ran GPT-4.5 with our real-time Eval solution, the findings are mixed...

English

adi@adonis_singh·28 Şub

GPT-4.5 asked for 1 truly novel human insight (might be my favourite answer on this prompt)

English

424

25.7K

Chris Mauck retweetledi

Akshay 🚀@akshay_pachaar·11 Şub

Let's build a trustworthy RAG app that provides a confidence score for each response:

English

437

99.6K

Chris Mauck retweetledi

LangChain@LangChain·18 Oca

🧹 Hallucination detection from Cleanlab 🧪 The new tlm-langchain package augments your LangChain / LangGraph applications with an LLM trustworthiness score. @CleanlabAI 's Trustworthy Language Model detects incorrect LLM outputs in real-time via state-of-the-art uncertainty estimation. Get started here: help.cleanlab.ai/tlm/use-cases/…

English

218

19.9K

Chris Mauck retweetledi

Curtis G. Northcutt@cgnorthcutt·25 Eki

NEWS: @CleanlabAI + @pinecone set a new standard for trustworthy GenAI/RAG! Our latest: AI that’s accurate, curated, and hallucination-free using Cleanlab’ knowledge curation and Pinecone’s vector search. Reliable responses and trust scoring. Full blog 👇pinecone.io/learn/building…

English

1.5K

Chris Mauck retweetledi

Cleanlab@CleanlabAI·21 Eki

Want to reduce the error-rate of responses from OpenAI’s o1 LLM by over 20% and also catch incorrect responses in real-time? Just published: 3 benchmarks demonstrating this can be achieved with the Trustworthy Language Model (TLM) framework [...]

English

2.6K

Keşfet

@businessbarista @da_fant @glean @CleanlabAI @langtrace_ai @MLflow @cleanlabai @aiusergroup