Chris Mauck

374 posts

Chris Mauck banner
Chris Mauck

Chris Mauck

@cmauck10

Data Scientist @ Cleanlab, Car Enthusiast, and Food Connoisseur

Dallas, TX Katılım Aralık 2021
528 Takip Edilen139 Takipçiler
Chris Mauck
Chris Mauck@cmauck10·
@businessbarista @da_fant You’re exactly describing @glean ! Arvind, our founder, built Google (search) in the late 90s, early 2000s and then IPOd Rubrik (security) in 2024. Glean does exactly what you’re mentioning. We index and learn your enterprise data sources and apps to build this agentic brain.
English
1
0
3
146
Alex Lieberman
Alex Lieberman@businessbarista·
Someone is going to build a worldclass “Brain” for enterprises & make a stupid amount of money. Why? As @da_fant said, “coding w ai is solved bc all context is in the git repo. knowledge work is difficult bc context is spread out. an ai system that creates a git repo w all context for a knowledge worker will be able to 100% automate the work.” When companies talk about being data ready for AI, this is what they’re implicitly saying. Engineering has been prepared for this moment for a long time because of the deterministic nature of code, the centralization/versioning of data (read: GitHub), and AI tools that are largely build by engineers for engineers. But for the rest of white collar work, there’s a TON of catching up to do to properly harness the power of the technology. The big challenge here, and why no one has truly cracked the code for "an ai system that creates a git repo w all context for a knowledge worker" is because unlike code, most knowledge is 1) distributed, 2) unstructured, and 3) unverifiable. It's distributed: transcripts live in Granola. Documents in Notion. Customer Data in Hubspot. ERP. Emails. Slack messages. Random spreadsheets. SOP docs. Etc. Etc. Building an ingestion engine that connects to all of your disparate data sources and auto-updates based on the shelf-life of the data is the first, and frankly, easiest step of the process. Next, it's unstructured: let's say I want to create a proposal for a potential client. To nail the proposal, I want it to pull important information from a variety of sources. The specific asks & background from our initial sales call. Previous proposals to anchor ourselves to a proven format. And completed sprint boards from Linear, so the pricing & timeline in the document is grounded in truth. Whether it's a thoughtful filesystem (a la Obsidian) or an OpenClaw-esque memory structure, the brain needs to be great at self-organizing in a thoughtful schema. This is very hard, especially if you want to build a generalizable brain that can be shaped to an array of different enterprises. And finally, most knowledge is unverifiable: writing a function, running a unit test, and seeing if the code works is easy. It works or it doesn't. Using AI to accelerate your content creation process is highly subjective. What is a good/bad idea? Is the content in your voice or not? Does it feel like slop or novel? Answering these questions are both difficult and non-verifiable. That same system described above doesn't just have to be great at organizing & forming coherent relationships, but it also has to be great at self-improving based on feedback from the user. Memory systems (like those introduced by OpenClaw) are great to a point, but as you scale the corpus of data within your company's brain, things like compaction and cleaning become wildly important to avoid the needle in the haystack problem. Someone is going to figure out how to solve this problem, and when they do, not only will they make a shit ton of money, but they'll be robinhood for knowledge workers, enabling non-engineers to enjoy the sort of leverage that only technical folks have felt for the last few years.
English
156
73
916
203.9K
Chris Mauck retweetledi
LlamaIndex 🦙
LlamaIndex 🦙@llama_index·
New integration: @CleanlabAI + LlamaIndex LlamaIndex lets you build AI knowledge assistants and production agents that generate insights from enterprise data. Cleanlab makes their responses trustworthy. Add Cleanlab to: • Score trust for every LLM response • Catch hallucinated/incorrect responses in real time • Root cause why certain responses are untrustworthy (poor retrieval, bad data/context, tricky query, LLM hallucination, ...) Powerful assistants. Trustworthy outputs. Get started → docs.llamaindex.ai/en/stable/exam…
LlamaIndex 🦙 tweet media
English
5
13
68
9K
Chris Mauck retweetledi
Cleanlab
Cleanlab@CleanlabAI·
New: @langtrace_ai now includes native support for Cleanlab! Log trust scores, explanations, and metadata for every LLM response—automatically. Instantly surface risky or low-quality outputs. 📝 Blog: langtrace.ai/blog/langtrace… 💻 Docs: #cleanlab" target="_blank" rel="nofollow noopener">docs.langtrace.ai/supported-inte…
Cleanlab tweet media
English
1
4
9
406
Chris Mauck retweetledi
Cleanlab
Cleanlab@CleanlabAI·
Cleanlab now works with @MLflow — making it easier to detect bad LLM responses right in your pipeline. Faster review cycles. Less manual work. We’re joining MLflow’s upcoming meetup to show how it works. 📅 Attend: lu.ma/mlflow423 📝 Blog: mlflow.org/blog/tlm-traci…
Cleanlab tweet media
English
1
1
4
225
Chris Mauck retweetledi
Curtis G. Northcutt
Curtis G. Northcutt@cgnorthcutt·
Tomorrow I'm spilling the secrets as to how several Fortune 500 @cleanlabai customers are solving the hardest problem in AI -- producing accurate, compliant, safe fully automated AI Agent responses -- at the @aiusergroup Conference in SF. Stop by and get your hands dirty and your AI cleaner.
Curtis G. Northcutt tweet media
English
1
3
6
175
Chris Mauck retweetledi
MLflow
MLflow@MLflow·
🚀 New on the MLflow blog: Automatically find the bad LLM responses in your LLM Evals with @CleanlabAI! Cleanlab’s Trustworthy Language Models (TLM) analyze prompts and responses to calculate a 𝚝𝚛𝚞𝚜𝚝𝚠𝚘𝚛𝚝𝚑𝚒𝚗𝚎𝚜𝚜_𝚜𝚌𝚘𝚛𝚎 — flagging potentially incorrect or hallucinated outputs, no ground truth labels required. In this guide, learn how to: ✅ Apply TLM to LLM responses captured with MLflow tracing ✅ Log, track, and analyze trustworthiness scores and explanations ✅ Use MLflow Evaluation to compare scores across runs 🔗 Check out the full guide: mlflow.org/blog/tlm-traci… #opensource #mlflow #cleanlab #oss #llm
MLflow tweet media
English
1
4
7
721
Chris Mauck retweetledi
arize-phoenix
arize-phoenix@ArizePhoenix·
Better LLMs start with better data and observability We’ve integrated @CleanlabAI's Trustworthy Language Model (TLM) with Phoenix to help teams improve LLM reliability and performance 🔍 TLM automatically identifies mislabeled, low-quality, or ambiguous training data—ensuring models are built on trustworthy foundations 📊 Phoenix provides deep observability to debug, evaluate, and enhance LLM performance in production How it works: 1️⃣ Extract LLM traces from Phoenix and structure input-output pairs for evaluation 2️⃣ Use Cleanlab TLM to assign a trustworthiness score and explanation to each response 3️⃣ Log evaluations back to Phoenix for traceability, clustering, and deeper insights into model performance 🔗 Dive into the full implementation in our docs & notebook:
arize-phoenix tweet media
English
2
6
15
2.1K
xlr8harder
xlr8harder@xlr8harder·
I want to applaud @OpenAI on releasing GPT-4.5. It's not a benchmark beater, and they released it anyway. That takes some courage on their part, because they will get a lot of dumb criticism on eval scores. (If you think it needs to top evals to be valuable, you are wrong.)
English
46
20
736
51.4K
Chris Mauck
Chris Mauck@cmauck10·
@omarsar0 We ran GPT-4.5 with our real-time Eval solution, the findings are mixed...
Chris Mauck tweet media
English
0
0
0
10
elvis
elvis@omarsar0·
Pro tip: Use the OpenAI Playground to compare GPT-4.5 and other models. Watch how "thoughtful" the GPT-4.5 response is.
elvis tweet media
English
6
4
48
18.8K
Chris Mauck
Chris Mauck@cmauck10·
@ai_for_success We ran GPT-4.5 with our real-time Eval solution, the findings are mixed...
Chris Mauck tweet media
English
0
0
0
2
AshutoshShrivastava
AshutoshShrivastava@ai_for_success·
LMAO, OpenAI GPT-4.5 pricing is insane. What on earth are they even thinking??
AshutoshShrivastava tweet media
English
351
116
2K
328.9K
Chris Mauck
Chris Mauck@cmauck10·
@flavioAd We ran GPT-4.5 with our real-time Eval solution, the findings are mixed...
Chris Mauck tweet media
English
0
0
0
30
Flavio Adamo
Flavio Adamo@flavioAd·
🚨 GPT-4.5 is impressive! 🚨 This is the most realistic result so far. "write a Python program that shows a ball bouncing inside a spinning hexagon. The ball should be affected by gravity and friction, and it must bounce off the rotating walls realistically"
English
77
69
1.1K
851.4K
Chris Mauck
Chris Mauck@cmauck10·
@dylan522p We ran GPT-4.5 with our real-time Eval solution, the findings are mixed...
Chris Mauck tweet media
English
0
0
0
3
Dylan Patel
Dylan Patel@dylan522p·
Claude 3.7 beats GPT 4.5 most tasks But 4.5 has better vibes... the first non-Anthropic model since 3 Opus 4.5 is vibey, and legitimately the first time a model made me laugh Humor is intelligence You exist in the context of all in which you live and what came before you meaning
Dylan Patel tweet mediaDylan Patel tweet mediaDylan Patel tweet mediaDylan Patel tweet media
San Francisco, CA 🇺🇸 English
42
42
807
151.5K
Chris Mauck
Chris Mauck@cmauck10·
@benhylak We ran GPT-4.5 with our real-time Eval solution, the findings are mixed...
Chris Mauck tweet media
English
0
0
0
10
ben hylak
ben hylak@benhylak·
i've been testing gpt 4.5 for the past few weeks. it's the first model that can actually write. this is literally the midjourney-moment for writing. (comparison to gpt 4o below)
ben hylak tweet media
English
218
84
2.1K
647.1K
Chris Mauck
Chris Mauck@cmauck10·
@deedydas We ran GPT-4.5 with our real-time Eval solution, the findings are mixed...
Chris Mauck tweet media
English
0
0
0
14
Deedy
Deedy@deedydas·
Can confirm. GPT 4.5 is hilarious. "Finish the text: > Be me > L3 at Meta / Google >"
Deedy tweet media
English
104
204
6.7K
885.1K
Chris Mauck
Chris Mauck@cmauck10·
@nikunjhanda We ran GPT-4.5 with our real-time Eval solution, the findings are mixed...
Chris Mauck tweet media
English
0
0
0
9
Nikunj Handa
Nikunj Handa@nikunjhanda·
Pro-tip: When using GPT-4.5, prepend your system message with the message below. Based on OpenAI internal evals, it results in better performance!
Nikunj Handa tweet media
English
81
246
3.9K
665.4K
Chris Mauck
Chris Mauck@cmauck10·
@adonis_singh We ran GPT-4.5 with our real-time Eval solution, the findings are mixed...
Chris Mauck tweet media
English
0
0
0
16
adi
adi@adonis_singh·
GPT-4.5 asked for 1 truly novel human insight (might be my favourite answer on this prompt)
adi tweet media
English
424
2K
25.7K
2M
Chris Mauck retweetledi
Akshay 🚀
Akshay 🚀@akshay_pachaar·
Let's build a trustworthy RAG app that provides a confidence score for each response:
English
11
42
437
99.6K
Chris Mauck retweetledi
LangChain
LangChain@LangChain·
🧹 Hallucination detection from Cleanlab 🧪 The new tlm-langchain package augments your LangChain / LangGraph applications with an LLM trustworthiness score. @CleanlabAI 's Trustworthy Language Model detects incorrect LLM outputs in real-time via state-of-the-art uncertainty estimation. Get started here: help.cleanlab.ai/tlm/use-cases/…
LangChain tweet media
English
3
40
218
19.9K
Chris Mauck retweetledi
Curtis G. Northcutt
Curtis G. Northcutt@cgnorthcutt·
NEWS: @CleanlabAI + @pinecone set a new standard for trustworthy GenAI/RAG! Our latest: AI that’s accurate, curated, and hallucination-free using Cleanlab’ knowledge curation and Pinecone’s vector search. Reliable responses and trust scoring. Full blog 👇pinecone.io/learn/building…
English
0
3
9
1.5K
Chris Mauck retweetledi
Cleanlab
Cleanlab@CleanlabAI·
Want to reduce the error-rate of responses from OpenAI’s o1 LLM by over 20% and also catch incorrect responses in real-time? Just published: 3 benchmarks demonstrating this can be achieved with the Trustworthy Language Model (TLM) framework [...]
Cleanlab tweet media
English
1
4
11
2.6K