

Mahesh Sathiamoorthy
4.5K posts

@madiator
RL Environment Curation. Data Curation (OpenThoughts). Post-training. CEO @bespokelabsai. Ex-GoogleDeepMind.







1 million context window: Now generally available for Claude Opus 4.6 and Claude Sonnet 4.6.

LIVE WORKSHOP: The State Agentic Evals! Agents are capable of multi-step reasoning, tool use, and real-world task completio, so evaluation needs to keep up. We will discuss topics related to questions such as: Where is the state of the art in evaluating agentic systems? Why are agentics systems’ performance on benchmarks not reflected in usage? How can we evaluate the agentic systems and language models that we use? This workshop brings together diverse perspectives from academia, industry, and policy to explore the frontier of agentic evaluation.



gigafucked: - grammarly - calendly - miro - retool - webflow - langchain - writer - harvey - glean - expedia - monday fucked: - accenture - intuit - notion - jasper - canva - alphasense - postman - airtable - talkdesk - sierra - zapier - replit - solace probably fucked: - cursor - pilot - clay - mercor naively seems fucked but so competent / plugged in they seem to be figuring it out on the fly anyway: - linear

Terminal-Bench is a leading benchmark for agents. Unfortunately it’s hard: most small coding agents get very low scores on TB2, so training/system ablations look flat - you can't tell what's working. Announcing OpenThoughts-TBLite - 100 curated TB2-style tasks, difficulty-calibrated so even 8B models can make progress. It's designed to give researchers measurable signal during development, providing faster feedback for experimental iteration while closely tracking true TB2 performance🧵