Mahesh Sathiamoorthy

4.5K posts

Mahesh Sathiamoorthy

@madiator

RL Environment Curation. Data Curation (OpenThoughts). Post-training. CEO @bespokelabsai. Ex-GoogleDeepMind.

Inside a RL Environment انضم Şubat 2008

1.4K يتبع14.5K المتابعون

تغريدة مثبتة

Mahesh Sathiamoorthy@madiator·28 Oca

We are announcing Open Thoughts, our large-scale open-source effort to curate the best open reasoning datasets! DeepSeek-R1 is amazing but we still don't have access to high-quality open reasoning datasets. These datasets are crucial if you want to build your reasoning models! Bespoke Labs released a 17k reasoning dataset last Wednesday, and the reception has been phenomenal (it's trending on HF). So we are joining forces with the Datacomp community to launch Open Thoughts --- an open data, open model, and open code initiative for creating the best open reasoning datasets and the associated models. Along with this, we release OpenThoughts-114k reasoning dataset and the associated OpenThinker-7B model. Links to the code, model, and data are below in 🧵.

English

286

1.8K

226.8K

Mahesh Sathiamoorthy@madiator·2d

@mjamei that's just the interface. what about adding helper functions for calling llm-as-a-judge, storing to yaml, reading from yaml etc.

English

Mehdi Jamei 🗽@mjamei·2d

@madiator This is 20 lines of code. why would you need a lib?

English

Mahesh Sathiamoorthy@madiator·3d

What's the library people use for defining/loading/processing rubrics?

English

2.9K

Mahesh Sathiamoorthy@madiator·3d

@AashaySachdeva Standardization, reuse etc. Also, I asked opus and it gave me this. Do you like this way of representing rubrics?

English

195

aashay sachdeva@AashaySachdeva·3d

@madiator Why do you need a library for this?

English

343

Mahesh Sathiamoorthy@madiator·4d

Weekend project: getting back to some fixes in Curator..

English

725

Mahesh Sathiamoorthy@madiator·4d

@NandoDF They make more money since I use Claude periodically to keep my version of Claude code in sync with theirs.

English

154

Nando de Freitas@NandoDF·5d

What happens to Anthropic when anyone can use Claude Code to generate Claude Code?

English

102

26.2K

Mahesh Sathiamoorthy@madiator·4d

Bay area is so pretty now

English

1.7K

Mahesh Sathiamoorthy@madiator·5d

Claude code high-five'ing itself about how well it explored the code :)

English

2.1K

Mahesh Sathiamoorthy@madiator·6d

I am curious to hear what the community thinks will be the context length of opus in two years!

Claude@claudeai

1 million context window: Now generally available for Claude Opus 4.6 and Claude Sonnet 4.6.

English

Mahesh Sathiamoorthy@madiator·6d

@BobbySamuels @bespokelabsai, @datologyai and a few others have been doing this for years!

English

183

Bobby Samuels@BobbySamuels·11 Mar

x.com/i/article/2030…

ZXX

513

284.8K

Mahesh Sathiamoorthy@madiator·13 Mar

Haha 😂

Saurabh Shah@saurabh_shah2

Ohhh good point!! Since RL mostly works now RL data might be a thing people wanna sell. Is anyone doing this? Selling RL envs? Is there even a single company doing this?

Filipino

1.6K

Mahesh Sathiamoorthy@madiator·12 Mar

Drop that "in filmmaking" -- it's cleaner!

English

571

Mahesh Sathiamoorthy@madiator·11 Mar

I will be talking at the Agentic Evals workshop organized by HuggingFace. It's streamed live, so mark your calendars!

Ben Burtenshaw@ben_burtenshaw

LIVE WORKSHOP: The State Agentic Evals! Agents are capable of multi-step reasoning, tool use, and real-world task completio, so evaluation needs to keep up. We will discuss topics related to questions such as: Where is the state of the art in evaluating agentic systems? Why are agentics systems’ performance on benchmarks not reflected in usage? How can we evaluate the agentic systems and language models that we use? This workshop brings together diverse perspectives from academia, industry, and policy to explore the frontier of agentic evaluation.

English

702

Mahesh Sathiamoorthy أُعيد تغريده

Daanish Khazi@bertgodel·4 Mar

We’re announcing Kos-1 Lite, a medical model that achieves SOTA on HealthBench Hard at 46.6%. As a medium sized language model (~100B), it achieves these results at a fraction of the serving cost of frontier trillion-parameter models.

English

318

24.6K

Mahesh Sathiamoorthy أُعيد تغريده

rohan anil@_arohan_·2 Mar

I feel a bit responsible for hyping agentic coding in December as I was having and still having too much fun doing best technical work. However I heard some gossip about certain big tech hiring fewer junior eng. so I wanted to make a point. If you want your engineering output to actually compound, hire ambitious junior engineers, give them exceptional tools, and pair them tightly with senior engineers who are great communicators and genuinely care about teaching. Juniors move fast and explore multiple approaches, while seniors spend their time framing the hard problems and raising the bar for everyone around them. This will avoid endless debates and death by committees.

English

472

33.7K

Mahesh Sathiamoorthy أُعيد تغريده

alex fazio@alxfazio·28 Şub

you should be headless claude maxxing, so here’s an article that explains it better than the anthropic docs

danialhasan@dhasandev

x.com/i/article/2009…

English

822

195.6K

Mahesh Sathiamoorthy@madiator·24 Şub

Bought a Tesla model Y in 2021 for 55k (or actually probably slightly more). I owe 15k now and it's value in the market is like 19k. So all these years, I will get 4k out of it if I had to sell it. I probably went through a time before where i was underwater..

English

3.9K

Mahesh Sathiamoorthy@madiator·23 Şub

Slack is not here and makes sense that it's not here. I have seen various people say that they can just vibe code slack. But the main selling point of slack is that I can interact with other organizations via external slack connect. So it has a nice moat based on network effects. Now, can someone please vibe code a standard so that vibe coded slacks can talk to each other please?

Tenobrus@tenobrus

gigafucked: - grammarly - calendly - miro - retool - webflow - langchain - writer - harvey - glean - expedia - monday fucked: - accenture - intuit - notion - jasper - canva - alphasense - postman - airtable - talkdesk - sierra - zapier - replit - solace probably fucked: - cursor - pilot - clay - mercor naively seems fucked but so competent / plugged in they seem to be figuring it out on the fly anyway: - linear

English

3.7K

Mahesh Sathiamoorthy@madiator·22 Şub

Growing spinach in the backyard. Also have radish, cilantro, mint, rosemary. Eggplant plant survived the winter so it should be producing nice yield this summer..

English

1.2K

Mahesh Sathiamoorthy@madiator·21 Şub

This benchmark helps making progress on RL training of small models faster. Congrats on the release and happy to have played a role here with Bespoke.

Richard Zhuang@RichardZ412

Terminal-Bench is a leading benchmark for agents. Unfortunately it’s hard: most small coding agents get very low scores on TB2, so training/system ablations look flat - you can't tell what's working. Announcing OpenThoughts-TBLite - 100 curated TB2-style tasks, difficulty-calibrated so even 8B models can make progress. It's designed to give researchers measurable signal during development, providing faster feedback for experimental iteration while closely tracking true TB2 performance🧵

English

اكتشف

@mjamei @AashaySachdeva @NandoDF @BobbySamuels @bespokelabsai @datologyai @elonmusk @BarackObama