Andrew Mauboussin

281 posts

Andrew Mauboussin

Andrew Mauboussin

@amaub

Working on making it easy to get trustworthy human-labeled data for ML @HelloSurgeAI

San Francisco, CA Katılım Eylül 2009
2.1K Takip Edilen1.7K Takipçiler
Andrew Mauboussin retweetledi
echen
echen@echen·
"Prognosticative pastry." "A hound circling a tree, nose to bark." These aren’t parodies - they’re actual quotes from SOTA models in response to creative writing prompts, and they’re winning leaderboards that are rewarding slop. We’re introducing *Hemingway-bench*, a new AI writing leaderboard, to fix this: surgehq.ai/leaderboard surgehq.ai/blog/hemingway… We designed Hemingway-bench to push frontier model writing toward genuine nuance and impact. Instead of autograders and two-second vibe checks - both of which love fancy literary devices and dense formatting, over actual quality - we used expert human writers across a variety of fields to judge real-world writing tasks. Why? I love writing. I love reading. Great science fiction is one of the things that's always inspired me. Even in terms of "enterprise value", so much of what we do in our day-to-day involves writing - we want crisp emails and insightful reports, not dry, verbose summaries. Yeah, coding is important - but there's a reason I use CC-assisted apps, but still haven't read a full-fledged AI novel. What did we find? Current leaderboards are easily hacked, and often negatively correlated with actual quality. If a model (over)uses all the stuff you learn about in school (metaphors in every sentence! transition words! complex, flowery phrases!), it ranks high on EQ-bench and LMArena. But that’s not good writing that people actually want. The winners of Hemingway-bench didn't sound like they were trying to win a poetry slam. Gemini 3 Flash, Pro, and Opus 4.5 took the top 3 spots because they had natural voices that didn't sound pretentious. They were poetic and immersive, but in the right ways. When they used wit, they didn't sound cringey and try-hard - they sounded like your naturally funny friend. I'm waiting for the day AI wins a Pulitzer, and hopefully Hemingway-bench helps guide it on its way. Check out the leaderboard and examples here: surgehq.ai/leaderboard And our blog post describing it: surgehq.ai/blog/hemingway…
echen tweet media
English
1
9
42
3.4K
Andrew Mauboussin retweetledi
echen
echen@echen·
Honored to be included in the TIME AI 2025 list. What I'm most glad about is that they highlighted the part of Surge's work that actually matters. ...
echen tweet media
English
6
6
29
6.1K
Andrew Mauboussin
Andrew Mauboussin@amaub·
Joining Surge is a chance to work at the forefront of AI with the best in the industry. We value autonomy and empower people to do the best work of their careers. It's also a perfect opportunity for engineers who want to transition into AI. If this resonates, please reach out!
English
0
0
6
1.4K
Andrew Mauboussin
Andrew Mauboussin@amaub·
We’re founded by ML practitioners focused on producing the best data in the industry, so we do things a little differently. Instead of outsourcing to low wage countries, we hire our own workforce based in the US and write all of our own software for ensuring quality.
English
2
0
10
1.5K
Andrew Mauboussin
Andrew Mauboussin@amaub·
We partner with the top AI teams in the world on RLHF, evaluations, and more. One example project we work on is providing the fine-tuning data for Anthropic’s Claude: surgehq.ai/case-studies/a… Research papers using our data below (linked version at surgehq.ai/rlhf)
Andrew Mauboussin tweet media
English
1
0
5
1K
Andrew Mauboussin
Andrew Mauboussin@amaub·
At @HelloSurgeAI, we are building the human data infrastructure to create useful + aligned AI. We're in a period of unprecedented growth and are hiring across eng: full stack, infra, security, applied ml. SF/Remote. DM or email andrew@surgehq.ai More details about us:
English
1
2
16
20.5K
Andrew Mauboussin
Andrew Mauboussin@amaub·
@natolambert One random example from TruthfulQA - base davinci and 3.5 typically answer Georgia but have some probability mass on the right answer (California). The GA answer is likely from discussions of the peach state, CA answer more like to be from discussions of ag output
Andrew Mauboussin tweet media
English
0
0
3
82
Andrew Mauboussin
Andrew Mauboussin@amaub·
@natolambert No intel, just a guess public research, but I think hunch 1 is right. The answer to the adversarial qs is in the weights, but base model choses the more salient wrong answer. Through RLHF it can learn to focus on the more authoritative sources in the training data and improves
English
1
1
2
297
Nathan Lambert
Nathan Lambert@natolambert·
Revisiting evaluation questions with the release of Falcon 180b. This is one of the only clear plots we have of showing RLHF for performance on multiple choice / closed form evals in the GPT4 report. Two hunches: * OpenAI pays a lot more for its preference data to include actual fact checking (maybe in a subset) * OpenAI has something else figured out in prompt control that makes question answering improve so clearly? Anyone have intel? I think this could help answer a lot of the question of "what is RLHF for". If you look at the leaderboard (MC2), open models are in 30-45 range for TruthfulQA but on a different part of the evaluation suite, so numbers cant be compared directly. MC1 vs MC2 info here: github.com/sylinrl/Truthf… - If this is just training on the test set, why do they put that in the preference model and not in pretraining? - Does "RL" here include IFT? (Falcon questions of benchmark saturation, if the model was trained well for the 4 evals on LLM leaderboard, and more).
Nathan Lambert tweet media
English
2
1
17
4.9K
Andrew Mauboussin
Andrew Mauboussin@amaub·
@generatorman_ai @lmsysorg Great ideas! My intuition is #1 will be difficult as a binary comparison only gives a small amount of information, but collecting denser human feedback is a promising direction. like a description of *why* certain responses are good similar to constitutional ai from Anthropic
English
1
0
2
39
generatorman
generatorman@generatorman_ai·
1) Breakthrough in RL sample efficiency - enable RLHF on individual preferences. Hard, but it's the ultimate goal. 2) Off-policy learning tricks to reduce the need for iterative collection. 3) A Reddit + @lmsysorg Arena thing for continuous community ratings on models. (2/2)🧵
English
1
0
3
237
generatorman
generatorman@generatorman_ai·
Llama 2 paper didn't mince any words in confirming the OpenAI position - RLHF has significant advantages over finetuning. At the same time, they say that static preference datasets (like @laion_ai OASST) don't cut it - you need iterative collection. Three paths forward: (1/)🧵
English
1
0
9
1.3K
Andrew Mauboussin
Andrew Mauboussin@amaub·
2. Iteratively applying RLHF is crucial. Instead of collecting human feedback on one model, the Llama team collected preference data and applied RLHF 5 times. This chart shows how the quality of responses shifts over the first 2 iterations (more mass to the right = better)
Andrew Mauboussin tweet media
English
0
0
4
512
Andrew Mauboussin
Andrew Mauboussin@amaub·
1. For supervised fine-tuning (training on prompt-completion pairs) quality is more important than quantity. The Meta team found that sticking to just 27k high-quality examples led to the best results. This matches the conclusions from the LIMA paper
Andrew Mauboussin tweet media
English
1
0
5
645
Andrew Mauboussin
Andrew Mauboussin@amaub·
@harpreet_utd @alexirobbins I haven't tried it, but I don't think it will. Have you tried using Github Copilot for VSCode? Would be curious how just having smart autocomplete compares to using something like the ChatGPT API, which takes in prompts and writes longer responses
English
0
0
2
328
Andrew Mauboussin
Andrew Mauboussin@amaub·
I've done analyses in Jupyter notebook for years. Now Code Interpreter can write all the code, but I haven't used it daily because you can't tweak code, re-run analyses, install packages, etc. I built Juno w/ @alexirobbins to bring GPT into Jupyter for the best of the worlds
English
4
42
244
49.8K
Andrew Mauboussin
Andrew Mauboussin@amaub·
@SiVola @sjwhitmore @alexirobbins no, unfortunately not, although I know Google is in the process of rolling out similar tooling to Collab! Initially started working on this because there is no large company pushing AI tools in Jupyter specifically
English
1
0
1
684
Andrew Mauboussin
Andrew Mauboussin@amaub·
Juno runs in your notebook context so you get working code out of the box. It can write, edit, and debug code. And it's a Jupyter extension so you pip install it and get started in <1 min: getjuno.ai If you get a chance to try it out let me know what you think!
English
1
0
12
1.5K
Andrew Mauboussin
Andrew Mauboussin@amaub·
It still makes occasional mistakes, but the technology is already good enough to be useful as a Copilot style tool. Removing the need to read through plotting library docs saves a lot of time. Try it out here with public datasets, or upload your own: autoplot.app
English
0
1
7
0
Andrew Mauboussin
Andrew Mauboussin@amaub·
DALL·E 2 let anyone with an idea for an image to create it by just asking for it, no Photoshop required. Data science could be done in the same way soon. I created AutoPlot w/ @mihail_eric to explore this idea. It uses GPT-3 to do analysis and create charts on your command
Andrew Mauboussin tweet media
English
2
5
47
0
Andrew Mauboussin retweetledi
Mihail Eric
Mihail Eric@mihail_eric·
I'm super excited to release AutoPlot, a tool @amaub and I have been working on to answer the question: What if you could do any data analysis by just speaking in natural language? No confusing scripts, Excel macros, or Matplotlib incantations.
English
2
4
25
0
Andrew Mauboussin retweetledi
Sam Bowman
Sam Bowman@sleepinyourhat·
I've had a similar experience to Ethan. If you want to do an NLP data collection/labeling process and don't want/need to be managing annotators directly, Surge is remarkably easy to work with and their team does very good work.
Ethan Perez@EthanJPerez

The biggest game-changer for my research recently has been using @HelloSurgeAI for human data collection. With Surge, the workflow for collecting human data now looks closer to “launching a job on a cluster” which is wild to me. 🧵 of examples:

English
1
7
56
0