Theophile Sautory

40 posts

Theophile Sautory

Theophile Sautory

@TSautory

applied x startups @openai views my own

San Francisco, CA Katılım Şubat 2020
204 Takip Edilen152 Takipçiler
Houda Nait El Barj
Houda Nait El Barj@Houda_nait·
This weekend, Codex stunned me with beauty. It solved a problem with striking elegance, taking a path few human minds would have chosen. Yet the result was undeniably correct. I think AI reasoning may be starting to diverge from human reasoning in a deep but beautiful way.
English
3
0
16
932
Theophile Sautory retweetledi
Windsurf
Windsurf@windsurf·
GPT-5.4 mini is now available in Windsurf!
English
18
14
273
34.5K
Theophile Sautory retweetledi
Evan Goldschmidt
Evan Goldschmidt@evangoldschmidt·
@OpenAI's frontier models sit at the core of our approach. recently we sat down with them to document & share some of our learnings: openai.com/index/tolan/
English
1
1
2
199
Theophile Sautory retweetledi
Windsurf
Windsurf@windsurf·
GPT-5.2 is now live in Windsurf! Available for 0x credits for a limited time (paid and trial users). The version bump undersells the jump in intelligence: - Biggest leap for GPT models in agentic coding since GPT-5 - SOTA coding model at its price point - Default in Windsurf
Windsurf tweet media
English
40
40
488
119.6K
Theophile Sautory retweetledi
Noah MacCallum
Noah MacCallum@noahmacca·
GPT-5.1-Codex-Max is our best agentic coding model, and is now available in our API. It's a great model, but it’s also easier than ever to integrate. I’ve spent the last two weeks experimenting with our top customers - here's how to get the most out of it 🧵
English
20
9
294
30.3K
Theophile Sautory retweetledi
Alexander Embiricos
Alexander Embiricos@embirico·
We just shipped GPT-5.1-Codex-Max to API! Codex agents everywhere are now faster & smarter, whether you're using Codex in the CLI via API key, or in other tools like @cursor_ai and @code. More about how Cursor updated their harness to make the most of it cursor.com/blog/codex-mod…
English
29
26
417
69.4K
Theophile Sautory retweetledi
Consensus
Consensus@ConsensusNLP·
We are thrilled to partner with @OpenAI on this launch. Scholar Agent is powered by GPT-5. @sama, @gdb, @kevinweil and team have built an amazing product that powers the future AI-enabled scientific exploration and discovery. More on our partnership⤵️ openai.com/index/consensu…
English
1
6
26
6.5K
Theophile Sautory retweetledi
shyamal
shyamal@shyamalanadkat·
getting started with evals doesn't require too much. the pattern that we've seen work for small teams looks a lot like test‑driven development applied to AI engineering: 1/ anchor evals in user stories, not in abstract benchmarks: sit down with your product/design counterpart and list out the concrete things your model needs to do for users. "answer insurance claim questions accurately", "generate SQL queries from natural language". for each, write 10–20 representative inputs and the desired outputs/behaviors. this is your first eval file. 2/ automate from day one, even if it's brittle. resist the temptation to "just eyeball it". well, ok, vibes doesn't scale for too long. wrap your evals in code. you can write a simple pytest that loops over your examples, calls the model, and asserts that certain substrings appear. it's crude, but it's a start. 3/ use the model to bootstrap harder eval data. manually writing hundreds of edge cases is expensive. you can use reasoning models (o3) to generate synthetic variations ("give me 50 claim questions involving fire damage") and then hand‑filter. this speeds up coverage without sacrificing relevance. 4/ don't chase leaderboards; iterate on what fails. when something fails in production, don't just fix the prompt – add the failing case to your eval set. over time your suite will grow to reflect your real failure modes. periodically slice your evals (by input length, by locale, etc.) to see if you're regressing on particular segments. 5/ evolve your metrics as your product matures. as you scale, you'll want more nuanced scoring (semantic similarity, human ratings, cost/latency tracking). build hooks in your eval harness to log these and trend them over time. instrument your UI to collect implicit feedback (did the user click "thumbs up"?) and feed that back into your offline evals. 6/ make evals visible. put a simple dashboard in front of the team and stakeholders showing eval pass rates, cost, latency. use it in stand‑ups. this creates accountability and helps non‑ML folks participate in the trade‑off discussions. finally, treat evals as a core engineering artifact. assign ownership, review them in code review, celebrate when you add a new tricky case. the discipline will pay compounding dividends as you scale.
English
4
31
220
27.2K
Theophile Sautory retweetledi
shyamal
shyamal@shyamalanadkat·
new guide on “exploring model graders for reinforcement fine-tuning” on openai’s platform. h/t @TSautory final result was a high-signal, domain-sensitive grader that guided the model toward better explanations. cookbook.openai.com/examples/reinf…
English
5
9
72
15.5K
Theophile Sautory retweetledi
Bill Chen
Bill Chen@realchillben·
I will be hosting this week's @OpenAI build hour this Thurs! This one will be on Image Gen! I will talk about: - cool use cases you can build with it - how you can use it together with other hosted tools - best practices - demo you can use! It will be technical. Builders welcome!
Bill Chen tweet media
English
5
11
225
27.2K
Theophile Sautory retweetledi
Brendan Fortuner
Brendan Fortuner@bfortuner·
Ambience's RFT-tuned model boosted ICD-10 coding accuracy 27% over expert clinicians—and it's now spotlighted in @OpenAIDevs' Reinforcement Fine-Tuning use case guide. Excited to see what the community builds with this powerful new customization technique: platform.openai.com/docs/guides/rf…
English
2
19
105
45.8K
Theophile Sautory retweetledi
Noah MacCallum
Noah MacCallum@noahmacca·
I've been loving GPT-4.1, and it's now my daily driver. We've spent hundreds of hours experimenting to get it performing at its best, and I wanted to share some key things we learned to help you get the most out of these models from day 1. Agentic workflows - This is your new starting point. Just add tools and prompt to make it plan, execute, and verify iteratively until producing a correct answer. Only add multi-agent frameworks once you've squeezed as much as you can here. - Coding agents will be particularly good with this model, especially when paired with our recommended diff output format and diff apply tool. Chain of thought - Prompting for chain of thought can help to break down the problem and improve overall intelligence. - Start with a simple instruction at the end of your prompt, as the model is already quite good at thinking through problems. Then, observe correct and incorrect eval samples to direct further changes to the thinking instructions. - Often, asking the model to spend more time understanding user intent and gathering context can help. Long context - The full context window (now 1M token in put) is much more performant/usable than before. As task complexity increases, performance might still improve by decreasing context size. - Check the guide for suggested delimiters. Improving instruction following - Performance is much better with this model, and you can add significant detail for what to do/not do in different circumstances. - Many model performance issues actually come from inconsistent, conflicting, or underspecified instructions (including conflicting information in examples as well). - Be careful with telling the model to "always" do something, you might not actually want that and it can induce hallucinations, for example if it needs to call a tool with insufficient context. - I find it's helpful to ask ChatGPT Canvas or Cursor for help iterating on prompts cohesively (e.g. asking for conflicting instructions, edge cases, or adding an instruction and updating examples at the same time) Prompt structure - tl;dr check out the recommended template (attached) - Context at or near the bottom seems to be best (and repeating instructions both above and below very long context can actually help too). - Markdown and XML both work great
Noah MacCallum tweet media
English
15
77
810
114.7K
Theophile Sautory retweetledi
Houda Nait El Barj
Houda Nait El Barj@Houda_nait·
Nick Romeo's new book starts with the story of @coreeconteam. When I joined CoreEcon in 2016, we believed in “Better Economics for a Better World” & aimed to reform the Economics curriculum. Today, our textbook is used in ~400 colleges, and constantly integrates latest research🎊
PublicAffairs@public_affairs

Capitalism causes environmental destruction and massive income inequality, impoverishing most at the expense of enriching the few. But there is an alternative. Check out Nick Romeo's new book THE ALTERNATIVE, in bookstores today. hachettebookgroup.com/titles/nick-ro…

English
0
3
9
2.9K
Theophile Sautory retweetledi
Daniele Grattarola
Daniele Grattarola@riceasphait·
Fresh off the press by @jacobbamberger @SPLTS2 "A topological characterisation of Weisfeiler-Leman equivalence classes" Introduces a dataset of non-isomorphic graphs that cannot be distinguished by message-passing. arxiv.org/abs/2206.11876
Daniele Grattarola tweet media
English
1
7
42
0
Groupe Galeries Lafayette
Groupe Galeries Lafayette@Galeries_Laf·
La Coupole des Galeries Lafayette Paris Haussmann à 360 degrés, comme vous ne l’avez jamais vue ! L’occasion de (re)découvrir ce joyau de l’Art Nouveau édifié en 1912 qui a achevé l’an dernier une rénovation majeure retrouvant ainsi toute sa splendeur originelle. 📸 @dixhuit_prod
Groupe Galeries Lafayette tweet media
Français
6
4
10
0