Shawn Sullivan

1.8K posts

Shawn Sullivan

@shawntsullivan

CTO @ https://t.co/Ejx8na6sMZ: GenAI for Edu. Benchmarking AI in Edu. Early reading with Reading Critters. Co-founder/ex-CTO @ Phase Genomics: AI+Genomics. MIT CS, Ex-MSFT

Seattle Katılım Mayıs 2014

193 Takip Edilen380 Takipçiler

Shawn Sullivan@shawntsullivan·14 Kas

MLB is cooked If your baseball understanding is a spreadsheet, @TheJudge44 is your guy If you actually watch baseball, you know @Mariners #CalRaleigh turned in a season no catcher has ever come close to Judge was great. Cal was irreplaceable. "Valuable" is not "cell A1"

English

Shawn Sullivan@shawntsullivan·10 Eki

@deedydas Society of Mind vibes

English

412

Deedy@deedydas·9 Eki

The TRM paper feels like a significant AI breakthrough. It destroys the pareto frontier on the ARC AGI 1 and 2 benchmarks (and Sudoku and Maze solving) with an estd < $0.01 cost per task and cost < $500 to train the 7M model on 2 H100s for 2 days. [Training and test specifics] For ARC, it trained on 160 examples from ConceptARC. At test-time, it uses the most common answer of 1000 augmentations at test-time and embeds a fixed shape of the task in the input. [Industry implications] Most AI companies today use general purpose LLMs with prompting for tasks. For specific tasks, smaller models may not just be cheaper, but far higher quality! Startups could (and should) train models for < $1000 for specific "fixed length" subtasks (specific PDF extraction, time series forecasting, etc) and use it as a tool to the general model to not only push performance, but build some meaningful IP at the task they're trying to automate.

English

206

1.6K

143K

Shawn Sullivan@shawntsullivan·10 Eki

@rryssf Applying evolutionary forces to model refinement is a good idea This work also has some parallels to how language and an internal monologue (or equivalent for people without one) could have translated into evolutionary advantages

English

1.9K

Robert Youssef@rryssf·9 Eki

RIP fine-tuning ☠️ This new Stanford paper just killed it. It’s called 'Agentic Context Engineering (ACE)' and it proves you can make models smarter without touching a single weight. Instead of retraining, ACE evolves the context itself. The model writes, reflects, and edits its own prompt over and over until it becomes a self-improving system. Think of it like the model keeping a growing notebook of what works. Each failure becomes a strategy. Each success becomes a rule. The results are absurd: +10.6% better than GPT-4–powered agents on AppWorld. +8.6% on finance reasoning. 86.9% lower cost and latency. No labels. Just feedback. Everyone’s been obsessed with “short, clean” prompts. ACE flips that. It builds long, detailed evolving playbooks that never forget. And it works because LLMs don’t want simplicity, they want *context density. If this scales, the next generation of AI won’t be “fine-tuned.” It’ll be self-tuned. We’re entering the era of living prompts.

English

236

1.2K

7.8K

715.1K

Shawn Sullivan@shawntsullivan·25 Mar

The probability of the LLM making a bad decision for each action it takes is non-zero. Assuming programmers of equal skill, the difference is one or both of these: A) Spending time to provide the LLM context, rules, and guidelines before using it B) Carefully checking the LLM’s output to correct any bad decisions before they snowball A and B are also non-binary, so one can do varying degrees of a good job on them to get different results

English

Eliezer Yudkowsky ⏹️@ESYudkowsky·24 Mar

We are underasking the question of why different skilled programmers report fantastically different results from "vibecoding". (Anyone yelling "skill issue" needs to learn what is not an advance-predicting explanation.)

English

282

62.7K

Shawn Sullivan@shawntsullivan·22 Mar

Some actual data-driven, practical advice for students using AI to get ready for AP test season, including specific recommendations for which models to use for which subjects. Edu AI benchmarks that mean something!

EduBench@AIEduBench

AP exam season is looming! 😰 Feeling the pressure? We tested how well popular AI models can help you get a 5. The results might surprise you 🧵👇 edubench.com/edublog/which-… #APTests #AI #EduTech 1/8

English

142

Shawn Sullivan@shawntsullivan·12 Mar

@athyuttamre Looks great! previous_response_id is killer. It always felt weird having to send so many strings back and forth. Same for hosted tools. It always felt weird that the model would need me to call basic tools for it. Excited to try it out!

English

1.4K

Atty Eleti@athyuttamre·11 Mar

Introducing the Responses API: the new primitive of the OpenAI API. It is the culmination of 2 years of learnings designing the OpenAI API, and the foundation of our next chapter of building agents. 🧵Here’s the story of how we designed it:

English

116

357

2.8K

Shawn Sullivan@shawntsullivan·12 Mar

@tunguz If only some clever sci-fi author like Orwell or Stephenson or Asimov or Dick or Gibson or Bradbury or Vonnegut or Huxley or Vinge or Heinlein had warned us

English

685

Bojan Tunguz@tunguz·12 Mar

It's so surreal to be living in the world where technologically we are about to enter into the SciFi future that we have been reading about all of our lives, while politically we are on the fast regressive backtrack towards the naked power politics of the nineteenth and earlier centuries. Like we are in some kind of upside-down nightmarish cyberpunk horror story.

English

595

31.1K

Shawn Sullivan@shawntsullivan·12 Mar

It seems unlikely we'll see true automation of software engineering until AGI. It also seems unlikely AGI doesn't result in automating software engineering. So: P(engineering_automated | not_AGI_yet) ~= 0 P(engineering_automated | AGI) ~= 1 So is your assessment P(AGI) in your lifetime is 25%? In the meantime, what's most likely is dramatically higher engineering productivity for the developers who embrace AI tools. Which is what we're seeing.

English

1.3K

Santiago@svpino·12 Mar

Yesterday, I said I was changing my estimate of seeing AI automate Software Engineering *in my lifetime* from 10% to around 25%. Today, I said it's bullshit to think this automation will happen in 3 - 6 months. And now people are melting down in my comments because they can't understand how these two statements are compatible.

English

181

45.2K

Shawn Sullivan@shawntsullivan·12 Mar

Tbh I was surprised to see these results, given most LLMs did reasonably well creating multiple-choice questions. Something about creating a quiz really seems to throw them for a loop... digging into why soon.

EduBench@AIEduBench

Now live on edubench.com: Quiz Composition benchmarks! Unlike MCQ Generation, frontier models struggle with most aspects of creating good quizzes. More details soon; for now, head over to edubench.com and see for yourself. What stands out to you?

English

Shawn Sullivan@shawntsullivan·8 Mar

AI-generated MCQs can look good but still fail students. A key issue? Bad distractors—wrong answers that are too obvious to provide effective assessment. Measuring distractor quality is a critical tool for using AI effectively in the classroom

EduBench@AIEduBench

We're benchmarking AI for education. What does that mean? Let's dive into an example. Suppose you're an AP World History teacher who wants to use an LLM to create a quiz question for your students... 🧵 1/7

English

Shawn Sullivan@shawntsullivan·6 Mar

One takeaway from our recent comparison of Claude 3.5 and 3.7 for edu: Both Claude versions fail because they're optimized for capability demos, not educational outcomes. Simply being able to generate questions isn't enough. We need models that create content that actually helps students learn. #EdTech

EduBench@AIEduBench

We analyzed Claude 3.5 vs 3.7 for creating educational content. The results are surprising... edubench.com/edublog/claude… 1/

English

100

Shawn Sullivan@shawntsullivan·4 Mar

@Jacobsklug @lovable Build

English

Jacob Klug@Jacobsklug·3 Mar

After generating $250K (last 2 months) I built a playbook for @lovable apps—and I’m giving it away. In just two months, we cracked the code to building apps with AI. I’ve distilled everything we learned into this single document. Comment "Build" and drop a follow. I’ll DM it to you. P.S. This will likely blow up, so give me some time to reply.

English

6.3K

194

3.2K

743.7K

Shawn Sullivan@shawntsullivan·27 Şub

Excited to announce a new project! In EduBench, we aim to comprehensively benchmark AI in educational settings. We're starting with multiple-choice questions. Much more to come, including our own efforts to create best-in-class LLMs for edu applications. Follow for more!

EduBench@AIEduBench

🚨 AI in education is at a crossroads. Too many AI tools look helpful but actually mislead students, fail to align with curricula, and give a false sense of competence. We’re launching EduBench to demand real educational impact from AI. 🧵👇

English

Shawn Sullivan@shawntsullivan·4 Şub

Holy sh*t Been using @lovable for 2 weeks and it's already a game changer. Looks like it's only going to keep getting better

emil@emilahlback

Bringing pixel perfect editing to your @lovable apps (feature preview)

English

155

Shawn Sullivan retweetledi

Martin Fowler@martinfowler·28 Oca

NEW POST My colleague Bharani Subramaniam has started to write patterns from our recent work building production Gen AI applications. We begin with Evals - ways of assessing if they are working effectively. martinfowler.com/articles/gen-a…

English

177

19.2K

Shawn Sullivan@shawntsullivan·22 Ara

@goodside Buckle up

English

243

Riley Goodside@goodside·21 Ara

ARC-AGI scores for past five years of OpenAI models (updated w/ release dates)

Riley Goodside@goodside

Past five years of OpenAI models vs. the ARC-AGI benchmark

English

100

522

2.9K

1.9M

Shawn Sullivan@shawntsullivan·22 Kas

Father of 2 here (and I couldn't imagine having more) The right unit for thinking about parenting isn’t dollars. It’s hours Child credits fall short because they don’t save time. Free child care does, because it's just there. Like a utility. You don't have to worry about it to use it Most people don’t skip having kids due to a lack of money (income and fertility are negatively correlated). They’re time-constrained Modern parenting means arranging summer camps, coordinating playdates, filling in for a declining education system, ensuring the “right” trajectory for college. Plus cooking, cleaning, laundry, chauffeuring, etc. Add in always-online work culture and two-income households (a net good, by the way)... it’s a lot “Time is money,” sure. But it takes time to convert child credits to childcare. Parents are out of time Pronatal policy should focus on giving parents time. Remove entire categories of worry. Don't just make having a kid "worth it" with money Incidentally, AI’s promise for fertility isn’t extra productivity to fund retirees. It’s giving parents time: handling scheduling, homework, chores, even babysitting. It's making being a working (and sleeping) parent more possible in a 24-hour day

English

124

Noah Smith 🐇🇺🇸🇺🇦🇹🇼@Noahpinion·22 Kas

Population aging and decline constitutes a long-term economic threat to our way of life. And unlike the threat of climate change, this is one that we don't yet have any solution for. noahpinion.blog/p/nobody-knows…

English

167

751

202.1K

Shawn Sullivan@shawntsullivan·23 Eki

@rowancheung This shows 1 of 2 things re:“it’s autocomplete on steroids”: - There’s more going on in these models than we can see beyond token prediction - Predicting the next token is sufficient to create HL intelligence (this isn’t HL yet, but looks like a straightforward path to get there)

English

Rowan Cheung@rowancheung·22 Eki

Anthropic just announced Computer Use It allows Claude to control your computer screen based on a prompt and take actions on your behalf The use cases in agentic coding with automated debugging, customer support, and education are going to be INSANE

English

190

8.4K

Shawn Sullivan@shawntsullivan·13 Eki

@Noahpinion And pride month is a way of reminding Alabama the same thing

English

598

Noah Smith 🐇🇺🇸🇺🇦🇹🇼@Noahpinion·13 Eki

A friend said: "Fleet Week is a way of reminding San Francisco that they're still part of the United States."

English

2.1K

133.5K

Shawn Sullivan@shawntsullivan·5 Eyl

Strong agreement: Ed Dept's AI in education report hits the nail on the head. AI isn't just a cost-cutting tool; it's the key to unlocking personalized learning at scale. We can finally tailor education to each child's needs, not just the average student. tech.ed.gov/ai-future-of-t…

English

Keşfet

@TheJudge44 @Mariners @deedydas @rryssf @athyuttamre @tunguz @Jacobsklug @lovable