Shawn Sullivan

1.8K posts

Shawn Sullivan banner
Shawn Sullivan

Shawn Sullivan

@shawntsullivan

CTO @ https://t.co/Ejx8na6sMZ: GenAI for Edu. Benchmarking AI in Edu. Early reading with Reading Critters. Co-founder/ex-CTO @ Phase Genomics: AI+Genomics. MIT CS, Ex-MSFT

Seattle Katılım Mayıs 2014
193 Takip Edilen380 Takipçiler
Shawn Sullivan
Shawn Sullivan@shawntsullivan·
MLB is cooked If your baseball understanding is a spreadsheet, @TheJudge44 is your guy If you actually watch baseball, you know @Mariners #CalRaleigh turned in a season no catcher has ever come close to Judge was great. Cal was irreplaceable. "Valuable" is not "cell A1"
English
0
0
0
43
Deedy
Deedy@deedydas·
The TRM paper feels like a significant AI breakthrough. It destroys the pareto frontier on the ARC AGI 1 and 2 benchmarks (and Sudoku and Maze solving) with an estd < $0.01 cost per task and cost < $500 to train the 7M model on 2 H100s for 2 days. [Training and test specifics] For ARC, it trained on 160 examples from ConceptARC. At test-time, it uses the most common answer of 1000 augmentations at test-time and embeds a fixed shape of the task in the input. [Industry implications] Most AI companies today use general purpose LLMs with prompting for tasks. For specific tasks, smaller models may not just be cheaper, but far higher quality! Startups could (and should) train models for < $1000 for specific "fixed length" subtasks (specific PDF extraction, time series forecasting, etc) and use it as a tool to the general model to not only push performance, but build some meaningful IP at the task they're trying to automate.
Deedy tweet media
English
59
206
1.6K
143K
Shawn Sullivan
Shawn Sullivan@shawntsullivan·
@rryssf Applying evolutionary forces to model refinement is a good idea This work also has some parallels to how language and an internal monologue (or equivalent for people without one) could have translated into evolutionary advantages
English
0
0
0
1.9K
Robert Youssef
Robert Youssef@rryssf·
RIP fine-tuning ☠️ This new Stanford paper just killed it. It’s called 'Agentic Context Engineering (ACE)' and it proves you can make models smarter without touching a single weight. Instead of retraining, ACE evolves the context itself. The model writes, reflects, and edits its own prompt over and over until it becomes a self-improving system. Think of it like the model keeping a growing notebook of what works. Each failure becomes a strategy. Each success becomes a rule. The results are absurd: +10.6% better than GPT-4–powered agents on AppWorld. +8.6% on finance reasoning. 86.9% lower cost and latency. No labels. Just feedback. Everyone’s been obsessed with “short, clean” prompts. ACE flips that. It builds long, detailed evolving playbooks that never forget. And it works because LLMs don’t want simplicity, they want *context density. If this scales, the next generation of AI won’t be “fine-tuned.” It’ll be self-tuned. We’re entering the era of living prompts.
Robert Youssef tweet media
English
236
1.2K
7.8K
715.1K
Shawn Sullivan
Shawn Sullivan@shawntsullivan·
The probability of the LLM making a bad decision for each action it takes is non-zero. Assuming programmers of equal skill, the difference is one or both of these: A) Spending time to provide the LLM context, rules, and guidelines before using it B) Carefully checking the LLM’s output to correct any bad decisions before they snowball A and B are also non-binary, so one can do varying degrees of a good job on them to get different results
English
0
0
0
62
Eliezer Yudkowsky ⏹️
Eliezer Yudkowsky ⏹️@ESYudkowsky·
We are underasking the question of why different skilled programmers report fantastically different results from "vibecoding". (Anyone yelling "skill issue" needs to learn what is not an advance-predicting explanation.)
English
61
3
282
62.7K
Shawn Sullivan
Shawn Sullivan@shawntsullivan·
@athyuttamre Looks great! previous_response_id is killer. It always felt weird having to send so many strings back and forth. Same for hosted tools. It always felt weird that the model would need me to call basic tools for it. Excited to try it out!
English
0
0
2
1.4K
Atty Eleti
Atty Eleti@athyuttamre·
Introducing the Responses API: the new primitive of the OpenAI API. It is the culmination of 2 years of learnings designing the OpenAI API, and the foundation of our next chapter of building agents. 🧵Here’s the story of how we designed it:
English
116
357
2.8K
2M
Shawn Sullivan
Shawn Sullivan@shawntsullivan·
@tunguz If only some clever sci-fi author like Orwell or Stephenson or Asimov or Dick or Gibson or Bradbury or Vonnegut or Huxley or Vinge or Heinlein had warned us
English
0
0
23
685
Bojan Tunguz
Bojan Tunguz@tunguz·
It's so surreal to be living in the world where technologically we are about to enter into the SciFi future that we have been reading about all of our lives, while politically we are on the fast regressive backtrack towards the naked power politics of the nineteenth and earlier centuries. Like we are in some kind of upside-down nightmarish cyberpunk horror story.
English
41
38
595
31.1K
Shawn Sullivan
Shawn Sullivan@shawntsullivan·
It seems unlikely we'll see true automation of software engineering until AGI. It also seems unlikely AGI doesn't result in automating software engineering. So: P(engineering_automated | not_AGI_yet) ~= 0 P(engineering_automated | AGI) ~= 1 So is your assessment P(AGI) in your lifetime is 25%? In the meantime, what's most likely is dramatically higher engineering productivity for the developers who embrace AI tools. Which is what we're seeing.
English
0
0
2
1.3K
Santiago
Santiago@svpino·
Yesterday, I said I was changing my estimate of seeing AI automate Software Engineering *in my lifetime* from 10% to around 25%. Today, I said it's bullshit to think this automation will happen in 3 - 6 months. And now people are melting down in my comments because they can't understand how these two statements are compatible.
English
23
4
181
45.2K
Shawn Sullivan
Shawn Sullivan@shawntsullivan·
Tbh I was surprised to see these results, given most LLMs did reasonably well creating multiple-choice questions. Something about creating a quiz really seems to throw them for a loop... digging into why soon.
EduBench@AIEduBench

Now live on edubench.com: Quiz Composition benchmarks! Unlike MCQ Generation, frontier models struggle with most aspects of creating good quizzes. More details soon; for now, head over to edubench.com and see for yourself. What stands out to you?

English
0
0
0
93
Shawn Sullivan
Shawn Sullivan@shawntsullivan·
AI-generated MCQs can look good but still fail students. A key issue? Bad distractors—wrong answers that are too obvious to provide effective assessment. Measuring distractor quality is a critical tool for using AI effectively in the classroom
EduBench@AIEduBench

We're benchmarking AI for education. What does that mean? Let's dive into an example. Suppose you're an AP World History teacher who wants to use an LLM to create a quiz question for your students... 🧵 1/7

English
0
0
0
70
Shawn Sullivan
Shawn Sullivan@shawntsullivan·
One takeaway from our recent comparison of Claude 3.5 and 3.7 for edu: Both Claude versions fail because they're optimized for capability demos, not educational outcomes. Simply being able to generate questions isn't enough. We need models that create content that actually helps students learn. #EdTech
EduBench@AIEduBench

We analyzed Claude 3.5 vs 3.7 for creating educational content. The results are surprising... edubench.com/edublog/claude… 1/

English
0
0
1
100
Jacob Klug
Jacob Klug@Jacobsklug·
After generating $250K (last 2 months) I built a playbook for @lovable apps—and I’m giving it away. In just two months, we cracked the code to building apps with AI. I’ve distilled everything we learned into this single document. Comment "Build" and drop a follow. I’ll DM it to you. P.S. This will likely blow up, so give me some time to reply.
English
6.3K
194
3.2K
743.7K
Shawn Sullivan
Shawn Sullivan@shawntsullivan·
Excited to announce a new project! In EduBench, we aim to comprehensively benchmark AI in educational settings. We're starting with multiple-choice questions. Much more to come, including our own efforts to create best-in-class LLMs for edu applications. Follow for more!
EduBench@AIEduBench

🚨 AI in education is at a crossroads. Too many AI tools look helpful but actually mislead students, fail to align with curricula, and give a false sense of competence. We’re launching EduBench to demand real educational impact from AI. 🧵👇

English
0
0
0
75
Shawn Sullivan retweetledi
Martin Fowler
Martin Fowler@martinfowler·
NEW POST My colleague Bharani Subramaniam has started to write patterns from our recent work building production Gen AI applications. We begin with Evals - ways of assessing if they are working effectively. martinfowler.com/articles/gen-a…
English
2
37
177
19.2K
Shawn Sullivan
Shawn Sullivan@shawntsullivan·
Father of 2 here (and I couldn't imagine having more) The right unit for thinking about parenting isn’t dollars. It’s hours Child credits fall short because they don’t save time. Free child care does, because it's just there. Like a utility. You don't have to worry about it to use it Most people don’t skip having kids due to a lack of money (income and fertility are negatively correlated). They’re time-constrained Modern parenting means arranging summer camps, coordinating playdates, filling in for a declining education system, ensuring the “right” trajectory for college. Plus cooking, cleaning, laundry, chauffeuring, etc. Add in always-online work culture and two-income households (a net good, by the way)... it’s a lot “Time is money,” sure. But it takes time to convert child credits to childcare. Parents are out of time Pronatal policy should focus on giving parents time. Remove entire categories of worry. Don't just make having a kid "worth it" with money Incidentally, AI’s promise for fertility isn’t extra productivity to fund retirees. It’s giving parents time: handling scheduling, homework, chores, even babysitting. It's making being a working (and sleeping) parent more possible in a 24-hour day
English
0
0
4
124
Shawn Sullivan
Shawn Sullivan@shawntsullivan·
@rowancheung This shows 1 of 2 things re:“it’s autocomplete on steroids”: - There’s more going on in these models than we can see beyond token prediction - Predicting the next token is sufficient to create HL intelligence (this isn’t HL yet, but looks like a straightforward path to get there)
English
0
0
0
35
Rowan Cheung
Rowan Cheung@rowancheung·
Anthropic just announced Computer Use It allows Claude to control your computer screen based on a prompt and take actions on your behalf The use cases in agentic coding with automated debugging, customer support, and education are going to be INSANE
English
190
1K
8.4K
1M
Shawn Sullivan
Shawn Sullivan@shawntsullivan·
Strong agreement: Ed Dept's AI in education report hits the nail on the head. AI isn't just a cost-cutting tool; it's the key to unlocking personalized learning at scale. We can finally tailor education to each child's needs, not just the average student. tech.ed.gov/ai-future-of-t…
English
0
0
0
69