Richard Everts

1.2K posts

Richard Everts banner
Richard Everts

Richard Everts

@rich_everts

Lead Cognitive OS Researcher @ Deep SIML Labs https://t.co/wumzvKfpaE | Former CTO Bestie Bot | Author “On Terran Liberty” series | “I make good angels”

Katılım Temmuz 2023
168 Takip Edilen224 Takipçiler
Sabitlenmiş Tweet
Richard Everts
Richard Everts@rich_everts·
"Today, we choose to pursue AI superiority and defend Western liberty, not because it is easy, but because it is necessary." This is straight from my award winning 2024 paper for @FedSoc on establishing US AI superiority. 🤖🇺🇸 bit.ly/3AacsrY
English
1
0
4
439
Richard Everts
Richard Everts@rich_everts·
@jondelarroz The Romulan War book series by Michael Martin retconned Trip’s death being part of a larger plan to win the war, and he ends up alive and living with someone (I won’t ruin surprise). Trip needs to be retconned back in. I tagged @_MichaelSussman about it too…
English
0
0
0
23
Jon Del Arroz | Pop Culture & Gaming 🎮
Since people like this so much, I will try every day to come up with a pitch for a story to fix every broken franchise. I'm a pro writer and good at what I do! 🤷‍♀️ Make sure to check out the books! Any requests?
Jon Del Arroz | Pop Culture & Gaming 🎮@jondelarroz

I have a solution for Star Trek that can tie in United and remove the Kurtzman universe entirely: Years into his presidency, Jonathan Archer and the nascent United Federation of Planets are rocked by escalating temporal anomalies that threaten the Romulan War peace and the young Federation itself. Starfleet traces the disturbances to the still-unresolved Temporal Cold War, and discovers that the shadowy “Future Guy” who once manipulated the Suliban Cabal was none other than Archer himself, projected from a devastated 28th-century future. In that broken timeline (the one containing the Burn, the Federation’s near-collapse, and all the Kurtzman-era cataclysms), a desperate Archer had volunteered to become a non-corporeal agent, trying to steer 22nd-century events toward a stronger Federation. Instead, his well-intentioned meddling fractured the prime timeline, birthing the divergent horrors he was attempting to prevent. Working with a time-displaced descendant and a preserved message from his own Enterprise crew, President Archer confronts his future self in a temporal nexus aboard the new flagship USS United. He convinces the older version to stand down, allowing the original, unaltered timeline to reassert itself. The Kurtzman-era disasters are retroactively erased, revealed as the “bad future” that no longer exists, restoring continuity and ushering in a stable golden age of exploration. The series then proceeds from this corrected prime timeline, with Archer’s presidency now free to focus on building the Federation we always wanted to see, setting up ongoing stories of unity, diplomacy, and discovery without the baggage of the last decade’s continuity snarls. What do you think?

English
29
7
122
5.1K
Richard Everts retweetledi
Lenny Rachitsky
Lenny Rachitsky@lennysan·
My biggest takeaways from @simonw: 1. November 2025 was an inflection point for AI coding. GPT 5.1 and Claude Opus 4.5 crossed a threshold where coding agents went from “mostly works” to “almost always does what you want it to do.” Software engineers who tinkered over the holidays realized the technology had become genuinely reliable. 2. Mid-career engineers are the most vulnerable—not juniors, not seniors. AI amplifies experienced engineers by letting them leverage decades of pattern recognition. It also dramatically helps new engineers onboard. Cloudflare and Shopify each hired a thousand interns because AI cut ramp-up time from a month to a week. But mid-career engineers who haven’t accumulated deep expertise and have already captured the beginner boost are in the most precarious position. 3. AI exhaustion is real and underestimated. Simon runs four coding agents in parallel and is mentally wiped out by 11 a.m. He’s getting more time back, but his brain is exhausted from the intensity of directing multiple autonomous workers. Some engineers are losing sleep to keep agents running. This may just be a novelty issue, but the underlying dynamic—that managing AI amplifies cognitive load even as it reduces labor—is a real tension. Good companies will manage expectations rather than expecting 5x output indefinitely. 4. Code is cheap now. This simple idea has profound implications. The thing that used to take most of the time—writing code—now takes the least. The bottleneck has shifted to everything else: deciding what to build, proving ideas work, getting user feedback. Since prototyping is nearly free, Simon often builds three versions of every feature when he’s getting started. 5. The “dark factory” is the most radical experiment in AI-assisted development happening right now. A company called StrongDM established a policy: nobody writes code, nobody reads code. Instead, they run a swarm of AI-simulated end users 24/7—thousands of fake employees making requests like “give me access to Jira”—at $10,000 a day in token costs. They even had coding agents build simulated versions of Slack, Jira, and Okta from API documentation so they could test without rate limits. 6. "Red/green TDD" is the single highest-leverage agentic engineering pattern. Having coding agents write tests first, watch them fail, then write the implementation, then watch them pass produces materially better results. The five-word prompt “use red/green TDD” encodes this entire workflow because the agents recognize the jargon. 7. “Hoarding things you know how to do” is one of Simon's other favorite agentic engineering patterns. Simon maintains a GitHub repo of 193 small HTML/JavaScript tools and a separate research repo of coding-agent experiments. Each one captures a technique, a proof of concept, or a library he’s tested. When a new problem arrives, he can point Claude Code at past projects and say “combine these two approaches.” 8. The "lethal trifecta" makes AI agent security fundamentally unsolved. Whenever an AI agent has access to private data, exposure to untrusted content (like incoming emails), and the ability to send data externally (like replying to email), you have a lethal trifecta. Prompt injection—where malicious instructions in untrusted text override the agent’s intended behavior—cannot be reliably prevented. Simon has predicted a “Challenger disaster” for AI security every six months for three years. It hasn’t happened yet, but he’s pretty sure it will. 9. Start every project from a thin template, not a long instructions file. Coding agents are phenomenally good at matching existing patterns. A single test file with your preferred indentation and style is more effective than paragraphs of written instructions. Simon starts every project with a template containing one test (literally testing that 1 + 1 = 2) laid out in his preferred style. The agent picks it up and follows the convention across the entire codebase. This is cheaper and more reliable than maintaining elaborate prompt files. 10. The pelican-on-a-bicycle benchmark accidentally became a real AI benchmark. Simon created it as a joke to mock numeric benchmarks—get each LLM to generate an SVG of a pelican riding a bicycle, and compare the drawings. Unexpectedly, there’s a strong correlation between how good the drawing is and how good the model is at everything else. Nobody can explain why. It’s become a meme: Gemini 3.1’s launch video featured a pelican riding a bicycle. The AI labs are aware of it and quietly competing on it. Don't miss our full conversation: youtube.com/watch?v=wc8FBh…
YouTube video
YouTube
Lenny Rachitsky@lennysan

"Using coding agents well is taking every inch of my 25 years of experience as a software engineer." Simon Willison (@simonw) is one of the most prolific independent software engineers and most trusted voices on how AI is changing the craft of building software. He co-created Django, coined the term "prompt injection," and popularized the terms "agentic engineering" and "AI slop." In our in-depth conversation, we discuss: 🔸 Why November 2025 was an inflection point 🔸 The "dark factory" pattern 🔸 Why mid-career engineers (not juniors) are the most at risk right now 🔸 Three agentic engineering patterns he uses daily: red/green TDD, thin templates, hoarding 🔸 Why he writes 95% of his code from his phone while walking the dog 🔸 Why he thinks we're headed for an AI Challenger disaster 🔸 How a pelican riding a bicycle became the unofficial benchmark for AI model quality Listen now 👇 youtu.be/wc8FBhQtdsA

English
88
138
1.1K
324.7K
Richard Everts retweetledi
Lenny Rachitsky
Lenny Rachitsky@lennysan·
"Using coding agents well is taking every inch of my 25 years of experience as a software engineer." Simon Willison (@simonw) is one of the most prolific independent software engineers and most trusted voices on how AI is changing the craft of building software. He co-created Django, coined the term "prompt injection," and popularized the terms "agentic engineering" and "AI slop." In our in-depth conversation, we discuss: 🔸 Why November 2025 was an inflection point 🔸 The "dark factory" pattern 🔸 Why mid-career engineers (not juniors) are the most at risk right now 🔸 Three agentic engineering patterns he uses daily: red/green TDD, thin templates, hoarding 🔸 Why he writes 95% of his code from his phone while walking the dog 🔸 Why he thinks we're headed for an AI Challenger disaster 🔸 How a pelican riding a bicycle became the unofficial benchmark for AI model quality Listen now 👇 youtu.be/wc8FBhQtdsA
YouTube video
YouTube
English
76
149
1.1K
2.2M
Richard Everts
Richard Everts@rich_everts·
@exQUIZitely Absolutely!!! I have this, Centurion, and a few others still!! The games got really long sometimes
English
0
0
2
33
exQUIZitely 🕹️
exQUIZitely 🕹️@exQUIZitely·
When it comes to turn-based strategy games, you often hear about the usual suspects: Heroes of Might and Magic, Civilization, Colonization, and Alpha Centauri - and yes, they are all great. A sometimes-forgotten early gem is the Warlords series. In this case, Warlords II (1993), which is my personal favorite. It doesn't have the complexity of those four classics, but that didn't make it any less fun. You assemble your armies from 30 different unit types, find artifacts that give combat bonuses, and explore a massive map shrouded in fog of war at the start. You can play against the computer (its AI was surprisingly strong for 1993 and often praised), or in hotseat mode with friends. It even supported play-by-email mode - which I never tried, but I can only imagine the patience required... and how a single game could stretch over several weeks. On a scale from 1 to 10, I would rate it a solid 8. What score would you give it?
English
37
5
135
6.1K
Richard Everts
Richard Everts@rich_everts·
@exQUIZitely It was a fantastic game!! Played on Genesis and PC. One of the few games I still keep on DosBox.
English
0
0
1
22
exQUIZitely 🕹️
exQUIZitely 🕹️@exQUIZitely·
Et tu, Brute? - often translated as "You, as well, Brutus?" Centurion: Defender of Rome (1990) is sometimes described as the spiritual successor to Defender of the Crown. One of the more obvious connections would be the graphic artist James Sachs (legend!), though Centurion didn't quite reach the same "wow" level in terms of graphics as Defender of the Crown did four years prior. You start as an officer in 275 BC with one legion, conquering provinces across Europe and North Africa to ultimately trying to ascend to Caesar. You also manage taxes, host "bread and circuses" like chariot races and gladiator fights (rather simple mini-games). The design and cutscenes seemed to be inspired by Ben-Hur and Spartacus. While this game didn't win any rewards or receive more than "just good" reviews, it still holds a pretty special place in my heart. For its time it was pretty good, and the tactical options for each fight - though rather limited - kept you engaged and always trying to improve. Pretty high replay value and just one of those games that always pulled you back in.
English
12
5
138
7.2K
Richard Everts retweetledi
Guri Singh
Guri Singh@heygurisingh·
Holy shit... Stanford just proved that GPT-5, Gemini, and Claude can't actually see. They removed every image from 6 major vision benchmarks. The models still scored 70-80% accuracy. They were never looking at your photos. Your scans. Your X-rays. Here's what's really going on: ↓ The paper is called MIRAGE. Co-authored by Fei-Fei Li. They tested GPT-5.1, Gemini-3-Pro, Claude Opus 4.5, and Gemini-2.5-Pro across 6 benchmarks -- medical and general. Then silently removed every image. No warning. No prompt change. The models didn't even notice. They kept describing images in detail. Diagnosing conditions. Writing full reasoning traces. From images that were never there. Stanford calls it the "mirage effect." Not hallucination. Something worse. Hallucination = making up wrong details about a real input. Mirage = constructing an entire fake reality and reasoning from it confidently. The models built imaginary X-rays, described fake nodules, and diagnosed conditions -- all from text patterns alone. But that's not the scary part. They trained a "super-guesser" -- a tiny 3B parameter text-only model. Zero vision capability. Fine-tuned it on the largest chest X-ray benchmark (696,000 questions). Images removed. It beat GPT-5. It beat Gemini. It beat Claude. It beat actual radiologists. Ranked #1 on the held-out test set. Without ever seeing a single X-ray. The reasoning traces? Indistinguishable from real visual analysis. Now here's what should terrify you: When the models fake-see medical images, their mirage diagnoses are heavily biased toward the most dangerous conditions. STEMI. Melanoma. Carcinoma. Life-threatening diagnoses -- from images that don't exist. 230 million people ask health questions on ChatGPT every day. They also found something wild: → Tell a model "there's no image, just guess" -- performance drops → Silently remove the image and let it assume it's there -- performance stays high The model enters "mirage mode." It doesn't know it can't see. And it performs BETTER when it doesn't know it's blind. When Stanford applied their cleanup method (B-Clean) to existing benchmarks, it removed 74-77% of all questions. Three-quarters of "vision" benchmarks don't test vision. Every leaderboard. Every "multimodal breakthrough." Every benchmark score you've seen this year. Built on mirages. Code is open-sourced. Paper is live on arXiv. If you're building anything with multimodal AI -- especially in healthcare -- read this paper before you ship. (Link in the comments)
Guri Singh tweet media
English
289
849
4.2K
687.6K
Richard Everts
Richard Everts@rich_everts·
@_MichaelSussman @razorfist So… what could any of us out here do to support this? Just be sure to retcon Trip’s arc back by using “The Good That Men Do” fantastic book series… 😉
English
0
0
1
101
Michael Sussman
Michael Sussman@_MichaelSussman·
@razorfist Sussman here. Appreciate the shout out. FYI some other series I’ve used as comps for UNITED are ‘Homeland,’ ‘Designated Survivor,’ and ‘Jack Ryan.’ 🖖
English
6
0
71
1K
RazörFist
RazörFist@RazorFist·
They made Starfleet Academy instead of this.
RazörFist tweet media
English
251
403
4.4K
113K
Richard Everts
Richard Everts@rich_everts·
@oldyzach Anybody know where you can get an actual original box with the real 4 3.5” disks, manuals, etc? All you ever see on eBay is fake disks and maybe the manuals and tech tree…
English
0
0
0
125
PeteZach
PeteZach@oldyzach·
Love or... Wait. Just love.
PeteZach tweet media
English
57
27
709
58K
kache
kache@yacineMTB·
now that i'm doing actual research instead of just engineering i realize lecun was right and elon was wrong
English
188
109
4.2K
615.1K
Mart Retro
Mart Retro@RetroBrothers·
Dead easy. Name the game, machine, year and publisher #retrogames
Mart Retro tweet media
English
24
3
107
8.3K
Richard Everts retweetledi
Michael Levin
Michael Levin@drmichaellevin·
Ever wonder what a nervous system would look like if it self-assembled inside a novel being that hadn't faced a history of selection for its organism-level form and function? Or, perhaps you wondered how #Xenobots would look and act, or what their transcriptome would be like, if they had nervous systems? Well, here's the first step: advanced.onlinelibrary.wiley.com/doi/epdf/10.10… "Engineered Living Systems With Self-Organizing NeuralNetworks: From Anatomy to Behavior and Gene Expression" Our awesome team: led by @halehf: @LaurieONeill99, @mmsperry, @LPiolopez, @DrPatrickE, and Tiffany Lin. The @TuftsUniversity and @wyssinstitute press releases are here, for summaries: now.tufts.edu/2026/03/16/sci… wyss.harvard.edu/news/toward-au…
Michael Levin tweet media
English
64
269
1.5K
210.8K
Richard Everts
Richard Everts@rich_everts·
@exQUIZitely First RPG with live day and night? I thought I heard that somewhere… Really wish they would have finished the goblin cave…
English
1
0
4
436
exQUIZitely 🕹️
exQUIZitely 🕹️@exQUIZitely·
Sierra was arguably the #1 studio for adventure games in the 80s and 90s. No other company created as many iconic titles. Best known for King's Quest, Space Quest, and Leisure Suit Larry, Sierra also published the Quest for Glory series, the first of which was "So You Want to Be a Hero" from 1989 (EGA version). Originally released as Hero's Quest, it is a hybrid adventure/RPG set in the cursed valley barony of Spielburg. You play a customizable adventurer - fighter, magic-user, or thief. The stats and RPG elements were a new element, diverting quite a bit from the more traditional Sierra adventures. The land suffers under a curse from the evil ogress Baba Yaga, who retaliated against Baron von Spielburg after he tried to banish her. You explore the town and wilderness, solve puzzles, train skills, fight (or flee) enemies, complete side quests (helping locals, fetching items), and build up your character's stats.
English
68
32
629
36.2K
Richard Everts retweetledi
Andrej Karpathy
Andrej Karpathy@karpathy·
Three days ago I left autoresearch tuning nanochat for ~2 days on depth=12 model. It found ~20 changes that improved the validation loss. I tested these changes yesterday and all of them were additive and transferred to larger (depth=24) models. Stacking up all of these changes, today I measured that the leaderboard's "Time to GPT-2" drops from 2.02 hours to 1.80 hours (~11% improvement), this will be the new leaderboard entry. So yes, these are real improvements and they make an actual difference. I am mildly surprised that my very first naive attempt already worked this well on top of what I thought was already a fairly manually well-tuned project. This is a first for me because I am very used to doing the iterative optimization of neural network training manually. You come up with ideas, you implement them, you check if they work (better validation loss), you come up with new ideas based on that, you read some papers for inspiration, etc etc. This is the bread and butter of what I do daily for 2 decades. Seeing the agent do this entire workflow end-to-end and all by itself as it worked through approx. 700 changes autonomously is wild. It really looked at the sequence of results of experiments and used that to plan the next ones. It's not novel, ground-breaking "research" (yet), but all the adjustments are "real", I didn't find them manually previously, and they stack up and actually improved nanochat. Among the bigger things e.g.: - It noticed an oversight that my parameterless QKnorm didn't have a scaler multiplier attached, so my attention was too diffuse. The agent found multipliers to sharpen it, pointing to future work. - It found that the Value Embeddings really like regularization and I wasn't applying any (oops). - It found that my banded attention was too conservative (i forgot to tune it). - It found that AdamW betas were all messed up. - It tuned the weight decay schedule. - It tuned the network initialization. This is on top of all the tuning I've already done over a good amount of time. The exact commit is here, from this "round 1" of autoresearch. I am going to kick off "round 2", and in parallel I am looking at how multiple agents can collaborate to unlock parallelism. github.com/karpathy/nanoc… All LLM frontier labs will do this. It's the final boss battle. It's a lot more complex at scale of course - you don't just have a single train. py file to tune. But doing it is "just engineering" and it's going to work. You spin up a swarm of agents, you have them collaborate to tune smaller models, you promote the most promising ideas to increasingly larger scales, and humans (optionally) contribute on the edges. And more generally, *any* metric you care about that is reasonably efficient to evaluate (or that has more efficient proxy metrics such as training a smaller network) can be autoresearched by an agent swarm. It's worth thinking about whether your problem falls into this bucket too.
Andrej Karpathy tweet media
English
972
2.1K
19.4K
3.6M
9
9@QQSource·
Get accustomed to using these words for the future change: Terra Terrans The only species that says Earth is us, rest of the Galaxy knows our true name, the Terrans
English
143
194
2.5K
414.5K