Tom Walczak

343 posts

Tom Walczak

@tom_walchak

AI Engineer | Building Open Debate & AlexAI | Computability, epistemology, verification | https://t.co/c7GKIesHMf

London & San Francisco Katılım Temmuz 2012

1.1K Takip Edilen254 Takipçiler

Tom Walczak@tom_walchak·6d

@AndrewMayne @mattshumer_ @iamjohnoliver Bill Maher did the same the other week. It's so obvious these people don't use AI themselves and just look at headlines from 3yrs ago.

English

Andrew Mayne@AndrewMayne·27 Nis

@mattshumer_ @iamjohnoliver Just about every reference in that whole clip was was a year out a day typical for his show

English

3.7K

Matt Shumer@mattshumer_·27 Nis

People keep sending me this clip of @iamjohnoliver using my tweet as evidence that AI models don’t work well. Just to clear up any confusion, with respect, the tweet was a) taken way out of context and b) extremely outdated. The model in question (4o) is multiple generations old, and was shut down for being too sycophantic. Current models would not have behaved this way. It’s sort of like looking at a Nokia flip phone and saying “this isn’t useful”, when an iPhone exists. John, I’m a fan, and welcome any discussion here. Just want things to be accurate and not misleading!

English

107

1.1K

151.1K

Tom Walczak@tom_walchak·6d

@GregKamradt @ToKTeacher Didn't know you were also a Deutsch fan! Easily one of my favorite books. I've been re-reading it every year and it's always mind-blowing.

English

Greg Kamradt@GregKamradt·6d

If you’re looking to jump into David Deutsch I recommend starting with beginning of infinity After you’re done reading a chapter, listen to @ToKTeacher’s audio recap of it to reinforce ideas This book had a profound impact on my mental model of the world

Ryan Dahl@rough__sea

Just discovered there’s another print of The Beginning of Infinity with this awesomely dorky portrait of him. SO much better than the cover of my print. Rereading Deutsch after hearing @demishassabis say Fabric of Reality was his favorite book

English

134

13.9K

Tom Walczak@tom_walchak·21 Nis

ChatGPT Images 2.0 is really impressive so far. Full technical breakdown of how @boomsupersonic Symphony engine works: cross-sections, specs, multiple views — all text is legible and coherent (even if some of the numbers are hallucinated!) @bscholl

English

Tom Walczak retweetledi

Blake Scholl 🛫@bscholl·1 Nis

I'm genuinely excited to see America headed back to the Moon. But Artemis is a moondoggle and shows we haven't learned the deepest lessons of the Apollo era. Remember that Apollo did *not* result in durable progress in space. It marked a literal high point for more than half a century. The cost of space access remained prohibitively high until we had a rebirth of space entrepreneurship. Thank you for showing the way, SpaceX. Apollo was history's greatest tech demo—the Moon landing. This is inspiring—it shows the triumph of ingenuity, science, and reason. But also, Apollo led to half a century of stasis and regression. It was fundamentally uneconomic, contributed to creation of a cost-insensitive space agency and supply base, all more concerned with perpetuating their own existence, more concerned with make-work jobs than accelerating human progress. Now we're going back to the Moon... essentially the same way we did in 1969. Again uneconomically, again with central planning. A disposable rocket, no answer to how we create a self-sustaining lunar economy. Again, we're taking communists approaches in competition with the communists. Communism didn't work for the Russians, and it won't work for America either. The sooner we can be done with this moondoggle, the better. But there is also reason to be optimistic: this time around, there's a nascent, commercially-led vision for the moon. Lunar hotels. Mass drivers. Data centers in space. Helium-3. The commercial programs that gave SpaceX an early assist show a different and better path forward. This is where the better future lies, and this is where America should be focused. America should take the Moon, and we should take it the same way we took the American West. Let's encourage and protect lunar value creation. How about a Homestead Act for the Moon? Most important, let's stop dumping money and more importantly the time of our engineers and scientists on glory projects that will never lead to a better future. It is indeed time for another space race. Last time, we fought communism with communism. This time, let's remember what made America great. This time, let's fight communism with capitalism.

English

915

87.6K

Tom Walczak@tom_walchak·31 Mar

This is how you know it's time to /clear Claude Code's context!

English

Tom Walczak retweetledi

Dean W. Ball@deanwball·27 Mar

This is a devastating ruling for the government, finding Anthropic likely to prevail on essentially all of its theories for why the government’s actions were unlawful and unconstitutional. One of the things she mentions is the huge range of amici briefs supporting Anthropic (by the way, 0 supported USG)—so thanks to everyone here who signed on to FAI’s brief, or to one of the many many others. These things do matter. More importantly, you were on the right side of history. On a personal note: some friends and allies of mine on the right have been angry at me for my own words and actions in all this. Anyone who thinks I spoke out for personal gain or trivial reasons against an administration I served in is crazy. It was a hugely costly decision for me. But Judge Lin’s ruling shows why I did it: this is a staggeringly illegal act by the government. That is why I am particularly honored to have been (implicitly) quoted in the ruling for calling this what it was when Secretary Hegseth initially made his announcement: an attempted act of corporate murder. The case continues, but Anthropic has scored a very large win here. The real victors, however, are all red-blooded Americans who are, as the founders would have said, “jealous of their liberties.”

Hadas Gold@Hadas_Gold

BREAKING: Anthropic has been GRANTED a preliminary injunction re: Pentagon 'supply chain risk' designation by Judge Rita Lin in California but is allowing a stay for one week storage.courtlistener.com/recap/gov.usco…

English

259

2.4K

224.4K

Tom Walczak@tom_walchak·26 Mar

My killer workflow for getting Claude Code to do 2 weeks of engineering work in 2 hours: Our QA engineer went through all 5 apps in our codebase and found 90+ bugs and improvements. He documented each one with a screenshot or video, and — this is the key part — gave every file a descriptive name that explains the issue (e.g. "Button sizing is inconsistent across the mobile page.png"). I then pointed Claude Code at the folder and told it: "Understand all the issues, create a tracking document, and reference each screenshot and video." Here's what it did on its own: 1. Looked at all 90+ screenshots to understand each issue visually 2. Watched the videos to understand those too 3. Created a structured issue tracker with severity ratings, root cause analysis, and proposed fixes — down to exact locations in the code I briefly scan the tracker, add a bit of context where needed (e.g. "this is a Vimeo issue, research options rather than trying to fix it directly"), make a few decisions on what to fix and what to leave, and then Claude works through the fixes one by one. The entire process — from a folder of QA screenshots to a fully understood, categorized, and tracked set of 90+ issues being fixed across 5 apps — takes about 2 hours instead of the 2 weeks it would normally take an engineer. If the screenshots and videos are named clearly enough, Claude can read them and figure out what needs to be fixed without any additional explanation.

English

Tom Walczak@tom_walchak·25 Mar

Very exciting. I've been using Claude remotely for a few weeks now for software development e.g., to keep the agents working while I'm at the gym, traveling to the airport etc. It's a killer feature. I think that this direction in AI may result in people rethinking how they structure their typical work day.

Claude@claudeai

You can now enable Claude to use your computer to complete tasks. It opens your apps, navigates your browser, fills in spreadsheets—anything you'd do sitting at your desk. Research preview in Claude Cowork and Claude Code, macOS only.

English

Tom Walczak retweetledi

Mark Wallace@wallaceme·23 Mar

I enjoy a dystopia as much as anyone, but we seem to have forgotten there’s a place and a role for optimistic sci-fi. Andy Weir’s heroes solve problems using science and intelligence, which is not only positive but the story of human history.

Sonny Bunch@SonnyBunch

Every negative critique of PROJECT HAIL MARY I’ve seen is basically “I resent this movie’s amiable tone.” Which strikes me as yet more proof it’ll have pretty solid legs, since most people don’t want to be made miserable.

English

377

4.8K

94K

Tom Walczak@tom_walchak·23 Mar

The single biggest improvement I've made to AlexAI Pro this year was adding Plan Mode. Plan Mode divides the agent's work into two distinct steps: exploration and execution. I've taken inspiration from agentic coding tools like Claude Code (where it works exceptionally well) and adapted it for knowledge work such as policy analysis. Take a complex question like estimating the human cost of anti-DDT campaigns — AlexAI Pro does an impressive amount of work before arriving at an answer: 1️⃣ Found Alex Epstein's specific DDT arguments from Fossil Future (Chapters 3 and 4) and identified where he has and hasn't addressed this topic directly 2️⃣ Researched malaria death statistics, the DDT ban timeline, and mosquito resistance evidence from web sources 3️⃣ Built a back-of-the-envelope framework for estimating excess deaths, with transparent uncertainty ranges 4️⃣ Surfaced 7 counterarguments — from mosquito resistance to the emergence of bednets to the fact that Rachel Carson never actually called for a ban 5️⃣ Identified the core tension: the anti-DDT campaign was clearly morally monstrous, but the precise death toll is confounded by multiple causes. Told the executor agent to grapple with this honestly. Why Plan Mode works so well: ▸ AI writes much better prompts than humans. Humans are lazy (rightly so, in this case!). I can provide a short, somewhat vague prompt expressing my general intent and the AI, after doing some exploration, writes an entire comprehensive plan for itself. ▸ Division of labor is very powerful. Having one agent focus on exploration of the subject and asking the right questions without worrying about having to answer them improves the quality of the analysis significantly. One agent can go wide and deep, surfacing competing, legitimate perspectives instead of rushing to complete the task. ▸ Context isolation prevents AI from getting distracted. We don't want to pollute the context of the executor agent with 100k+ exploration tokens full of dead-ends. In fact, we want to tell the agent what NOT to search for and why some plausible-looking angles are misleading. ▸ Structured plans combat LLM laziness. A structured plan with specific tasks forces the AI to do much more thorough work than a single open-ended prompt. Without it, LLMs tend to jump to a confident answer with minimal research. ▸ The user catches mistakes early. Reviewing the plan takes seconds and catches misunderstandings before 20+ minutes of work is wasted. The user often realises they forgot to mention a crucial angle or provide enough context. Plan Mode has made AlexAI Pro indispensable to our users. AlexAI Pro now handles analyses that our users wouldn't have the bandwidth to do themselves — or covers the groundwork so thoroughly that they can focus on the parts that actually need their judgment.

English

Tom Walczak@tom_walchak·17 Mar

AI is held to an almost impossible standard that no human is ever asked to meet. Many software engineers worry about agents making mistakes when writing code. Agents do make mistakes, but so do humans — and quite a lot of them, too. If AI is 10x better overall at a critical task, and if its mistakes follow a reasonable pattern such that we can mitigate their impact, then the argument that “AI isn’t 100% perfect” is no argument at all.

English

Tom Walczak retweetledi

François Chollet@fchollet·12 Mar

The bottleneck of current AI is simple: the techniques we use are still predicated on pattern memorization and retrieval, and thus they need *someone* to tell them which patterns to memorize (training data, RL envs...) That role cannot yet be played by AI in a truly open-ended and autonomous way. We can't yet remove the humans in the loop. In that sense, current AI is still purely a reflection of human cognition (both in terms of which tasks/goals it pursues and the patterns it uses to solve them). It isn't yet its own thing.

English

177

149

1.3K

96.8K

Tom Walczak@tom_walchak·10 Mar

I find the example of a concert pianist very interesting here. A machine can obviously play the piano at superhuman speed and accuracy, but that's not the product people are looking for. A great concert pianist goes off-script in a way that's surprising but right — what @kenneth0stanley describes as "novel yet internally consistent." During ML training, novelty is actively punished — that's what makes it reliable, but it's also why it can't do the thing Kenneth is describing. AlphaGo's genuinely creative "move 37" is very revealing here, too. Move 37 was creative within a searchable space. Most human creativity operates in spaces that aren't searchable in the same way. AI struggles with coherent novelty because there are vastly more ways to go off-script badly than to go off-script well. The search space explodes and ML has no way to navigate it.

Kenneth Stanley@kenneth0stanley

If AI will soon match any human cognitive skill, then enhancing your “AI skills” (or whatever similar meme) will not be a moat because using AI is itself a cognitive skill. So where’s your edge? The only thing you really have over AGI is your novelty: AGI can never be you. You have 100 trillion connections in your brain. That’s a lot. No AI will ever precisely replicate those parameters. The training data isn’t there for AI to vacuum up because you are the only entity ever to live your life, and the only one who ever will. The question is whether the sum and total of all that experience yields a novel perspective, where the value is in its uniqueness. Even today those who make a living off their perceived novelty tend to be the most successful. We anticipate a novel (yet often internally consistent) take from a public figure or leader or artist or intellectual we like or respect. Uniqueness and novelty will retain their edge in a post-AGI world because there are virtually infinite possible 100-trillion parameter minds, and even the largest model theoretically conceivable can never capture that whole distribution. At the same time, the once-sterling premium of those skills that no longer make us unique is sinking. Expertise that once distinguished people, like how to code, is losing its edge. But the tricky part is that new skills, like “using AI effectively” are equally vulnerable. All of it just takes intelligence, and that’s the thing that’s being automated. Seeking some new “safe” skillset is a looming adventure in frustrating futility. But what’s still left is your unique perspective. Novelty. No one and nothing can see the world through your eyes. But you have to nurture that uniqueness. Post-AGI, being like everyone else would be the real danger.

English

Tom Walczak@tom_walchak·10 Mar

Love this part of @WilliamBryk's vision for @ExaAILabs — "search by idea" The downstream applications of this are huge. A few I have been thinking about: - I am writing a draft and want to know: has anyone already made a stronger version of my argument? Or refuted it? What are the competing views? - A 5-hour @lexfridman podcast comes out. You can instantly surface which claims hold up and which have been challenged elsewhere. (Not to mention eventually doing this in real-time!) - You have a new housing policy idea for your city. Before you publish the draft, have your AI agent search for every serious objection that's been made to similar proposals. I find that this kind of search has to be comprehensive so that the agent has confidence and clarity making any recommendations and can self-correct its own reasoning.

English

Tom Walczak retweetledi

Dean W. Ball@deanwball·24 Şub

Here is a quote I think about every day: “It was incredible that the human species should have arrived at so noble an attitude [as classical liberalism], so paradoxical, so refined, so acrobatic, so antinatural. Hence, it is not to be wondered at that this same humanity should soon appear anxious to get rid of it. It is a discipline too difficult and complex to take firm root on earth.”

English

252

16.9K

Tom Walczak@tom_walchak·24 Şub

"My honest answer" is @claudeai 's dead giveaway that it's being lazy and hasn't actually done the homework.

English

Tom Walczak@tom_walchak·19 Şub

"Are there any risks to the fix that you're proposing?"

English

Tom Walczak@tom_walchak·9 Şub

Can AI teach itself superhuman persuasion? I built Open Debate — an open-source tool that lets you run AI debates with recursive self-improvement. Two AI agents argue opposing positions on any topic, an AI judge scores each exchange, and both agents analyze what worked and rewrite their own prompts. Then they debate again. So is superhumanly persuasive AI on the horizon? No. (Not yet!) Some findings from 200+ debates: • AI self-improvement works — but it hits a ceiling. After a few rounds, both sides have adapted to each other and neither can find new angles. The scores flatten out. A human who understands the topic can break through by reframing the whole debate. • In a climate debate, the AI kept losing no matter how many rounds it ran. A hand-written brief that changed the framing took the win rate from 19% to 60%. • I ran a therapy culture debate using Qwen, a Chinese-trained model. By Debate 5, both sides were calling for communist revolution. • Different models pick different winners on the same topic. Model selection alone can determine the outcome. Now I'm working on a web UI with a real-time, multi-agent writing assistant that stress-tests your arguments as you write.

English

Tom Walczak@tom_walchak·23 Oca

Traditional software engineering (i.e., writing code by hand) is not coming back for the vast majority of projects. Since Opus 4.5 and Claude Code, I can write a month's worth of code in a day. We're going to have 10x or 100x more software to maintain and understand. The value of software engineering is not in writing code anymore. It's infrastructure and interoperability and testing and monitoring. The new engineering challenge is making codebases understandable for both humans and AI, and "making the trains run on time".

English

Tom Walczak@tom_walchak·15 Oca

This is a great list. To make "Agents for non-code work" really work, i.e. make them long-running and self-correcting, we will need some verification / error-checking mechanism. Some way for an agent to check "hey, does this plan / pitch / argument hold up? is it logical, are there any reasonable objections, given the context?" Interesting times ahead!

Greg Kamradt@GregKamradt

What I'm excited about in AI: - Agent economy - Agents can't do everything. They'll outsource work (and get charged for it). What happens when agents have mcp__your_credit_card? - Agents for non-code work - Obvious, but clearly needed. The wave of claude code/codex but for everyone else - Agent UIs - The UI of agents hasn’t been figured out yet, 2026 will have it - Data-for-LLMs: Aggregating data and prepping it for LLMs. Location, places, github profiles, researchers. “markdown-ifying” the internet signal.

English

Keşfet

@AndrewMayne @mattshumer_ @iamjohnoliver @GregKamradt @ToKTeacher @boomsupersonic @bscholl @kenneth0stanley