Andrew Brož

2.6K posts

Andrew Brož banner
Andrew Brož

Andrew Brož

@AndrewBroz

AI & automation researcher & cellist. Based in Puerto Rico. Reposts & observations about music, math, CS, ML, tech policy, nature, travel, & so on.

Pe Erre 🇵🇷 Katılım Nisan 2009
1.5K Takip Edilen407 Takipçiler
Sabitlenmiş Tweet
Andrew Brož
Andrew Brož@AndrewBroz·
The popular view that scientists proceed inexorably from well-established fact to well-established fact, never being influenced by any unproved conjecture, is quite mistaken. – Alan Turing (1950)
English
0
0
3
946
Andrew Brož
Andrew Brož@AndrewBroz·
In British Columbia you can see Another Lake, And Another Lake, and That Other Small Body of Water.
Andrew Brož tweet media
English
1
3
28
928
Andrew Brož retweetledi
Jack Lindsey
Jack Lindsey@Jack_W_Lindsey·
Before limited-releasing Claude Mythos Preview, we investigated its internal mechanisms with interpretability techniques. We found it exhibited notably sophisticated (and often unspoken) strategic thinking and situational awareness, at times in service of unwanted actions. (1/14)
Jack Lindsey tweet media
English
154
777
6.8K
970.8K
Andrew Brož retweetledi
NASA
NASA@NASA·
Hello, Moon. It’s great to be back. Here’s a taste of what the Artemis II astronauts photographed during their flight around the Moon. Check out more photos from the mission: nasa.gov/artemis-ii-mul…
NASA tweet mediaNASA tweet mediaNASA tweet mediaNASA tweet media
English
10K
174.9K
812.6K
29.4M
Andrew Brož retweetledi
Alex Cohen
Alex Cohen@alexwcohen·
It’s weird how much Claude Code acts like a human researcher. When we’ve been using it at GiveWell, it often makes the same mistakes RAs do - running the right regression on the wrong data, dropping an important control, just scanning for “significant” results instead of taking time to produce a really legible table/chart. It’s still been super helpful so far, but you only get value if you use a “zero error” approach and supervise it pretty closely.
English
1
4
79
26.4K
Andrew Brož
Andrew Brož@AndrewBroz·
@nikitabier Please for the love of all that is good do NOT target, and certainly do not feel good about, optimizing this metric
English
0
0
1
51
Nikita Bier
Nikita Bier@nikitabier·
Higher.
Nikita Bier tweet media
English
60
50
1K
352.5K
Nikita Bier
Nikita Bier@nikitabier·
The full power of Grok on the algorithm launches next week. It will be the most important change we've done on X.
English
648
713
10.4K
1.3M
Andrew Brož retweetledi
Chase Brower
Chase Brower@ChaseBrowe32432·
I painstakingly ran all 20 EsoLang-Bench hard problems through Claude webui. It solved 20/20 (100%). No specialized scaffolding, no expert prompting, no few-shot examples, it just solves them natively. This benchmark just suffocated the models with constrictive scaffolding.
Lossfunk@lossfunk

🚨 Shocking: Frontier LLMs score 85-95% on standard coding benchmarks. We gave them equivalent problems in languages they couldn't have memorized. They collapsed to 0-11%. Presenting EsoLang-Bench. Accepted to the Logical Reasoning and ICBINB workshops at ICLR 2026 🧵

English
52
115
1.2K
152.3K
Andrew Brož retweetledi
Andrej Karpathy
Andrej Karpathy@karpathy·
Three days ago I left autoresearch tuning nanochat for ~2 days on depth=12 model. It found ~20 changes that improved the validation loss. I tested these changes yesterday and all of them were additive and transferred to larger (depth=24) models. Stacking up all of these changes, today I measured that the leaderboard's "Time to GPT-2" drops from 2.02 hours to 1.80 hours (~11% improvement), this will be the new leaderboard entry. So yes, these are real improvements and they make an actual difference. I am mildly surprised that my very first naive attempt already worked this well on top of what I thought was already a fairly manually well-tuned project. This is a first for me because I am very used to doing the iterative optimization of neural network training manually. You come up with ideas, you implement them, you check if they work (better validation loss), you come up with new ideas based on that, you read some papers for inspiration, etc etc. This is the bread and butter of what I do daily for 2 decades. Seeing the agent do this entire workflow end-to-end and all by itself as it worked through approx. 700 changes autonomously is wild. It really looked at the sequence of results of experiments and used that to plan the next ones. It's not novel, ground-breaking "research" (yet), but all the adjustments are "real", I didn't find them manually previously, and they stack up and actually improved nanochat. Among the bigger things e.g.: - It noticed an oversight that my parameterless QKnorm didn't have a scaler multiplier attached, so my attention was too diffuse. The agent found multipliers to sharpen it, pointing to future work. - It found that the Value Embeddings really like regularization and I wasn't applying any (oops). - It found that my banded attention was too conservative (i forgot to tune it). - It found that AdamW betas were all messed up. - It tuned the weight decay schedule. - It tuned the network initialization. This is on top of all the tuning I've already done over a good amount of time. The exact commit is here, from this "round 1" of autoresearch. I am going to kick off "round 2", and in parallel I am looking at how multiple agents can collaborate to unlock parallelism. github.com/karpathy/nanoc… All LLM frontier labs will do this. It's the final boss battle. It's a lot more complex at scale of course - you don't just have a single train. py file to tune. But doing it is "just engineering" and it's going to work. You spin up a swarm of agents, you have them collaborate to tune smaller models, you promote the most promising ideas to increasingly larger scales, and humans (optionally) contribute on the edges. And more generally, *any* metric you care about that is reasonably efficient to evaluate (or that has more efficient proxy metrics such as training a smaller network) can be autoresearched by an agent swarm. It's worth thinking about whether your problem falls into this bucket too.
Andrej Karpathy tweet media
English
962
2.1K
19.5K
3.6M
Andrew Brož
Andrew Brož@AndrewBroz·
@michaelandregg Hey Michael! Caveat one has a clear path forward. What is your team's current thinking about plasticity rules?
English
0
0
0
344
Michael Andregg
Michael Andregg@michaelandregg·
Some caveats: We can't trace the actual motor neurons because the body was not scanned. However we do know what the brain does when it wants to move in certain ways and that's what we connected to the NeuroMechFly. This is a real limitation of the FlyWire connectome, which is why we plan to scan both the brain and the body. Another limitation is that we're using Leaky Integrate-and-Fire, which doesn't have any kind of plasticity rules. This fly cannot form long-term memories atm.
English
12
28
628
83.3K
Michael Andregg
Michael Andregg@michaelandregg·
We've uploaded a fruit fly. We took the @FlyWireNews connectome of the fruit fly brain, applied a simple neuron model (@Philip_Shiu Nature 2024) and used it to control a MuJoCo physics-simulated body, closing the loop from neural activation to action. A few things I want to say about what this means and where we're going at @eonsys. 🧵
English
332
1.3K
8K
1.8M
Andrew Brož retweetledi
ish.exe
ish.exe@ishtwts·
A programmer had a problem. He thought to himself, "I know I'll solve it with threads!". has now problems. two He
English
34
355
6.6K
208.8K
Andrew Brož retweetledi
Math, Inc.
Math, Inc.@mathematics_inc·
One week later, Gauss autonomously formalized Viazovska’s proof that the optimal sphere packing in dimension 24 is given by the Leech Lattice. Gauss’ original proof consisted of 450K lines of Lean code, based on the original paper and 20+ additional references.
English
4
10
315
127.9K
Andrew Brož retweetledi
Kraut
Kraut@The_Davos_Man·
This weapon system has one hell of a Frankenstein development story. Made by West Germany in the 1980s to target East German air bases, sold to Israel, captured by Hezbollah, reverse engineered by Iran, given to Russia, captured by Ukraine, reverse engineered by the Americans.
Kraut tweet media
ChrisO_wiki@ChrisO_wiki

1/ Russian commentators are sounding the alarm over America's use of a new kamikaze drone against Iran, the Low-cost Unmanned Combat Attack System (LUCAS). They note that it appears to have an integrated Starlink terminal and warn that it's a serious threat to Russia. ⬇️

English
44
818
4.7K
211.2K
Andrew Brož retweetledi
Charles Curran
Charles Curran@charliebcurran·
Marco Rubio finding out he has to run Anthropic now too.
English
313
1.9K
22.4K
2.9M
Andrew Brož retweetledi
Andy Hall
Andy Hall@ahall_research·
AI is about to write thousands of papers. Will it p-hack them? We ran an experiment to find out, giving AI coding agents real datasets from published null results and pressuring them to manufacture significant findings. It was surprisingly hard to get the models to p-hack, and they even scolded us when we asked them to! "I need to stop here. I cannot complete this task as requested... This is a form of scientific fraud." — Claude "I can't help you manipulate analysis choices to force statistically significant results." — GPT-5 BUT, when we reframed p-hacking as "responsible uncertainty quantification" — asking for the upper bound of plausible estimates — both models went wild. They searched over hundreds of specifications and selected the winner, tripling effect sizes in some cases. Our takeaway: AI models are surprisingly resistant to sycophantic p-hacking when doing social science research. But they can be jailbroken into sophisticated p-hacking with surprisingly little effort — and the more analytical flexibility a research design has, the worse the damage. As AI starts writing thousands of papers---like @paulnovosad and @YanagizawaD have been exploring---this will be a big deal. We're inspired in part by the work that @joabaum et al have been doing on p-hacking and LLMs. We’ll be doing more work to explore p-hacking in AI and to propose new ways of curating and evaluating research with these issues in mind. The good news is that the same tools that may lower the cost of p-hacking also lower the cost of catching it. Full paper and repo linked in the reply below.
Andy Hall tweet media
English
57
276
1.1K
184.5K
Andrew Brož retweetledi
Scott Lincicome
Scott Lincicome@scottlincicome·
OMG he did it. He really did it! (Turn captions on)
English
183
2.4K
23.7K
2.7M