Moisés Arizpe

3.6K posts

Moisés Arizpe

Moisés Arizpe

@_numerico

Mexico Inscrit le Ekim 2014
1.2K Abonnements1.3K Abonnés
Tweet épinglé
Moisés Arizpe
Moisés Arizpe@_numerico·
El día de hoy @svrvhi y yo publicamos en Taller de datos de @nexosmexico un análisis de las causas de muerte y lugares de muerte registradas en las actas de defunción de la CDMX durante la pandemia. 👇🧵
Español
7
88
201
0
Andrej Karpathy
Andrej Karpathy@karpathy·
@LinghuaJ Interesting.. Chain of thought is a reduce (in addition to attention ofc), so I guess this can be seen as a bit more of a directed context compaction mechanism, inheriting structure from the preexisting idea of a wiki.
English
16
14
479
56.3K
Linghua Jin 🥥 🌴
LLM knowledge Base idea open sourced by @karpathy . RAG only has map and no reduce. The essence is to have LLM 𝐢𝐧𝐜𝐫𝐞𝐦𝐞𝐧𝐭𝐚𝐥𝐥𝐲 𝐛𝐮𝐢𝐥𝐝𝐬 𝐚𝐧𝐝 𝐦𝐚𝐢𝐧𝐭𝐚𝐢𝐧𝐬 𝐚 𝐩𝐞𝐫𝐬𝐢𝐬𝐭𝐞𝐧𝐭 𝐰𝐢𝐤𝐢 - So basically at indexing time there must be reduce so you can dig accumulated / compounded synthesized knowledge beyond individual fact. Different data source has different fact. A lot of discovery are like this you derive general fact from single fact. Physics theory you have a lot of independent data and derive more generic theorem. A lot of things coming up all together. Having a solid incremental engine to drive the process is everything we have been building cocoindex for, and keep indexing up to date and organizaed from source of truth. Very much looking forward to what's next!
Andrej Karpathy@karpathy

Wow, this tweet went very viral! I wanted share a possibly slightly improved version of the tweet in an "idea file". The idea of the idea file is that in this era of LLM agents, there is less of a point/need of sharing the specific code/app, you just share the idea, then the other person's agent customizes & builds it for your specific needs. So here's the idea in a gist format: gist.github.com/karpathy/442a6… You can give this to your agent and it can build you your own LLM wiki and guide you on how to use it etc. It's intentionally kept a little bit abstract/vague because there are so many directions to take this in. And ofc, people can adjust the idea or contribute their own in the Discussion which is cool.

English
27
19
416
112.6K
Moisés Arizpe
Moisés Arizpe@_numerico·
Cambio mis dos boletos de The XX para el viernes por dos boletos para el domingo. @The_xx #thexx
Español
2
0
0
263
Andrej Karpathy
Andrej Karpathy@karpathy·
I had the same thought so I've been playing with it in nanochat. E.g. here's 8 agents (4 claude, 4 codex), with 1 GPU each running nanochat experiments (trying to delete logit softcap without regression). The TLDR is that it doesn't work and it's a mess... but it's still very pretty to look at :) I tried a few setups: 8 independent solo researchers, 1 chief scientist giving work to 8 junior researchers, etc. Each research program is a git branch, each scientist forks it into a feature branch, git worktrees for isolation, simple files for comms, skip Docker/VMs for simplicity atm (I find that instructions are enough to prevent interference). Research org runs in tmux window grids of interactive sessions (like Teams) so that it's pretty to look at, see their individual work, and "take over" if needed, i.e. no -p. But ok the reason it doesn't work so far is that the agents' ideas are just pretty bad out of the box, even at highest intelligence. They don't think carefully though experiment design, they run a bit non-sensical variations, they don't create strong baselines and ablate things properly, they don't carefully control for runtime or flops. (just as an example, an agent yesterday "discovered" that increasing the hidden size of the network improves the validation loss, which is a totally spurious result given that a bigger network will have a lower validation loss in the infinite data regime, but then it also trains for a lot longer, it's not clear why I had to come in to point that out). They are very good at implementing any given well-scoped and described idea but they don't creatively generate them. But the goal is that you are now programming an organization (e.g. a "research org") and its individual agents, so the "source code" is the collection of prompts, skills, tools, etc. and processes that make it up. E.g. a daily standup in the morning is now part of the "org code". And optimizing nanochat pretraining is just one of the many tasks (almost like an eval). Then - given an arbitrary task, how quickly does your research org generate progress on it?
Thomas Wolf@Thom_Wolf

How come the NanoGPT speedrun challenge is not fully AI automated research by now?

English
560
801
8.7K
1.6M
Taelin
Taelin@VictorTaelin·
so what's the best way to convert money into intelligence now? if I wanted to improve 5.3 beyond xhigh, what's the right approach? is there any ready to use product that allows me to do just that? I suppose "just spawn 10 instances lol" isn't the optimal answer, right?
Noam Brown@polynoamial

Perhaps a 🌶️ take but I think the criticisms of @GoogleDeepMind's release are missing the point, and the real problem is that AI labs and safety orgs need to adapt to a world where intelligence is a function of inference compute. When Google says that Deep Think poses no new risks beyond Gemini 3 Pro, they probably mean that Deep Think is a scaffold of Gemini 3 Pro that anyone externally could have constructed on their own anyway. In other words, the capabilities of Deep Think have always been available to anyone willing to pay for Deep Think amounts of inference, simply by scaffolding a bunch of Gemini 3 Pro queries together. Deep Think just makes that more convenient for the casual user. The corollary of this is that capabilities far beyond Gemini 3 Deep Think are already available to anyone willing to scaffold a system together that uses even more inference compute. As a trivial example, you could run 10 Deep Think queries and just do consensus over them. That would be 10x the cost but would have higher performance on many benchmarks. Most Preparedness Frameworks were developed in ~2023 before the era of effective test-time scaling. But today, there is a massive difference on the hardest evals between something like GPT-5.2 Low and GPT-5.2 Extra High. Scaffolds are also much more effective. So if you want to evaluate whether Gemini 3 can, for example, help make a bio weapon, the answer may depend on how much inference compute you give it. In my opinion, the proper solution is to account for inference compute when measuring model capabilities. E.g., if one were to spend $1,000 on inference with a really good scaffold, what performance could be expected on a benchmark? ARC-AGI has already adopted this mindset but few other benchmarks have. Of course, serious entities like state actors could spend well beyond $1,000. Accurate benchmark evaluations can require dozens of queries on hundreds of problems. So, if we want to measure a model's capability when using $1 million of inference, we might need to spend billions of dollars for each model release! But in the same way that pretraining scaling laws can predict the capabilities of larger pretrained models, performance also scales somewhat cleanly with additional inference compute. In my opinion, it should become standard practice for all system cards to show plots of benchmark performance as a function of inference compute, and safety thresholds should be based on a projection of what performance would look like at $1 million+ of inference compute. If that were the norm, then indeed releasing Deep Think probably would not result in a meaningful safety change compared to Gemini 3 Pro, other than making good scaffolds more easily available to casual users.

English
31
3
208
33.1K
SIGKITTEN
SIGKITTEN@SIGKITTEN·
codex app-server is legit af i was just looking into it for a project and accidentally ended up making an actual native codex iphone app i can spawn and talk to codexes anywhere on my network and one of the best parts... I built and linked codex into the actual iphone app and it now runs locally on the actual iphone gl doing that, cc
SIGKITTEN tweet mediaSIGKITTEN tweet mediaSIGKITTEN tweet media
English
116
123
2.6K
800.8K
Moisés Arizpe
Moisés Arizpe@_numerico·
@fchollet I think it’s more like going from a manufacturing factory to a chemical processing plant.
English
0
0
0
167
François Chollet
François Chollet@fchollet·
Sufficiently advanced agentic coding is essentially machine learning: the engineer sets up the optimization goal as well as some constraints on the search space (the spec and its tests), then an optimization process (coding agents) iterates until the goal is reached. The result is a blackbox model (the generated codebase): an artifact that performs the task, that you deploy without ever inspecting its internal logic, just as we ignore individual weights in a neural network. This implies that all classic issues encountered in ML will soon become problems for agentic coding: overfitting to the spec, Clever Hans shortcuts that don't generalize outside the tests, data leakage, concept drift, etc. I would also ask: what will be the Keras of agentic coding? What will be the optimal set of high-level abstractions that allow humans to steer codebase 'training' with minimal cognitive overhead?
English
171
381
3.3K
321.6K
Moisés Arizpe retweeté
Matthew Honnibal
Matthew Honnibal@honnibal·
I still see a lot of people discussing LLMs as next-token predictors, which is by now quite a misunderstanding. A related opinion is that LLM progress will probably plateau. This post explains why I don't think the "plateau" argument holds up. honnibal.dev/blog/ai-bubble
English
8
12
135
34.6K
François Chollet
François Chollet@fchollet·
Right now it's still taking me more time to generate medium-complexity diagrams by describing them to Nano Banana than by drawing them manually in Google Slides...
English
53
14
462
42.5K
Irving MA
Irving MA@moaimx·
Andamos cocinando la nueva versión de las herramientas de validación y limpieza para bases de datos a publicarse en la Plataforma Nacional de Datos Abiertos. Estas serán una herramienta fundamental para que las Instituciones publiquen con la mejor calidad...
Irving MA tweet mediaIrving MA tweet media
Español
8
5
115
3.3K
Moisés Arizpe
Moisés Arizpe@_numerico·
@chrisalbon @infinitehumanai Tell it to create reports with charts that let you explore inputs and outputs. Also to include code snippets etc. just as you would do with a direct report. The inspection layer does not need to be code.
English
1
0
2
226
Chris Albon
Chris Albon@chrisalbon·
@infinitehumanai I feel like I miss basic things if I don’t see the output and the code. Like I was processing some Wikipedia articles and Claude Code decided to only look at the first 500 words of each article for some reason
English
2
0
6
1.2K
Chris Albon
Chris Albon@chrisalbon·
Just saw the new codex app. I have just been using vscode with Claude in the in-app terminal. But all these new apps (conductor, codex, cursor) have this new paradigm where you basically aren’t looking at the code at all. Has everyone switched to this new paradigm? Is it hype?
English
54
0
61
29.4K
Prakash
Prakash@8teAPi·
you are watching the final stages of the democratization of information and the initial stages of the democratization of action
English
7
2
67
3.2K
Moisés Arizpe
Moisés Arizpe@_numerico·
Reddit take off was not in bingo card
English
0
0
0
31
Johannes Schickling
Johannes Schickling@schickling·
Who's building an IDE for reviewing code instead of writing code? Don't only show me diffs. Show me before/after UIs, terminal output, benchmarks, historic trends, playgrounds, demos, test results etc. Someone stop me from building this myself.
English
160
47
1.3K
92.3K
Sam Altman
Sam Altman@sama·
Delighted to see Ahmad join Airbnb! Airbnb is a rare combination of world-class design and engineering, and I am excited to see what Brian and Ahmad build together. Companies that are the furthest from AI—like travel and experiences—are quite interesting in a world with lots of AI, although I am also sure bringing AI to Airbnb will make it much better.
Brian Chesky@bchesky

.@Ahmad_Al_Dahle is joining as Airbnb's new CTO. I’m often asked about our AI strategy. We believe pairing great design with frontier technology will help us improve the way people experience travel. Excited to build!

English
485
184
3.5K
975.8K
Simon Willison
Simon Willison@simonw·
@geoffreylitt Yeah this is really hard! I've played with "we built this" but it feels weird and anthropomorphic
English
19
1
152
9.5K
Geoffrey Litt
Geoffrey Litt@geoffreylitt·
We need a shorthand way of saying: "An AI did the work, but I vouch for the result" Saying "I did it" feels slightly sketchy, but saying "Claude did it" feels like avoiding responsibility
English
1.1K
258
7.8K
542.5K
Moisés Arizpe retweeté
Lee Robinson
Lee Robinson@leerob·
I migrated cursor​.com from a CMS to raw code and Markdown. I had estimated it would take a few weeks, but was able to finish the migration in three days with $260 in tokens and hundreds of agents. Here's how I did it + all my my usage stats. leerob.com/agents
English
260
245
4.4K
2.2M
Moisés Arizpe retweeté
Noam Brown
Noam Brown@polynoamial·
Today, we at @OpenAI achieved a milestone that many considered years away: gold medal-level performance on the 2025 IMO with a general reasoning LLM—under the same time limits as humans, without tools. As remarkable as that sounds, it’s even more significant than the headline 🧵
Alexander Wei@alexwei_

1/N I’m excited to share that our latest @OpenAI experimental reasoning LLM has achieved a longstanding grand challenge in AI: gold medal-level performance on the world’s most prestigious math competition—the International Math Olympiad (IMO).

English
142
507
4.7K
1.1M