Carles Navarro

1.5K posts

Carles Navarro banner
Carles Navarro

Carles Navarro

@11krls

Chemisty & Physics | Machine Learning at @acellera

L'Ampolla, España Beigetreten Ekim 2013
538 Folgt242 Follower
Angehefteter Tweet
Carles Navarro
Carles Navarro@11krls·
There is only one basic human right, the right to do as you damn well please. And with it comes the only basic human duty, the duty to take the consequences.
English
0
1
9
0
Carles Navarro
Carles Navarro@11krls·
guys all important stuff (for us) of claude code was mostly public, system prompts and hooks were all leaked. All the rest is just an envelop to put an agent into production and make it work on our computers / permissions / observability.
English
0
0
2
103
Cristian Córdova 🐧
👀 He estado analizando todo el supuesto código de Claude Code para entender mejor cómo funciona por detrás y así intentar sacarle más partido a nivel usuario. Algunas cosas interesantes que veo y que algunas de ellas no sabía: - El system prompt tiene algunas reglas curiosas como que no debe usar emojis salvo que se le pida, tiene un cyber risk y no da estimaciones de tiempo (en esto último se parece a cualquier dev xD) - Tiene 45 Tools y algunas de ellas se le dice explicitament que debe usarlas por encima de herramientas CLI del sistema (Bash). - Tiene 6 tipos de agentes built-in para diferentes propósitos como por ejemplo cuando entra en "Plan Mode". Cada uno tira de un modelo concreto. - Tiene memoria basada en ficheros como sabemos pero hace taxonomía por perfil de usuario (como nos conoce), por feedback (lo que le vamos diciendo, corrigiendo, etc), por project (conocimiento del código, lo que se hizo con él, etc) y por reference (punteros a sistemas externos, como Linear, GH, etc) - El MEMORY.md debe tener máximo 200 líneas. Si se pasa de ahí. Claude Code simplemente parece que lo ignora. - El patrón para implementar decisiones de permisos se hace a través de instrucciones escitas en ficheros XML. Y lo curioso es que gasta entre 64 y 4096 que el agente tome la decisión de saber si ejecuta algo o no y si tiene permisos o no. - Por defecto, la ventana de contexto se compacta en ~13,000 tokens y tras esto tienen un budget post-compactación de 50K tokens. - Tiene un modo "Undercover" para trabajadores internos de Anthropic que oculta nombres e IDs de modelos del prompt que previene que modelos no anunciados se filtren en commits/PRs. Me podría tirar así todo el día porque hay un montón de cosas curiosas. Creo que mas bien voy a recopilar todo lo que pueda en una web a modo educativo para mí mismo y para quien le apetezca. Luego mas tarde la publicaré.
Español
32
255
2.8K
766.9K
Carles Navarro retweetet
Carles Navarro
Carles Navarro@11krls·
Irónicamente, la universidad sí se ha adaptado a la ley de la oferta y la demanda: a la infinita demanda de madres, padres y chavales a quienes se les ha hecho creer que la única salida digna es estudiar una carrera. La mayoría de grados no deberían existir o deberían ser especializaciones.
Español
0
0
0
154
Mihura
Mihura@XMihura·
ya que hemos abierto el melón de los funcionarios me gustaría comentar una cosa que veo mucho en mi generación y que me parece un síntoma clarísimo de que algo falla: he visto ya muchos casos, muchísimos, de gente que estudia una carrera (fisioterapia, contabilidad y finanzas, veterinaria, edificación, da igual) y a los dos o tres años de estrellarse contra el mundo laboral se meten a preparar una oposición que no tiene NADA que ver con lo que han estudiado. en muchos casos policía o auxiliar administrativo y si lo he visto tantas veces es porque es algo muy común este modelo está literalmente cogiendo los años de mayor energía de una generación y tirándolos a la basura 4 años de grado + 1 de máster + 2 años de trabajos precarios + 2-3 años encerrados memorizando un temario para un puesto que no requiere ese título universitario casi una década de capital humano desperdiciado yo me pregunto, ¿no necesita este sistema repensarse de arriba a abajo? ¿de verdad necesitamos escupir miles de graduados en fisioterapia, magisterio, turismo cada año si sabemos que el mercado no los va a absorber? la universidad en españa parece haberse convertido en un parking de jóvenes, una guardería donde los chavales retrasan su vida 5-6 años una herramienta carísima que lo que en muchos casos solo consigue mantener a la gente bloqueada fuera del mercado laboral en sus años de mayor energía el coste de oportunidad de este modelo para el país muy alto tener a chavales quemando sus veintes en academias de oposiciones para huir de la precariedad, cuando podrían estar construyendo cosas, innovando o simplemente aportando valor real desde los 20 años tenemos que empezar a alinear la formación con la realidad material del mundo, porque sostener este sumidero de talento no le hace un favor a nadie
Español
337
1.3K
5.9K
843.8K
Carles Navarro
Carles Navarro@11krls·
There’s one thing people who say ‘MCP or CLI’ and people who say ‘Transformers or diffusion’ have in common: they have no idea.
English
0
0
0
35
David Gomes
David Gomes@davidgomes·
I'm unshipping a feature from Cursor, and I can tell that all the SOTA models are really bad at deleting code. They will routinely: - Leave behind `throw Error("Not implemented")` style things - Want to give users notifications "Feature has been deprecated" - Keep tests for features that don't exists, and write stubs for these features in the tests (instead of deleting tests) - Not find a ton of dead code, and not care too much about deleting it - Leave useless comments about deleted code/functionality I think we need to much improve the RL data with lots of feature deletions, because they have been trained to only generate code, not to delete it.
English
63
21
800
65.7K
Carles Navarro retweetet
Matt Pocock
Matt Pocock@mattpocockuk·
Good tip for avoiding cognitive debt in codebases where AI has run wild: Design the interface, delegate the implementation
English
69
71
956
61.8K
Carles Navarro retweetet
rahul
rahul@rahulgs·
seems obvious but: things that are changing rapidly: 1. context windows 2. intelligence / ability to reason within context 3. performance on any given benchmark 4. cost per token things that are not changing much: 1. humans 2. human behavior, preferences, affinities 3. tools, integrations, infrastructure 4. single core cpu performance therefore, ngmi: 1. "i found this method to cut 15% context" 2. "our method improves retrieval performance 10% by using hybrid search" 3. "our finetuned model is cheaper than opus at this benchmark" 4. "our harness does this better because we invented this multi agent system" 5. "we're building a memory system" 6. "context graphs" 7. "we trained an in house specialized rl model to improve task performance in X benchmark at Y% cost reduction" wagmi: 1. product/ui 3. customer acquisition 4. integrations 5. fast linting, ci, skills, feedback for agents 6. background agent infra to parallelize more work 7. speed up your agent verification loops 8. training your users, connecting to their systems and working with their data, meeting them where they are
English
111
229
3.2K
394.7K
Carles Navarro
Carles Navarro@11krls·
Memory should be only per chat. So I can have a long chat without losing context, if I start a new chat is because I want fresh context. When context engineering gets good enough I may never leave the conversation, but for now it just messes up both experiences (not enough good for infinite conversation nor for global context)
English
0
0
0
11
Andrej Karpathy
Andrej Karpathy@karpathy·
One common issue with personalization in all LLMs is how distracting memory seems to be for the models. A single question from 2 months ago about some topic can keep coming up as some kind of a deep interest of mine with undue mentions in perpetuity. Some kind of trying too hard.
English
1.8K
1.1K
21.2K
2.6M
Carles Navarro retweetet
Kpaxs
Kpaxs@Kpaxs·
This is a man who has been haunted since childhood and built a billion dollar company as a side effect of trying to make the haunting stop.
Kpaxs tweet media
English
110
452
4.3K
978.3K
Carles Navarro
Carles Navarro@11krls·
With agents getting better and working for longer, feels no longer productive to wait several minutes for each task to finish. I resigned, I don't want to know what code they produce anymore, I fully embrace multi-agentic coding. This codebase is modular enough so I can coordinate with a simple agent_inbox/ files communication protocol.
Carles Navarro tweet media
English
0
0
0
18
Carles Navarro
Carles Navarro@11krls·
Cursor's job is not to compete with model providers IMO, creating an agentOS was (may still be) a better strategy. Easy to say it a posteriori though
English
0
0
0
77
Carles Navarro
Carles Navarro@11krls·
@karpathy This is a real ML researcher optimizing even the random seed when there is nothing else to try
English
0
0
0
128
Andrej Karpathy
Andrej Karpathy@karpathy·
I packaged up the "autoresearch" project into a new self-contained minimal repo if people would like to play over the weekend. It's basically nanochat LLM training core stripped down to a single-GPU, one file version of ~630 lines of code, then: - the human iterates on the prompt (.md) - the AI agent iterates on the training code (.py) The goal is to engineer your agents to make the fastest research progress indefinitely and without any of your own involvement. In the image, every dot is a complete LLM training run that lasts exactly 5 minutes. The agent works in an autonomous loop on a git feature branch and accumulates git commits to the training script as it finds better settings (of lower validation loss by the end) of the neural network architecture, the optimizer, all the hyperparameters, etc. You can imagine comparing the research progress of different prompts, different agents, etc. github.com/karpathy/autor… Part code, part sci-fi, and a pinch of psychosis :)
Andrej Karpathy tweet media
English
1.1K
3.7K
28.3K
10.9M
Carles Navarro retweetet
Ivan Burazin
Ivan Burazin@ivanburazin·
Sandboxes are layer one. As agents take on more complex work, every layer needs rethinking: - Networking for agent to agent communication - Storage for petabyte scale snapshots - Observability for debugging million path execution trees - Security for autonomous decision making The whole stack will be rebuilt from first principles.
English
54
37
371
26.3K
Andrej Karpathy
Andrej Karpathy@karpathy·
There was a nice time where researchers talked about various ideas quite openly on twitter. (before they disappeared into the gold mines :)). My guess is that you can get quite far even in the current paradigm by introducing a number of memory ops as "tools" and throwing them into the mix in RL. E.g. current compaction and memory implementations are crappy, first, early examples that were somewhat bolted on, but both can be fairly easily generalized and made part of the optimization as just another tool during RL. That said neither of these is fully satisfying because clearly people are capable of some weight-based updates (my personal suspicion - mostly during sleep). So there should be even more room for more exotic approaches for long-term memory that do change the weights, but exactly - the details are not obvious. This is a lot more exciting, but also more into the realm of research outside of the established prod stack.
Awni Hannun@awnihannun

I've been thinking a bit about continual learning recently, especially as it relates to long-running agents (and running a few toy experiments with MLX). The status quo of prompt compaction coupled with recursive sub-agents is actually remarkably effective. Seems like we can go pretty far with this. (Prompt compaction = when the context window gets close to full, model generates a shorter summary, then start from scratch using the summary. Recursive sub-agents = decompose tasks into smaller tasks to deal with finite context windows) Recursive sub-agents will probably always be useful. But prompt compaction seems like a bit of an inefficient (though highly effective) hack. The are two other alternatives I know of 1. online fine-tuning and 2. memory based techniques. Online fine-tuning: train some LoRA adapters on data the model encounters during deployment. I'm less bullish on this in general. Aside from the engineering challenges of deploying custom models / adapters for each use case / user there are a some fundamental issues: - Online fine-tuning is inherently unstable. If you train on data in the target domain you can catastrophically destroy capabilities that you don't target. One way around this is to keep a mixed dataset with the new and the old. But this gets pretty complicated pretty quickly. - What does the data even look like for online fine tuning? Do you generate Q/A pairs based on the target domain to train the model? You also have the problem prioritizing information in the data mixture given finite capacity. Memory based techniques: basically a policy for keeping useful memory around and discarding what is not needed. This feels much more like how humans retain information: "use it or lose it". You only need a few things for this to work: - An eviction/retention policy. Something like "keep a memory if it has been accessed at least once in the last 10k tokens". - The policy needs to be efficiently computable - A place for the model to store and access long-term memory. Maybe a sparsely accessed KV cache would be sufficient. But for efficient access to a large memory a hierarchical data structure might be beter.

English
274
299
4.6K
588.5K
Carles Navarro retweetet
gabriel
gabriel@gabriel1·
i could learn any topic in 5 minutes with the mist optimal text & visual explanation, if it's adapting live to what you do and dont understand fundamentally ai has like another 5 gpt3 to gpt4 moments ahead
English
40
12
589
25.8K