Sabitlenmiş Tweet
Lute Lillo Portero
581 posts

Lute Lillo Portero
@Lute47Lillo
👨🏼💻🇺🇸 Researching Continual Learning and Loss of Plasticity as a PhD Student at @uvmvermont @OmahaMSOC alumni 📍Burlington, VT - El Altet, Alicante🏡
Burlington, VT Katılım Mayıs 2016
763 Takip Edilen199 Takipçiler
Lute Lillo Portero retweetledi

Continual learning through automated research is like evolution: every new checkpoint is an evolved version, gated by natural or human selection.
Continual learning through test-time training (TTT) or memory is like living a lifetime for an intelligent creature. The model learns in context, adapts to new situations, and gradually loses its plasticity and capacity to absorb new information as they accumulate.
There shouldn’t be any conflict between the two CL approaches, but they do emphasize different aspects. Automated research treats abilities as gifts that models are born with, while TTT treats them as flexible knowledge and skills that can be acquired via experience. One enables adaptation; the other largely does not.
Sure, no fundamental change can be achieved by TTT alone, just as no dinosaur can evolve into a human within its lifetime. But relying entirely on evolution is also quite restrictive. Human beings learn their skills through interaction with the environment: they observe and react to new information, familiarize themselves with their surroundings, and gradually become capable at what they practice. In the current AI landscape, only RL allows a model to truly learn from experience, and it’s almost as costly as pretraining. Most of the time, AI models can only continuously learn through either in-context adaptation or leaving external traces, such as creating a Claude.md, to prompt itself better.
When evolution dominates AI progress, its drawbacks dominate as well. We inevitably encounter poor evolutionary checkpoints (such as the blunder of Opus 4.7 compared to GPT-5.5), while also enduring slow correction cycles. TTT is not a cure for the pain or slowness of evolution, but it may offer a smoother path. We don’t need to train a thousand model checkpoints to realize that certain data, experiences, environments, or rewards should not be included in training. Moreover, some frontier human capabilities can be improved through deep exploration alone, without requiring evolutionary change—where the sparse traces of progress may be overwhelmed by tons of ordinary ones.
Life on Earth evolved for half a billion years before producing a species capable of learning from highly abstracted information within a single lifetime. We may still be far from building a lifelong learning agent, but when it arrives, it could represent a different kind of singularity.
English
Lute Lillo Portero retweetledi

Meet KinDER — a stress test for robot physical reasoning. All 13 methods failed 😈
🌎 25 environments
♾️ Infinite tasks
🏋️ Gymnasium API
⚒️ Over 20 parameterized skills
🪧 Human demonstrations
📊 13 baselines (planning and learning)
From @Princeton @CMU_Robotics @ICatGT @CambridgeMLG @nvidia @MIT_CSAIL
🧵 1/n
English
Lute Lillo Portero retweetledi

The smartest system will need to update model behavior along the way. Crucially, you also need to remember what your overall goal is. Hence, continual learning.
I tend to think the efficiency part is a key ingredient here that didn't get as much of a front row seat in this article. There is are several key directions of research in continual learning. But for the goals of enabling new intelligence, the speed at which you can incorporate new information critical.
If you can learn under low latency constraints, you can explore your environment much faster. The unit of time that interaction and behavior updates costs determines how quickly you can explore effectively.
English
Lute Lillo Portero retweetledi

Released on @berkeley_ai blog, recent work by @michaelpsenka M. Rabbat @ask1729 @ylecun @_amirbar — long horizons in visual world models punish naive gradients, GRASP reshapes them (lifted virtual states, noised state iterates, action-friendly descent) so planning stays stable when rollouts get ill-conditioned … 🐻📄
bair.berkeley.edu/blog/2026/04/2…
GIF
English

9/ Takeaway:
Continual RL should move beyond preserving a single successful policy.
What matters is preserving a reusable neighborhood of policies that keeps adaptation possible later.
Check out the current preprint: arxiv.org/abs/2604.15414
&
Come talk to me at ICLR'26 🇧🇷
English
Lute Lillo Portero retweetledi

When simulation becomes the norm, it weakens the human capacity for discernment. As a result, our social bonds close in upon themselves, forming self-referential circuits that no longer expose us to reality. We thus come to live within bubbles, impermeable to one another. Feeling threatened by anyone who is different, we grow unaccustomed to encounter and dialogue. In this way, polarization, conflict, fear and violence spread. What is at stake is not merely the risk of error, but a transformation in our very relationship with truth.
English
Lute Lillo Portero retweetledi
Lute Lillo Portero retweetledi

AMI Labs just raised $1.03B. World Labs raised $1B a few weeks earlier. Both are betting on world models.
But almost nobody means the same thing by that term.
Here are, in my view, five categories of world models.
---
1. Joint Embedding Predictive Architecture (JEPA)
Representatives: AMI Labs (@ylecun), V-JEPA 2
The central bet here is that pixel reconstruction alone is an inefficient objective for learning the abstractions needed for physical understanding. LeCun has been saying this for years — predicting every pixel of the future is intractable in any stochastic environment. JEPA sidesteps this by predicting in a learned latent space instead.
Concretely, JEPA trains an encoder that maps video patches to representations, then a predictor that forecasts masked regions in that representation space — not in pixel space.
This is a crucial design choice.
A generative model that reconstructs pixels is forced to commit to low-level details (exact texture, lighting, leaf position) that are inherently unpredictable. By operating on abstract embeddings, JEPA can capture "the ball will fall off the table" without having to hallucinate every frame of it falling.
V-JEPA 2 is the clearest large-scale proof point so far. It's a 1.2B-parameter model pre-trained on 1M+ hours of video via self-supervised masked prediction — no labels, no text. The second training stage is where it gets interesting: just 62 hours of robot data from the DROID dataset is enough to produce an action-conditioned world model that supports zero-shot planning. The robot generates candidate action sequences, rolls them forward through the world model, and picks the one whose predicted outcome best matches a goal image. This works on objects and environments never seen during training.
The data efficiency is the real technical headline. 62 hours is almost nothing. It suggests that self-supervised pre-training on diverse video can bootstrap enough physical prior knowledge that very little domain-specific data is needed downstream. That's a strong argument for the JEPA design — if your representations are good enough, you don't need to brute-force every task from scratch.
AMI Labs is LeCun's effort to push this beyond research. They're targeting healthcare and robotics first, which makes sense given JEPA's strength in physical reasoning with limited data. But this is a long-horizon bet — their CEO has openly said commercial products could be years away.
---
2. Spatial Intelligence (3D World Models)
Representative: World Labs (@drfeifei)
Where JEPA asks "what will happen next," Fei-Fei Li's approach asks "what does the world look like in 3D, and how can I build it?"
The thesis is that true understanding requires explicit spatial structure — geometry, depth, persistence, and the ability to re-observe a scene from novel viewpoints — not just temporal prediction.
This is a different bet from JEPA: rather than learning abstract dynamics, you learn a structured 3D representation of the environment that you can manipulate directly.
Their product Marble generates persistent 3D environments from images, text, video, or 3D layouts. "Persistent" is the key word — unlike a video generation model that produces a linear sequence of frames, Marble's outputs are actual 3D scenes with spatial coherence. You can orbit the camera, edit objects, export meshes. This puts it closer to a 3D creation tool than to a predictive model, which is deliberate.
For context, this builds on a lineage of neural 3D representation work (NeRFs, 3D Gaussian Splatting) but pushes toward generation rather than reconstruction. Instead of capturing a real scene from multi-view photos, Marble synthesizes plausible new scenes from sparse inputs. The challenge is maintaining physical plausibility — consistent geometry, reasonable lighting, sensible occlusion — across a generated world that never existed.
---
3. Learned Simulation (Generative Video + Latent-Space RL)
Representatives: Google DeepMind (Genie 3, Dreamer V3/V4), Runway GWM-1
This category groups two lineages that are rapidly converging: generative video models that learn to simulate interactive worlds, and RL agents that learn world models to train policies in imagination.
The video generation lineage. DeepMind's Genie 3 is the purest version — text prompt in, navigable environment out, 24 fps at 720p, with consistency for a few minutes. Rather than relying on an explicit hand-built simulator, it learns interactive dynamics from data. The key architectural property is autoregressive generation conditioned on user actions: each frame is generated based on all previous frames plus the current input (move left, look up, etc.). This means the model must maintain an implicit spatial memory — turn away from a tree and turn back, and it needs to still be there. DeepMind reports consistency up to about a minute, which is impressive but still far from what you'd need for sustained agent training.
Runway's GWM-1 takes a similar foundation — autoregressive frame prediction built on Gen-4.5 — but splits into three products: Worlds, Robotics, and Avatars. The split into Worlds / Avatars / Robotics suggests the practical generality problem is still being decomposed by action space and use case.
The RL lineage. The Dreamer series has the longer intellectual history. The core idea is clean: learn a latent dynamics model from observations, then roll out imagined trajectories in latent space and optimize a policy via backpropagation through the model's predictions. The agent never needs to interact with the real environment during policy learning.
Dreamer V3 was the first AI to get diamonds in Minecraft without human data. Dreamer 4 did the same purely offline — no environment interaction at all. Architecturally, Dreamer 4 moves from Dreamer’s earlier recurrent-style lineage to a more scalable transformer-based world-model recipe, and introduced "shortcut forcing" — a training objective that lets the model jump from noisy to clean predictions in just 4 steps instead of the 64 typical in diffusion models. This is what makes real-time inference on a single H100 possible.
These two sub-lineages used to feel distinct: video generation produces visual environments, while RL world models produce trained policies.
But Dreamer 4 blurred the line — humans can now play inside its world model interactively, and Genie 3 is being used to train DeepMind's SIMA agents.
The convergence point is that both need the same thing: a model that can accurately simulate how actions affect environments over extended horizons.
The open question for this whole category is one LeCun keeps raising: does learning to generate pixels that look physically correct actually mean the model understands physics? Or is it pattern-matching appearance? Dreamer 4's ability to get diamonds in Minecraft from pure imagination is a strong empirical counterpoint, but it's also a game with discrete, learnable mechanics — the real world is messier.
---
4. Physical AI Infrastructure (Simulation Platform)
Representative: NVIDIA Cosmos
NVIDIA's play is don't build the world model, build the platform everyone else uses to build theirs.
Cosmos launched at CES January 2025 and covers the full stack — data curation pipeline (process 20M hours of video in 14 days on Blackwell, vs. 3+ years on CPU), a visual tokenizer with 8x better compression than prior SOTA, model training via NeMo, and deployment through NIM microservices.
The pre-trained world foundation models are trained on 9,000 trillion tokens from 20M hours of real-world video spanning driving, industrial, robotics, and human activity data.
They come in two architecture families: diffusion-based (operating on continuous latent tokens) and autoregressive transformer-based (next-token prediction on discretized tokens). Both can be fine-tuned for specific domains.
Three model families sit on top of this.
Predict generates future video states from text, image, or video inputs — essentially video forecasting that can be post-trained for specific robot or driving scenarios.
Transfer handles sim-to-real domain adaptation, which is one of the persistent headaches in physical AI — your model works great in simulation but breaks in the real world due to visual and dynamics gaps.
Reason (added at GTC 2025) brings chain-of-thought reasoning over physical scenes — spatiotemporal awareness, causal understanding of interactions, video Q&A.
---
5. Active Inference
Representative: VERSES AI (Karl Friston)
This is the outlier on the list — not from the deep learning tradition at all, but from computational neuroscience.
Karl Friston's Free Energy Principle says intelligent systems continuously generate predictions about their environment and act to minimize surprise (technically: variational free energy, an upper bound on surprise).
Where standard RL is usually framed around reward maximization, active inference frames behavior as minimizing variational / expected free energy, which blends goal-directed preferences with epistemic value. This leads to natural exploration behavior: the agent is drawn to situations where it's uncertain, because resolving uncertainty reduces free energy.
VERSES built AXIOM (Active eXpanding Inference with Object-centric Models) on this foundation.
The architecture is fundamentally different from neural network world models. Instead of learning a monolithic function approximator, AXIOM maintains a structured generative model where each entity in the environment is a discrete object with typed attributes and relations.
Inference is Bayesian — beliefs are probability distributions that get updated via message passing, not gradient descent. This makes it interpretable (you can inspect what the agent believes about each object), compositional (add a new object type without retraining), and extremely data-efficient.
In their robotics work, they've shown a hierarchical multi-agent setup where each joint of a robot arm is its own active inference agent. The joint-level agents handle local motor control while higher-level agents handle task planning, all coordinating through shared beliefs in a hierarchy. The whole system adapts in real time to unfamiliar environments without retraining — you move the target object and the agent re-plans immediately, because it's doing online inference, not executing a fixed policy.
They shipped a commercial product (Genius) in April 2025, and the AXIOM benchmarks against RL baselines are competitive on standard control tasks while using orders of magnitude less data.
---
imo, these five categories aren't really competing — they're solving different sub-problems.
JEPA compresses physical understanding.
Spatial intelligence reconstructs 3D structure.
Learned simulation trains agents through generated experience.
NVIDIA provides the picks and shovels.
Active inference offers a fundamentally different computational theory of intelligence.
My guess is the lines between them blur fast.
English
Lute Lillo Portero retweetledi

psychology solved the ai memory problem decades ago. we just haven't been reading the right papers.
your identity isn't something you have. it's something you construct. constantly. from autobiographical memory, emotional experience, and narrative coherence.
Martin Conway's Self-Memory System (2000, 2005) showed that memories aren't stored like video recordings.
they're reconstructed every time you access them, assembled from fragments across different neural systems. and the relationship is bidirectional: your memories constrain who you can plausibly be, but your current self-concept also reshapes how you remember. memory is continuously edited to align with your current goals and self-images. this isn't a bug. it's the architecture.
not all memories contribute equally. Rathbone et al. (2008) showed autobiographical memories cluster disproportionately around ages 10-30, the "reminiscence bump," because that's when your core self-images form.
you don't remember your life randomly. you remember the transitions. the moments you became someone new. Madan (2024) takes it further: combined with Episodic Future Thinking, this means identity isn't just backward-looking. it's predictive. you use who you were to project who you might become. memory doesn't just record the past. it generates the future self.
if memory constructs identity, destroying memory should destroy identity. it does. Clive Wearing, a British musicologist who suffered brain damage in 1985, lost the ability to form new memories. his memory resets every 30 seconds. he writes in his diary: "Now I am truly awake for the first time." crosses it out. writes it again minutes later.
but two things survived: his ability to play piano (procedural memory, stored in cerebellum, not the damaged hippocampus) and his emotional bond with his wife. every time she enters the room, he greets her with overwhelming joy. as if reunited after years. every single time. episodic memory is fragile and localized.
emotional memory is distributed widely and survives damage that obliterates everything else.
Antonio Damasio's Somatic Marker Hypothesis destroyed the Western tradition of separating reason from emotion.
emotions aren't obstacles to rational decisions. they're prerequisites.
when you face a decision, your brain reactivates physiological states from past outcomes of similar decisions. gut reactions. subtle shifts in heart rate. these "somatic markers" bias cognition before conscious deliberation begins.
the Iowa Gambling Task proved it: normal participants develop a "hunch" about dangerous card decks 10-15 trials before conscious awareness catches up. their skin conductance spikes before reaching for a bad deck. the body knows before the mind knows. patients with ventromedial prefrontal cortex damage understand the math perfectly when told. but keep choosing the bad decks anyway. their somatic markers are gone. without the emotional signal, raw reasoning isn't enough.
Overskeid (2020) argues Damasio undersold his own theory: emotions may be the substrate upon which all voluntary action is built.
put the threads together. Conway: memory is organized around self-relevant goals. Damasio: emotion makes memories actionable. Rathbone: memories cluster around identity transitions. Bruner: narrative is the glue.
identity = memories organized by emotional significance, structured around self-images, continuously reconstructed to maintain narrative coherence. now look at ai agent memory and tell me what's missing.
current architectures all fail for the same reason: they treat memory as storage, not identity construction. vector databases (RAG) are flat embedding space with no hierarchy, no emotional weighting, no goal-filtering. past 10k documents, semantic search becomes a coin flip. conversation summaries compress your autobiography into a one-paragraph bio. key-value stores reduce identity to a lookup table. episodic buffers give you a 30-second memory span, which as the Wearing case shows, is enough to operate moment-to-moment but not enough to construct identity.
five principles from psychology that ai memory lacks.
first, hierarchical temporal organization (Conway): human memory narrows by life period, then event type, then specific details. ai memory is flat, every fragment at the same level, brute-force search across everything. fix: interaction epochs, recurring themes, specific exchanges, retrieval descends the hierarchy.
second, goal-relevant filtering (Conway's "working self"): your brain retrieves memories relevant to current goals, not whatever's closest in embedding space. fix: a dynamic representation of current goals and task context that gates retrieval.
third, emotional weighting (Damasio): emotionally significant experiences encode deeper and retrieve faster. ai agents store frustrated conversations with the same weight as routine queries. fix: sentiment-scored metadata on memory nodes that biases future behavior.
fourth, narrative coherence (Bruner): humans organize memories into a story maintaining consistent self across time. ai agents have zero narrative, each interaction exists independently. fix: a narrative layer synthesizing memories into a relational story that influences responses.
fifth, co-emergent self-model (Klein & Nichols): human identity and memory bootstrap each other through a feedback loop. ai agents have no self-model that evolves. fix: not just "what I know about this user" but "who I am in this relationship."
the fundamental problem isn't technical. it's conceptual. we've been modeling agent memory on databases. store, retrieve, done. but human memory is an identity construction system. it builds who you are, weights what matters, forgets what doesn't serve the current self, rewrites the narrative to maintain coherence. the paradigm shift: stop building agent memory as a retrieval system. start building it as an identity system.
every component has engineering analogs that already exist.
hierarchical memory = graph databases with temporal clustering.
emotional weighting = sentiment-scored metadata.
goal-relevant filtering = attention mechanisms conditioned on task state.
narrative coherence = periodic summarization with consistency constraints.
self-model bootstrapping = meta-learning loops on interaction history.
the pieces are there. what's missing is the conceptual framework to assemble them. psychology provides that framework.
the path forward isn't better embeddings or bigger context windows. it's looking inward. Conway showed memory is organized by the self, for the self. Damasio showed emotion is the guidance system. Rathbone showed memories cluster around identity transitions. Bruner showed narrative holds it together.
Klein and Nichols showed self and memory bootstrap each other into existence. if we're serious about building agents with functional memory, we should stop reading database architecture papers and start reading psychology journals.

English
Lute Lillo Portero retweetledi

JUNE 2028.
The S&P is down 38% from its highs. Unemployment just printed 10.2%. Private credit is unraveling. Prime mortgages are cracking. AI didn’t disappoint. It exceeded every expectation.
What happened?
citriniresearch.com/p/2028gic
English

@GregKamradt Interesting, it can just be that research feels less like scaling for production and more like working with uncertainty. Fast iteration, reproducibility, etc
If you remember it, I’d still take a look at it, even if it’s not a bedtime read, curious to see what it is about.
English

@Lute47Lillo I haven’t found a great one that I like tbh
If you want a deep technical one I have one but was unapproachable for me as a night time read. Gotta dig it up
English

As models got better at coding I found my appetite for technical textbooks went up, not down.
Since then I’m with Will, I care more about system design and good architecture choices than lower level
I imagine this too will change to managing tests and verifications
will brown@willccbb
with the latest models, i am now finding myself thinking about complex system design problems more, not less. the magnitude of what can be reasonably attempted is monumentally larger. you need to make sure you’re asking it to build the right thing.
English

