Tomás Puig

5.3K posts

Tomás Puig

@tomascooking

Founder & CEO of Alembic, recovering global CMO, angel at Test Kitchen Capital, 1st gen Cuban. I tweet in long delayed bursts as I’m usually too busy building.

San Francisco, CA Katılım Haziran 2008

1.2K Takip Edilen1.3K Takipçiler

Tomás Puig@tomascooking·2d

@jdevalk Bonus, after initial deployment issues comes with 100% less drama.

English

Joost de Valk@jdevalk·4d

I built Yoast SEO. I ran my blog on WordPress for years. Then yesterday I moved it to static HTML. Everything that matters, SEO, search, schema, is still there. What I dropped was the overhead. Do you actually need a CMS? For quite some sites: no. joost.blog/do-you-need-a-…

English

157

706

186.3K

Tomás Puig@tomascooking·5d

@Suhail We have moved over to Blackwell fully so need to figure out what to do with all our old DGX H-100’s

English

Suhail@Suhail·6d

I am now at 5 GPU providers being completely sold out for a single node of 8xH100s. I don’t think people understand the gravity of what is about to come.

Suhail@Suhail

The run on inference capacity is coming. You have been warned.

English

144

124

2.3K

774.1K

Tomás Puig@tomascooking·15 Mar

@ChrisLaubAI Interesting this is very similar to the fp8 pipeline I built. Now my evening is seeing if this int4 works if done like that.

English

Chris Laub@ChrisLaubAI·14 Mar

Here’s what 4-bit pretraining actually looks like under the hood 👇 Every GEMM (matrix multiply) runs in NVFP4. Only the final 10–15% of layers stay in BF16 for stability. Gradients get stochastic rounding for unbiased updates. This figure says it all a full 4-bit pipeline running end-to-end.

English

1.7K

Chris Laub@ChrisLaubAI·14 Mar

🚨 NVIDIA just did the impossible. They trained a 12B-parameter language model on 10 trillion tokens entirely in 4-bit precision. It’s called NVFP4, and it might redefine how frontier AI models are trained. Here’s why this matters: • NVFP4 delivers 2–3× faster math throughput and 50% less memory vs FP8 • Accuracy? Practically identical. (MMLU-Pro: FP8 = 62.62%, NVFP4 = 62.58%) • Stability issues? Solved using Random Hadamard transforms, stochastic rounding, and 2D scaling • Trained entirely on NVIDIA Blackwell GPUs the first 4-bit run stable across 10T tokens This is the first successful demonstration of large-scale 4-bit pretraining without losing accuracy. The next generation of frontier models will be faster, cheaper, and greener without compromise.

English

125

841

58.7K

Tomás Puig@tomascooking·14 Mar

@ylecun @zhuokaiz People going to be wanting better spatiotemporal graph infrastructure quick.

English

Yann LeCun@ylecun·13 Mar

@zhuokaiz Precisely.

English

155

10.6K

Zhuokai Zhao@zhuokaiz·13 Mar

AMI Labs just raised $1.03B. World Labs raised $1B a few weeks earlier. Both are betting on world models. But almost nobody means the same thing by that term. Here are, in my view, five categories of world models. --- 1. Joint Embedding Predictive Architecture (JEPA) Representatives: AMI Labs (@ylecun), V-JEPA 2 The central bet here is that pixel reconstruction alone is an inefficient objective for learning the abstractions needed for physical understanding. LeCun has been saying this for years — predicting every pixel of the future is intractable in any stochastic environment. JEPA sidesteps this by predicting in a learned latent space instead. Concretely, JEPA trains an encoder that maps video patches to representations, then a predictor that forecasts masked regions in that representation space — not in pixel space. This is a crucial design choice. A generative model that reconstructs pixels is forced to commit to low-level details (exact texture, lighting, leaf position) that are inherently unpredictable. By operating on abstract embeddings, JEPA can capture "the ball will fall off the table" without having to hallucinate every frame of it falling. V-JEPA 2 is the clearest large-scale proof point so far. It's a 1.2B-parameter model pre-trained on 1M+ hours of video via self-supervised masked prediction — no labels, no text. The second training stage is where it gets interesting: just 62 hours of robot data from the DROID dataset is enough to produce an action-conditioned world model that supports zero-shot planning. The robot generates candidate action sequences, rolls them forward through the world model, and picks the one whose predicted outcome best matches a goal image. This works on objects and environments never seen during training. The data efficiency is the real technical headline. 62 hours is almost nothing. It suggests that self-supervised pre-training on diverse video can bootstrap enough physical prior knowledge that very little domain-specific data is needed downstream. That's a strong argument for the JEPA design — if your representations are good enough, you don't need to brute-force every task from scratch. AMI Labs is LeCun's effort to push this beyond research. They're targeting healthcare and robotics first, which makes sense given JEPA's strength in physical reasoning with limited data. But this is a long-horizon bet — their CEO has openly said commercial products could be years away. --- 2. Spatial Intelligence (3D World Models) Representative: World Labs (@drfeifei) Where JEPA asks "what will happen next," Fei-Fei Li's approach asks "what does the world look like in 3D, and how can I build it?" The thesis is that true understanding requires explicit spatial structure — geometry, depth, persistence, and the ability to re-observe a scene from novel viewpoints — not just temporal prediction. This is a different bet from JEPA: rather than learning abstract dynamics, you learn a structured 3D representation of the environment that you can manipulate directly. Their product Marble generates persistent 3D environments from images, text, video, or 3D layouts. "Persistent" is the key word — unlike a video generation model that produces a linear sequence of frames, Marble's outputs are actual 3D scenes with spatial coherence. You can orbit the camera, edit objects, export meshes. This puts it closer to a 3D creation tool than to a predictive model, which is deliberate. For context, this builds on a lineage of neural 3D representation work (NeRFs, 3D Gaussian Splatting) but pushes toward generation rather than reconstruction. Instead of capturing a real scene from multi-view photos, Marble synthesizes plausible new scenes from sparse inputs. The challenge is maintaining physical plausibility — consistent geometry, reasonable lighting, sensible occlusion — across a generated world that never existed. --- 3. Learned Simulation (Generative Video + Latent-Space RL) Representatives: Google DeepMind (Genie 3, Dreamer V3/V4), Runway GWM-1 This category groups two lineages that are rapidly converging: generative video models that learn to simulate interactive worlds, and RL agents that learn world models to train policies in imagination. The video generation lineage. DeepMind's Genie 3 is the purest version — text prompt in, navigable environment out, 24 fps at 720p, with consistency for a few minutes. Rather than relying on an explicit hand-built simulator, it learns interactive dynamics from data. The key architectural property is autoregressive generation conditioned on user actions: each frame is generated based on all previous frames plus the current input (move left, look up, etc.). This means the model must maintain an implicit spatial memory — turn away from a tree and turn back, and it needs to still be there. DeepMind reports consistency up to about a minute, which is impressive but still far from what you'd need for sustained agent training. Runway's GWM-1 takes a similar foundation — autoregressive frame prediction built on Gen-4.5 — but splits into three products: Worlds, Robotics, and Avatars. The split into Worlds / Avatars / Robotics suggests the practical generality problem is still being decomposed by action space and use case. The RL lineage. The Dreamer series has the longer intellectual history. The core idea is clean: learn a latent dynamics model from observations, then roll out imagined trajectories in latent space and optimize a policy via backpropagation through the model's predictions. The agent never needs to interact with the real environment during policy learning. Dreamer V3 was the first AI to get diamonds in Minecraft without human data. Dreamer 4 did the same purely offline — no environment interaction at all. Architecturally, Dreamer 4 moves from Dreamer’s earlier recurrent-style lineage to a more scalable transformer-based world-model recipe, and introduced "shortcut forcing" — a training objective that lets the model jump from noisy to clean predictions in just 4 steps instead of the 64 typical in diffusion models. This is what makes real-time inference on a single H100 possible. These two sub-lineages used to feel distinct: video generation produces visual environments, while RL world models produce trained policies. But Dreamer 4 blurred the line — humans can now play inside its world model interactively, and Genie 3 is being used to train DeepMind's SIMA agents. The convergence point is that both need the same thing: a model that can accurately simulate how actions affect environments over extended horizons. The open question for this whole category is one LeCun keeps raising: does learning to generate pixels that look physically correct actually mean the model understands physics? Or is it pattern-matching appearance? Dreamer 4's ability to get diamonds in Minecraft from pure imagination is a strong empirical counterpoint, but it's also a game with discrete, learnable mechanics — the real world is messier. --- 4. Physical AI Infrastructure (Simulation Platform) Representative: NVIDIA Cosmos NVIDIA's play is don't build the world model, build the platform everyone else uses to build theirs. Cosmos launched at CES January 2025 and covers the full stack — data curation pipeline (process 20M hours of video in 14 days on Blackwell, vs. 3+ years on CPU), a visual tokenizer with 8x better compression than prior SOTA, model training via NeMo, and deployment through NIM microservices. The pre-trained world foundation models are trained on 9,000 trillion tokens from 20M hours of real-world video spanning driving, industrial, robotics, and human activity data. They come in two architecture families: diffusion-based (operating on continuous latent tokens) and autoregressive transformer-based (next-token prediction on discretized tokens). Both can be fine-tuned for specific domains. Three model families sit on top of this. Predict generates future video states from text, image, or video inputs — essentially video forecasting that can be post-trained for specific robot or driving scenarios. Transfer handles sim-to-real domain adaptation, which is one of the persistent headaches in physical AI — your model works great in simulation but breaks in the real world due to visual and dynamics gaps. Reason (added at GTC 2025) brings chain-of-thought reasoning over physical scenes — spatiotemporal awareness, causal understanding of interactions, video Q&A. --- 5. Active Inference Representative: VERSES AI (Karl Friston) This is the outlier on the list — not from the deep learning tradition at all, but from computational neuroscience. Karl Friston's Free Energy Principle says intelligent systems continuously generate predictions about their environment and act to minimize surprise (technically: variational free energy, an upper bound on surprise). Where standard RL is usually framed around reward maximization, active inference frames behavior as minimizing variational / expected free energy, which blends goal-directed preferences with epistemic value. This leads to natural exploration behavior: the agent is drawn to situations where it's uncertain, because resolving uncertainty reduces free energy. VERSES built AXIOM (Active eXpanding Inference with Object-centric Models) on this foundation. The architecture is fundamentally different from neural network world models. Instead of learning a monolithic function approximator, AXIOM maintains a structured generative model where each entity in the environment is a discrete object with typed attributes and relations. Inference is Bayesian — beliefs are probability distributions that get updated via message passing, not gradient descent. This makes it interpretable (you can inspect what the agent believes about each object), compositional (add a new object type without retraining), and extremely data-efficient. In their robotics work, they've shown a hierarchical multi-agent setup where each joint of a robot arm is its own active inference agent. The joint-level agents handle local motor control while higher-level agents handle task planning, all coordinating through shared beliefs in a hierarchy. The whole system adapts in real time to unfamiliar environments without retraining — you move the target object and the agent re-plans immediately, because it's doing online inference, not executing a fixed policy. They shipped a commercial product (Genius) in April 2025, and the AXIOM benchmarks against RL baselines are competitive on standard control tasks while using orders of magnitude less data. --- imo, these five categories aren't really competing — they're solving different sub-problems. JEPA compresses physical understanding. Spatial intelligence reconstructs 3D structure. Learned simulation trains agents through generated experience. NVIDIA provides the picks and shovels. Active inference offers a fundamentally different computational theory of intelligence. My guess is the lines between them blur fast.

English

235

1.5K

314.5K

Tomás Puig@tomascooking·27 Şub

AI news in a nutshell today

GIF

English

Tomás Puig@tomascooking·22 Şub

@Memetic_Theory Depends. Domain of an economy you’re going to have a mix of daily, weekly, monthly, quarter etc. withing data we get from customers there might also be transactional data or sub hourly / minute. Any model has to act like an MoE in a way.

English

mass@Memetic_Theory·21 Şub

@tomascooking Yes, that’s why the proper domain is the whole economy

English

329

mass@Memetic_Theory·20 Şub

This guy is literally at the state of the art of this field, he lives in bumfuck Texas and no one knows him. It’s fucking hilarious.

English

1.9K

279.2K

Tomás Puig@tomascooking·19 Şub

@dippatel1994 @techwith_ram We still haven’t seen any were blown away by. Though we’ve god a lot more interesting things than MCMC now with transfer entropy and other evolutions in last decade or so. I think time series is so domain dependent that models would need more context that the series itself.

English

Dipkumar Patel@dippatel1994·19 Şub

@tomascooking @techwith_ram Tried to test the latest LLMs-powered models, but didn't go great, so did not release them here. Seems like a good time to revisit them 🙃

English

𝗿𝗮𝗺𝗮𝗸𝗿𝘂𝘀𝗵𝗻𝗮— 𝗲/𝗮𝗰𝗰@techwith_ram·18 Şub

𝗔 𝗧𝗶𝗺𝗲 𝗦𝗲𝗿𝗶𝗲𝘀 𝗙𝗼𝘂𝗻𝗱𝗮𝘁𝗶𝗼𝗻 𝗠𝗼𝗱𝗲𝗹 𝗕𝘆 𝗚𝗼𝗼𝗴𝗹𝗲 This has been pre-trained on a time series corpus of 100 billion data points, & shows impressive performance on various benchmarks from diverse domains. 𝗧𝗶𝗺𝗲𝘀𝗙𝗠 𝗚𝗶𝘁𝗵𝘂𝗯 𝗽𝗮𝗴𝗲: github.com/google-researc… 𝗟𝗲𝗮𝗿𝗻 𝗠𝗟 𝗮𝗻𝗱 𝗙𝗼𝗿𝗲𝗰𝗮𝘀𝘁𝗶𝗻𝗴: leanpub.com/pycaretbook/

English

256

2.6K

599K

Tomás Puig@tomascooking·19 Şub

@dippatel1994 @techwith_ram I hadn’t seen this library. Love seeing other people doing causal time-series work.

English

201

Dipkumar Patel@dippatel1994·18 Şub

@techwith_ram We tried benchmarking it and couldn't get good performance while developing github.com/Sanofi-Public/…. Plus, it's a black box—you never know what drives the predictions!

English

11.9K

Tomás Puig@tomascooking·18 Şub

@ID_AA_Carmack Whole reason they bought RunAI is to achieve this.

English

John Carmack@ID_AA_Carmack·17 Şub

The glory work of GPU scheduling is in the frontier data centers with hundreds of thousands of GPUs, but a lot of research work is done with single GPU jobs on modest clusters, and the scheduling leaves much to be desired. I wish there were a clean way to preempt GPU tasks, so long running tasks could be transparently paused to allow higher priority tasks to get the minimum time-to-results. Manual checkpointing and cooperative multitasking is an option, but it complicates codebases and is fertile ground for bugs. It feels like most of the pieces are present: Everything goes through page tables on the GPUs already, Nvidia UVM (Unified Virtual Memory) allows demand paging to host memory, and MPS (Multi-Process Service) could act as a CUDA shim to force everything to use a different memory allocator. Memory page thrashing would be catastrophic for GPU tasks, but the idea would be to pause the host task of the low priority process, then let the high priority process force only the necessary pages out (or maybe none at all, if the memory pressure wasn’t high enough) while it is running, then resume the low priority task on completion, allowing it to page everything back in. Task switching at the level of tens of seconds, not milliseconds. Even if it didn’t handle absolutely all memory (kernel allocations and such) and had some overhead, that would be quite useful. Of course, Nvidia would prefer you to Just Buy More GPUs!

English

1.2K

98.2K

Tomás Puig@tomascooking·9 Şub

As a Latino founder nice to finally see a Latino Super Bowl half time show live. P.S. the trees were actual people!

English

3.3K

Tomás Puig retweetledi

Villy@villa__que·9 Şub

Estuvo hermoso, todos los latinos lo sentimos #Halftime #SuperBowl

Español

957

Tomás Puig@tomascooking·8 Şub

@tenderizzation This almost made me spit out my coffee

English

tender@tenderizzation·8 Şub

"Always uniform, always deterministic." cuda graphs users:

Chenwei Cui@ccui42

Introducing Multi-Head LatentMoE 🚀 Turns out, making NVIDIA's LatentMoE [1] multi-head further unlocks O(1), balanced, and deterministic communication. Our insight: Head Parallel; Move routing from before all-to-all to after. Token duplication happens locally. Always uniform, always deterministic. It works orthogonally to EP as a new dimension of parallelism. For example, use HP for intra-cluster all-to-all as a highway, then use EP locally. We propose FlashAttention-like routing and expert computation, both exact, IO-aware, and constant memory. This is to handle the increased number of sub-tokens. Results: - We replicate LatentMoE and confirm it is indeed faster than MoE, with matching model performance. (See Design Principle IV in [1]) - Up to 1.61x faster training than MoE+EP with identical model performance. - Higher model performance while still 1.11x faster with doubled granularity. 📄 Paper: arxiv.org/abs/2602.04870… 💻 Code: github.com/kerner-lab/Spa… [1] Elango et al., "LatentMoE: Toward Optimal Accuracy per FLOP and Parameter in Mixture of Experts", 2026. arxiv.org/abs/2601.18089

English

137

10.1K

Tomás Puig@tomascooking·8 Şub

@hthieblot SSS tier for quality of life. Noe and Bernel Heights

English

Hubert Thieblot@hthieblot·7 Şub

San Francisco Neighborhood Ranking S Tier: Pac Heights, Cole Valley, North Beach A Tier: Marina, Noe valley, Richmond, Hayes valley, Nob hills B Tier: Russian Hill, Haight Ashbury, Rincon Hill, Mission, Castro C Tier: Potrero Hill, Bernal Heights, Sunset D tier : SoMa,Dog patch No go list : Tender loin, civic center, bay view, mid market, 6 th street

English

272

1.5K

382.6K

Tomás Puig@tomascooking·8 Şub

@bygregorr @hasantoxr Also a well built CSR format is way superior to neo4j and more compatible with cuGraph imo.

English

Gregor@bygregorr·7 Şub

While this is a promising development, I'm not convinced that any LLM will work seamlessly, as some may require significant fine-tuning to produce accurate outputs. The Neo4j output is a nice touch, though, as it opens up possibilities for further analysis and querying. How do you envision handling the potential discrepancies between LLMs and their impact on the resulting knowledge graphs?

English

2.6K

Hasan Toor@hasantoxr·7 Şub

You can now converts any unstructured text into structured knowledge graphs for Graph RAG. - Works with any LLM. - Outputs to Neo4j 100% Opensource.

English

202

1.6K

92.1K

Tomás Puig@tomascooking·8 Şub

@atelicinvest This article screams: “Tell me you’ve never worked with time-series data at scale without telling me you’ve never worked with time-series data at scale”

English

133

Unemployed Capital Allocator@atelicinvest·7 Şub

Why do all these futurists think that everything changes except for the existing system and data?

Zain Hoda@zain_hoda

x.com/i/article/2019…

English

27.8K

Tomás Puig@tomascooking·25 Oca

@finbarrtimbers @joefioti Hopper has had longer to mature. We moved off PyTorch to C++ / CUDA a while ago and just use it mostly for orchestration these days. cuBLAS and other things are pretty optimized already if you go that route.

English

132

finbarr@finbarrtimbers·25 Oca

@joefioti Why is it so much more challenging than eg hopper? Are there any write ups about this?

English

1.8K

finbarr@finbarrtimbers·25 Oca

I really don’t understand why Nvidia doesn’t own torch and hasn’t made it trivial to run B200s

finbarr@finbarrtimbers

I don’t know any of the details but I continue to be surprised at how hard it’s been to get B200s working. My naive model of “new gen = magically faster” has failed me.

English

205

27.8K

Tomás Puig@tomascooking·23 Oca

Every single time I have to fly @united I’m reminded why I switched all status to @Delta.

English

134

Tomás Puig@tomascooking·20 Oca

One of the speaking engagements at Davos.

English

169

Tomás Puig@tomascooking·19 Oca

I’m speaking alongside the Davos World Economic Forum in Switzerland this week. First time here and so very curious what it will be like. My talk is Tuesday and hosted at the Forbes stage on “Causal AI and the New Rules of Decision & Power”.

English

Keşfet

@jdevalk @Suhail @ChrisLaubAI @ylecun @zhuokaiz @drfeifei @Memetic_Theory @dippatel1994