Khurram Pirov

212 posts

Khurram Pirov

@KhurramCEO

PhD dropout MIT/MIPT CS/Quantum Physics • prev CV @ Samsung AI Bayesian Inference • Helping physical AI learn skills the way humans do

SF Katılım Ekim 2021

288 Takip Edilen328 Takipçiler

Khurram Pirov@KhurramCEO·15h

Good post, I would add: 1) A lot of things need to be done on the automotive QA part before going towards any post-processing 2) Better and more open evaluation benchmarks, especially for trajectories 3) Collect - real-time QA L1 - post-process L1 - real-time QA without the pre-train, and then cut the stream or continue how you described, post-processing and qa is so underestimated

English

104

Nikolaus West@NikolausWest·1d

There is a funny inversion with end-to-end robotics where you remove a lot of explicit perception and vision methods from inference but then still do them as part of data preparation for training. For example, to scale dataset size and diversity lots of teams are using human data to train policies. @GeneralistAI and @sundayrobotics are famously using UMI-style grippers and @physical_int and @Tesla_Optimus have talked about how they train on egocentric data. To turn that data into high quality robot-like trajectories for training you might need to do do camera calibration, hand pose tracking, image segmentation and in-painting, or even full 4D reconstruction. Generalist has reported running over 10k CPUs in their data prep pipeline. These pipelines get complex and hard to debug and manage quickly, especially when using a make-shift data layer consisting of buckets of files + a classic database. Basics like adding columns, extracting slices, and visual debugging are hard. Combining that with scalable read/write and incremental compute is even harder. Modern data-driven robotics needs a new data layer that makes working with physical data as simple as editing a table. In the meantime, experiment loops will be slow and millions of dollars of data of questionable quality will be trained on.

Nikolaus West@NikolausWest

x.com/i/article/2049…

English

213

22.7K

Khurram Pirov@KhurramCEO·2d

@jaiselsingh Looks cool

English

jaisel@jaiselsingh·6 Nis

lovely jubely! sim overlay on real (kinematic only) is looking better than ever. next up, parameter estimation for optimisation :>

jaisel@jaiselsingh

the robot sim to real overlay is near perfect but that SAM3 mask tracking for the object trajectory is quite noisy :) might be a good idea to add some filters/ or try out TAPnext

English

12.5K

Khurram Pirov@KhurramCEO·19 Nis

I just watched a bee fail 15 times in a row. Same move. No variation. Same failure. The exit was 5cm below. Most beings don’t get stuck because the world is hard… They get stuck because they don’t explore.

English

Khurram Pirov@KhurramCEO·14 Nis

Just launched a reincarnation mode in Claude to beat every benchmark, and it went wild — it says it has lived 50K lives

English

Khurram Pirov@KhurramCEO·13 Nis

What is the best AI concierge in the world?

English

Khurram Pirov@KhurramCEO·13 Nis

Robotics data is entering its fake MAU era. Everyone flexes hours. No one talks about quality. India's hours ≠ of real-world coverage. Camera-only ≠ multimodal. $100 ≠ Manus-level accuracy. What actually matters: more Western-world data multimodal capture real-time QA + evals ground-truth sensor feedback 20k hrs/week low-drift sensor fusion tight loops with robotics labs Most are buying traction. Very few are building infrastructure. Scale AI didn’t win with tech first. They won with ops.

English

Khurram Pirov@KhurramCEO·13 Nis

@rohanpaul_ai No real moats. Everyone is dumping prices in India. Useful data requires global coverage and multimodal capture (Manus-level quality for $100) - Not happening. Most startups are now just buying traction.

English

965

Rohan Paul@rohanpaul_ai·12 Nis

India is quietly becoming a training floor for humanoid robots, with workers filming thousands of first-person hand tasks so AI systems can learn grasping, folding, sorting, and tool use. This story is really about how the humanoid robot boom still depends on cheap, repetitive human labor to teach machines basic physical skill. The problem is that robots do not fail on big plans first; they fail on tiny physical details like grip angle, finger timing, slip correction, and object contact. That kind of knowledge is hard to code and expensive to collect. These labs capture that missing layer by putting cameras or sensors on people and recording ordinary actions as machine-readable motion examples. The useful part is not the towel or box itself but the sequence: where the hand starts, how force changes, when fingers adjust, and how the body recovers from small mistakes. That gives robotics teams supervised data for models that map visual input to physical actions, which is much easier than hand-coding every movement rule. This is a story about how physical intelligence gets extracted before it gets automated. --- quasa. io/media/the-hidden-hand-farms-of-india-fueling-the-ai-robot-revolution-with-human-motion

Rohan Paul@rohanpaul_ai

Indian factory workers wear head-mounted cameras to capture data for training robotics AI models. This image captures a blunt truth about robotics: teaching a machine to move in the real world is still painfully expensive. What looks dystopian at first is also a clue about the bottleneck. Robots do not learn useful physical behavior from internet-scale text the way language models do. They need embodied data: hands reaching, wrists turning, objects slipping, fabric folding, tools resisting, people recovering from small mistakes in real time. That data is rare because reality is slow, messy, and costly. A robot fleet is expensive to buy, expensive to maintain, hard to supervise, and dangerous to scale in uncontrolled settings. Even teleoperation is costly, because every minute of human-guided movement requires hardware, operators, calibration, and failure recovery. So companies go looking for the cheapest possible proxy for physical intelligence. First-person video from factory workers is not the same as robot action data, but it can still be valuable because it captures sequencing, posture, bimanual coordination, and the micro-adjustments that make real work look easy. The frontier in robotics is not just better models. It is better pipelines for collecting reality itself. That is why warehouses, factories, kitchens, and repair benches matter so much: they are dense environments of repeated contact with the physical world, which is exactly what robots lack. The unsettling part is that this turns human labor into training infrastructure twice over, first as work, then as data. And until embodied data becomes cheaper to gather than human motion is to record, robotics will keep learning from workers before it fully replaces them.

English

363

1.3K

226.3K

Khurram Pirov@KhurramCEO·2 Nis

@grok here are the parameters of the recent earthquake Magnitude 5.1 earthquake 1 miles from Boulder Creek, CA · 1:41:25 AM 37.122°N 122.107°W 10.9 km depth What is the minimum magnitude that could significantly damage a 10-story building in San Francisco?

English

242

Khurram Pirov@KhurramCEO·2 Nis

SF, earthquake? Felt it so hard I immediately grabbed all my docs.

English

1.8K

Khurram Pirov@KhurramCEO·24 Mar

Introducing the engineering guide to Active Inference. Physical AI is moving from imitation to learned intuition. Foundation models are a remarkable perceptual layer - powerful priors, broad knowledge, strong pattern recognition. But perception alone is not enough for real-world deployment. The physical world doesn't wait. It shifts, drifts, and surprises. To act reliably under real-world uncertainty, you need more than prediction. You need a system that knows what it doesn't know — and acts accordingly. That is what Active Inference adds: a single principled objective that sits above the perceptual layer, unifying learning and action, where the agent actively reduces uncertainty rather than assuming it away. For those familiar with JEPA: set the epistemic term to zero and you recover JEPA. Add it back, and your system goes from "I predict" to "I know what I don't know." Active Inference has been around for over a decade. Yet to the best of our knowledge, no paper explains it clearly from an engineering perspective — until now. Friston's Ecosystems paper outlined the research agenda. This is the engineering companion - translating Active Inference into practical implementation, with reactive message passing as the realization. Friston wrote the vision. Bert wrote the manual. xlabrobotics.com/research/2603.…

English

235

Khurram Pirov@KhurramCEO·17 Mar

Classical Active Inference does not work on real hardware. It is a powerful theory, but robotics requires fast Bayesian methods and a lot of engineering hacks. Still, it remains the closest framework to brain-inspired intelligence that can work in Physical AI. I would also mention Bayesian RL, since the lines between these approaches are increasingly blurring.

English

Zhuokai Zhao@zhuokaiz·13 Mar

AMI Labs just raised $1.03B. World Labs raised $1B a few weeks earlier. Both are betting on world models. But almost nobody means the same thing by that term. Here are, in my view, five categories of world models. --- 1. Joint Embedding Predictive Architecture (JEPA) Representatives: AMI Labs (@ylecun), V-JEPA 2 The central bet here is that pixel reconstruction alone is an inefficient objective for learning the abstractions needed for physical understanding. LeCun has been saying this for years — predicting every pixel of the future is intractable in any stochastic environment. JEPA sidesteps this by predicting in a learned latent space instead. Concretely, JEPA trains an encoder that maps video patches to representations, then a predictor that forecasts masked regions in that representation space — not in pixel space. This is a crucial design choice. A generative model that reconstructs pixels is forced to commit to low-level details (exact texture, lighting, leaf position) that are inherently unpredictable. By operating on abstract embeddings, JEPA can capture "the ball will fall off the table" without having to hallucinate every frame of it falling. V-JEPA 2 is the clearest large-scale proof point so far. It's a 1.2B-parameter model pre-trained on 1M+ hours of video via self-supervised masked prediction — no labels, no text. The second training stage is where it gets interesting: just 62 hours of robot data from the DROID dataset is enough to produce an action-conditioned world model that supports zero-shot planning. The robot generates candidate action sequences, rolls them forward through the world model, and picks the one whose predicted outcome best matches a goal image. This works on objects and environments never seen during training. The data efficiency is the real technical headline. 62 hours is almost nothing. It suggests that self-supervised pre-training on diverse video can bootstrap enough physical prior knowledge that very little domain-specific data is needed downstream. That's a strong argument for the JEPA design — if your representations are good enough, you don't need to brute-force every task from scratch. AMI Labs is LeCun's effort to push this beyond research. They're targeting healthcare and robotics first, which makes sense given JEPA's strength in physical reasoning with limited data. But this is a long-horizon bet — their CEO has openly said commercial products could be years away. --- 2. Spatial Intelligence (3D World Models) Representative: World Labs (@drfeifei) Where JEPA asks "what will happen next," Fei-Fei Li's approach asks "what does the world look like in 3D, and how can I build it?" The thesis is that true understanding requires explicit spatial structure — geometry, depth, persistence, and the ability to re-observe a scene from novel viewpoints — not just temporal prediction. This is a different bet from JEPA: rather than learning abstract dynamics, you learn a structured 3D representation of the environment that you can manipulate directly. Their product Marble generates persistent 3D environments from images, text, video, or 3D layouts. "Persistent" is the key word — unlike a video generation model that produces a linear sequence of frames, Marble's outputs are actual 3D scenes with spatial coherence. You can orbit the camera, edit objects, export meshes. This puts it closer to a 3D creation tool than to a predictive model, which is deliberate. For context, this builds on a lineage of neural 3D representation work (NeRFs, 3D Gaussian Splatting) but pushes toward generation rather than reconstruction. Instead of capturing a real scene from multi-view photos, Marble synthesizes plausible new scenes from sparse inputs. The challenge is maintaining physical plausibility — consistent geometry, reasonable lighting, sensible occlusion — across a generated world that never existed. --- 3. Learned Simulation (Generative Video + Latent-Space RL) Representatives: Google DeepMind (Genie 3, Dreamer V3/V4), Runway GWM-1 This category groups two lineages that are rapidly converging: generative video models that learn to simulate interactive worlds, and RL agents that learn world models to train policies in imagination. The video generation lineage. DeepMind's Genie 3 is the purest version — text prompt in, navigable environment out, 24 fps at 720p, with consistency for a few minutes. Rather than relying on an explicit hand-built simulator, it learns interactive dynamics from data. The key architectural property is autoregressive generation conditioned on user actions: each frame is generated based on all previous frames plus the current input (move left, look up, etc.). This means the model must maintain an implicit spatial memory — turn away from a tree and turn back, and it needs to still be there. DeepMind reports consistency up to about a minute, which is impressive but still far from what you'd need for sustained agent training. Runway's GWM-1 takes a similar foundation — autoregressive frame prediction built on Gen-4.5 — but splits into three products: Worlds, Robotics, and Avatars. The split into Worlds / Avatars / Robotics suggests the practical generality problem is still being decomposed by action space and use case. The RL lineage. The Dreamer series has the longer intellectual history. The core idea is clean: learn a latent dynamics model from observations, then roll out imagined trajectories in latent space and optimize a policy via backpropagation through the model's predictions. The agent never needs to interact with the real environment during policy learning. Dreamer V3 was the first AI to get diamonds in Minecraft without human data. Dreamer 4 did the same purely offline — no environment interaction at all. Architecturally, Dreamer 4 moves from Dreamer’s earlier recurrent-style lineage to a more scalable transformer-based world-model recipe, and introduced "shortcut forcing" — a training objective that lets the model jump from noisy to clean predictions in just 4 steps instead of the 64 typical in diffusion models. This is what makes real-time inference on a single H100 possible. These two sub-lineages used to feel distinct: video generation produces visual environments, while RL world models produce trained policies. But Dreamer 4 blurred the line — humans can now play inside its world model interactively, and Genie 3 is being used to train DeepMind's SIMA agents. The convergence point is that both need the same thing: a model that can accurately simulate how actions affect environments over extended horizons. The open question for this whole category is one LeCun keeps raising: does learning to generate pixels that look physically correct actually mean the model understands physics? Or is it pattern-matching appearance? Dreamer 4's ability to get diamonds in Minecraft from pure imagination is a strong empirical counterpoint, but it's also a game with discrete, learnable mechanics — the real world is messier. --- 4. Physical AI Infrastructure (Simulation Platform) Representative: NVIDIA Cosmos NVIDIA's play is don't build the world model, build the platform everyone else uses to build theirs. Cosmos launched at CES January 2025 and covers the full stack — data curation pipeline (process 20M hours of video in 14 days on Blackwell, vs. 3+ years on CPU), a visual tokenizer with 8x better compression than prior SOTA, model training via NeMo, and deployment through NIM microservices. The pre-trained world foundation models are trained on 9,000 trillion tokens from 20M hours of real-world video spanning driving, industrial, robotics, and human activity data. They come in two architecture families: diffusion-based (operating on continuous latent tokens) and autoregressive transformer-based (next-token prediction on discretized tokens). Both can be fine-tuned for specific domains. Three model families sit on top of this. Predict generates future video states from text, image, or video inputs — essentially video forecasting that can be post-trained for specific robot or driving scenarios. Transfer handles sim-to-real domain adaptation, which is one of the persistent headaches in physical AI — your model works great in simulation but breaks in the real world due to visual and dynamics gaps. Reason (added at GTC 2025) brings chain-of-thought reasoning over physical scenes — spatiotemporal awareness, causal understanding of interactions, video Q&A. --- 5. Active Inference Representative: VERSES AI (Karl Friston) This is the outlier on the list — not from the deep learning tradition at all, but from computational neuroscience. Karl Friston's Free Energy Principle says intelligent systems continuously generate predictions about their environment and act to minimize surprise (technically: variational free energy, an upper bound on surprise). Where standard RL is usually framed around reward maximization, active inference frames behavior as minimizing variational / expected free energy, which blends goal-directed preferences with epistemic value. This leads to natural exploration behavior: the agent is drawn to situations where it's uncertain, because resolving uncertainty reduces free energy. VERSES built AXIOM (Active eXpanding Inference with Object-centric Models) on this foundation. The architecture is fundamentally different from neural network world models. Instead of learning a monolithic function approximator, AXIOM maintains a structured generative model where each entity in the environment is a discrete object with typed attributes and relations. Inference is Bayesian — beliefs are probability distributions that get updated via message passing, not gradient descent. This makes it interpretable (you can inspect what the agent believes about each object), compositional (add a new object type without retraining), and extremely data-efficient. In their robotics work, they've shown a hierarchical multi-agent setup where each joint of a robot arm is its own active inference agent. The joint-level agents handle local motor control while higher-level agents handle task planning, all coordinating through shared beliefs in a hierarchy. The whole system adapts in real time to unfamiliar environments without retraining — you move the target object and the agent re-plans immediately, because it's doing online inference, not executing a fixed policy. They shipped a commercial product (Genius) in April 2025, and the AXIOM benchmarks against RL baselines are competitive on standard control tasks while using orders of magnitude less data. --- imo, these five categories aren't really competing — they're solving different sub-problems. JEPA compresses physical understanding. Spatial intelligence reconstructs 3D structure. Learned simulation trains agents through generated experience. NVIDIA provides the picks and shovels. Active inference offers a fundamentally different computational theory of intelligence. My guess is the lines between them blur fast.

English

241

1.5K

320K

Khurram Pirov@KhurramCEO·17 Mar

Thanks for mentioning Active Inference. At xlabrobotics.com, we like what @ylecun is pushing with JEPA and believe that combining strong sensor priors with information-seeking via variational free energy minimization can significantly improve reasoning in robotics and help us move beyond brute-force imitation learning.

English

164

Khurram Pirov@KhurramCEO·7 Mar

Altman → alternative human Anthropic → human-centered Gemini → dual intelligence Hideo Kojima is secretly writing the AI industry.

English

106

Khurram Pirov@KhurramCEO·5 Mar

@Ric_RTP A huge opportunity for brain/physics-inspired algorithms, nature calculates everything faster.

English

Ricardo@Ric_RTP·4 Mar

The biggest power grab since Standard Oil is happening today and almost nobody is paying attention. Tech companies are building their own power grid. They're about to produce more electricity than entire COUNTRIES. Right now at the White House, CEOs from Amazon, Google, Meta, Microsoft, OpenAI, Oracle, and xAI are signing a pledge that most people will scroll past. But it might be the most important business deal of the decade. They're committing to build, bring, or buy 100% of their own electricity for every new AI data center. Their own power plants. Their own transmission lines. Their own energy infrastructure. These are SOFTWARE companies agreeing to become power utilities. Here's why this matters for everyone reading this: By the end of this year, at least 5 US data centers will each consume over 1 gigawatt of continuous power. 1 gigawatt powers 850,000 homes. 5 of these facilities will use more electricity than some entire countries. The US grid physically cannot handle it. Capacity prices in the PJM grid, which covers 13 states, exploded from $28.92 per megawatt-day to $329.17 in just two years. That's literally a 1,000% increase. So what do you do when the grid can't support you? You stop using the grid. Amazon is buying nuclear reactors. Microsoft restarted Three Mile Island. Meta signed 20-year nuclear deals. Chevron is building a 2.5 gigawatt natural gas plant in West Texas specifically to power data centers. These companies aren't supplementing the grid. They're replacing it. For themselves. Think about what's actually happening here: 7 companies now control more computing power than most governments. And today they're signing paperwork to control their own energy supply too. Computing. Data. Energy. Infrastructure. That's not a "tech" company anymore. A Harvard energy law professor already called the pledge "meaningless" because utilities in PJM are spending tens of billions on power projects for data centers and those costs are STILL being spread across ratepayers anyway. The pledge has zero legal teeth. No enforcement mechanism. No compliance monitoring. No penalty for breaking it. It's a political move designed to get tech companies through the midterms without becoming the villain of every campaign ad about electricity bills. But the underlying shift is real and irreversible: Tech companies are becoming energy companies. Energy companies are becoming AI infrastructure. And the line between Big Tech and Big Energy is about to disappear completely. The big question here: When seven companies control both the world's intelligence AND the power that runs it, who exactly is governing who?

English

187

1.8K

2.7K

146.5K

Khurram Pirov@KhurramCEO·25 Şub

A child does not generate a lot of new data; we learn good abstractions fast, get strong priors from ancestors, and then we learn how to learn. Validation and edge cases for policies do matter, but I think instead of moving to "predicting the next token strategy,” it’s better to focus on how to become less dependent on data.

Palatial@PalatialSim

A child consumes more data in 1 month than any LLM has ever seen. Embodied agents learn by doing, but the data that teaches them is tactile, sensorial and causal. Such data does not exist. To make physical AGI possible, we need to generate this new data at an industrial scale. Enter Palatial: automated infrastructure that converts raw data into sensory rich playgrounds for robots to learn in. Today, we’re unveiling Palatial PhysReady, the first automated sim asset generator (try it ⬇️) [1/5]

English

Khurram Pirov@KhurramCEO·23 Şub

Today at Alchemist Springs, District hosted a 42°F (5.5°C) cold plunge. Many people had never done a cold plunge before. Still, the moment it became a friendly challenge, everything changed. One first-timer stayed 5 minutes — and that set the bar. Social pressure + competition kicked in → 7, 8, 10, and then the most innocent guy — who’d never done it before — stayed for 15 minutes. I was genuinely surprised: even without experience, founders don’t tap easily. In SF, that mindset shows up everywhere -even in ice water. ❄️

English

Khurram Pirov@KhurramCEO·21 Şub

Burj Khalifa cost ≈ $1.5B GTA 6 is rumored ≈ $2B Yeah… "GenAI is rewriting the gaming industry."

English

Khurram Pirov@KhurramCEO·19 Şub

Looks cool, I’m especially curious about how you extract priors from simulation. Have you thought about the diversity of pressure distributions across different objects, and how much data we’d need to learn sufficiently abstract representations for different types of pressure interactions?

English

Embodied AI Reading Notes@EmbodiedAIRead·18 Şub

Contact-Aware Neural Dynamics Project: changwei-jing.github.io/neural-physics/ Paper: arxiv.org/pdf/2601.12796 This project proposes a sim-to-real alignment framework that learns to directly align simulator’s dynamics with real world contact information. - How it works: Uses the off-the-shelf simulator as a base prior and learns a contact-aware neural dynamics model to refine simulated states using real-world observations. - Why we need this: the authors show the learned forward dynamics improves state prediction accuracy and can be effectively used to refine policies trained in simulators. - Pipeline: (1) First train a neural forward dynamics model in simulation using large-scale rollouts of a dexterous hand interacting with diverse objects under extensive domain randomization. (2) Then collect corresponding real-world trajectories, again including both successes and failures, augmented with tactile sensor readings, and fine-tune the simulation-only model with real-world data.

English

116

8.5K

Keşfet

@GeneralistAI @sundayrobotics @physical_int @Tesla_Optimus @jaiselsingh @rohanpaul_ai @grok @ylecun