Dhruva Chakravarthi

3.6K posts

Dhruva Chakravarthi banner
Dhruva Chakravarthi

Dhruva Chakravarthi

@dhrude

Founder, Fighter, Coder, Writer | #Bitcoin

Bengaluru, India Katılım Eylül 2009
2.9K Takip Edilen2.5K Takipçiler
Sabitlenmiş Tweet
Dhruva Chakravarthi
Dhruva Chakravarthi@dhrude·
Today, I feel free. I’m happy to be back home in Bangalore, and what an amazing journey I’ve had. I’m extremely grateful to every single person in my life who’s helped me get here. Thank you, everyone. Thank you, Satoshi Nakamoto.
Dhruva Chakravarthi tweet media
English
54
51
882
0
Dhruva Chakravarthi
Dhruva Chakravarthi@dhrude·
AIs are smart, but they miss one key variable - time. If you've tried coding something or even had a conversation with AI, I'm sure you've been frustrated by how quickly it forgets things. Even the best AIs have the memory of a goldfish, and use memory orchestration as a bandaid to try and give you the illusion of memory. Claude Code forgets context within 10 minutes of coding, Gemini treats every conversation as a fresh start, and ChatGPT remembers maybe 3 key things from each conversation at best. And that's opened up opportunities for MCPs to try and solve this by creating a directory of note files to solve this. But it's just not good enough. This problem is imperative to solve when it comes to vision - if you want to count objects in an image, no problem. But ask it to remember what's happening over the course of a video and they quickly fall apart. This is where Temporal Context comes in. We architect VLMs in a way that we keep track of: > Action recognition across time > Behavioral pattern detection > Causal reasoning ("this happened because of that") > Intent and trajectory prediction Here's how your brain does it: You walk into a coffee shop. In milliseconds, you're not just detecting objects. You're modeling trajectories, predicting collisions, reading intent, inferring social dynamics. > "That person is about to bump into you, step aside." > "The barista is overwhelmed with orders, lets give them a minute." > "That table is about to free up, slowly approach it." All of this requires time as an input modality. So time isn't just another dimension. It's THE dimension that separates perception from understanding. And solving it makes AI think in more human terms. If your AI had the ability of temporal reasoning, how would your use cases improve?
English
1
0
0
99
Dhruva Chakravarthi
Dhruva Chakravarthi@dhrude·
When people think Video + AI, they think generation, but there's so much more... "AI video" doesn't mean what you think it means. When most people hear it, they picture Sora generating movie clips from text. But that's just the tip of the VLM iceberg. Let me break down the layers to this: 1. Text input → Video output (Generation): You type a prompt, they render out a video. These demos are impressive because they can cheat benchmarks and lean into your confirmation bias of "Hey, that's a pretty good video". But they're mostly limited to ~25 seconds max. They can't generate to specifics of styles and movements. Physics still breaks. And they're not production-ready for most use cases (this means there's still some job security for video editors and creators). Examples that you know of are Sora, Runway, Veo. 2. Video input → Text output (Understanding): This is where things get interesting (at least for me). Feed a model hours of footage. Get structured analysis, summaries, action breakdowns. More than you a regular human could understand themselves. Not taking a podcast and giving you a bite size summary, but taking an MMA fight and understanding body feints, creation of negative spaces, sequence strategies, and application of IQ that veterans may even miss. This is what I'm building. 3. Video inputs → Video outputs (Transformation): Taking video A and asking what would happen if xyz happened to generate video B. This is combining Layer 1 (Generation) and Layer 2 (Understanding) to a different level. Imagine taking the iconic opening scene from Inglorious Basterds and asking how this would be transformed if Bollywood was making it. Or wondering how Real Madrid and FC Barcelona would have changed their game styles if Messi and Ronaldo were to switch sides. This requires deep understanding of concepts with highly specifiable generation capabilities. 4. Video input → Physics outputs (Superhuman Vision): We often take for granted how easily we comprehend what we see. We can gauge depth because we have 2 eyes a specific distance away from each other (stereoscopic vision). We have familiarized understanding of materials, textures and newtonian laws, so we can tell when an overdramatic action scene from an anime or Tamil action movie is actually physically uncanny. Doing this with a computer takes a lot of deep understanding of the world to supplement and cross check with vision. It took video game developers (who all used NVidia graphics cards too) decades to work on physics simulation for elements we are highly accustomed to like water, fire, earth and air, and it's still not perfect. This is a decade long problem to work on, which makes the VLM horizon vast. When we crack this, we can easily reconstruct 3D mental models grounded in real time physics with simple 2D videos. Currently, we are still in Layer 1 and making some progress on Layer 2. What are you looking forward to from this?
Dhruva Chakravarthi tweet media
English
0
0
2
111
Dhruva Chakravarthi
Dhruva Chakravarthi@dhrude·
Everyone's talking about an AI bubble. That might apply to LLMs and their wrapper products. Meanwhile, there's silent and slow growth happening in a space most people haven't even heard of - Vision Language Models (VLMs). These aren't just the simple image/video generators like Sora or Veo, but models that can watch video and actually understand what's happening, and do a lot more with that visual knowledge. Here's what most people miss about VLMs: 1. They don't just "see" images. They reason across time, tracking objects, actions, and context frame by frame. Time is the key variable and we are still working on making models pay attention for a long time. 2. They're not just video generators; that's just one side of VLMs that doesn't paint the whole picture. The hard problem isn't making video, it's understanding it with so many other real world elements like mathematics and physics, and knowing what's happening across frames like our brains instinctively do. 3. They're already outperforming traditional CV (Computer Vision) on complex tasks like action recognition, anomaly detection, and temporal reasoning, but they often work hand-in-hand with these tools. So architectures are complicated and wrapper products are not so easy to build. 4. Token economics are brutal. A 60-second video can be 1000x more expensive to process than a text prompt because of how much more data we carry across these frames. A picture does say a 1000 words and 60 seconds are about 1440 frames. Do the math. 5. Benchmarks are broken. We've only evaluated video generation with the 'Will Smith eating Spaghetti' test and we're still in the process of building comprehensive evals for both quantitative and qualitative measurable outputs. 6. Understanding humans is far more complicated than we think. We instinctively know how humans look and move, even in highly occluded scenarios, but VLMs struggle here a lot. Remember all those generated images of people with the wrong number of fingers? Scale that for a long video. 7. Long-horizon understanding is still in the works. Most models lose coherence after 20+ seconds, which is often the length limit of videos we see generated, and currently we're working on understanding 20+ minute videos with consistent high attention. 8. This is where the moat will be built. Whoever solves temporal reasoning at scale wins the next decade of multimodal AI. And it's not going to be easy. I've spent most of this year researching and building in this space. And I'll be sharing a lot of my learnings with y'all as I go along. The gap between product demos and production-ready VLMs is massive. But the upside? Even bigger. I've been working on use cases in the realms of sports and public safety and I can't wait to share them with you! Now that you know a little bit more about VLMs what domains do you think will be transformed first by video understanding AI?
Dhruva Chakravarthi tweet media
English
0
0
3
77
Shenkii
Shenkii@Ox_SBM·
We teamed up with @base & @BaseIndia to bring some serious ⚽ energy to the field! 
What a game, what a vibe — onchain or off, the BASE spirit always wins. 💙 
And guess what? Team @HeyElsaAI took the win! WE ARE BASED.
English
63
10
147
13.7K
Dhruva Chakravarthi
Dhruva Chakravarthi@dhrude·
So does this mean Indians can work remote without visa for the same salary and cost the company $100k less?
English
3
0
12
687
Dhruva Chakravarthi
Dhruva Chakravarthi@dhrude·
Sleep Token is a phenomenon. Girls are dancing and singing along. Guys are headbanging and counting polyrhythms. They have the mystique and metal of Slipknot with the heartthrob and theatrics of The Weeknd. Their concerts are rituals, their stage productions are immaculate, their costumes are ornate, and their performance is hypnotizing. I watched them with 100,000 people and was able to tune out everything and get sucked into their trance. What an experience.
English
0
1
15
1.3K
Dhruva Chakravarthi retweetledi
Crypto India
Crypto India@CryptooIndia·
BREAKING: 🇮🇳 BJP National Spokesperson Pradeep Bhandari proposes an Indian Rupee stablecoin, backed 1:1 by Govt bonds, to: 1. Enable seamless cross-border payments 2. Reduce remittance costs 3. Strengthen INR’s global role
Crypto India tweet mediaCrypto India tweet media
English
67
121
1K
41.5K
Dhruva Chakravarthi
Dhruva Chakravarthi@dhrude·
Global entropy is at an all time high and hockey stick breakout is imminent. How do I long this?
English
1
0
3
417
Bryan Johnson
Bryan Johnson@bryan_johnson·
Sitting 4+ hours a day increases your risk of death. + sitting 4-8 hours/day: +12% increased risk + sitting 8-11 hours/day: +27% increased risk + sitting >11 hours/day: +48% increased risk Here’s the study details and how to measure your own sitting habits.
English
361
192
2.7K
391.1K
Dhruva Chakravarthi
Dhruva Chakravarthi@dhrude·
@hasanransari It’s an android with a wallet app that (I guess) has an extra hardware chip to work with It has a nice dApp store, so maybe can play around with that and build a super cool Solana mobile app :)
English
0
0
1
38
Bryan Johnson
Bryan Johnson@bryan_johnson·
I'm going undercover, what should I put in the box? wrong answers only
Bryan Johnson tweet media
English
559
17
1.7K
186.6K
Paul Finney
Paul Finney@paulfinneyx·
Gradients of SF sky
Paul Finney tweet media
English
2
0
24
784
Sai Lakshmi
Sai Lakshmi@DusiSailakshmi·
Everyone in India wants to move to Bangalore. Everyone in Bangalore wants to move to SF. Now the real question is… where do people in SF want to move? 👀
English
588
128
3.4K
194.4K
Dhruva Chakravarthi
Dhruva Chakravarthi@dhrude·
Bitchat community popping off in the west coast 😂
Dhruva Chakravarthi tweet media
English
1
0
6
951
Dhruva Chakravarthi
Dhruva Chakravarthi@dhrude·
I’ve driven more in the past 24 hours in the US than I’ve driven in the last 10 years in Bangalore. @TonyCatoff - I wanna see a video of your driving experiences in the uru 🙃
English
2
0
24
1.1K