Dhruva Chakravarthi

3.6K posts

Dhruva Chakravarthi

@dhrude

Founder, Fighter, Coder, Writer | #Bitcoin

Bengaluru, India Katılım Eylül 2009

2.9K Takip Edilen2.5K Takipçiler

Sabitlenmiş Tweet

Dhruva Chakravarthi@dhrude·13 Kas

Today, I feel free. I’m happy to be back home in Bangalore, and what an amazing journey I’ve had. I’m extremely grateful to every single person in my life who’s helped me get here. Thank you, everyone. Thank you, Satoshi Nakamoto.

English

882

Dhruva Chakravarthi@dhrude·3 Ara

AIs are smart, but they miss one key variable - time. If you've tried coding something or even had a conversation with AI, I'm sure you've been frustrated by how quickly it forgets things. Even the best AIs have the memory of a goldfish, and use memory orchestration as a bandaid to try and give you the illusion of memory. Claude Code forgets context within 10 minutes of coding, Gemini treats every conversation as a fresh start, and ChatGPT remembers maybe 3 key things from each conversation at best. And that's opened up opportunities for MCPs to try and solve this by creating a directory of note files to solve this. But it's just not good enough. This problem is imperative to solve when it comes to vision - if you want to count objects in an image, no problem. But ask it to remember what's happening over the course of a video and they quickly fall apart. This is where Temporal Context comes in. We architect VLMs in a way that we keep track of: > Action recognition across time > Behavioral pattern detection > Causal reasoning ("this happened because of that") > Intent and trajectory prediction Here's how your brain does it: You walk into a coffee shop. In milliseconds, you're not just detecting objects. You're modeling trajectories, predicting collisions, reading intent, inferring social dynamics. > "That person is about to bump into you, step aside." > "The barista is overwhelmed with orders, lets give them a minute." > "That table is about to free up, slowly approach it." All of this requires time as an input modality. So time isn't just another dimension. It's THE dimension that separates perception from understanding. And solving it makes AI think in more human terms. If your AI had the ability of temporal reasoning, how would your use cases improve?

English

Dhruva Chakravarthi@dhrude·2 Ara

When people think Video + AI, they think generation, but there's so much more... "AI video" doesn't mean what you think it means. When most people hear it, they picture Sora generating movie clips from text. But that's just the tip of the VLM iceberg. Let me break down the layers to this: 1. Text input → Video output (Generation): You type a prompt, they render out a video. These demos are impressive because they can cheat benchmarks and lean into your confirmation bias of "Hey, that's a pretty good video". But they're mostly limited to ~25 seconds max. They can't generate to specifics of styles and movements. Physics still breaks. And they're not production-ready for most use cases (this means there's still some job security for video editors and creators). Examples that you know of are Sora, Runway, Veo. 2. Video input → Text output (Understanding): This is where things get interesting (at least for me). Feed a model hours of footage. Get structured analysis, summaries, action breakdowns. More than you a regular human could understand themselves. Not taking a podcast and giving you a bite size summary, but taking an MMA fight and understanding body feints, creation of negative spaces, sequence strategies, and application of IQ that veterans may even miss. This is what I'm building. 3. Video inputs → Video outputs (Transformation): Taking video A and asking what would happen if xyz happened to generate video B. This is combining Layer 1 (Generation) and Layer 2 (Understanding) to a different level. Imagine taking the iconic opening scene from Inglorious Basterds and asking how this would be transformed if Bollywood was making it. Or wondering how Real Madrid and FC Barcelona would have changed their game styles if Messi and Ronaldo were to switch sides. This requires deep understanding of concepts with highly specifiable generation capabilities. 4. Video input → Physics outputs (Superhuman Vision): We often take for granted how easily we comprehend what we see. We can gauge depth because we have 2 eyes a specific distance away from each other (stereoscopic vision). We have familiarized understanding of materials, textures and newtonian laws, so we can tell when an overdramatic action scene from an anime or Tamil action movie is actually physically uncanny. Doing this with a computer takes a lot of deep understanding of the world to supplement and cross check with vision. It took video game developers (who all used NVidia graphics cards too) decades to work on physics simulation for elements we are highly accustomed to like water, fire, earth and air, and it's still not perfect. This is a decade long problem to work on, which makes the VLM horizon vast. When we crack this, we can easily reconstruct 3D mental models grounded in real time physics with simple 2D videos. Currently, we are still in Layer 1 and making some progress on Layer 2. What are you looking forward to from this?

English

111

Dhruva Chakravarthi@dhrude·1 Ara

Everyone's talking about an AI bubble. That might apply to LLMs and their wrapper products. Meanwhile, there's silent and slow growth happening in a space most people haven't even heard of - Vision Language Models (VLMs). These aren't just the simple image/video generators like Sora or Veo, but models that can watch video and actually understand what's happening, and do a lot more with that visual knowledge. Here's what most people miss about VLMs: 1. They don't just "see" images. They reason across time, tracking objects, actions, and context frame by frame. Time is the key variable and we are still working on making models pay attention for a long time. 2. They're not just video generators; that's just one side of VLMs that doesn't paint the whole picture. The hard problem isn't making video, it's understanding it with so many other real world elements like mathematics and physics, and knowing what's happening across frames like our brains instinctively do. 3. They're already outperforming traditional CV (Computer Vision) on complex tasks like action recognition, anomaly detection, and temporal reasoning, but they often work hand-in-hand with these tools. So architectures are complicated and wrapper products are not so easy to build. 4. Token economics are brutal. A 60-second video can be 1000x more expensive to process than a text prompt because of how much more data we carry across these frames. A picture does say a 1000 words and 60 seconds are about 1440 frames. Do the math. 5. Benchmarks are broken. We've only evaluated video generation with the 'Will Smith eating Spaghetti' test and we're still in the process of building comprehensive evals for both quantitative and qualitative measurable outputs. 6. Understanding humans is far more complicated than we think. We instinctively know how humans look and move, even in highly occluded scenarios, but VLMs struggle here a lot. Remember all those generated images of people with the wrong number of fingers? Scale that for a long video. 7. Long-horizon understanding is still in the works. Most models lose coherence after 20+ seconds, which is often the length limit of videos we see generated, and currently we're working on understanding 20+ minute videos with consistent high attention. 8. This is where the moat will be built. Whoever solves temporal reasoning at scale wins the next decade of multimodal AI. And it's not going to be easy. I've spent most of this year researching and building in this space. And I'll be sharing a lot of my learnings with y'all as I go along. The gap between product demos and production-ready VLMs is massive. But the upside? Even bigger. I've been working on use cases in the realms of sports and public safety and I can't wait to share them with you! Now that you know a little bit more about VLMs what domains do you think will be transformed first by video understanding AI?

English

Dhruva Chakravarthi@dhrude·6 Eki

@Ox_SBM @base @BaseIndia @HeyElsaAI Bruh don’t scam people IRL also Keep that onchain

English

121

Shenkii@Ox_SBM·5 Eki

We teamed up with @base & @BaseIndia to bring some serious ⚽ energy to the field!  What a game, what a vibe — onchain or off, the BASE spirit always wins. 💙  And guess what? Team @HeyElsaAI took the win! WE ARE BASED.

English

147

13.7K

Dhruva Chakravarthi@dhrude·21 Eyl

So does this mean Indians can work remote without visa for the same salary and cost the company $100k less?

English

687

Dhruva Chakravarthi@dhrude·20 Eyl

Sleep Token is a phenomenon. Girls are dancing and singing along. Guys are headbanging and counting polyrhythms. They have the mystique and metal of Slipknot with the heartthrob and theatrics of The Weeknd. Their concerts are rituals, their stage productions are immaculate, their costumes are ornate, and their performance is hypnotizing. I watched them with 100,000 people and was able to tune out everything and get sucked into their trance. What an experience.

English

1.3K

Dhruva Chakravarthi retweetledi

mert@mert·19 Eyl

congrats all you've done is ensure that only the big tech giants can afford the most skilled workers, while startups now have an even tougher job

*Walter Bloomberg@DeItaone

TRUMP TO ADD NEW $100,000 FEE FOR H-1B VISAS IN LATEST CRACKDOWN

English

117

527

104.6K

Dhruva Chakravarthi retweetledi

Crypto India@CryptooIndia·14 Eyl

BREAKING: 🇮🇳 BJP National Spokesperson Pradeep Bhandari proposes an Indian Rupee stablecoin, backed 1:1 by Govt bonds, to: 1. Enable seamless cross-border payments 2. Reduce remittance costs 3. Strengthen INR’s global role

English

121

41.5K

Dhruva Chakravarthi@dhrude·12 Eyl

@theforger0x Not really, but now that you mention it I should probably check Polymarket

English

Deji Daniel@theforger0x·12 Eyl

@dhrude Prediction market promo?

English

Dhruva Chakravarthi@dhrude·12 Eyl

Global entropy is at an all time high and hockey stick breakout is imminent. How do I long this?

English

417

Dhruva Chakravarthi@dhrude·9 Eyl

@bryan_johnson Bruh I sit when I’m driving, I sit on flights - what am I supposed to do

English

Bryan Johnson@bryan_johnson·9 Eyl

Sitting 4+ hours a day increases your risk of death. + sitting 4-8 hours/day: +12% increased risk + sitting 8-11 hours/day: +27% increased risk + sitting >11 hours/day: +48% increased risk Here’s the study details and how to measure your own sitting habits.

English

361

192

2.7K

391.1K

Dhruva Chakravarthi@dhrude·9 Eyl

@yash431garg @JupiterExchange Bro JavaScript is only not safe

English

yashgarg.sol@yash431garg·9 Eyl

Is @JupiterExchange safe or not?

Charles Guillemet@P3b7_

🚨 There’s a large-scale supply chain attack in progress: the NPM account of a reputable developer has been compromised. The affected packages have already been downloaded over 1 billion times, meaning the entire JavaScript ecosystem may be at risk. The malicious payload works by silently swapping crypto addresses on the fly to steal funds. If you use a hardware wallet, pay attention to every transaction before signing and you're safe. If you don’t use a hardware wallet, refrain from making any on-chain transactions for now. It’s still unclear whether the attacker is also stealing seeds from software wallets directly at this stage. Excellent report here: jdstaerk.substack.com/p/we-just-foun…

English

348

Dhruva Chakravarthi@dhrude·9 Eyl

@hasanransari It’s an android with a wallet app that (I guess) has an extra hardware chip to work with It has a nice dApp store, so maybe can play around with that and build a super cool Solana mobile app :)

English

Hasan Ansari@hasanransari·9 Eyl

@dhrude How's the experience?

English

Dhruva Chakravarthi@dhrude·9 Eyl

Airdrops incoming

English

545

Dhruva Chakravarthi@dhrude·9 Eyl

@bryan_johnson 10000 Bitcoin

English

157

Bryan Johnson@bryan_johnson·9 Eyl

I'm going undercover, what should I put in the box? wrong answers only

English

559

1.7K

186.6K

Dhruva Chakravarthi@dhrude·9 Eyl

@paulfinneyx Pretty sure you’re somewhere in this

English

Paul Finney@paulfinneyx·8 Eyl

Gradients of SF sky

English

784

Dhruva Chakravarthi@dhrude·8 Eyl

@DusiSailakshmi Back to Bangalore, and then keep visiting the Bay ever so often x.com/dhrude/status/…

Dhruva Chakravarthi@dhrude

English

16.7K

Sai Lakshmi@DusiSailakshmi·8 Eyl

Everyone in India wants to move to Bangalore. Everyone in Bangalore wants to move to SF. Now the real question is… where do people in SF want to move? 👀

English

588

128

3.4K

194.4K

Dhruva Chakravarthi@dhrude·8 Eyl

@VALIPOKKANN Reminds me of 4chan

English

VΛLIPOKKΛNN.ΞTH ஃ (𑀯𑀴𑀺𑀧𑀧𑁄𑀓𑀓𑀦)@VALIPOKKANN·8 Eyl

@dhrude I teleported there, there was a H!tl£r and sometime later it was some bot spamming hashes.

VΛLIPOKKΛNN.ΞTH ஃ (𑀯𑀴𑀺𑀧𑀧𑁄𑀓𑀓𑀦) tweet media

English

Dhruva Chakravarthi@dhrude·8 Eyl

Bitchat community popping off in the west coast 😂

English

951

Dhruva Chakravarthi@dhrude·7 Eyl

@Rjparab @TonyCatoff Not wrong side bro, the camera just mirrors things in the western hemisphere Come try it some time

English

Raj Parab@Rjparab·7 Eyl

@dhrude @TonyCatoff What is it like driving on the wrong side? 👀

English

Dhruva Chakravarthi@dhrude·6 Eyl

I’ve driven more in the past 24 hours in the US than I’ve driven in the last 10 years in Bangalore. @TonyCatoff - I wanna see a video of your driving experiences in the uru 🙃

English

1.1K

Keşfet

@Ox_SBM @base @BaseIndia @HeyElsaAI @theforger0x @bryan_johnson @yash431garg @JupiterExchange