Dev Shah@0xDevShah
If I were a16z, yc, or sequoia, I’d be aggressively investing in startups that are building novel ways to collect and annotate real-world data.
> Billions of hours of driving data
> Factory workers interacting with appliances and heavy machinery
> Audio segmentation with deep dialectical and cultural understanding
> Wet-lab experimental data
> Continuous collection and annotation of agent traces at compute scale
When we built LLMs, most of the data already existed on the internet. We just had to scrape, clean, and scale. But as we move toward world foundation models, the bottleneck is high-quality, real-world, well-annotated data.
And annotation quality matters. There’s a massive difference between:
“Apple on a tree”
and
“Ripe apples on a tree. The wind is blowing at 2 miles per hour. The temperature is around 18°C.”
The question is simple. How much of the world can you actually capture?
Today, LLMs know that apples fall because of gravity, not because they understand causality, but because they understand language correlations extremely well. Understanding the causal structure comes next.
If I were building towards that future, I’d anchor data collection in India and other South and Southeast Asian regions. I’d deploy hardware, collect thousands of hours of human activity data, health signals, and vitals, and run annotation pipelines continuously. Day and night.
If I were a16z, I’d fund founders to do this.
I might just have the urge to do it myself.