Dmitry Petrov

1.2K posts

Dmitry Petrov banner
Dmitry Petrov

Dmitry Petrov

@FullStackML

🛠️ Building data infra for AI/ML. Ex-Data Scientist @Microsoft. Created DVC, now DataChain. PhD in CS. Serious about data. Less serious about everything else.

San Francisco, CA Katılım Ekim 2011
538 Takip Edilen2.2K Takipçiler
Dmitry Petrov
Dmitry Petrov@FullStackML·
@atmoio The internet feeds were coffee-addictive. AI conversations are heroin-addictive.
English
0
0
0
70
Mo
Mo@atmoio·
AI is making CEOs delusional
Indonesia
993
2.6K
19K
2.8M
Andrej Karpathy
Andrej Karpathy@karpathy·
I packaged up the "autoresearch" project into a new self-contained minimal repo if people would like to play over the weekend. It's basically nanochat LLM training core stripped down to a single-GPU, one file version of ~630 lines of code, then: - the human iterates on the prompt (.md) - the AI agent iterates on the training code (.py) The goal is to engineer your agents to make the fastest research progress indefinitely and without any of your own involvement. In the image, every dot is a complete LLM training run that lasts exactly 5 minutes. The agent works in an autonomous loop on a git feature branch and accumulates git commits to the training script as it finds better settings (of lower validation loss by the end) of the neural network architecture, the optimizer, all the hyperparameters, etc. You can imagine comparing the research progress of different prompts, different agents, etc. github.com/karpathy/autor… Part code, part sci-fi, and a pinch of psychosis :)
Andrej Karpathy tweet media
English
1K
3.6K
28.1K
10.7M
Dmitry Petrov
Dmitry Petrov@FullStackML·
@jamiequint Love the shift to context management! Works best when the system exposes structure (repos, dbt DAGs). Harder with datasets like images/video/docs where logic lives in code and dependencies are mostly implicit.
English
0
0
1
438
Dmitry Petrov
Dmitry Petrov@FullStackML·
For data projects CLAUDE.md is must-have. Meaning of data needs to be specified. Storage buckets to use (and not to touch). What files to use and semantic. Bonus point if it can point to code that produced the files - no need describing semantic manually. Repo structure is the foundation. Data context layer most teams skip.
English
0
0
1
1.2K
Shraddha Bharuka
Shraddha Bharuka@BharukaShraddha·
Most people treat CLAUDE.md like a prompt file. That’s the mistake. If you want Claude Code to feel like a senior engineer living inside your repo, your project needs structure. Claude needs 4 things at all times: • the why → what the system does • the map → where things live • the rules → what’s allowed / not allowed • the workflows → how work gets done I call this: The Anatomy of a Claude Code Project 👇 ━━━━━━━━━━━━━━━ 1️⃣ CLAUDE.md = Repo Memory (keep it short) This is the north star file. Not a knowledge dump. Just: • Purpose (WHY) • Repo map (WHAT) • Rules + commands (HOW) If it gets too long, the model starts missing important context. ━━━━━━━━━━━━━━━ 2️⃣ .claude/skills/ = Reusable Expert Modes Stop rewriting instructions. Turn common workflows into skills: • code review checklist • refactor playbook • release procedure • debugging flow Result: Consistency across sessions and teammates. ━━━━━━━━━━━━━━━ 3️⃣ .claude/hooks/ = Guardrails Models forget. Hooks don’t. Use them for things that must be deterministic: • run formatter after edits • run tests on core changes • block unsafe directories (auth, billing, migrations) ━━━━━━━━━━━━━━━ 4️⃣ docs/ = Progressive Context Don’t bloat prompts. Claude just needs to know where truth lives: • architecture overview • ADRs (engineering decisions) • operational runbooks ━━━━━━━━━━━━━━━ 5️⃣ Local CLAUDE.md for risky modules Put small files near sharp edges: src/auth/CLAUDE.md src/persistence/CLAUDE.md infra/CLAUDE.md Now Claude sees the gotchas exactly when it works there. ━━━━━━━━━━━━━━━ Prompting is temporary. Structure is permanent. When your repo is organized this way, Claude stops behaving like a chatbot… …and starts acting like a project-native engineer.
Shraddha Bharuka tweet media
English
159
985
6.7K
1M
Dmitry Petrov
Dmitry Petrov@FullStackML·
Super Bowl in SF and it’s ALL Patriots… 🤨 Go West Coast! 🦅
English
0
0
0
87
Dmitry Petrov
Dmitry Petrov@FullStackML·
OpenAI's data agent - how structured / SQL data done right: openai.com/index/inside-o… 🎥🔊🖼️ Multimodal data is harder: schemas and lineage aren't explicit - they must be inferred from Python code. The upside: a single language removes an entire layer of context and simplifies reasoning. ✨ True meaning lives in the code ✨
English
0
1
3
220
Dmitry Petrov
Dmitry Petrov@FullStackML·
LLMs broke out once text data hit scale. Neuro is entering its own scaling era - EEG, DICOM/NIfTI imaging, 3D-scans. Guess which part breaks first 👀 The data stack. datachain.ai/blog/neuro-dat…
English
0
2
5
349
Dmitry Petrov
Dmitry Petrov@FullStackML·
DBT + Fivetran 🚀 A huge milestone for the "modern data stack". Consolidation is on - who's next? Snowflake ❄️? Databricks 🔥? But maybe that doesn’t even matter. The next wave is here: Multimodal data stack It's not replacing the old one - it's for different users: 🤖 AI, not Analytics 🧠 Unstructured, not tabular 📂 Files, not tables 🐍 Python, not SQL ⚙️ Way more CPU/GPU-hungry 😅 Tabular data is just one modality - and whoever wins multimodality might own tabular too. Such an exciting time to be in the front row of this race 🔥
dbt@getdbt

@dbt_labs and @fivetran are joining forces to define the future of data: open data infrastructure. One foundation for movement, transformation, and AI—built to be open, reliable, and interoperable. Read more about our shared vision getdbt.com/blog/dbt-labs-…

English
2
2
8
628
Andrew Lee
Andrew Lee@startupandrew·
Today we're launching Tasklet — an AI agent for automating your business. Unlike ChatGPT, @TaskletAI actually does the work for you: connecting to your tools, triggering automatically, and handling tasks while you sleep.
English
52
66
295
79.9K
Dmitry Petrov
Dmitry Petrov@FullStackML·
AI isn't just about text and code. What about sounds, videos, and sensors? 🎧🎬🔬 I’ll be at @MLOpsWorld Summit (Oct 6-9 in Austin, TX) sharing how to query inside the file ⚡️ Come nerd out with me in Texas 👋🤠 #MLOpsWorld2025
Dmitry Petrov tweet media
English
0
3
7
280
Dmitry Petrov
Dmitry Petrov@FullStackML·
"90% of code will be AI-written" 🤖 Sounds insane - until you see the pattern. When the building blocks exist, coding is just connecting the dots 🔗 And nobody connects dots better than AI. That’s why AI crushes boilerplate web apps 🛠️ - the blocks are there. And why it struggles in real projects 🚧 - they aren’t. The trick: focus on first principles, create the blocks. AI covers the other 90%⚡️
English
0
0
2
209
Nick Davidov
Nick Davidov@Nick_Davidov·
vibe coding is more evil than Civ in terms of "one more step and I'll go to bed"
English
5
1
21
1.4K
Nader Khalil🍊
Nader Khalil🍊@NaderLikeLadder·
Friendly reminder that multiplying revenue by 12 is not ARR
English
135
99
2.8K
397.8K
Dmitry Petrov
Dmitry Petrov@FullStackML·
To stay ahead of the curve in vibe-coding you must rotate IDEs every 2 months Cursor → Claude → Cursor → (???) → repeat. Productivity is temporary, but vibes are forever 😎✨
English
0
0
4
235
Dmitry Petrov
Dmitry Petrov@FullStackML·
Good points! Are you envisioning this LLM data pipeline as more of a batch process, or closer to a dynamic runtime orchestration like LangChain?
English
1
0
0
15
Nathan Danielsen
Nathan Danielsen@Nate_somewhere·
So much raw power available that needs to be bounded by lots of determinism, I.e guardrails
English
1
0
0
31
Nathan Danielsen
Nathan Danielsen@Nate_somewhere·
Building out an LLM driven data pipeline now. In a traditional data PO pipeline with high levels of deterministic behaviors, I would have clear separations of concerns for each of the ETL steps.
English
1
0
1
56
Dmitry Petrov
Dmitry Petrov@FullStackML·
Totally! WebDataset is awesome - you can grab just the files you need + stream them in parallel. 🚀 No need in pulling a whole chunk of records/files. Great for training at scale. The only trade-off: data visibility - since it's all tarballs, you need extra tooling to peek inside 👀 This is how it looks in datachain UI:
Dmitry Petrov tweet media
English
0
0
1
34
Dmitry Petrov
Dmitry Petrov@FullStackML·
There's a trap I see AI/ML teams fall into with video, audio, and multimodal data. 🎥🎧👽 I wrote a blog post about it (with memes). In 🧵
Dmitry Petrov tweet media
English
1
1
7
392