george he 🥥🤗

20 posts

george he 🥥🤗 banner
george he 🥥🤗

george he 🥥🤗

@georgehe0

infra & ai ⭐https://t.co/b8pEINeGSF

san francisco Katılım Nisan 2026
97 Takip Edilen50 Takipçiler
george he 🥥🤗 retweetledi
george he 🥥🤗 retweetledi
Linghua Jin 🥥 🌴
Directed context compaction is very accurate. This is a very interesting problem to solve. Compaction without direction is to put everything inside one bucket. If we think about it, even map reduce has a reduce key. In this scenario, it’s interesting on the “Directed” part - follow what to reduce. User can cite some instruction, for example, there’s different technology, product. For novel, person, place, organization. It will also be dynamically adjusted based on subsequent questions users has (as feedback) to organizing the reduction derived from source of truth
Andrej Karpathy@karpathy

@LinghuaJ Interesting.. Chain of thought is a reduce (in addition to attention ofc), so I guess this can be seen as a bit more of a directed context compaction mechanism, inheriting structure from the preexisting idea of a wiki.

English
0
1
20
2.2K
george he 🥥🤗 retweetledi
Linghua Jin 🥥 🌴
Love it - Directed context compaction is very accurate. This is a very interesting problem to solve. Compaction without direction is to put everything inside one bucket. If we think about it, even map reduce has a reduce key. In this scenario, it’s interesting on the “Directed” part - follow what to reduce. User can cite some instruction, for example, there’s different technology, product. For novel, person, place, organization. It will also be dynamically adjusted based on subsequent questions users has (as feedback) to organizing the reduction derived from source of truth
English
2
2
18
1.8K
george he 🥥🤗 retweetledi
Linghua Jin 🥥 🌴
LLM knowledge Base idea open sourced by @karpathy . RAG only has map and no reduce. The essence is to have LLM 𝐢𝐧𝐜𝐫𝐞𝐦𝐞𝐧𝐭𝐚𝐥𝐥𝐲 𝐛𝐮𝐢𝐥𝐝𝐬 𝐚𝐧𝐝 𝐦𝐚𝐢𝐧𝐭𝐚𝐢𝐧𝐬 𝐚 𝐩𝐞𝐫𝐬𝐢𝐬𝐭𝐞𝐧𝐭 𝐰𝐢𝐤𝐢 - So basically at indexing time there must be reduce so you can dig accumulated / compounded synthesized knowledge beyond individual fact. Different data source has different fact. A lot of discovery are like this you derive general fact from single fact. Physics theory you have a lot of independent data and derive more generic theorem. A lot of things coming up all together. Having a solid incremental engine to drive the process is everything we have been building cocoindex for, and keep indexing up to date and organizaed from source of truth. Very much looking forward to what's next!
Andrej Karpathy@karpathy

Wow, this tweet went very viral! I wanted share a possibly slightly improved version of the tweet in an "idea file". The idea of the idea file is that in this era of LLM agents, there is less of a point/need of sharing the specific code/app, you just share the idea, then the other person's agent customizes & builds it for your specific needs. So here's the idea in a gist format: gist.github.com/karpathy/442a6… You can give this to your agent and it can build you your own LLM wiki and guide you on how to use it etc. It's intentionally kept a little bit abstract/vague because there are so many directions to take this in. And ofc, people can adjust the idea or contribute their own in the Discussion which is cool.

English
20
11
290
56.1K
george he 🥥🤗
george he 🥥🤗@georgehe0·
Persistent, compounding, incremental.
Linghua Jin 🥥 🌴@LinghuaJ

LLM knowledge Base idea open sourced by @karpathy . RAG only has map and no reduce. The essence is to have LLM 𝐢𝐧𝐜𝐫𝐞𝐦𝐞𝐧𝐭𝐚𝐥𝐥𝐲 𝐛𝐮𝐢𝐥𝐝𝐬 𝐚𝐧𝐝 𝐦𝐚𝐢𝐧𝐭𝐚𝐢𝐧𝐬 𝐚 𝐩𝐞𝐫𝐬𝐢𝐬𝐭𝐞𝐧𝐭 𝐰𝐢𝐤𝐢 - So basically at indexing time there must be reduce so you can dig accumulated / compounded synthesized knowledge beyond individual fact. Different data source has different fact. A lot of discovery are like this you derive general fact from single fact. Physics theory you have a lot of independent data and derive more generic theorem. A lot of things coming up all together. Having a solid incremental engine to drive the process is everything we have been building cocoindex for, and keep indexing up to date and organizaed from source of truth. Very much looking forward to what's next!

Català
0
1
2
29
george he 🥥🤗 retweetledi
Linghua Jin 🥥 🌴
what a great time to learn from the best. my idol and GOAT @karpathy liked my thoughts around ai map reduce for knowledge. kinda know what getting a concert ticket feels like now. Back to building with a lot more energy tonight 🔥
Linghua Jin 🥥 🌴 tweet media
Linghua Jin 🥥 🌴@LinghuaJ

Very excited to see @karpathy is using this pattern. The old map reduce is mechanically process data. At the AI area map reduce is to process documents data and aggregate based on each user’s need and can be more close and made possible to individual developers. This is how it should be like. I foresee individual agent builder or vibe coder maintaining their own logic of how they want agent to see things and aggregate data per their need with a source of truth. CocoIndex is building a dynamic incremental engine to power flexible logic with AI that tailors everyone’s need but processing data / logic with minimal recompilation to keep knowledge base fresh. Very much looking forward to what’s next!

English
1
4
38
3.1K
george he 🥥🤗 retweetledi
Linghua Jin 🥥 🌴
Very excited to see @karpathy is using this pattern. The old map reduce is mechanically process data. At the AI area map reduce is to process documents data and aggregate based on each user’s need and can be more close and made possible to individual developers. This is how it should be like. I foresee individual agent builder or vibe coder maintaining their own logic of how they want agent to see things and aggregate data per their need with a source of truth. CocoIndex is building a dynamic incremental engine to power flexible logic with AI that tailors everyone’s need but processing data / logic with minimal recompilation to keep knowledge base fresh. Very much looking forward to what’s next!
Andrej Karpathy@karpathy

LLM Knowledge Bases Something I'm finding very useful recently: using LLMs to build personal knowledge bases for various topics of research interest. In this way, a large fraction of my recent token throughput is going less into manipulating code, and more into manipulating knowledge (stored as markdown and images). The latest LLMs are quite good at it. So: Data ingest: I index source documents (articles, papers, repos, datasets, images, etc.) into a raw/ directory, then I use an LLM to incrementally "compile" a wiki, which is just a collection of .md files in a directory structure. The wiki includes summaries of all the data in raw/, backlinks, and then it categorizes data into concepts, writes articles for them, and links them all. To convert web articles into .md files I like to use the Obsidian Web Clipper extension, and then I also use a hotkey to download all the related images to local so that my LLM can easily reference them. IDE: I use Obsidian as the IDE "frontend" where I can view the raw data, the the compiled wiki, and the derived visualizations. Important to note that the LLM writes and maintains all of the data of the wiki, I rarely touch it directly. I've played with a few Obsidian plugins to render and view data in other ways (e.g. Marp for slides). Q&A: Where things get interesting is that once your wiki is big enough (e.g. mine on some recent research is ~100 articles and ~400K words), you can ask your LLM agent all kinds of complex questions against the wiki, and it will go off, research the answers, etc. I thought I had to reach for fancy RAG, but the LLM has been pretty good about auto-maintaining index files and brief summaries of all the documents and it reads all the important related data fairly easily at this ~small scale. Output: Instead of getting answers in text/terminal, I like to have it render markdown files for me, or slide shows (Marp format), or matplotlib images, all of which I then view again in Obsidian. You can imagine many other visual output formats depending on the query. Often, I end up "filing" the outputs back into the wiki to enhance it for further queries. So my own explorations and queries always "add up" in the knowledge base. Linting: I've run some LLM "health checks" over the wiki to e.g. find inconsistent data, impute missing data (with web searchers), find interesting connections for new article candidates, etc., to incrementally clean up the wiki and enhance its overall data integrity. The LLMs are quite good at suggesting further questions to ask and look into. Extra tools: I find myself developing additional tools to process the data, e.g. I vibe coded a small and naive search engine over the wiki, which I both use directly (in a web ui), but more often I want to hand it off to an LLM via CLI as a tool for larger queries. Further explorations: As the repo grows, the natural desire is to also think about synthetic data generation + finetuning to have your LLM "know" the data in its weights instead of just context windows. TLDR: raw data from a given number of sources is collected, then compiled by an LLM into a .md wiki, then operated on by various CLIs by the LLM to do Q&A and to incrementally enhance the wiki, and all of it viewable in Obsidian. You rarely ever write or edit the wiki manually, it's the domain of the LLM. I think there is room here for an incredible new product instead of a hacky collection of scripts.

English
7
8
112
19.1K
george he 🥥🤗 retweetledi
Linghua Jin 🥥 🌴
A mini map-reduce for knowledge base. Having a source of truth and keeping derived knowledge in sync of source is everything we do at @cocoindex.io. With a robust incremental engine, cocoindex maintains up to date knowledge for ai agents.
Andrej Karpathy@karpathy

LLM Knowledge Bases Something I'm finding very useful recently: using LLMs to build personal knowledge bases for various topics of research interest. In this way, a large fraction of my recent token throughput is going less into manipulating code, and more into manipulating knowledge (stored as markdown and images). The latest LLMs are quite good at it. So: Data ingest: I index source documents (articles, papers, repos, datasets, images, etc.) into a raw/ directory, then I use an LLM to incrementally "compile" a wiki, which is just a collection of .md files in a directory structure. The wiki includes summaries of all the data in raw/, backlinks, and then it categorizes data into concepts, writes articles for them, and links them all. To convert web articles into .md files I like to use the Obsidian Web Clipper extension, and then I also use a hotkey to download all the related images to local so that my LLM can easily reference them. IDE: I use Obsidian as the IDE "frontend" where I can view the raw data, the the compiled wiki, and the derived visualizations. Important to note that the LLM writes and maintains all of the data of the wiki, I rarely touch it directly. I've played with a few Obsidian plugins to render and view data in other ways (e.g. Marp for slides). Q&A: Where things get interesting is that once your wiki is big enough (e.g. mine on some recent research is ~100 articles and ~400K words), you can ask your LLM agent all kinds of complex questions against the wiki, and it will go off, research the answers, etc. I thought I had to reach for fancy RAG, but the LLM has been pretty good about auto-maintaining index files and brief summaries of all the documents and it reads all the important related data fairly easily at this ~small scale. Output: Instead of getting answers in text/terminal, I like to have it render markdown files for me, or slide shows (Marp format), or matplotlib images, all of which I then view again in Obsidian. You can imagine many other visual output formats depending on the query. Often, I end up "filing" the outputs back into the wiki to enhance it for further queries. So my own explorations and queries always "add up" in the knowledge base. Linting: I've run some LLM "health checks" over the wiki to e.g. find inconsistent data, impute missing data (with web searchers), find interesting connections for new article candidates, etc., to incrementally clean up the wiki and enhance its overall data integrity. The LLMs are quite good at suggesting further questions to ask and look into. Extra tools: I find myself developing additional tools to process the data, e.g. I vibe coded a small and naive search engine over the wiki, which I both use directly (in a web ui), but more often I want to hand it off to an LLM via CLI as a tool for larger queries. Further explorations: As the repo grows, the natural desire is to also think about synthetic data generation + finetuning to have your LLM "know" the data in its weights instead of just context windows. TLDR: raw data from a given number of sources is collected, then compiled by an LLM into a .md wiki, then operated on by various CLIs by the LLM to do Q&A and to incrementally enhance the wiki, and all of it viewable in Obsidian. You rarely ever write or edit the wiki manually, it's the domain of the LLM. I think there is room here for an incredible new product instead of a hacky collection of scripts.

English
2
4
23
3K
george he 🥥🤗
george he 🥥🤗@georgehe0·
What excited me most about this project is the kind of questions you can answer once podcast knowledge is structured: What did a specific guest say about AI safety across five different shows? Which experts discussed the same technology but reached opposite conclusions? These are simple questions, but impossible to answer when knowledge is locked in hours of audio.
English
0
0
2
10
george he 🥥🤗 retweetledi
CocoIndex
CocoIndex@cocoindex_io·
We built an #opensource project that continously turn turns podcast episodes into a knowledge graph. 🌟 Source code: github.com/cocoindex-io/c… Podcasts are one of the richest sources of expert knowledge on the internet. A single Lex Fridman or Dwarkesh Patel episode can contain dozens of substantive claims about people, technologies, and organizations — but it's all locked inside hours of audio. You can't query any of it. You can't cross-reference what two different guests said about the same topic. In this post, we'll build a @cocoindex_io pipeline that turns YouTube podcast episodes into a queryable knowledge graph. The pipeline downloads audio, transcribes with speaker diarization, uses an LLM to extract structured statements and entities, resolves duplicates across episodes, and stores everything in @SurrealDB as a graph. We use CocoIndex and @pydantic to build the pipeline. CocoIndex is a data indexing framework for building incremental data transformation pipelines — it tracks what's been processed, so re-running the pipeline only processes new or changed episodes. CocoIndex makes it exceptionally easy to build a knowledge graph without writing Cypher, and easy to add/remove any podcast incrementally.
CocoIndex tweet media
English
3
8
19
6.3K
george he 🥥🤗
george he 🥥🤗@georgehe0·
What excited me most about this project is the kind of questions you can answer once podcast knowledge is structured: What did a specific guest say about AI safety across five different shows? Which experts discussed the same technology but reached opposite conclusions? These are simple questions, but impossible to answer when knowledge is locked in hours of audio. The pipeline handles the hard parts — identifying unnamed speakers from metadata, resolving "Apple" vs "Apple Inc." across episodes — and CocoIndex keeps it all incremental: when new episodes drop, only the new content is processed; when you refine your ontology or tweak extraction logic, only the affected parts reprocess. No starting from scratch. Full source code in the original post.
CocoIndex@cocoindex_io

We built an #opensource project that continously turn turns podcast episodes into a knowledge graph. 🌟 Source code: github.com/cocoindex-io/c… Podcasts are one of the richest sources of expert knowledge on the internet. A single Lex Fridman or Dwarkesh Patel episode can contain dozens of substantive claims about people, technologies, and organizations — but it's all locked inside hours of audio. You can't query any of it. You can't cross-reference what two different guests said about the same topic. In this post, we'll build a @cocoindex_io pipeline that turns YouTube podcast episodes into a queryable knowledge graph. The pipeline downloads audio, transcribes with speaker diarization, uses an LLM to extract structured statements and entities, resolves duplicates across episodes, and stores everything in @SurrealDB as a graph. We use CocoIndex and @pydantic to build the pipeline. CocoIndex is a data indexing framework for building incremental data transformation pipelines — it tracks what's been processed, so re-running the pipeline only processes new or changed episodes. CocoIndex makes it exceptionally easy to build a knowledge graph without writing Cypher, and easy to add/remove any podcast incrementally.

English
1
2
9
512
Linghua Jin 🥥 🌴
Introducing my cofounder @georgehe0 - the best engineer and coolest coder i ever work with. He finally got on to X 🔥 Thrilled to build @cocoindex_io with him. Follow George if you like topics about coding, ai, infra, rust and all! Let's go!
Linghua Jin 🥥 🌴 tweet media
English
5
1
33
1.2K
george he 🥥🤗 retweetledi
Linghua Jin 🥥 🌴
Hot 🔥🔥 continuously turn podcast into knowledge graphs. What if we can equip our agent with clear structured knowledge with these great minds . 🌟github.com/cocoindex-io/c… Star the repo if you like it! 🔥Full tutorial: cocoindex.io/blogs/podcast-… Podcasts are one of the richest sources of expert knowledge on the internet. A single Lex Fridman or Dwarkesh Patel episode can contain dozens of substantive claims about people, technologies, and organizations — but it's all locked inside hours of audio. Checkout this open source project with detailed walk through!
English
1
5
24
4.7K
george he 🥥🤗 retweetledi
Linghua Jin 🥥 🌴
Linghua Jin 🥥 🌴@LinghuaJ·
Building an Invisible Daemon🔥🔥 - Everyone is building CLI for developer tools these days. We just published how we built invisible deamon with cocoindex-code. The challenge isn't just building the daemon. It's making it invisible: something that starts when needed, upgrades when the tool upgrades, reloads config when settings change, and shuts down cleanly. Link in comment!
Linghua Jin 🥥 🌴 tweet media
English
3
6
32
2.3K
george he 🥥🤗 retweetledi
0xMarioNawfal
0xMarioNawfal@RoundtableSpace·
CocoIndex Code gives your coding agent a brain. - Semantic search across your entire codebase - AST-based - it actually understands the code - Only re-indexes changed files - No API key, runs local - Saves 70% of tokens Your agent stops guessing, starts knowing. Is token efficiency the next moat for AI dev tools?
0xMarioNawfal tweet media
English
25
8
53
55.5K
george he 🥥🤗 retweetledi
Linghua Jin 🥥 🌴
Linghua Jin 🥥 🌴@LinghuaJ·
Super excited to share our learnings - From Pickle to Type-Guided Serde 🔥🔥: How We Made Python Serialization Safe and Automatic. @cocoindex_io is a framework for building incremental data pipelines. It has a Rust core 🦀for performance and exposes a Python SDK for users to define their pipelines. Under the hood, the engine needs to serialize and deserialize Python objects constantly — caching function results, persisting pipeline state, tracking records for change detection — with the serialized data crossing the Rust/Python boundary and stored by the Rust core. Here are the lessons and things we did to make Python Serialization Safe and Automatic - things to read if you are into building reliable data infrastructure.
Linghua Jin 🥥 🌴 tweet media
English
1
6
27
976
george he 🥥🤗 retweetledi
Linghua Jin 🥥 🌴
AI review can spot 30% problem, but lacks coverage and may miss critical things like a critical config. So the way to think about it reverse thought, what are the changes in the release, and makes every change explainable. You cannot release unjustifiable change. Changes don’t have lineage and lacks source of truth. This is something we at @cocoindex_io are thinking about. Lineage is such an important role for explainable AI and coding to avoid situations lie this to happen. Incremental & explainable AI are future.
Chaofan Shou@Fried_rice

Claude code source code has been leaked via a map file in their npm registry! Code: …a8527898604c1bbb12468b1581d95e.r2.dev/src.zip

English
2
2
11
573
george he 🥥🤗 retweetledi
CocoIndex
CocoIndex@cocoindex_io·
Vulnerability scan & code review - built with @cocoindex & #opensource - LLM-powered Android vulnerability scanner built on CocoIndex semantic search and DeepAgents orchestration. Beatrix indexes Android source code into vector embeddings, then deploys 8 specialized AI sub-agents to systematically scan for 100 vulnerability types across all OWASP Mobile Top 10 categories. Each finding includes source-to-sink data flow analysis, proof-of-concept exploits, Mermaid.js flow diagrams, and remediation steps.
CocoIndex tweet media
English
1
1
8
431