Komal Mathur

30 posts

Komal Mathur

Komal Mathur

@KomalBoyo

Katılım Temmuz 2023
195 Takip Edilen11 Takipçiler
Komal Mathur retweetledi
elvis
elvis@omarsar0·
// Agentic Harness Engineering // Pay attention to this one, AI devs. (bookmark it) Most coding-agent harnesses are still tuned by hand or brittle trial-and-error self-evolution. This new work introduces Agentic Harness Engineering, a framework that makes harness evolution observable. They do this through three layers: components as revertible files, experience as condensed evidence from millions of trajectory tokens, and decisions as falsifiable predictions checked against task outcomes. Each edit becomes a contract you can verify or revert. Results: pass@1 on Terminal-Bench 2 climbs from 69.7% to 77.0% in ten iterations, beating human-designed Codex-CLI (71.9%) and self-evolving baselines like ACE and TF-GRPO. The evolved harness also transfers across model families with +5.1 to +10.1 point gains, while using 12% fewer tokens than the seed on SWE-bench-verified. Harness work is the biggest hidden cost in most agent systems. This is the first credible recipe for letting the harness improve itself without drifting into noise. Paper: arxiv.org/abs/2604.25850 Learn to build effective AI agents in our academy: academy.dair.ai
elvis tweet media
English
69
234
1.6K
139.2K
Komal Mathur
Komal Mathur@KomalBoyo·
This challenge was such a breath of fresh air, thank you @_saah1l @ShiprocketIndia I think I finished my protein intake for the day with all those eggs
Komal Mathur tweet media
English
0
0
0
46
Komal Mathur retweetledi
Nithin Kamath
Nithin Kamath@Nithin0dha·
Asked someone from the industry whether foreign investors are still interested in allocating to India. The TLDR: Interest has pretty much died out. India is seen as geopolitically exposed, especially to an oil shock. There are no real AI plays. Valuations are rich. And the rupee situation doesn't help. On top of that, investors who were sitting on gains have taken money off the table and are now looking at markets like Japan, Taiwan, Korea, Europe etc instead. He also pointed out that our LTCG/STCG structure and the increase in STT have made India less attractive compared to other markets that are seeing inflows. If we need to attract FPIs back, and we do, fixing this feels like pretty low-hanging fruit.
English
1.5K
3.1K
18.3K
2.8M
Komal Mathur retweetledi
Andrej Karpathy
Andrej Karpathy@karpathy·
Wow, this tweet went very viral! I wanted share a possibly slightly improved version of the tweet in an "idea file". The idea of the idea file is that in this era of LLM agents, there is less of a point/need of sharing the specific code/app, you just share the idea, then the other person's agent customizes & builds it for your specific needs. So here's the idea in a gist format: gist.github.com/karpathy/442a6… You can give this to your agent and it can build you your own LLM wiki and guide you on how to use it etc. It's intentionally kept a little bit abstract/vague because there are so many directions to take this in. And ofc, people can adjust the idea or contribute their own in the Discussion which is cool.
Andrej Karpathy@karpathy

LLM Knowledge Bases Something I'm finding very useful recently: using LLMs to build personal knowledge bases for various topics of research interest. In this way, a large fraction of my recent token throughput is going less into manipulating code, and more into manipulating knowledge (stored as markdown and images). The latest LLMs are quite good at it. So: Data ingest: I index source documents (articles, papers, repos, datasets, images, etc.) into a raw/ directory, then I use an LLM to incrementally "compile" a wiki, which is just a collection of .md files in a directory structure. The wiki includes summaries of all the data in raw/, backlinks, and then it categorizes data into concepts, writes articles for them, and links them all. To convert web articles into .md files I like to use the Obsidian Web Clipper extension, and then I also use a hotkey to download all the related images to local so that my LLM can easily reference them. IDE: I use Obsidian as the IDE "frontend" where I can view the raw data, the the compiled wiki, and the derived visualizations. Important to note that the LLM writes and maintains all of the data of the wiki, I rarely touch it directly. I've played with a few Obsidian plugins to render and view data in other ways (e.g. Marp for slides). Q&A: Where things get interesting is that once your wiki is big enough (e.g. mine on some recent research is ~100 articles and ~400K words), you can ask your LLM agent all kinds of complex questions against the wiki, and it will go off, research the answers, etc. I thought I had to reach for fancy RAG, but the LLM has been pretty good about auto-maintaining index files and brief summaries of all the documents and it reads all the important related data fairly easily at this ~small scale. Output: Instead of getting answers in text/terminal, I like to have it render markdown files for me, or slide shows (Marp format), or matplotlib images, all of which I then view again in Obsidian. You can imagine many other visual output formats depending on the query. Often, I end up "filing" the outputs back into the wiki to enhance it for further queries. So my own explorations and queries always "add up" in the knowledge base. Linting: I've run some LLM "health checks" over the wiki to e.g. find inconsistent data, impute missing data (with web searchers), find interesting connections for new article candidates, etc., to incrementally clean up the wiki and enhance its overall data integrity. The LLMs are quite good at suggesting further questions to ask and look into. Extra tools: I find myself developing additional tools to process the data, e.g. I vibe coded a small and naive search engine over the wiki, which I both use directly (in a web ui), but more often I want to hand it off to an LLM via CLI as a tool for larger queries. Further explorations: As the repo grows, the natural desire is to also think about synthetic data generation + finetuning to have your LLM "know" the data in its weights instead of just context windows. TLDR: raw data from a given number of sources is collected, then compiled by an LLM into a .md wiki, then operated on by various CLIs by the LLM to do Q&A and to incrementally enhance the wiki, and all of it viewable in Obsidian. You rarely ever write or edit the wiki manually, it's the domain of the LLM. I think there is room here for an incredible new product instead of a hacky collection of scripts.

English
1.1K
2.8K
26.7K
7.1M
Komal Mathur retweetledi
Andrej Karpathy
Andrej Karpathy@karpathy·
LLM Knowledge Bases Something I'm finding very useful recently: using LLMs to build personal knowledge bases for various topics of research interest. In this way, a large fraction of my recent token throughput is going less into manipulating code, and more into manipulating knowledge (stored as markdown and images). The latest LLMs are quite good at it. So: Data ingest: I index source documents (articles, papers, repos, datasets, images, etc.) into a raw/ directory, then I use an LLM to incrementally "compile" a wiki, which is just a collection of .md files in a directory structure. The wiki includes summaries of all the data in raw/, backlinks, and then it categorizes data into concepts, writes articles for them, and links them all. To convert web articles into .md files I like to use the Obsidian Web Clipper extension, and then I also use a hotkey to download all the related images to local so that my LLM can easily reference them. IDE: I use Obsidian as the IDE "frontend" where I can view the raw data, the the compiled wiki, and the derived visualizations. Important to note that the LLM writes and maintains all of the data of the wiki, I rarely touch it directly. I've played with a few Obsidian plugins to render and view data in other ways (e.g. Marp for slides). Q&A: Where things get interesting is that once your wiki is big enough (e.g. mine on some recent research is ~100 articles and ~400K words), you can ask your LLM agent all kinds of complex questions against the wiki, and it will go off, research the answers, etc. I thought I had to reach for fancy RAG, but the LLM has been pretty good about auto-maintaining index files and brief summaries of all the documents and it reads all the important related data fairly easily at this ~small scale. Output: Instead of getting answers in text/terminal, I like to have it render markdown files for me, or slide shows (Marp format), or matplotlib images, all of which I then view again in Obsidian. You can imagine many other visual output formats depending on the query. Often, I end up "filing" the outputs back into the wiki to enhance it for further queries. So my own explorations and queries always "add up" in the knowledge base. Linting: I've run some LLM "health checks" over the wiki to e.g. find inconsistent data, impute missing data (with web searchers), find interesting connections for new article candidates, etc., to incrementally clean up the wiki and enhance its overall data integrity. The LLMs are quite good at suggesting further questions to ask and look into. Extra tools: I find myself developing additional tools to process the data, e.g. I vibe coded a small and naive search engine over the wiki, which I both use directly (in a web ui), but more often I want to hand it off to an LLM via CLI as a tool for larger queries. Further explorations: As the repo grows, the natural desire is to also think about synthetic data generation + finetuning to have your LLM "know" the data in its weights instead of just context windows. TLDR: raw data from a given number of sources is collected, then compiled by an LLM into a .md wiki, then operated on by various CLIs by the LLM to do Q&A and to incrementally enhance the wiki, and all of it viewable in Obsidian. You rarely ever write or edit the wiki manually, it's the domain of the LLM. I think there is room here for an incredible new product instead of a hacky collection of scripts.
English
2.9K
7.2K
59.2K
21.2M
Komal Mathur retweetledi
Animesh Koratana
Animesh Koratana@akoratana·
Context graphs will be to the 2030s what databases were to the 2000s. Within a year, every frontier lab will be building one and here's why: At 10 people, coordination is free. Everyone knows what everyone else is doing. You never hold a meeting to "align." At 100 people, you spend maybe 20% of your payroll on coordination. Managers, syncs, standups, planning sessions, status reports. At 10,000 people, that number approaches 60%. The majority of your headcount exists not to produce anything but to make sure the people who produce things are producing the right things in the right order. This is the dirty secret of large organizations: output scales linearly with headcount, but coordination cost scales exponentially. Every person you add creates new information pathways that must be maintained. The hierarchy is the protocol that manages this, and it's brutally expensive. Hierarchy is a compression algorithm for organizational knowledge. At every layer, a manager compresses the reality of their team into a summary that fits in a 30-minute meeting with their boss. Their boss compresses eight of those summaries into one for their boss. By the time information reaches the CEO, it's been lossy-compressed through five or six layers of human interpretation. This is why CEOs make bad decisions. The information they receive has been compressed, filtered, and distorted at every layer. The hierarchy is high-latency, low-bandwidth, and lossy. Jack didn't fire 4,000 producers but cut 4,000 compression nodes. Block's "world model" is a replacement algorithm. Zero latency, high bandwidth, lossless. Every person at the edge gets the full picture without waiting for information to travel through human relays. The infrastructure that makes this possible is the context graph. A living, continuously updated representation of how the organization actually works. Not just data, but decision traces: the reasoning connecting observations to actions. Not what's true now, but why it became true. The shift from "give agents memory" to "give agents organizational judgment" will define the next platform war
Animesh Koratana tweet media
jack@jack

x.com/i/article/2038…

English
96
197
1.7K
389.5K
Komal Mathur
Komal Mathur@KomalBoyo·
🦔 goniffler. A 3D knowledge graph for your X feed. Anyone who hopes to keep up with AI these days can prolly relate - you wake up, scroll X, see 15 new AI tools, a framework you've never heard of, and @bcherny's post on yet another CC feature (goated tbh). Bookmark some. Like some. And then just forget about it until someone brought it up again. I wanted to actually use what I found, try out the tools, build on top of projects, integrate the ideas people share. But my bookmarks were a graveyard and X's search is... X's search. So I built goniffler.com. Chrome extension sits on X. See something worth remembering, one click. Auto tagged with AI. Every saved tweet becomes a node in a 3D knowledge graph. Topics cluster together. Tags connect ideas you didn't know were related. Now when I'm building something and think "someone tweeted about this" I just search. By keyword, author, or tag. Click any node to read the full tweet. Or ask Claude, your whole saved collection is queryable as a Claude Code skill. Not random results from Google. Real stuff, from people you follow, that you already decided was worth saving. Integrates and calls @garrytan's gstack too! Free, open source, still early. Would genuinely love feedback on what's broken and what would make you use it daily. Check it out - goniffler.com
English
0
0
1
65
Komal Mathur retweetledi
Andrej Karpathy
Andrej Karpathy@karpathy·
When I built menugen ~1 year ago, I observed that the hardest part by far was not the code itself, it was the plethora of services you have to assemble like IKEA furniture to make it real, the DevOps: services, payments, auth, database, security, domain names, etc... I am really looking forward to a day where I could simply tell my agent: "build menugen" (referencing the post) and it would just work. The whole thing up to the deployed web page. The agent would have to browse a number of services, read the docs, get all the api keys, make everything work, debug it in dev, and deploy to prod. This is the actually hard part, not the code itself. Or rather, the better way to think about it is that the entire DevOps lifecycle has to become code, in addition to the necessary sensors/actuators of the CLIs/APIs with agent-native ergonomics. And there should be no need to visit web pages, click buttons, or anything like that for the human. It's easy to state, it's now just barely technically possible and expected to work maybe, but it definitely requires from-scratch re-design, work and thought. Very exciting direction!
Patrick Collison@patrickc

When @karpathy built MenuGen (karpathy.bearblog.dev/vibe-coding-me…), he said: "Vibe coding menugen was exhilarating and fun escapade as a local demo, but a bit of a painful slog as a deployed, real app. Building a modern app is a bit like assembling IKEA future. There are all these services, docs, API keys, configurations, dev/prod deployments, team and security features, rate limits, pricing tiers." We've all run into this issue when building with agents: you have to scurry off to establish accounts, clicking things in the browser as though it's the antediluvian days of 2023, in order to unblock its superintelligent progress. So we decided to build Stripe Projects to help agents instantly provision services from the CLI. For example, simply run: $ stripe projects add posthog/analytics And it'll create a PostHog account, get an API key, and (as needed) set up billing. Projects is launching today as a developer preview. You can register for access (we'll make it available to everyone soon) at projects.dev. We're also rolling out support for many new providers over the coming weeks. (Get in touch if you'd like to make your service available.) projects.dev

English
626
540
6.4K
2.5M
Komal Mathur
Komal Mathur@KomalBoyo·
Having quite the week
Komal Mathur tweet media
English
0
0
0
36
Komal Mathur retweetledi
Anthony Morris ツ
Anthony Morris ツ@amorriscode·
if you're bored waiting for claude to finish doing work, start another session. life is too short to be bored.
English
70
31
623
97.6K
Komal Mathur retweetledi
Zara Zhang
Zara Zhang@zarazhangrui·
Almost every AI power user I know is MORE stressed and busier after using AI, not less What people thought AI would do: 10x productivity so that we can finish work earlier & relax more What it’s actually doing: 10x productivity so that we end up with 20x more things to do cos of the sheer possibilities
English
460
241
2.7K
290K