Adithya Krishnan

604 posts

Adithya Krishnan banner
Adithya Krishnan

Adithya Krishnan

@krish_adi_

--dangerously-skip-permissions/acc | Quack quack @motherduck

Amsterdam เข้าร่วม Ekim 2020
933 กำลังติดตาม181 ผู้ติดตาม
Adithya Krishnan
Adithya Krishnan@krish_adi_·
claw mode ...
Ole Lehmann@itsolelehmann

i can't believe more people aren't talking about this part of the claude code leak there's a hidden feature in the source code called KAIROS, and it basically shows you anthropic's endgame KAIROS is an always-on, *proactive* Claude that does things without you asking it to. it runs in the background 24/7 while you work (or sleep) anthropic hasn't turned it on to the public yet, but the code is fully built here's how it works: every few seconds, KAIROS gets a heartbeat. basically a prompt that says "anything worth doing right now?" it looks at what's happening and makes a call: do something, or stay quiet if it acts, it can fix errors in your code, respond to messages, update files, run tasks... basically anything claude code can already do, just without you telling it to but here's what makes KAIROS different from regular claude code: it has (at least) 3 exclusive tools that regular claude code doesn't get: 1. push notifications, so it can reach you on your phone or desktop even when you're not in the terminal 2. file delivery, so it can send you things it created without you asking for them 3. pull request subscriptions, so it can watch your github and react to code changes on its own regular claude code can only talk to you when you talk to it. KAIROS can tap you on the shoulder and it keeps daily logs of everything. > what it noticed > what it decided > what it did append-only, meaning it can't erase its own history (you can read everything) at night it runs something the code literally calls "autoDream." where it consolidates what it learned during the day and reorganizes its memory while you sleep and it persists across sessions. close your laptop friday, open it monday, it's been working the whole time think about what this means in practice: > you're asleep and your website goes down. KAIROS detects it, restarts the server, and sends you a notification. by the time you see it, it's already back up > you get a customer complaint email at 2am. KAIROS reads it, sends the reply, and logs what it did. you wake up and it's already resolved > your stripe subscription page has a typo that's been live for 3 days. KAIROS spots it, fixes it, and logs the change endless use-cases, it's essentially a co-founder who never sleeps the codebase has this fully built and gated behind internal feature flags called PROACTIVE and KAIROS i think this is probably the clearest signal yet for where all ai tools are going. we are heading into the "post-prompting" era where the ai just works for you in the background like an all-knowing teammate who notices and handles everything, before you even think to ask

English
0
0
0
29
Adithya Krishnan
Adithya Krishnan@krish_adi_·
@soraofficialapp Not surprising. What would've been surprising if they managed to get it integrated with chatGPT.
English
0
0
0
17
Sora
Sora@soraofficialapp·
We’re saying goodbye to the Sora app. To everyone who created with Sora, shared it, and built community around it: thank you. What you made with Sora mattered, and we know this news is disappointing. We’ll share more soon, including timelines for the app and API and details on preserving your work. – The Sora Team
English
11.5K
5.8K
36.8K
48.8M
Adithya Krishnan
Adithya Krishnan@krish_adi_·
Are RL environments for sciences different from that of let's say SWE? In image, audio, video and prose gen the compounding effects of AI slop is ignorable. But in sciences it compounds to making the entire study useless.
English
0
0
0
17
Adithya Krishnan
Adithya Krishnan@krish_adi_·
Great read for anyone working in AI for science. While many in this space are building fine tuned models and harnesses, the verifiability of the intermediate steps are the biggest bottleneck. How would you build an RL environment for physics, chemistry?
Anthropic@AnthropicAI

We’re launching with two new posts. Can AI do theoretical physics? Harvard physicist Matthew Schwartz led Claude Opus 4.5 through a graduate-level calculation. AI can’t yet do original work autonomously, but it can vastly accelerate it. Read more: anthropic.com/research/vibe-…

English
1
0
0
32
dax
dax@thdxr·
we've been experimenting with getting rid of the bash tool agents can write js fine which can do what bash can (though some gaps with things like git) and is more cross platform and then could run that in this
Rivet@rivet_dev

Introducing the Secure Exec SDK Secure Node.js execution without a sandbox ⚡ 17.9 ms coldstart, 3.4 MB mem, 56x cheaper 📦 Just a library – supports Node.js, Bun, & browsers 🔐 Powered by the same tech as Cloudflare Workers $ 𝚗𝚙𝚖 𝚒𝚗𝚜𝚝𝚊𝚕𝚕 𝚜𝚎𝚌𝚞𝚛𝚎-𝚎𝚡𝚎𝚌

English
89
27
1K
211.5K
Adithya Krishnan รีทวีตแล้ว
Mike Freedman
Mike Freedman@michaelfreedman·
Introducing TigerFS - a filesystem backed by PostgreSQL, and a filesystem interface to PostgreSQL. Idea is simple: Agents don't need fancy APIs or SDKs, they love the file system. ls, cat, find, grep. Pipelined UNIX tools. So let’s make files transactional and concurrent by backing them with a real database. There are two ways to use it: File-first: Write markdown, organize into directories. Writes are atomic, everything is auto-versioned. Any tool that works with files -- Claude Code, Cursor, grep, emacs -- just works. Multi-agent task coordination is just mv'ing files between todo/doing/done directories. Data-first: Mount any Postgres database and explore it with Unix tools. For large databases, chain filters into paths that push down to SQL: .by/customer_id/123/.order/created_at/.last/10/.export/json. Bulk import/export, no SQL needed, and ships with Claude Code skills. Every file is a real PostgreSQL row. Multiple agents and humans read and write concurrently with full ACID guarantees. The filesystem /is/ the API. Mounts via FUSE on Linux and NFS on macOS, no extra dependencies. Point it at an existing Postgres database, or spin up a free one on Tiger Cloud or Ghost. I built this mostly for agent workflows, but curious what else people would use it for. It's early but the core is solid. Feedback welcome. tigerfs.io
English
77
106
1.1K
129.3K
Adithya Krishnan รีทวีตแล้ว
Jesse Peltan
Jesse Peltan@JessePeltan·
We should revive traditional clothing. So many cultures have really cool traditions and manufacturing techniques shaped by local materials and climate. Everything has become so homogenized now that many people don’t even know how their ancestors dressed.
Jason Smith - 上官杰文@ShangguanJiewen

Guangxi gets a 4-day official holiday for the Zhuang ethnic group in March. Because in China, minorities have their own holidays, and these are respected under the law.

English
213
188
2.8K
31.9M
Adithya Krishnan รีทวีตแล้ว
Balaji
Balaji@balajis·
I'm going to make some obvious points. (1) Blowing up all the oil infrastructure in the Middle East is an insane idea, and may well result in a global economic crash and humanitarian crisis unrivaled in the lives of those now living. We're talking about the price of everything everywhere rising, from food to gas, at a moment when inflation was already high. All of that will be laid at the feet of the authors of this war. (2) The antebellum status quo of Feb 27, 2026 was just not that bad, but we're unlikely to return to it. Expect indefinite, long-term, ongoing disruptions to everything out of the Middle East. (3) Also assume tech financing crashes for the indefinite future. The genius plan to get the Gulf states caught in the crossfire has incinerated much of the funding for LPs, for datacenters, and for IPOs. Anyone in tech who supported this war may soon learn the meaning of "force majeure" as funding gets yanked. (4) Many capital allocators will instead be allocating much further down Maslow's hierarchy of needs, towards useful basic things like food and energy. (5) It's fortunate that all those progressives yelled about the "climate crisis." Yes, their reasoning about timelines was wrong, and much of the money was wasted in graft, but the result was right: we all need energy independence from the Middle East, pronto. It's also fortunate that Elon and China autistically took climate seriously. Now they're going to need to ship a billion solar panels, electric vehicles, batteries, nuclear power plants, and the like to get everyone off oil, immediately. (6) It's not just an oil and gas problem, of course. It's also a fertilizer problem, and a chemical precursor problem. Maybe some new sources will come online at the new prices, but it takes time to dial stuff up, particularly at this scale, so shortages are almost a certainty. That said, China has actually scaled up coal-to-chemicals[a,c] (C2C), and there's also something more sci-fi called Power-to-X[b] which turns arbitrary power + water + air into hydrocarbons. But all of that will need to get accelerated. I have a background in chemical engineering so may start funding things in this area. (7) Ultimately, this war is going to result in tremendous blame for anyone associated with it. It's a no-win scenario to blow up this much infrastructure for so many people. Simply not worth it for whatever objective they thought they were going to attain. But unless you're actually in a position to stop the madness, the pragmatic thing to do is: scramble to mitigate the fallout to yourself, your business, and your people. [a]: reuters.com/business/energ… [b]: alfalaval.com/industries/ene… [c]: reuters.com/sustainability…
Balaji tweet media
English
699
2.1K
11.6K
3.3M
Adithya Krishnan
Adithya Krishnan@krish_adi_·
My version of this is don't stay in the path of the model, let it decide what it needs to do, just give it the tools. If you're AGI pilled you'd fall back to letting the model do everything, opposed to you handholding it.
Drew Bent@drew_bent

I see people at Anthropic who didn't necessarily start that way getting better at it. Part of it is being surrounded by others who are AGI-pilled + watching how they push the models. But ultimately... 1. Ask yourself: what if the exponential actually continues 2. Take a task and handhold the AI less, be more ambitious, try to do more of it end-to-end with AI 3. Do #2 enough until you reach the limits of current AI and it fails 4. Wait until the models get better and can successfully complete that task 5. Learn from this. Update your strategy. Rethink what the future looks like. And practice that over & over

English
0
0
0
17
Adithya Krishnan
Adithya Krishnan@krish_adi_·
What's the SOTA on agent to agent authentication? Specifically NOT MCP!
English
0
0
0
14
Adithya Krishnan
Adithya Krishnan@krish_adi_·
all the open source coding agent harness are on typescript and I bet it's because of @bunjavascript !
English
0
0
0
38
Adithya Krishnan
Adithya Krishnan@krish_adi_·
There are so many benchmarks for different aspects of SWE agents!! Perhaps the reason why we have so much progress in agents for coding as opposed to other domains? Being formally verifiable is difficult in many domains...
Can Bölük@_can1357

x.com/i/article/2021…

English
0
0
0
37
Adithya Krishnan
Adithya Krishnan@krish_adi_·
@nikunj Benchmarks capture the attention of the audience, a form of ticket to enter the playing field.
English
0
0
0
8
Nikunj Kothari
Nikunj Kothari@nikunj·
Genuinely curious - has any engineer made a decision on a model (or harness) based on a benchmark result? Every researcher / engineer I have talked to routinely dismisses it. They trust their taste, evals and how models are performing for their specific use case. It’s seems like it’s simply now used for the labs to show how they’re slightly ahead but in my head has no material impact on whether a model is used or not. I’d love to be proven wrong so please push back!
English
46
2
101
10.7K
Adithya Krishnan
Adithya Krishnan@krish_adi_·
@GergelyOrosz And the not so surprising part is there are so many open source alternatives now, that you could just self host. And if willing, these people could contribute back to it monetary or code.
English
0
0
0
9
Adithya Krishnan
Adithya Krishnan@krish_adi_·
@JayScambler I like the idea though, how does it work in practice? Any use case that you applied this too? Something you learned along the way building this?
English
0
0
1
121
Jay Scambler
Jay Scambler@JayScambler·
@krish_adi_ Well the repo was just made public this morning - idk what to tell you. I don't control the views or the stars
English
1
0
2
998
Jay Scambler
Jay Scambler@JayScambler·
Introducing autocontext: a recursive self-improving harness designed to help your agents (and future iterations of those agents) succeed on any task. I built this for our clients with the intention of commercializing it but the community support around Karpathy's autoresearch convinced me to open source it instead. Our space is on the verge of something big and we want to do our part.
Andrej Karpathy@karpathy

Three days ago I left autoresearch tuning nanochat for ~2 days on depth=12 model. It found ~20 changes that improved the validation loss. I tested these changes yesterday and all of them were additive and transferred to larger (depth=24) models. Stacking up all of these changes, today I measured that the leaderboard's "Time to GPT-2" drops from 2.02 hours to 1.80 hours (~11% improvement), this will be the new leaderboard entry. So yes, these are real improvements and they make an actual difference. I am mildly surprised that my very first naive attempt already worked this well on top of what I thought was already a fairly manually well-tuned project. This is a first for me because I am very used to doing the iterative optimization of neural network training manually. You come up with ideas, you implement them, you check if they work (better validation loss), you come up with new ideas based on that, you read some papers for inspiration, etc etc. This is the bread and butter of what I do daily for 2 decades. Seeing the agent do this entire workflow end-to-end and all by itself as it worked through approx. 700 changes autonomously is wild. It really looked at the sequence of results of experiments and used that to plan the next ones. It's not novel, ground-breaking "research" (yet), but all the adjustments are "real", I didn't find them manually previously, and they stack up and actually improved nanochat. Among the bigger things e.g.: - It noticed an oversight that my parameterless QKnorm didn't have a scaler multiplier attached, so my attention was too diffuse. The agent found multipliers to sharpen it, pointing to future work. - It found that the Value Embeddings really like regularization and I wasn't applying any (oops). - It found that my banded attention was too conservative (i forgot to tune it). - It found that AdamW betas were all messed up. - It tuned the weight decay schedule. - It tuned the network initialization. This is on top of all the tuning I've already done over a good amount of time. The exact commit is here, from this "round 1" of autoresearch. I am going to kick off "round 2", and in parallel I am looking at how multiple agents can collaborate to unlock parallelism. github.com/karpathy/nanoc… All LLM frontier labs will do this. It's the final boss battle. It's a lot more complex at scale of course - you don't just have a single train. py file to tune. But doing it is "just engineering" and it's going to work. You spin up a swarm of agents, you have them collaborate to tune smaller models, you promote the most promising ideas to increasingly larger scales, and humans (optionally) contribute on the edges. And more generally, *any* metric you care about that is reasonably efficient to evaluate (or that has more efficient proxy metrics such as training a smaller network) can be autoresearched by an agent swarm. It's worth thinking about whether your problem falls into this bucket too.

English
62
117
1.9K
296.9K
Adithya Krishnan
Adithya Krishnan@krish_adi_·
@DhravyaShah Doing all this in RAM is pretty crazy, when most new products in this soace are moving to use SSDs/NVME first methods, solely because it's so expensive to do this at scale. Nice balanced write up btw.
English
0
0
0
45
Dhravya Shah
Dhravya Shah@DhravyaShah·
been building in this space for years now, and have followed nishkarsh for years as well - congrats on the launch! since this is in the same space we're building in, i dived deep into it and have thoughts. the launch itself is very hype-y, and is meant to trigger rage bait 1. it's positioned as a database, but is almost a @supermemory-like system 2. their example of "vector dbs" not being able to do this, is really a question of "embedding models". and embedding models have superpositions, they are cheap and are easily able to infer differences between them. it's not hard to ask claude to do a mini experiment to prove this (attached below). What does matter is: is it able to track how knowledge evolves? time passes? this made me curious so i read their paper 3. their research paper is hardcoding and gaming the benchmark by different prompt for every category!!! (see image below). If their benchmarking is fixed, supermemory will remain the SOTA. 4. they reinvented contextual retrieval paper by Anthropic from 2024 and called it "the orphaned pronoun paradox" 5. they mention they use a custom "in-memory vector store" = at about 500GB, you will have to pay more than $10k for just the RAM. 6. inference is run too many times in the pipeline - which means for every LLM token you ingest, you will end up paying 5x more than token cost for the graph + contextualization + storage. 7. latency and cost numbers were never reported. My hunch is because of the architecture, the latency will struggle at scale. but i can't tell - their product is behind demo gate. 8. the benchmarking code is not OSS (from what i can tell). not replicable + who knows how much context they are injecting into the model? what's the K? 9. inorganic, undisclosed ads (just read the quote tweets). influencer accounts with 400k+ followers all saying the same thing. people keep getting away with this @nikitabier lol i'm all in for healthy competition and progress in this fields, enjoy seeing good work being done by others. but its easy to just say things. "no one will check." playing the game the right way is hard, and everyone's just saying whatever they can to impress people. TLDR is: you should use this if you want to spend 2-5x more for no real marginal improvement and enjoy unhealthy research and business practices. attached: 1. experiment to disprove hypothesis of vector dbs not understanding grey vs grey 2. one of their prompts, which just says "say i dont know". they scored 100% :)
Dhravya Shah tweet mediaDhravya Shah tweet media
Nishkarsh@contextkingceo

We've raised $6.5M to kill vector databases. Every system today retrieves context the same way: vector search that stores everything as flat embeddings and returns whatever "feels" closest. Similar, sure. Relevant? Almost never. Embeddings can’t tell a Q3 renewal clause from a Q1 termination notice if the language is close enough. A friend of mine asked his AI about a contract last week, and it returned a detailed, perfectly crafted answer pulled from a completely different client’s file. Once you’re dealing with 10M+ documents, these mix-ups happen all the time. VectorDB accuracy goes to shit. We built @hydra_db for exactly this. HydraDB builds an ontology-first context graph over your data, maps relationships between entities, understands the 'why' behind documents, and tracks how information evolves over time. So when you ask about 'Apple,' it knows you mean the company you're serving as a customer. Not the fruit. Even when a vector DB's similarity score says 0.94. More below ⬇️

English
52
12
439
82.6K