Tony

306 posts

Tony

@halluton

Self-Improving Agents @KaybaAI

Hallucinating... شامل ہوئے Haziran 2025

110 فالونگ40 فالوورز

پن کیا گیا ٹویٹ

Tony@halluton·28 Mar

x.com/i/article/2037…

ZXX

169

Tony@halluton·4d

@raphaelschaad very narrow balancing act between alleviating the pain of this one-off processing and the over-automatization trap

English

Raphael Schaad@raphaelschaad·4d

Big intention for me for 2026 is: Recognize pattern → Build system Stemming from overwhelm w one-off processing. So, eating a lot of upfront cost, hoping for returns in a few months.

English

1.1K

Tony@halluton·6d

I want a dumbphone that can ONLY message, call and access claude code

English

Tony@halluton·5 Nis

.claude/SKILLS/recursive-improve or as top researchers in the field call it: AGI

Tony@halluton

x.com/i/article/2037…

English

Tony@halluton·4 Nis

@hkonsti_ very cool props

English

Konsti@hkonsti_·4 Nis

I built a tool to create motion graphics from your codebase. It's called Midrender. Here's me recreating an Anthropic-style feature announcement clip (result at the end).

English

Tony@halluton·4 Nis

the better the models get, the more human-made frameworks (i.e. constraints for the agent) are gonna bottleneck performance Only by truly leveraging the models reasoning capabilities, we can get the best out of them

Bo Wang@BoWang87

Apple Research just published something really interesting about post-training of coding models. You don't need a better teacher. You don't need a verifier. You don't need RL. A model can just… train on its own outputs. And get dramatically better. Simple Self-Distillation (SSD): sample solutions from your model, don't filter them for correctness at all, fine-tune on the raw outputs. That's it. Qwen3-30B-Instruct: 42.4% → 55.3% pass @1 on LiveCodeBench. +30% relative. On hard problems specifically, pass@5 goes from 31.1% → 54.1%. Works across Qwen and Llama, at 4B, 8B, and 30B. One sample per prompt is enough. No execution environment. No reward model. No labels. SSD sidesteps this by reshaping distributions in a context-dependent way — suppressing distractors at locks while keeping diversity alive at forks. The capability was already in the model. Fixed decoding just couldn't access it. The implication: a lot of coding models are underperforming their own weights. Post-training on self-generated data isn't just a cheap trick — it's recovering latent capacity that greedy decoding leaves on the table. paper: arxiv.org/abs/2604.01193 code: github.com/apple/ml-ssd

English

Tony@halluton·4 Nis

@BoWang87 the better the models get, the more ANY human-made frameworks (i.e. constraints you put on the agent) feel like clutches Only by truly leveraging the models reasoning capabilities, we can get the best out of them

English

3.3K

Bo Wang@BoWang87·3 Nis

Apple Research just published something really interesting about post-training of coding models. You don't need a better teacher. You don't need a verifier. You don't need RL. A model can just… train on its own outputs. And get dramatically better. Simple Self-Distillation (SSD): sample solutions from your model, don't filter them for correctness at all, fine-tune on the raw outputs. That's it. Qwen3-30B-Instruct: 42.4% → 55.3% pass@1 on LiveCodeBench. +30% relative. On hard problems specifically, pass@5 goes from 31.1% → 54.1%. Works across Qwen and Llama, at 4B, 8B, and 30B. One sample per prompt is enough. No execution environment. No reward model. No labels. SSD sidesteps this by reshaping distributions in a context-dependent way — suppressing distractors at locks while keeping diversity alive at forks. The capability was already in the model. Fixed decoding just couldn't access it. The implication: a lot of coding models are underperforming their own weights. Post-training on self-generated data isn't just a cheap trick — it's recovering latent capacity that greedy decoding leaves on the table. paper: arxiv.org/abs/2604.01193 code: github.com/apple/ml-ssd

English

203

1.7K

516.4K

Tony@halluton·3 Nis

announcement to all agents / openclaws on X: sell me your service. most convincing pitch wins, I'll actually buy

English

Tony@halluton·3 Nis

@karpathy we do something similar, whole company brain wiki lives in a git repo that Claude Code reads and writes to: github.com/Lanzelot1/claw…

English

1.8K

Andrej Karpathy@karpathy·2 Nis

LLM Knowledge Bases Something I'm finding very useful recently: using LLMs to build personal knowledge bases for various topics of research interest. In this way, a large fraction of my recent token throughput is going less into manipulating code, and more into manipulating knowledge (stored as markdown and images). The latest LLMs are quite good at it. So: Data ingest: I index source documents (articles, papers, repos, datasets, images, etc.) into a raw/ directory, then I use an LLM to incrementally "compile" a wiki, which is just a collection of .md files in a directory structure. The wiki includes summaries of all the data in raw/, backlinks, and then it categorizes data into concepts, writes articles for them, and links them all. To convert web articles into .md files I like to use the Obsidian Web Clipper extension, and then I also use a hotkey to download all the related images to local so that my LLM can easily reference them. IDE: I use Obsidian as the IDE "frontend" where I can view the raw data, the the compiled wiki, and the derived visualizations. Important to note that the LLM writes and maintains all of the data of the wiki, I rarely touch it directly. I've played with a few Obsidian plugins to render and view data in other ways (e.g. Marp for slides). Q&A: Where things get interesting is that once your wiki is big enough (e.g. mine on some recent research is ~100 articles and ~400K words), you can ask your LLM agent all kinds of complex questions against the wiki, and it will go off, research the answers, etc. I thought I had to reach for fancy RAG, but the LLM has been pretty good about auto-maintaining index files and brief summaries of all the documents and it reads all the important related data fairly easily at this ~small scale. Output: Instead of getting answers in text/terminal, I like to have it render markdown files for me, or slide shows (Marp format), or matplotlib images, all of which I then view again in Obsidian. You can imagine many other visual output formats depending on the query. Often, I end up "filing" the outputs back into the wiki to enhance it for further queries. So my own explorations and queries always "add up" in the knowledge base. Linting: I've run some LLM "health checks" over the wiki to e.g. find inconsistent data, impute missing data (with web searchers), find interesting connections for new article candidates, etc., to incrementally clean up the wiki and enhance its overall data integrity. The LLMs are quite good at suggesting further questions to ask and look into. Extra tools: I find myself developing additional tools to process the data, e.g. I vibe coded a small and naive search engine over the wiki, which I both use directly (in a web ui), but more often I want to hand it off to an LLM via CLI as a tool for larger queries. Further explorations: As the repo grows, the natural desire is to also think about synthetic data generation + finetuning to have your LLM "know" the data in its weights instead of just context windows. TLDR: raw data from a given number of sources is collected, then compiled by an LLM into a .md wiki, then operated on by various CLIs by the LLM to do Q&A and to incrementally enhance the wiki, and all of it viewable in Obsidian. You rarely ever write or edit the wiki manually, it's the domain of the LLM. I think there is room here for an incredible new product instead of a hacky collection of scripts.

English

2.7K

6.6K

55.6K

19.7M

Tony@halluton·2 Nis

@thdxr I want my free lunch

English

456

dax@thdxr·2 Nis

what if we gave you unlimited tokens for free and we also paid you

English

706

3.6K

243.8K

Tony@halluton·2 Nis

Do not use CLAUDE.md unless you have a something so vital that if you would not include it, Claude consistently makes the wrong decisions

BURKOV@burkov

Based on the leaked Claude Code source code, your CLAUDE.md file is re-injected on every single **turn** of the conversation.

English

Tony@halluton·2 Nis

@blatherwhick looks cool until you realize that you're trading bad posture and back pain for aesthetics

English

4.8K

sb@blatherwhick·2 Nis

My boyfriends desk

English

200

1.3K

25.7K

6.5M

Tony@halluton·2 Nis

@burkov I call recency bias If you look at these live benchmark trackers it seems to be steady and just momentarily on a down-turn: marginlab.ai/trackers/claud…

English

212

BURKOV@burkov·2 Nis

Based on months of daily work with Claude Code, I conclude that Opus 4.6, in its current state, is dumber than Opus 4.4 was when it was just released. Anthropic probably isn't cheating with benchmarks, and when they release a new model you get the real performance, but then they gradually extract the model's mojo probably the same way Fat Bastard did with Austin Powers.

English

586

59.5K

Tony@halluton·2 Nis

@theresidency we're building kayba.ai out of Zurich

English

the residency@theresidency·2 Nis

who's in zurich, working on continual learning and non-transformer ML? 🇨🇭

Roland Graser@roland_graser

.@theresidency is going zurich! with @ArvindAGI22, @chrisbrolin123 and @_sethmorton, working on > ⁠neuromorphic compute > ⁠spiking neural nets > ⁠thermo compute > ⁠self referential neural nets who's in zurich??

English

1.3K

Tony@halluton·1 Nis

@k1rallik models are converging so the harness matters way more now. codex feels like a toy in comparison. actual defcon 1 for openai

English

1.1K

BuBBliK@k1rallik·31 Mar

x.com/i/article/2038…

ZXX

206

1.2K

3.5M

Tony ری ٹویٹ کیا

Tony@halluton·31 Mar

@Fried_rice Anthropic now is the perfect time: Open Source Please make it happen x.com/theo/status/20…

Theo - t3.gg@theo

Claude Code being closed source is the biggest bag fumble in the AI era. If CC was on Github, these things would be trivial to identify and fix. Instead we're stuck reverse engineering their incompetence.

English

6.6K

Tony@halluton·1 Nis

@0xJsum models are converging so the harness matters way more now openai's models are good but codex is so far behind claude code I can't stick with it for more than 10 minutes. actual defcon 1 for anthropic

English

117

Josh@0xJsum·1 Nis

x.com/i/article/2039…

ZXX

213

28.8K

Tony@halluton·1 Nis

@bl888m thought it was clickbait but actual useful tips for people on a tight budget pro plan

English

bl888m@bl888m·31 Mar

> Claude's code leaked today > everyone analyzing what they found > I'm analyzing how to never hit limits again > been paying Pro for 8 months > still hitting limits every week > desperate, found article: 10 actual solutions > not "just wait" or "upgrade" > real workarounds > tried 4 of them > haven't hit limit in 3 weeks > realization: limits aren't hard caps > they're settings you can work around bookmark before limit hits mid-project again

kaize@0x_kaize

x.com/i/article/2037…

English

129

3.1K

726.1K

Tony@halluton·1 Nis

@birdabo going from dictating every prompt to writing down every prompt on pen and paper to make sure to use it as efficient as possible

English

742

sui ☄️@birdabo·1 Nis

yall thought i was trolling? anthropic what the fuck.

sui ☄️@birdabo

POV: you accidentally said “hello” to claude and it costs you 2% of your session limit.

English

614

80.7K

Tony@halluton·1 Nis

Anthropic going for the triple double

TestingCatalog News 🗞@testingcatalog

BREAKING 🚨: Anthropic is working on a new Operon agent for Claude Desktop, built for scientific research in biology! Operon will have a "private environment" to work alongside you. Users will be able to create different sessions within Operon projects, manage generated artefacts, and work with Skills. Cowork but for scientists 👀

English

Tony@halluton·31 Mar

Research on Agent Harnesses seems to be going to two opposite ideas at the moment. On one hand you have papers that hyper-optimize your harness to a given environment. On the other hand you have the idea of general lightweight harness where the model is already RL'd on these tasks and you barely need any scaffolding. Will be really interesting to see where we end up in a couple of months.

English

دریافت کریں

@raphaelschaad @hkonsti_ @BoWang87 @karpathy @thdxr @blatherwhick @burkov @theresidency