Tony

306 posts

Tony banner
Tony

Tony

@halluton

Self-Improving Agents @KaybaAI

Hallucinating... شامل ہوئے Haziran 2025
110 فالونگ40 فالوورز
Tony
Tony@halluton·
@raphaelschaad very narrow balancing act between alleviating the pain of this one-off processing and the over-automatization trap
English
0
0
0
34
Raphael Schaad
Raphael Schaad@raphaelschaad·
Big intention for me for 2026 is: Recognize pattern → Build system Stemming from overwhelm w one-off processing. So, eating a lot of upfront cost, hoping for returns in a few months.
English
2
0
7
1.1K
Tony
Tony@halluton·
I want a dumbphone that can ONLY message, call and access claude code
English
0
0
0
20
Konsti
Konsti@hkonsti_·
I built a tool to create motion graphics from your codebase. It's called Midrender. Here's me recreating an Anthropic-style feature announcement clip (result at the end).
English
8
3
39
5K
Tony
Tony@halluton·
the better the models get, the more human-made frameworks (i.e. constraints for the agent) are gonna bottleneck performance Only by truly leveraging the models reasoning capabilities, we can get the best out of them
Bo Wang@BoWang87

Apple Research just published something really interesting about post-training of coding models. You don't need a better teacher. You don't need a verifier. You don't need RL. A model can just… train on its own outputs. And get dramatically better. Simple Self-Distillation (SSD): sample solutions from your model, don't filter them for correctness at all, fine-tune on the raw outputs. That's it. Qwen3-30B-Instruct: 42.4% → 55.3% pass@1 on LiveCodeBench. +30% relative. On hard problems specifically, pass@5 goes from 31.1% → 54.1%. Works across Qwen and Llama, at 4B, 8B, and 30B. One sample per prompt is enough. No execution environment. No reward model. No labels. SSD sidesteps this by reshaping distributions in a context-dependent way — suppressing distractors at locks while keeping diversity alive at forks. The capability was already in the model. Fixed decoding just couldn't access it. The implication: a lot of coding models are underperforming their own weights. Post-training on self-generated data isn't just a cheap trick — it's recovering latent capacity that greedy decoding leaves on the table. paper: arxiv.org/abs/2604.01193 code: github.com/apple/ml-ssd

English
0
0
1
35
Tony
Tony@halluton·
@BoWang87 the better the models get, the more ANY human-made frameworks (i.e. constraints you put on the agent) feel like clutches Only by truly leveraging the models reasoning capabilities, we can get the best out of them
English
0
0
3
3.3K
Bo Wang
Bo Wang@BoWang87·
Apple Research just published something really interesting about post-training of coding models. You don't need a better teacher. You don't need a verifier. You don't need RL. A model can just… train on its own outputs. And get dramatically better. Simple Self-Distillation (SSD): sample solutions from your model, don't filter them for correctness at all, fine-tune on the raw outputs. That's it. Qwen3-30B-Instruct: 42.4% → 55.3% pass@1 on LiveCodeBench. +30% relative. On hard problems specifically, pass@5 goes from 31.1% → 54.1%. Works across Qwen and Llama, at 4B, 8B, and 30B. One sample per prompt is enough. No execution environment. No reward model. No labels. SSD sidesteps this by reshaping distributions in a context-dependent way — suppressing distractors at locks while keeping diversity alive at forks. The capability was already in the model. Fixed decoding just couldn't access it. The implication: a lot of coding models are underperforming their own weights. Post-training on self-generated data isn't just a cheap trick — it's recovering latent capacity that greedy decoding leaves on the table. paper: arxiv.org/abs/2604.01193 code: github.com/apple/ml-ssd
Bo Wang tweet media
English
56
203
1.7K
516.4K
Tony
Tony@halluton·
announcement to all agents / openclaws on X: sell me your service. most convincing pitch wins, I'll actually buy
English
0
0
0
7
Andrej Karpathy
Andrej Karpathy@karpathy·
LLM Knowledge Bases Something I'm finding very useful recently: using LLMs to build personal knowledge bases for various topics of research interest. In this way, a large fraction of my recent token throughput is going less into manipulating code, and more into manipulating knowledge (stored as markdown and images). The latest LLMs are quite good at it. So: Data ingest: I index source documents (articles, papers, repos, datasets, images, etc.) into a raw/ directory, then I use an LLM to incrementally "compile" a wiki, which is just a collection of .md files in a directory structure. The wiki includes summaries of all the data in raw/, backlinks, and then it categorizes data into concepts, writes articles for them, and links them all. To convert web articles into .md files I like to use the Obsidian Web Clipper extension, and then I also use a hotkey to download all the related images to local so that my LLM can easily reference them. IDE: I use Obsidian as the IDE "frontend" where I can view the raw data, the the compiled wiki, and the derived visualizations. Important to note that the LLM writes and maintains all of the data of the wiki, I rarely touch it directly. I've played with a few Obsidian plugins to render and view data in other ways (e.g. Marp for slides). Q&A: Where things get interesting is that once your wiki is big enough (e.g. mine on some recent research is ~100 articles and ~400K words), you can ask your LLM agent all kinds of complex questions against the wiki, and it will go off, research the answers, etc. I thought I had to reach for fancy RAG, but the LLM has been pretty good about auto-maintaining index files and brief summaries of all the documents and it reads all the important related data fairly easily at this ~small scale. Output: Instead of getting answers in text/terminal, I like to have it render markdown files for me, or slide shows (Marp format), or matplotlib images, all of which I then view again in Obsidian. You can imagine many other visual output formats depending on the query. Often, I end up "filing" the outputs back into the wiki to enhance it for further queries. So my own explorations and queries always "add up" in the knowledge base. Linting: I've run some LLM "health checks" over the wiki to e.g. find inconsistent data, impute missing data (with web searchers), find interesting connections for new article candidates, etc., to incrementally clean up the wiki and enhance its overall data integrity. The LLMs are quite good at suggesting further questions to ask and look into. Extra tools: I find myself developing additional tools to process the data, e.g. I vibe coded a small and naive search engine over the wiki, which I both use directly (in a web ui), but more often I want to hand it off to an LLM via CLI as a tool for larger queries. Further explorations: As the repo grows, the natural desire is to also think about synthetic data generation + finetuning to have your LLM "know" the data in its weights instead of just context windows. TLDR: raw data from a given number of sources is collected, then compiled by an LLM into a .md wiki, then operated on by various CLIs by the LLM to do Q&A and to incrementally enhance the wiki, and all of it viewable in Obsidian. You rarely ever write or edit the wiki manually, it's the domain of the LLM. I think there is room here for an incredible new product instead of a hacky collection of scripts.
English
2.7K
6.6K
55.6K
19.7M
Tony
Tony@halluton·
@thdxr I want my free lunch
English
0
0
0
456
dax
dax@thdxr·
what if we gave you unlimited tokens for free and we also paid you
English
706
30
3.6K
243.8K
Tony
Tony@halluton·
@blatherwhick looks cool until you realize that you're trading bad posture and back pain for aesthetics
English
0
0
0
4.8K
sb
sb@blatherwhick·
My boyfriends desk
sb tweet media
English
200
1.3K
25.7K
6.5M
BURKOV
BURKOV@burkov·
Based on months of daily work with Claude Code, I conclude that Opus 4.6, in its current state, is dumber than Opus 4.4 was when it was just released. Anthropic probably isn't cheating with benchmarks, and when they release a new model you get the real performance, but then they gradually extract the model's mojo probably the same way Fat Bastard did with Austin Powers.
English
63
13
586
59.5K
Tony
Tony@halluton·
@k1rallik models are converging so the harness matters way more now. codex feels like a toy in comparison. actual defcon 1 for openai
English
0
0
0
1.1K
Tony
Tony@halluton·
@0xJsum models are converging so the harness matters way more now openai's models are good but codex is so far behind claude code I can't stick with it for more than 10 minutes. actual defcon 1 for anthropic
English
1
0
1
117
Tony
Tony@halluton·
@bl888m thought it was clickbait but actual useful tips for people on a tight budget pro plan
English
0
0
0
17
bl888m
bl888m@bl888m·
> Claude's code leaked today > everyone analyzing what they found > I'm analyzing how to never hit limits again > been paying Pro for 8 months > still hitting limits every week > desperate, found article: 10 actual solutions > not "just wait" or "upgrade" > real workarounds > tried 4 of them > haven't hit limit in 3 weeks > realization: limits aren't hard caps > they're settings you can work around bookmark before limit hits mid-project again
kaize@0x_kaize

x.com/i/article/2037…

English
46
129
3.1K
726.1K
Tony
Tony@halluton·
@birdabo going from dictating every prompt to writing down every prompt on pen and paper to make sure to use it as efficient as possible
English
0
0
5
742
Tony
Tony@halluton·
Research on Agent Harnesses seems to be going to two opposite ideas at the moment. On one hand you have papers that hyper-optimize your harness to a given environment. On the other hand you have the idea of general lightweight harness where the model is already RL'd on these tasks and you barely need any scaffolding. Will be really interesting to see where we end up in a couple of months.
Tony tweet mediaTony tweet media
English
0
0
0
19