Zayne Sprague ✈️ ICLR Rio

152 posts

Zayne Sprague ✈️ ICLR Rio banner
Zayne Sprague ✈️ ICLR Rio

Zayne Sprague ✈️ ICLR Rio

@ZayneSprague

Ph.D. student at NYU. My interest is in NLP, RL and CogSci research focusing on reasoning in AI models. (he/him)

NYC Katılım Kasım 2022
209 Takip Edilen479 Takipçiler
Zayne Sprague ✈️ ICLR Rio retweetledi
Zhiqiang Shen
Zhiqiang Shen@szq0214·
🎉🎉🎉We have released a technical report, A Systematic and Comprehensive Analysis of Claude Code, in which we present our analysis together with discussions and possible future directions for today's and next-generation AI agent systems. The arXiv submission is still pending, but the paper is currently accessible on our GitHub: GitHub: github.com/VILA-Lab/Dive-… Paper: github.com/VILA-Lab/Dive-… Welcome to check it out and share feedback!
Zhiqiang Shen tweet media
English
1
36
218
12.5K
Zayne Sprague ✈️ ICLR Rio
SkillFactory will be at ICLR this year! We study how self-distillation can create synthetic data that primes models for RLVR through SFT. We built a recipe for teaching your model new capabilities it doesn’t have yet, matching and sometimes outperforming teacher distillation.
Zayne Sprague ✈️ ICLR Rio@ZayneSprague

RL amplifies existing behaviors. Let’s prime models w/ good behaviors for better RL! Introducing SkillFactory: ✂️Rearrange model traces on a problem to demo verification + retry ⚙️SFT on those traces 🦾RL Result: Learn robust explicit verification + retry across domains 🧵

English
1
7
33
5.1K
Zayne Sprague ✈️ ICLR Rio retweetledi
Andrej Karpathy
Andrej Karpathy@karpathy·
Wow, this tweet went very viral! I wanted share a possibly slightly improved version of the tweet in an "idea file". The idea of the idea file is that in this era of LLM agents, there is less of a point/need of sharing the specific code/app, you just share the idea, then the other person's agent customizes & builds it for your specific needs. So here's the idea in a gist format: gist.github.com/karpathy/442a6… You can give this to your agent and it can build you your own LLM wiki and guide you on how to use it etc. It's intentionally kept a little bit abstract/vague because there are so many directions to take this in. And ofc, people can adjust the idea or contribute their own in the Discussion which is cool.
Andrej Karpathy@karpathy

LLM Knowledge Bases Something I'm finding very useful recently: using LLMs to build personal knowledge bases for various topics of research interest. In this way, a large fraction of my recent token throughput is going less into manipulating code, and more into manipulating knowledge (stored as markdown and images). The latest LLMs are quite good at it. So: Data ingest: I index source documents (articles, papers, repos, datasets, images, etc.) into a raw/ directory, then I use an LLM to incrementally "compile" a wiki, which is just a collection of .md files in a directory structure. The wiki includes summaries of all the data in raw/, backlinks, and then it categorizes data into concepts, writes articles for them, and links them all. To convert web articles into .md files I like to use the Obsidian Web Clipper extension, and then I also use a hotkey to download all the related images to local so that my LLM can easily reference them. IDE: I use Obsidian as the IDE "frontend" where I can view the raw data, the the compiled wiki, and the derived visualizations. Important to note that the LLM writes and maintains all of the data of the wiki, I rarely touch it directly. I've played with a few Obsidian plugins to render and view data in other ways (e.g. Marp for slides). Q&A: Where things get interesting is that once your wiki is big enough (e.g. mine on some recent research is ~100 articles and ~400K words), you can ask your LLM agent all kinds of complex questions against the wiki, and it will go off, research the answers, etc. I thought I had to reach for fancy RAG, but the LLM has been pretty good about auto-maintaining index files and brief summaries of all the documents and it reads all the important related data fairly easily at this ~small scale. Output: Instead of getting answers in text/terminal, I like to have it render markdown files for me, or slide shows (Marp format), or matplotlib images, all of which I then view again in Obsidian. You can imagine many other visual output formats depending on the query. Often, I end up "filing" the outputs back into the wiki to enhance it for further queries. So my own explorations and queries always "add up" in the knowledge base. Linting: I've run some LLM "health checks" over the wiki to e.g. find inconsistent data, impute missing data (with web searchers), find interesting connections for new article candidates, etc., to incrementally clean up the wiki and enhance its overall data integrity. The LLMs are quite good at suggesting further questions to ask and look into. Extra tools: I find myself developing additional tools to process the data, e.g. I vibe coded a small and naive search engine over the wiki, which I both use directly (in a web ui), but more often I want to hand it off to an LLM via CLI as a tool for larger queries. Further explorations: As the repo grows, the natural desire is to also think about synthetic data generation + finetuning to have your LLM "know" the data in its weights instead of just context windows. TLDR: raw data from a given number of sources is collected, then compiled by an LLM into a .md wiki, then operated on by various CLIs by the LLM to do Q&A and to incrementally enhance the wiki, and all of it viewable in Obsidian. You rarely ever write or edit the wiki manually, it's the domain of the LLM. I think there is room here for an incredible new product instead of a hacky collection of scripts.

English
1.1K
2.7K
26.1K
6.7M
Zayne Sprague ✈️ ICLR Rio
Zayne Sprague ✈️ ICLR Rio@ZayneSprague·
Neat project + post. This line of work is making me think of harnesses as transient reasoning scaffolds. Reminds me of earlier “XoT as intermediate structure” work (SatLM, Program-of-Thought, Tree-of-Thought, etc.), but now in an agentic regime with much more room for optimization. I keep wondering about this idea of meta-optimization: how broad should we expect that optimization to get? Should a meta-agent mostly do local search over prompts/tools/hyperparameters, or should it sometimes pursue riskier, longer-horizon interventions? Wilder example, but if we pointed something like AutoResearch at "coding", should we expect it to rediscover higher-level workflows or abstractions akin to Claude Code? My guess is that “meta” agents, or even deeper recursive optimizers, will tend to favor local improvements over sweeping pipeline redesigns. Very targeted and precise changes, even when broader features might help in the long run. Measuring that “meta-scope”, where do agents spend their time optimizing these harnesses, seems worth studying.
Kevin Gu@kevingu

x.com/i/article/2039…

English
2
3
23
5.2K
Zayne Sprague ✈️ ICLR Rio
Zayne Sprague ✈️ ICLR Rio@ZayneSprague·
Loved this post. One question: have you thought about making the latent queries partially cache-conditioned? Not fully generated from the current K,V, but also not entirely fixed by the training distribution. Would it help with generalization, or would this hurt speed too much?
English
0
0
0
95
Zayne Sprague ✈️ ICLR Rio retweetledi
Zayne Sprague ✈️ ICLR Rio
Zayne Sprague ✈️ ICLR Rio@ZayneSprague·
Claude Code with Agent-Deck really helped me manage my sessions - one GUI in one terminal session for all CC sessions. Conductors (Claude running other Claude sessions) are pretty neat too. Communication overhead / reward hacking for long experiments is still challenging tho, even with all the skills, superclaude/superpowers, and ctx engineering hacks - though they do help in their own way. github.com/asheshgoplani/…
English
0
0
2
207
Zayne Sprague ✈️ ICLR Rio
Zayne Sprague ✈️ ICLR Rio@ZayneSprague·
Feels directionally right after running AutoResearch for a bit. I set an ambitious target, 0.75 val_bpb & keep 50mil params. Didn't expect it to hit it but I did expect some far out heavy architecture changes. Across 500+ proposed experiments, it mostly hill-climbed through hyperparameters and familiar tricks. This feels like an optimization vs. discovery gap: Like a bias of LLMs to exploit what they already know as opposed to pursuing genuinely novel ideas, especially ones that are "wild" -- i.e. require long horizons to validate, go against priors, etc. (even when ideas like this are warranted/necessary)
François Chollet@fchollet

Current AI is a librarian of existing knowledge. Science requires an explorer of the unknown. You don't win a Nobel Prize by staying in the library.

English
1
0
18
1.7K
Zayne Sprague ✈️ ICLR Rio retweetledi
Manya Wadhwa
Manya Wadhwa@ManyaWadhwa1·
⚛️ Introducing CREATE, a benchmark for creative associative reasoning in LLMs. Making novel, meaningful connections is key for scientific & creative works. We objectively measure how well LLMs can do this. 🧵👇
Manya Wadhwa tweet media
English
4
43
143
21.2K
Zayne Sprague ✈️ ICLR Rio retweetledi
Shangyin Tan
Shangyin Tan@ShangyinT·
GEPA for skills is here! Introducing gskill, an automated pipeline to learn agent skills with @gepa_ai. With learned skills, we boost Claude Code’s repository task resolution rate to near-perfect levels, while making it 47% faster. Here's how we did it:
Shangyin Tan tweet media
English
18
49
381
80.4K
Zayne Sprague ✈️ ICLR Rio retweetledi
Wenxuan Ding
Wenxuan Ding@Wenxuan_Ding_·
Agents interact with environments to gather information. But exploration can be expensive. Tool use, retrieval, and user interaction carry latency or monetary cost. Calibrate-Then-Act allows LLM agents to balance exploration with cost: 📐 Estimate uncertainty about the environment 💭 Reason about cost-uncertainty tradeoffs ⚙️ Act accordingly
Wenxuan Ding tweet media
English
7
30
116
12.1K
Zayne Sprague ✈️ ICLR Rio retweetledi
Zayne Sprague ✈️ ICLR Rio retweetledi
Kyle Lo
Kyle Lo@kylelostat·
olmo 3 paper finally on arxiv 🫡 thx to our teammates esp folks who chased additional baselines thx to arxiv-latex-cleaner and overleaf feature for chasing latex bugs thx for all the helpful discussions after our Nov release, best part of open science is progressing together!
Kyle Lo tweet media
English
12
99
466
55.2K