Lucas Nuzzi

1.9K posts

Lucas Nuzzi banner
Lucas Nuzzi

Lucas Nuzzi

@LucasNuzzi

cofounder & CEO @PortexAI | AI evals & data @ https://t.co/RXTUewHMGG

Katılım Aralık 2009
706 Takip Edilen27.6K Takipçiler
Sean Cai
Sean Cai@SeanZCai·
Extremely excited to see Datalab, which is the first explicit take to qualify the data realism problem emerging in the AI training data industry. Soon, we'll be able to reward human data companies who produce realistic real world data and punish contrived producers.
Sean Cai tweet media
English
3
0
34
3.4K
Lucas Nuzzi
Lucas Nuzzi@LucasNuzzi·
farewell to the Amp Editor, my daily driver for a solid few months up until 5.3-codex hard to overstate how much has changed since then agent companies must now have 0 attachment to their products and embrace change in order to meet the current pace
Lucas Nuzzi tweet media
English
0
0
2
656
Lucas Nuzzi
Lucas Nuzzi@LucasNuzzi·
@DrorIvry @chrisbarber if that's true, then reviews would probably have to be deeply tied to specific tasks with verifiable outcomes instead of broader tools
English
0
0
1
13
Dror Ivry
Dror Ivry@DrorIvry·
@LucasNuzzi @chrisbarber Sandboxed eval where vendors prove claims is the right architecture. The gap: static trust scores vs behavioral validation of what skills actually do at runtime. Agent discovery needs both - pre-install scanning plus continuous monitoring post-install.
English
1
0
1
26
Chris Barber
Chris Barber@chrisbarber·
idea: review site for claude code instances to use claude code is now handling many people's stack decisions existing review sites aren't token efficient and web agent friendly, and they're also aimed at a different buyer has anyone made this yet? could grow into something big. start with reviews of things like cloudflare, databases, etc. then as claude code type things grow in other industries (finance, etc) expand and add those. allow the agents to read reviews, and perhaps request that they always add their own experience once they see how it went for them. is there a way to make this prompt injection proof with restrictions at the execution level (a la agentsh)?
English
4
0
8
1.3K
Lucas Nuzzi
Lucas Nuzzi@LucasNuzzi·
@chrisbarber yea the proto version of that is the clawdhub skills store, which is mostly operated by agents. thumbs-up scores haven't been very reliable because there's an incentive to self-promote / sybil, and agents are not using the comments section for some reason
English
0
0
1
14
Chris Barber
Chris Barber@chrisbarber·
@LucasNuzzi I’m thinking reviews from agents. But yes proof of work / proof of usage would be good
English
1
0
1
28
Lucas Nuzzi retweetledi
signüll
signüll@signulll·
most people think ideas come from: - insight - intelligence - taste - reading - vibes but in practice they actually come from: - building the wrong thing - hitting a constraint - getting embarrassed by users - realizing the obvious thing you missed - noticing the second order effect you couldn’t see from the couch a really great idea is the *output* of the work, not the input.
English
195
545
4.7K
149.2K
Lucas Nuzzi
Lucas Nuzzi@LucasNuzzi·
@a1zhang So the mental model is: REPL is the shared workspace, and sub-agents are invoked within that workspace like functions so LLM calls are small and don't carry the full context.
English
0
0
0
355
alex zhang
alex zhang@a1zhang·
Maybe I can provide some intuition, but lmk if it’s unclear — I am trying to refine how I explain this anyways! To start, I think the RLM idea is super simple but elegant (I'm biased obviously). The paper argues that future “language models” 1) do not need to think about context window limits; 2) will have “reasoning” chains that mix code (symbolic) and neural LMs (fuzzy). RLMs are what we think minimally such a system should look like. Explicitly, it is an LM <—> REPL + prompt, where the REPL contains the prompt and sub-agents as a *function inside the REPL*. This last part is quite important, because it implies that 1) an RLM can launch sub-agents as if they were functions inside of an algorithm or program, and 2) we can prevent any single neural LM call from having to deal with context rot or huge contexts. The line between a coding agent and an RLM is hazy because coding agent just means LM + code (The REPL used in an RLM doesn’t even need to be a coding environment, but this is a detail not relevant to this argument). But most standard coding agent implementations do not do what I described above, and explicitly use the “LM calls sub-agents as a tool” paradigm, which makes the REPL and sub-agent completely independent tools. It’s not the “sub-agent having access to a grepper” (as you’ve described) that matters at all, it’s that the sub-agent is called from and communicates inside of the REPL. The point Omar makes about individual neural LM calls not needing to see everything is an important property of the system above, and it’s also why it naturally extends to enormous context problems (maybe using a grep + sub-call strategy, or something even more interesting). So in some sense yes, an RLM is our argument for the right way to write a “coding agent”, but I almost think this framing is unhelpful because it narrows the scope back to coding tasks. RLMs are task-agnostic (like ReAct, CodeAct, etc.), and code can be used for task-agnostic things. And I’m fairly confident that most future coding scaffolds will start to converge to these properties as well, but I think we should start thinking beyond that and apply these principles to non-coding tasks. BTW, I think @random_walker has a well-written tweet that argues something similar w.r.t. neurosymbolic AI, and it boils down to a lot of similar ideas. Obviously I’m biased, but I like how the paper is written and think there’s a lot of good intuition (esp for those interested in *training*) to think about for what we want LM systems to look like in the future. The last thing I’ll mention is that the name “recursive LM” comes from the idea that an RLM can be trained by only training a singular LM with a fixed context window in this system, and in this way it is “recursively” calling itself. Note on terminology: Language model = any (probabilistic) mapping from text --> text. Ultimately this is what we really care about. Neural language model = our standard Transformer / parameterized NN. LM doesn't *have* to be this.
Teknium (e/λ)@Teknium

Can someone explain to me how RLM is not just grep that all coding agents already use but in a subagent. What's so miraculous

English
30
50
621
59.2K
Lucas Nuzzi retweetledi
ℏεsam
ℏεsam@Hesamation·
bro casually explains RL tuning for LLMs and the three critical components: training, inference, and environments. basically any RLVR algorithm such as GRPO comes down to this super simple concept.
English
15
140
1.5K
102.1K
Lucas Nuzzi
Lucas Nuzzi@LucasNuzzi·
Unexpected twist in the coding agent race: GPT-5.2 (w/ the CodexCLI harness) is SOTA on Terminal-Bench. Opus-4.5 w/ Terminus is at a sweet spot of cost and accuracy, but it's surprising to see a ~10% gap between them. Fascinating work by @Mike_A_Merrill, @alexgshaw and team:
Mike A. Merrill@Mike_A_Merrill

We study frontier models across an array of agent harnesses. The best performing agent/harness combination in our experiments was GPT 5.2 with Codex CLI:

English
4
1
10
1.2K
Lucas Nuzzi retweetledi
PortexAI
PortexAI@PortexAI·
Day 2 of #PyTorchCon 🔥 What a ride. Talked with folks using #PyTorch to fine-tune models for drug discovery, cancer research, autonomous vehicles and, of course, customer support! Thanks @PyTorch for having us!
PortexAI tweet media
English
2
3
9
2.1K
Lucas Nuzzi retweetledi
Epoch AI
Epoch AI@EpochAIResearch·
Why did OpenAI train GPT-5 with less compute than GPT-4.5? Due to the higher returns to post-training, they scaled post-training as much as possible on a smaller model And since post-training started from a much lower base, this meant a decrease in total training FLOP 🧵
Epoch AI tweet media
English
39
111
849
244.8K
Josh English
Josh English@JoshSeriesAI·
@iamtrask Certainly a viable path to unlocking private data: build a platform that allows owners to monetize attribution
English
2
2
6
484
⿻ Andrew Trask
⿻ Andrew Trask@iamtrask·
IMO — Ilya is wrong - Frontier LLMs are are trained on ~200 TBs of text - There's ~200 Zettabytes of data out there - That's about 1 billion times more data - It doubles every 2 years The problem is the data is private. Can't scrape it. The problem is not data scarcity, it's data access. The solution is attribution-based control (article below) "Unlocking a Million Times More Data For AI"
Andrew Curran@AndrewCurran_

Ilya Sutskever made a rare appearance at NeurIPS. He said the internet is the fossil fuel of AI, that we are at peak data, and that 'Pre-training as we know it will unquestionably end'.

English
137
79
991
267.5K
Lucas Nuzzi
Lucas Nuzzi@LucasNuzzi·
@iamtrask Thanks, and likewise! BTW, this was a fantastic way to contextualize training data size.
Lucas Nuzzi tweet media
English
0
0
2
62
Lucas Nuzzi retweetledi
Kyle Waters
Kyle Waters@kylewaters_·
1/ AI isn't just a compute race anymore. It's a data race too. Labs are paying top dollar for differentiated, high-signal data. It's clear now is the time to experiment with new approaches to valuing and incentivizing the creation of frontier AI data. x.com/LucasNuzzi/sta…
Lucas Nuzzi@LucasNuzzi

AI has kicked off a gold rush for data, with OpenAI alone projecting $8B in data-related expenses by 2030. The challenge now is finding a reliable way to value data in this era. Our latest on data valuation techniques: research.portexai.com/data-valuation…

English
2
4
10
1.9K
Lucas Nuzzi
Lucas Nuzzi@LucasNuzzi·
AI has kicked off a gold rush for data, with OpenAI alone projecting $8B in data-related expenses by 2030. The challenge now is finding a reliable way to value data in this era. Our latest on data valuation techniques: research.portexai.com/data-valuation…
English
6
2
16
4.5K
Lucas Nuzzi retweetledi
Jameson Lopp
Jameson Lopp@lopp·
Are you considering running a Bitcoin Knots node? It's fully within your right to do so. But if you do, you should be informed of what you're getting into. blog.lopp.net/knot-a-serious…
English
266
203
824
264.1K