Lucas Nuzzi

1.9K posts

Lucas Nuzzi

@LucasNuzzi

cofounder & CEO @PortexAI | AI evals & data @ https://t.co/RXTUewHMGG

Katılım Aralık 2009

706 Takip Edilen27.6K Takipçiler

Lucas Nuzzi@LucasNuzzi·11 Mar

@SeanZCai for a second there, I thought you were referring to datalab.portexai.com it's ok @BobbySamuels, I knew this moment would come

English

277

Sean Cai@SeanZCai·11 Mar

Extremely excited to see Datalab, which is the first explicit take to qualify the data realism problem emerging in the AI training data industry. Soon, we'll be able to reward human data companies who produce realistic real world data and punish contrived producers.

English

3.4K

Lucas Nuzzi@LucasNuzzi·6 Mar

farewell to the Amp Editor, my daily driver for a solid few months up until 5.3-codex hard to overstate how much has changed since then agent companies must now have 0 attachment to their products and embrace change in order to meet the current pace

English

656

Lucas Nuzzi retweetledi

Sean Cai@SeanZCai·27 Şub

x.com/i/article/2027…

ZXX

219

23.5K

Lucas Nuzzi@LucasNuzzi·20 Şub

They embraced RL environments

Yuchen Jin@Yuchenj_UW

How did Anthropic automate PowerPoint slides before Microsoft 365 Copilot?

English

770

Lucas Nuzzi retweetledi

swyx@swyx·20 Şub

x.com/i/article/2024…

ZXX

294

85.3K

Lucas Nuzzi@LucasNuzzi·10 Şub

@DrorIvry @chrisbarber if that's true, then reviews would probably have to be deeply tied to specific tasks with verifiable outcomes instead of broader tools

English

Dror Ivry@DrorIvry·9 Şub

@LucasNuzzi @chrisbarber Sandboxed eval where vendors prove claims is the right architecture. The gap: static trust scores vs behavioral validation of what skills actually do at runtime. Agent discovery needs both - pre-install scanning plus continuous monitoring post-install.

English

Chris Barber@chrisbarber·9 Şub

idea: review site for claude code instances to use claude code is now handling many people's stack decisions existing review sites aren't token efficient and web agent friendly, and they're also aimed at a different buyer has anyone made this yet? could grow into something big. start with reviews of things like cloudflare, databases, etc. then as claude code type things grow in other industries (finance, etc) expand and add those. allow the agents to read reviews, and perhaps request that they always add their own experience once they see how it went for them. is there a way to make this prompt injection proof with restrictions at the execution level (a la agentsh)?

English

1.3K

Lucas Nuzzi@LucasNuzzi·10 Şub

@chrisbarber yea the proto version of that is the clawdhub skills store, which is mostly operated by agents. thumbs-up scores haven't been very reliable because there's an incentive to self-promote / sybil, and agents are not using the comments section for some reason

English

Chris Barber@chrisbarber·9 Şub

@LucasNuzzi I’m thinking reviews from agents. But yes proof of work / proof of usage would be good

English

Lucas Nuzzi retweetledi

signüll@signulll·10 Şub

most people think ideas come from: - insight - intelligence - taste - reading - vibes but in practice they actually come from: - building the wrong thing - hitting a constraint - getting embarrassed by users - realizing the obvious thing you missed - noticing the second order effect you couldn’t see from the couch a really great idea is the *output* of the work, not the input.

English

195

545

4.7K

149.2K

Lucas Nuzzi@LucasNuzzi·8 Şub

@a1zhang So the mental model is: REPL is the shared workspace, and sub-agents are invoked within that workspace like functions so LLM calls are small and don't carry the full context.

English

355

alex zhang@a1zhang·8 Şub

Maybe I can provide some intuition, but lmk if it’s unclear — I am trying to refine how I explain this anyways! To start, I think the RLM idea is super simple but elegant (I'm biased obviously). The paper argues that future “language models” 1) do not need to think about context window limits; 2) will have “reasoning” chains that mix code (symbolic) and neural LMs (fuzzy). RLMs are what we think minimally such a system should look like. Explicitly, it is an LM <—> REPL + prompt, where the REPL contains the prompt and sub-agents as a *function inside the REPL*. This last part is quite important, because it implies that 1) an RLM can launch sub-agents as if they were functions inside of an algorithm or program, and 2) we can prevent any single neural LM call from having to deal with context rot or huge contexts. The line between a coding agent and an RLM is hazy because coding agent just means LM + code (The REPL used in an RLM doesn’t even need to be a coding environment, but this is a detail not relevant to this argument). But most standard coding agent implementations do not do what I described above, and explicitly use the “LM calls sub-agents as a tool” paradigm, which makes the REPL and sub-agent completely independent tools. It’s not the “sub-agent having access to a grepper” (as you’ve described) that matters at all, it’s that the sub-agent is called from and communicates inside of the REPL. The point Omar makes about individual neural LM calls not needing to see everything is an important property of the system above, and it’s also why it naturally extends to enormous context problems (maybe using a grep + sub-call strategy, or something even more interesting). So in some sense yes, an RLM is our argument for the right way to write a “coding agent”, but I almost think this framing is unhelpful because it narrows the scope back to coding tasks. RLMs are task-agnostic (like ReAct, CodeAct, etc.), and code can be used for task-agnostic things. And I’m fairly confident that most future coding scaffolds will start to converge to these properties as well, but I think we should start thinking beyond that and apply these principles to non-coding tasks. BTW, I think @random_walker has a well-written tweet that argues something similar w.r.t. neurosymbolic AI, and it boils down to a lot of similar ideas. Obviously I’m biased, but I like how the paper is written and think there’s a lot of good intuition (esp for those interested in *training*) to think about for what we want LM systems to look like in the future. The last thing I’ll mention is that the name “recursive LM” comes from the idea that an RLM can be trained by only training a singular LM with a fixed context window in this system, and in this way it is “recursively” calling itself. Note on terminology: Language model = any (probabilistic) mapping from text --> text. Ultimately this is what we really care about. Neural language model = our standard Transformer / parameterized NN. LM doesn't *have* to be this.

Teknium (e/λ)@Teknium

Can someone explain to me how RLM is not just grep that all coding agents already use but in a subagent. What's so miraculous

English

621

59.2K

Lucas Nuzzi retweetledi

ℏεsam@Hesamation·24 Oca

bro casually explains RL tuning for LLMs and the three critical components: training, inference, and environments. basically any RLVR algorithm such as GRPO comes down to this super simple concept.

English

140

1.5K

102.1K

Lucas Nuzzi@LucasNuzzi·23 Oca

Unexpected twist in the coding agent race: GPT-5.2 (w/ the CodexCLI harness) is SOTA on Terminal-Bench. Opus-4.5 w/ Terminus is at a sweet spot of cost and accuracy, but it's surprising to see a ~10% gap between them. Fascinating work by @Mike_A_Merrill, @alexgshaw and team:

Mike A. Merrill@Mike_A_Merrill

We study frontier models across an array of agent harnesses. The best performing agent/harness combination in our experiments was GPT 5.2 with Codex CLI:

English

1.2K

Lucas Nuzzi retweetledi

PortexAI@PortexAI·24 Eki

Day 2 of #PyTorchCon 🔥 What a ride. Talked with folks using #PyTorch to fine-tune models for drug discovery, cancer research, autonomous vehicles and, of course, customer support! Thanks @PyTorch for having us!

English

2.1K

Lucas Nuzzi retweetledi

Epoch AI@EpochAIResearch·26 Eyl

Why did OpenAI train GPT-5 with less compute than GPT-4.5? Due to the higher returns to post-training, they scaled post-training as much as possible on a smaller model And since post-training started from a much lower base, this meant a decrease in total training FLOP 🧵

English

111

849

244.8K

Lucas Nuzzi@LucasNuzzi·24 Eyl

@JoshSeriesAI @iamtrask Since data needs to be priced competitively, we've built a platform that achieves that in the form of a marketplace. Check it out: datalab.portexai.com/explore

English

Josh English@JoshSeriesAI·24 Eyl

@iamtrask Certainly a viable path to unlocking private data: build a platform that allows owners to monetize attribution

English

484

⿻ Andrew Trask@iamtrask·24 Eyl

IMO — Ilya is wrong - Frontier LLMs are are trained on ~200 TBs of text - There's ~200 Zettabytes of data out there - That's about 1 billion times more data - It doubles every 2 years The problem is the data is private. Can't scrape it. The problem is not data scarcity, it's data access. The solution is attribution-based control (article below) "Unlocking a Million Times More Data For AI"

Andrew Curran@AndrewCurran_

Ilya Sutskever made a rare appearance at NeurIPS. He said the internet is the fossil fuel of AI, that we are at peak data, and that 'Pre-training as we know it will unquestionably end'.

English

137

991

267.5K

Lucas Nuzzi@LucasNuzzi·24 Eyl

@iamtrask Thanks, and likewise! BTW, this was a fantastic way to contextualize training data size.

English

⿻ Andrew Trask@iamtrask·24 Eyl

@LucasNuzzi 100% and nice piece!

English

891

Lucas Nuzzi retweetledi

Kyle Waters@kylewaters_·24 Eyl

1/ AI isn't just a compute race anymore. It's a data race too. Labs are paying top dollar for differentiated, high-signal data. It's clear now is the time to experiment with new approaches to valuing and incentivizing the creation of frontier AI data. x.com/LucasNuzzi/sta…

Lucas Nuzzi@LucasNuzzi

AI has kicked off a gold rush for data, with OpenAI alone projecting $8B in data-related expenses by 2030. The challenge now is finding a reliable way to value data in this era. Our latest on data valuation techniques: research.portexai.com/data-valuation…

English

1.9K

Lucas Nuzzi@LucasNuzzi·24 Eyl

English

4.5K

Lucas Nuzzi retweetledi

Jameson Lopp@lopp·23 Eyl

Are you considering running a Bitcoin Knots node? It's fully within your right to do so. But if you do, you should be informed of what you're getting into. blog.lopp.net/knot-a-serious…

English

266

203

824

264.1K

Keşfet

@SeanZCai @BobbySamuels @DrorIvry @chrisbarber @a1zhang @random_walker @Mike_A_Merrill @alexgshaw