Nuno Campos

1.7K posts

Nuno Campos

@nfcampos

@witanlabs “that most limited of all specialists, the well-rounded man” https://t.co/sWDUx7YcQa

London Katılım Aralık 2008

1.5K Takip Edilen3K Takipçiler

Nuno Campos@nfcampos·10 Mar

This is what I’m seeing too, great summary of the new status quo

Harrison Chase@hwchase17

x.com/i/article/2031…

English

8.2K

Nuno Campos@nfcampos·10 Mar

Really great read!

LangChain@LangChain

x.com/i/article/2030…

English

1.6K

Nuno Campos@nfcampos·6 Mar

On the other hand, on a coding task I tried it produced better results than opus on the same earlier today

English

118

Nuno Campos@nfcampos·5 Mar

GPT 5.4 Thinking in ChatGPT seems like a step backwards in search, the query “What are the best tea rooms or coffee shops serving loose leaf tea in London, around Clerkenwell, Holborn, Marylebone where I can sit in for 60mins” produces a shortlist of 6 options, where options 1 and 2 have been closed for several years, and option 4 is a blog (lol) chatgpt.com/s/t_69a9e734ff…

English

407

Nuno Campos@nfcampos·4 Mar

@zebriez Not sure if it counts as a primer on a single topic, but designing data intensive applications has to be up there

English

677

Brie Wolfson@zebriez·4 Mar

what is the best technical primer on a topic you've ever read?

English

108

31K

Nuno Campos@nfcampos·4 Mar

@juanstoppa Thanks! Both GPT 5.1 and 5.2 and Opus and Haiku 4.5 and 4.6

English

Juan Stoppa@juanstoppa·4 Mar

@nfcampos great article Nuno! what model were you using for the research ? (sorry if I missed it in the github repo)

English

Nuno Campos@nfcampos·3 Mar

I wanted to share what we learned over the past few months, building agents 🧵

English

168

32.1K

Nuno Campos@nfcampos·4 Mar

@FundamentEdge Good questions, but benchmarks are meaningless without context, on an earlier version of our benchmark we hit 90+% accuracy. While that looks nice, it’s useless for continuing to make progress, so we added more complicated tasks.

English

186

Brett Caughran@FundamentEdge·3 Mar

This is one of the biggest sticking points on AI Excel that I'm trying to understand. 73% accuracy is progress, but is it useful for anything at all? We were on a vendor call last month and the vendor bragged of hitting 65% accuracy in Excel and Andrew Carr and I texted "an analyst who is 65% accurate in Excel is 100% fired". Why is AI Excel only 60-70% accurate? Are these issues fundamental or solvable? > Is MCP fundamentally too brittle to get to 99% accuracy? > Is the data layer clean enough to hit 99% accuracy (i.e. there's a reason why hedge fund analysts don't start their models with a Bloomberg download) > Are the foundation models powerful enough to handle the multi-modal (filings, PRs, investor decks, data supplementals), multi-document, "needle in a haystack" issues for LLMs? Context windows have grown, but they are still not large enough to capture all of the documents and files for one ticker (letalone a coverage universe) > Is the commercial opportunity large enough for foundation labs to build RL environments for public equity modeling, as they are doing on investment banking modeling? Does the "march of 9s" on AI Excel take 6 months or 6 years? Driverless cars took 13 years from DARPA Urban challenge to first Waymo. These are legit questions. I don't know. I also don't really trust public evaluation sets (i.e. LLM's win physics competitions...then you learn the LLM trained on the physics competition test bank lol). The real questions in investment research modeling are out of sample questions (i.e. how to model SAAS retention in a Claude-world...there is no prior on which to rely). So I am building my own evaluation set. 100 use cases ranging from simple (input 3 statements from 10-K to AMZN model) to complex (model GE split/spin). Am I wasting my time? 36 months form now, are we still only at 80% accuracy in AI Excel? These are questions, now answers - love your takes in replies or DM!

Patrick OShaughnessy@patrick_oshag

For all the spreadsheet people out there …

English

112

45K

Nuno Campos retweetledi

Harrison Chase@hwchase17·3 Mar

to make coding agents good at other tasks, you need good tools for those tasks these are some good tools for working with spreadsheets!

Nuno Campos@nfcampos

Code mode for spreadsheets, compatible with every coding agent (and OpenClaw!). Try it out and let me know what you think!

English

6.3K

Nuno Campos@nfcampos·3 Mar

I'll be posting new features and updates over next few weeks, you can always check the changelog to see it ahead of time... github.com/witanlabs/wita…

English

664

Nuno Campos@nfcampos·3 Mar

Code mode for spreadsheets, compatible with every coding agent (and OpenClaw!). Try it out and let me know what you think!

Witan Labs@WitanLabs

Coding agents are surprisingly bad at spreadsheets. They'll cobble together a Python script that silently breaks formulas, misreads cached values, or corrupts formatting. The problem isn't the model, it's the tools. We built Witan to fix this. witanlabs.com/agents

English

24.2K

Nuno Campos@nfcampos·3 Mar

@swyx @aiDotEngineer Would love to talk through these learnings github.com/witanlabs/rese…

English

197

swyx@swyx·2 Mar

we are finalizing @aidotengineer Europe speaker acceptances and realizing that we actually dont have enough good speakers for these tracks: - mechanistic interpretability - generative media (video/image but could also use more voice/music) - Claw track (openclaw etc power users) if you have GOOD work - not an ad for your barely launched startup, actual good work - in these areas please please reply or DM and I’ll skip you to the front of the line. We pay all travel and accommodation for our international speakers, and our talks regularly get hundreds of thousands of views. ai.engineer/europe

English

141

21.8K

Nuno Campos@nfcampos·3 Mar

@nicbstme github.com/witanlabs/rese… no need to reverse engineer ours! Great piece btw

English

Nicolas Bustamante@nicbstme·24 Şub

x.com/i/article/2026…

ZXX

338

76K

Nuno Campos@nfcampos·3 Mar

@0xarch1tect Thanks! Planning is just a prompt (skill to be precise) right now

English

907

The Architect@0xarch1tect·3 Mar

this thread is packed. a few things stood out. the "define the end state before you touch a cell" finding is the one id push on most. its basically forcing the agent to front-load its inference before it starts taking irreversible actions. the errors dont disappear, they just surface during planning where they cost nothing. ive seen similar patterns building agentic workflows, the ones that blow up mid execution almost always skipped a planning gate. the domain knowledge > tools point is underrated. most teams obsess over the tool layer and treat domain knowledge as an afterthought. but youre describing four tool backends coming and going while the financial expertise kept compounding. the tools are the interface, the knowledge is the actual product. thats a real reframe for anyone building vertical agents right now. the eval section deserves more attention in this thread. "llm as judge" failing on anything with a correct answer isnt surprising but its still something teams learn the hard way. programmatic comparison is slower to set up but youre flying blind without it. the 50 to 73 to 92 trajectory only becomes legible when you have something deterministic to measure against. the part i found most interesting was the failed test shaping the product direction. youre not just saying "run benchmarks", youre saying a benchmark that contradicted your thesis told you what to build next. thats a different relationship with evaluation than most teams have. whats your current thinking on how to structure the planning gate before execution kicks off? curious if thats a prompt pattern, a separate model call, or something else entirely.

English

1.2K

Nuno Campos@nfcampos·3 Mar

Read the full story (and lots of technical details!) in the repo github.com/witanlabs/rese…

English

2.4K

Nuno Campos@nfcampos·3 Mar

The test that contradicted our thesis was the one that shaped the product. We expected the Witan CLI verify workflow to beat openpyxl on QnA tasks. It lost, 70% to 85% — not because of capability, but because spawning a separate process per CLI command was the wrong interaction pattern for exploration tasks that need 20+ queries. That failure told us the REPL should be the external product (witan xlsx exec), and the remaining CLI commands (render, calc, lint) should be a lightweight verification add-on for agents that already have their own spreadsheet tools. We wouldn't have found either insight if the test had confirmed what we expected.

English

2.5K

Keşfet

@zebriez @juanstoppa @FundamentEdge @swyx @aiDotEngineer @aidotengineer @nicbstme @0xarch1tect