Nuno Campos

1.7K posts

Nuno Campos

Nuno Campos

@nfcampos

@witanlabs “that most limited of all specialists, the well-rounded man” https://t.co/sWDUx7YcQa

London Katılım Aralık 2008
1.5K Takip Edilen3K Takipçiler
Nuno Campos
Nuno Campos@nfcampos·
On the other hand, on a coding task I tried it produced better results than opus on the same earlier today
English
0
0
1
118
Nuno Campos
Nuno Campos@nfcampos·
GPT 5.4 Thinking in ChatGPT seems like a step backwards in search, the query “What are the best tea rooms or coffee shops serving loose leaf tea in London, around Clerkenwell, Holborn, Marylebone where I can sit in for 60mins” produces a shortlist of 6 options, where options 1 and 2 have been closed for several years, and option 4 is a blog (lol) chatgpt.com/s/t_69a9e734ff…
English
1
0
2
407
Nuno Campos
Nuno Campos@nfcampos·
@zebriez Not sure if it counts as a primer on a single topic, but designing data intensive applications has to be up there
English
0
0
2
677
Brie Wolfson
Brie Wolfson@zebriez·
what is the best technical primer on a topic you've ever read?
English
20
5
108
31K
Nuno Campos
Nuno Campos@nfcampos·
@juanstoppa Thanks! Both GPT 5.1 and 5.2 and Opus and Haiku 4.5 and 4.6
English
0
0
1
94
Juan Stoppa
Juan Stoppa@juanstoppa·
@nfcampos great article Nuno! what model were you using for the research ? (sorry if I missed it in the github repo)
English
1
0
0
87
Nuno Campos
Nuno Campos@nfcampos·
I wanted to share what we learned over the past few months, building agents 🧵
English
12
22
168
32.1K
Nuno Campos
Nuno Campos@nfcampos·
@FundamentEdge Good questions, but benchmarks are meaningless without context, on an earlier version of our benchmark we hit 90+% accuracy. While that looks nice, it’s useless for continuing to make progress, so we added more complicated tasks.
English
0
0
1
186
Brett Caughran
Brett Caughran@FundamentEdge·
This is one of the biggest sticking points on AI Excel that I'm trying to understand. 73% accuracy is progress, but is it useful for anything at all? We were on a vendor call last month and the vendor bragged of hitting 65% accuracy in Excel and Andrew Carr and I texted "an analyst who is 65% accurate in Excel is 100% fired". Why is AI Excel only 60-70% accurate? Are these issues fundamental or solvable? > Is MCP fundamentally too brittle to get to 99% accuracy? > Is the data layer clean enough to hit 99% accuracy (i.e. there's a reason why hedge fund analysts don't start their models with a Bloomberg download) > Are the foundation models powerful enough to handle the multi-modal (filings, PRs, investor decks, data supplementals), multi-document, "needle in a haystack" issues for LLMs? Context windows have grown, but they are still not large enough to capture all of the documents and files for one ticker (letalone a coverage universe) > Is the commercial opportunity large enough for foundation labs to build RL environments for public equity modeling, as they are doing on investment banking modeling? Does the "march of 9s" on AI Excel take 6 months or 6 years? Driverless cars took 13 years from DARPA Urban challenge to first Waymo. These are legit questions. I don't know. I also don't really trust public evaluation sets (i.e. LLM's win physics competitions...then you learn the LLM trained on the physics competition test bank lol). The real questions in investment research modeling are out of sample questions (i.e. how to model SAAS retention in a Claude-world...there is no prior on which to rely). So I am building my own evaluation set. 100 use cases ranging from simple (input 3 statements from 10-K to AMZN model) to complex (model GE split/spin). Am I wasting my time? 36 months form now, are we still only at 80% accuracy in AI Excel? These are questions, now answers - love your takes in replies or DM!
Brett Caughran tweet media
Patrick OShaughnessy@patrick_oshag

For all the spreadsheet people out there …

English
30
2
112
45K
Nuno Campos
Nuno Campos@nfcampos·
I'll be posting new features and updates over next few weeks, you can always check the changelog to see it ahead of time... github.com/witanlabs/wita…
English
1
0
9
664
swyx
swyx@swyx·
we are finalizing @aidotengineer Europe speaker acceptances and realizing that we actually dont have enough good speakers for these tracks: - mechanistic interpretability - generative media (video/image but could also use more voice/music) - Claw track (openclaw etc power users) if you have GOOD work - not an ad for your barely launched startup, actual good work - in these areas please please reply or DM and I’ll skip you to the front of the line. We pay all travel and accommodation for our international speakers, and our talks regularly get hundreds of thousands of views. ai.engineer/europe
English
49
17
141
21.8K
Nuno Campos
Nuno Campos@nfcampos·
@0xarch1tect Thanks! Planning is just a prompt (skill to be precise) right now
English
1
0
2
907
The Architect
The Architect@0xarch1tect·
this thread is packed. a few things stood out. the "define the end state before you touch a cell" finding is the one id push on most. its basically forcing the agent to front-load its inference before it starts taking irreversible actions. the errors dont disappear, they just surface during planning where they cost nothing. ive seen similar patterns building agentic workflows, the ones that blow up mid execution almost always skipped a planning gate. the domain knowledge > tools point is underrated. most teams obsess over the tool layer and treat domain knowledge as an afterthought. but youre describing four tool backends coming and going while the financial expertise kept compounding. the tools are the interface, the knowledge is the actual product. thats a real reframe for anyone building vertical agents right now. the eval section deserves more attention in this thread. "llm as judge" failing on anything with a correct answer isnt surprising but its still something teams learn the hard way. programmatic comparison is slower to set up but youre flying blind without it. the 50 to 73 to 92 trajectory only becomes legible when you have something deterministic to measure against. the part i found most interesting was the failed test shaping the product direction. youre not just saying "run benchmarks", youre saying a benchmark that contradicted your thesis told you what to build next. thats a different relationship with evaluation than most teams have. whats your current thinking on how to structure the planning gate before execution kicks off? curious if thats a prompt pattern, a separate model call, or something else entirely.
English
1
0
4
1.2K
Nuno Campos
Nuno Campos@nfcampos·
The test that contradicted our thesis was the one that shaped the product. We expected the Witan CLI verify workflow to beat openpyxl on QnA tasks. It lost, 70% to 85% — not because of capability, but because spawning a separate process per CLI command was the wrong interaction pattern for exploration tasks that need 20+ queries. That failure told us the REPL should be the external product (witan xlsx exec), and the remaining CLI commands (render, calc, lint) should be a lightweight verification add-on for agents that already have their own spreadsheet tools. We wouldn't have found either insight if the test had confirmed what we expected.
English
1
0
6
2.5K