Florian Brand

34.2K posts

Florian Brand

@xeophon

evals @PrimeIntellect | open models @interconnectsai

Katılım Temmuz 2015

723 Takip Edilen13.3K Takipçiler

Florian Brand@xeophon·30m

@1thousandfaces_ always fun to play the "is she subtweeting a tweet or a date" game

English

Hero Thousandfaces@1thousandfaces_·33m

What is a man? A miserable little collection of hackathon shirts

English

250

Florian Brand@xeophon·30m

@maksym_andr i do not expect anything else from the PTB + FutureSim ppl tbh

English

Maksym Andriushchenko@maksym_andr·32m

@xeophon and yes, there we did use native agent harnesses! and still all agents basically suck. it's gonna be a very interesting benchmark. i know i'm teasing too much...

English

Maksym Andriushchenko@maksym_andr·41m

💥New paper: LLMs are now used for high-stakes real-world decisions, but can their numerical predictions and uncertainty estimates be trusted? We built QuantSightBench, a benchmark to measure how well frontier models forecast numerical outcomes across business, politics, etc. Why forecasting? Forecasting of world events is a great testbed for general LLM decision-making. The real world produces so many things that can be forecast, and the objective ground truth eventually gets revealed. This is the ultimate benchmark: you want to predict how the real world will unfold. Beyond producing accurate point-wise forecasts, having correct uncertainty estimation is essential. LLMs typically don't produce consequential forecasts autonomously, but they rather assist human decision making. This requires calibrated uncertainty estimation, which is also a necessary skill for *agentic* LLM forecasting: the agent needs to know when to acquire more information and when to stop and commit to an answer. Why *numerical* forecasting? Nearly all prior LLM forecasting work evaluates on binary Polymarket-style questions (which is great, btw). However, most decisions that actually matter: GDP growth, ARR numbers, election margins, infrastructure timelines are not binary. They're numbers, and the confidence intervals there matter even more than the point estimates. So we built a benchmark to measure this! This is joint work with Jeremy Qin @Jjq2221.

GIF

English

576

Florian Brand@xeophon·35m

@maksym_andr you are treating me too well...

English

Maksym Andriushchenko@maksym_andr·38m

@xeophon we are actually gonna post another banger benchmark early next week! stay tuned.

English

Florian Brand@xeophon·1h

@_ueaj @EpochAIResearch i am arguing we should start using those harnesses in the first place... not a react agent with bash

English

ueaj@_ueaj·1h

But the ultimate metric is real world utility, which doesn't really involve benchmaxxing in the specific setting, but how a single harness does across all the tasks it's designed to. cowork and cc cover different tasks, so maybe pure generality isn't *necessary* but frankly a lot of that is just prompt caching efficiency. Though yes cybersec and certain tail value industries might be different, benchmaxxing on those is probably appropriate.

English

Florian Brand@xeophon·1h

new artifacts! i also comment on the open<>closed model gap, where US CAISI and @EpochAIResearch disagree, arguing that both are incomplete: for an assessment of the very frontier, we must elicit the best performance by tuning prompts and harnesses with the models

Interconnects@interconnectsai

Latest open artifacts (#21): Open model bonanza! Gemma 4, DeepSeek V4, Kimi K2.6, MiMo 2.5, GLM-5.1 & others. On CAISI's V4 assessment. An eventful month with one flagship release after another interconnects.ai/p/latest-open-…

English

3.3K

Florian Brand@xeophon·1h

@_ueaj @EpochAIResearch this is not my point, my point is that each model should be set up to benchmaxx in the specific setting to know what the raw capabilities are, esp. when being concerned about safety and open<>closed model gaps. Yet Another Benchmark does not solve this problem

English

ueaj@_ueaj·1h

@xeophon @EpochAIResearch There should be an ARC-AGI but for general capabilities, a private subset + some non-representative public train & test set (to measure generalization). Kinda like terminalbench but more general and a private subset

English

Florian Brand@xeophon·1h

@KateandPie claude code didn't cause covid afaik

English

Eoghan Flanagan@KateandPie·1h

@xeophon Show the full chart

English

Florian Brand@xeophon·2h

"Claude Code means we don't need software devs anymore" Software dev job postings:

English

2.7K

Florian Brand@xeophon·1h

@mitsuhiko it’s annoying that the masses spam, ruining the experience for the few good people :(

English

196

Armin Ronacher ⇌@mitsuhiko·2h

Alright. Proposal: the first time one submits a PR they need to jump on a video chat with the maintainer to explain their PR. If they fail they are banned from GitHub. x.com/mitsuhiko/stat…

Armin Ronacher ⇌@mitsuhiko

I think it would be great if people were upfront about declaring their own understanding of a topic / their pull request. Now that everybody can talk confident with their clanker it becomes way too hard to understand if they knew what they were doing when they prompted it :(

English

123

7.7K

Florian Brand@xeophon·4h

@giffmana Reverse distillation attack

English

916

Lucas Beyer (bl16)@giffmana·4h

ok, sure, google scholar, sounds like you got it all figured out

English

11.3K

Florian Brand@xeophon·9h

@fsfarimani @badlogicgames nah, our job will transform but things like agents checking every pr is too important to not do

English

710

Foad S. Farimani@fsfarimani·9h

@badlogicgames @xeophon Regardless, we eventually have to. All of us. We have like three years maximum.

English

808

Florian Brand@xeophon·13h

I am so tired of ppl dunking on Peter, who basically runs the largest experiment what the future of work will look like (and similarly, what the future of security looks like)

Peter Steinberger 🦞@steipete

People freaking out over my AI spend. What nobody sees: Part of what excites me so much about working on OpenClaw is that I'm trying to answer the question: How would we build software in the future if tokens don't matter? We constant run ~100 codex in the cloud, reviewing every PR, every issue. If a fix on main lands, @clawsweeper will eventually find that 6 month old issue and close it with an exact reference. We run codex on every commit to review for security issues (as it's far too easy to miss). We run codex to de-duplicate issues and find clusters and send reports for the most pressing issues. We have agents that can recreate complex setups, spin up ephemeral crabbox.sh machines, log into e.g. Telegram, make a video and post before/after fix on the PR. There's codex that watch new issues and - if it fits our documented vision well, automatically create a PR of it. (that then another codex reviews) We have codex running that scans comments for spam and blocks people. We have codex instances running that verify performance benchmarks and report regressions into Discord. We have agents that listen on our meetings and proactively start work, e.g. create PRs when we discuss new features while we discuss them. We build clawpatch.ai to split all our projects into functional units to review and find bugs and regresssions. We do the same split for security with Vercel's deepsec and Codex Security to find regressions and vulnerabilities. All that automation allows us to run this project extremely lean.

English

278

30.2K

Florian Brand@xeophon·10h

@dlhck_ @badlogicgames idk, works on many code bases just fine

English

David Höck@dlhck_·10h

@badlogicgames @xeophon plumber will be the safest job ever. no AI can get all the "Pfusch" that happened over decades into its context.

English

1.1K

Florian Brand@xeophon·10h

@kyliebytes flirting in sf: so I got this SPV…

English

248

Kylie Robison@kyliebytes·10h

was enjoying my time after the wrap of our sf shoot in the clurb but ran into a guy who yelled into my ear about anthropic secondaries. His friend DID watch our Alex pod tho

English

Florian Brand@xeophon·13h

@difficultyang And you can generate a bunch of different implementations at once

English

247

difficultyang@difficultyang·14h

the cool thing about ai slop is you can rapidly test hypotheses on what happens if you build the program around a core abstraction built one way, and then just keep tweaking it and seeing the implications until you're happy

English

824

Florian Brand@xeophon·13h

@maksym_andr 🫡 if you want to run experiments with prime labs, hit me up

Daniel Auras@rasdani_

English

544

Maksym Andriushchenko@maksym_andr·19h

i actually missed this: both OpenAI and Anthropic seem to be winding down their fine-tuning APIs. the most recent models like GPT-5.x and Claude 4.x are not available at all. why? misuse risks? this also seems relevant for continual learning (e.g., RL on user rollouts from Claude Code / Codex CLI) that produces personalized LoRA adapters for different users. probably the main reason it hasn't been deployed is that it's a potential nightmare for safety and alignment.

English

2.8K

Florian Brand@xeophon·22h

@shxf0072 🫡

QME

Florian Brand@xeophon·22h

@vincentweisser @arcee_ai @NVIDIAAI important to give the little guy (like nvidia) a shoutout

English

306

Vincent Weisser@vincentweisser·1d

@xeophon @arcee_ai @NVIDIAAI + Prime Intellect ;)

English

486

Florian Brand@xeophon·1d

amazing post and great timing w.r.t. ant's post yesterday we must build open ai to not get locked in by the vendors who will decide who gets which capabilities and the west has to realize that open models are important and support open model efforts (like @arcee_ai, @NVIDIAAI)

Bill Gurley@bgurley

A new @bgurley blog post! I have been thinking about how sophisticated executives are using open source in super creative ways. Started writing this three years ago. Excited to finish it up and publish it! And with the new @p3institute brand. substack.com/home/post/p-19…

English

262

46.1K

Florian Brand@xeophon·23h

@qkvproj I yearn to be there again

English

Keşfet

@1thousandfaces_ @maksym_andr @Jjq2221 @_ueaj @EpochAIResearch @KateandPie @mitsuhiko @giffmana