Florian Brand

34.2K posts

Florian Brand banner
Florian Brand

Florian Brand

@xeophon

evals @PrimeIntellect | open models @interconnectsai

Katılım Temmuz 2015
723 Takip Edilen13.3K Takipçiler
Hero Thousandfaces
Hero Thousandfaces@1thousandfaces_·
What is a man? A miserable little collection of hackathon shirts
English
1
0
12
250
Maksym Andriushchenko
Maksym Andriushchenko@maksym_andr·
@xeophon and yes, there we did use native agent harnesses! and still all agents basically suck. it's gonna be a very interesting benchmark. i know i'm teasing too much...
English
1
0
1
27
Maksym Andriushchenko
Maksym Andriushchenko@maksym_andr·
💥New paper: LLMs are now used for high-stakes real-world decisions, but can their numerical predictions and uncertainty estimates be trusted? We built QuantSightBench, a benchmark to measure how well frontier models forecast numerical outcomes across business, politics, etc. Why forecasting? Forecasting of world events is a great testbed for general LLM decision-making. The real world produces so many things that can be forecast, and the objective ground truth eventually gets revealed. This is the ultimate benchmark: you want to predict how the real world will unfold. Beyond producing accurate point-wise forecasts, having correct uncertainty estimation is essential. LLMs typically don't produce consequential forecasts autonomously, but they rather assist human decision making. This requires calibrated uncertainty estimation, which is also a necessary skill for *agentic* LLM forecasting: the agent needs to know when to acquire more information and when to stop and commit to an answer. Why *numerical* forecasting? Nearly all prior LLM forecasting work evaluates on binary Polymarket-style questions (which is great, btw). However, most decisions that actually matter: GDP growth, ARR numbers, election margins, infrastructure timelines are not binary. They're numbers, and the confidence intervals there matter even more than the point estimates. So we built a benchmark to measure this! This is joint work with Jeremy Qin @Jjq2221.
GIF
English
3
5
21
576
ueaj
ueaj@_ueaj·
But the ultimate metric is real world utility, which doesn't really involve benchmaxxing in the specific setting, but how a single harness does across all the tasks it's designed to. cowork and cc cover different tasks, so maybe pure generality isn't *necessary* but frankly a lot of that is just prompt caching efficiency. Though yes cybersec and certain tail value industries might be different, benchmaxxing on those is probably appropriate.
English
1
0
0
20
Florian Brand
Florian Brand@xeophon·
new artifacts! i also comment on the open<>closed model gap, where US CAISI and @EpochAIResearch disagree, arguing that both are incomplete: for an assessment of the very frontier, we must elicit the best performance by tuning prompts and harnesses with the models
Florian Brand tweet media
Interconnects@interconnectsai

Latest open artifacts (#21): Open model bonanza! Gemma 4, DeepSeek V4, Kimi K2.6, MiMo 2.5, GLM-5.1 & others. On CAISI's V4 assessment. An eventful month with one flagship release after another interconnects.ai/p/latest-open-…

English
2
1
19
3.3K
Florian Brand
Florian Brand@xeophon·
@_ueaj @EpochAIResearch this is not my point, my point is that each model should be set up to benchmaxx in the specific setting to know what the raw capabilities are, esp. when being concerned about safety and open<>closed model gaps. Yet Another Benchmark does not solve this problem
English
1
0
0
44
ueaj
ueaj@_ueaj·
@xeophon @EpochAIResearch There should be an ARC-AGI but for general capabilities, a private subset + some non-representative public train & test set (to measure generalization). Kinda like terminalbench but more general and a private subset
English
1
0
0
67
Florian Brand
Florian Brand@xeophon·
"Claude Code means we don't need software devs anymore" Software dev job postings:
Florian Brand tweet media
English
4
1
37
2.7K
Florian Brand
Florian Brand@xeophon·
@mitsuhiko it’s annoying that the masses spam, ruining the experience for the few good people :(
English
0
0
1
196
Lucas Beyer (bl16)
Lucas Beyer (bl16)@giffmana·
ok, sure, google scholar, sounds like you got it all figured out
Lucas Beyer (bl16) tweet media
English
3
1
61
11.3K
Florian Brand
Florian Brand@xeophon·
I am so tired of ppl dunking on Peter, who basically runs the largest experiment what the future of work will look like (and similarly, what the future of security looks like)
Peter Steinberger 🦞@steipete

People freaking out over my AI spend. What nobody sees: Part of what excites me so much about working on OpenClaw is that I'm trying to answer the question: How would we build software in the future if tokens don't matter? We constant run ~100 codex in the cloud, reviewing every PR, every issue. If a fix on main lands, @clawsweeper will eventually find that 6 month old issue and close it with an exact reference. We run codex on every commit to review for security issues (as it's far too easy to miss). We run codex to de-duplicate issues and find clusters and send reports for the most pressing issues. We have agents that can recreate complex setups, spin up ephemeral crabbox.sh machines, log into e.g. Telegram, make a video and post before/after fix on the PR. There's codex that watch new issues and - if it fits our documented vision well, automatically create a PR of it. (that then another codex reviews) We have codex running that scans comments for spam and blocks people. We have codex instances running that verify performance benchmarks and report regressions into Discord. We have agents that listen on our meetings and proactively start work, e.g. create PRs when we discuss new features while we discuss them. We build clawpatch.ai to split all our projects into functional units to review and find bugs and regresssions. We do the same split for security with Vercel's deepsec and Codex Security to find regressions and vulnerabilities. All that automation allows us to run this project extremely lean.

English
16
4
278
30.2K
David Höck
David Höck@dlhck_·
@badlogicgames @xeophon plumber will be the safest job ever. no AI can get all the "Pfusch" that happened over decades into its context.
English
1
0
4
1.1K
Kylie Robison
Kylie Robison@kyliebytes·
was enjoying my time after the wrap of our sf shoot in the clurb but ran into a guy who yelled into my ear about anthropic secondaries. His friend DID watch our Alex pod tho
English
2
0
25
2K
difficultyang
difficultyang@difficultyang·
the cool thing about ai slop is you can rapidly test hypotheses on what happens if you build the program around a core abstraction built one way, and then just keep tweaking it and seeing the implications until you're happy
English
2
0
9
824
Maksym Andriushchenko
Maksym Andriushchenko@maksym_andr·
i actually missed this: both OpenAI and Anthropic seem to be winding down their fine-tuning APIs. the most recent models like GPT-5.x and Claude 4.x are not available at all. why? misuse risks? this also seems relevant for continual learning (e.g., RL on user rollouts from Claude Code / Codex CLI) that produces personalized LoRA adapters for different users. probably the main reason it hasn't been deployed is that it's a potential nightmare for safety and alignment.
Maksym Andriushchenko tweet media
English
4
1
28
2.8K
Florian Brand
Florian Brand@xeophon·
amazing post and great timing w.r.t. ant's post yesterday we must build open ai to not get locked in by the vendors who will decide who gets which capabilities and the west has to realize that open models are important and support open model efforts (like @arcee_ai, @NVIDIAAI)
Florian Brand tweet media
Bill Gurley@bgurley

A new @bgurley blog post! I have been thinking about how sophisticated executives are using open source in super creative ways. Started writing this three years ago. Excited to finish it up and publish it! And with the new @p3institute brand. substack.com/home/post/p-19…

English
9
22
262
46.1K