Brandon Guo

358 posts

Brandon Guo

@brandonguo

data, markets

San Francisco Katılım Haziran 2019

516 Takip Edilen233 Takipçiler

Brandon Guo@brandonguo·13h

spending an extra 5-10 minutes on a prompt to make it self-sustaining + clear feedback loops can be a magical experience it goes from a one-off prompt to a loop that turns compute into intelligence the game is in finding the loops

English

105

Brandon Guo@brandonguo·5d

oh you're using langchain observability? bro just use raindrop. before that we tried judgement labs, then arize. but we still use helicone. have you tried langfuse? or braintrust. how about maxim? i heard they're better than gallileo, but worse than respan, par with laminar

English

123

Brandon Guo@brandonguo·19 May

@JohnnyNel_ automation matters even for technical folks when every developer hour unlocks 10-100x more productivity than before - every saved minute from automations also matters 10-100x more

English

393

Johnny Nel | AI for Founders@JohnnyNel_·19 May

@brandonguo workflow integration matters way more than raw AI power... automation lets non-tech folks actually ship without getting stuck

English

444

Brandon Guo@brandonguo·19 May

devin is a game changer - hard to imagine going back to raw codex/claude code at this point even beyond the raw coding capabilities (which are better), devex features like shared secrets, linear integrations, automated testing + demo videos, etc make it hard to switch off

English

14.9K

Brandon Guo@brandonguo·8 May

trying to dunk on ProgramBench for realism is like criticizing the SAT for not allowing google search a test doesn't need to perfectly emulate the real world to be useful; its value is whether it reveals underlying capabilities that generalize @jyangballin is cooking

English

224

Brandon Guo@brandonguo·5 May

@18jeffreyma @jyangballin @KLieret @OfirPress hotdog-bench with artificial constraints model has to write 2015-era cv code for each hot dog binary pass/fail

English

154

Jeff Ma@18jeffreyma·5 May

unfortunately my pitch to call it Jian-Yang-bench did not succeed so, so much fun working on this with @jyangballin @KLieret @OfirPress and all the folks at Meta

John Yang@jyangballin

How much of SQLite, FFmpeg, PHP compiler can LMs code from scratch? Given just an executable and no starter code or internet access. Introducing ProgramBench: 200 rigorous, whole-repo generation tasks where models design, build, and ship a working program end to end. 🧵

English

11.9K

Brandon Guo@brandonguo·3 May

@rkundy make sure to use a wdt tool and leveler before tamping reduces channeling that causes water flow inconsistencies

English

RISHI@rkundy·3 May

i struggle so hard to figure out how to pull a shot of espresso correctly. I bought fresh roasted beans, and the bag said 17.5 in, 30 out. I put 17.5 in twice, got 18, and then 12 out 😭😭😭. i would be the worst barista

English

Brandon Guo@brandonguo·2 May

@himanshustwts the true frontier has mostly moved past this but slop providers can keep repackaging the same 10k PRs to unsophisticated buyers

English

362

himanshu@himanshustwts·2 May

tldr of coding RL envs: > incredible PR mining > reconstruct real bug to fix trajectories > turn them into executable sandboxes > add synth bugs to widen coverage > train on patch / test / fail / recover loops > reward fixes that generalize better everything on top of this is just to make your tooling and quality better.

English

195

9.8K

Brandon Guo@brandonguo·1 May

this rough pattern is what many “human” data vendors have already been doing under the hood won’t name names. but most of the data from the largest players are made in this fashion - both sides of the deal choose to pretend they don’t know what’s going on

Jason Weston@jaseweston

💎Autodata: an agentic data scientist to create high quality data✨ We introduce a method for building agents that create high-quality training & evaluation data. Key idea: agentic data creation provides a way to *convert increased inference compute into higher quality model training*. We show how to train (meta-optimize) such a data scientist agent, so that it can create even stronger data. Our initial study with a specific practical implementation, Agentic Self-Instruct, shows strong gains on scientific reasoning problems compared to classical synthetic dataset creation methods. Overall, we believe this direction has the potential to change how we build AI data! Read more in the blog post: facebookresearch.github.io/RAM/blogs/auto…

English

476

Brandon Guo@brandonguo·28 Nis

@caelin_sutch just one more shitty vector db pls i promise bro

English

362

Caelin@caelin_sutch·28 Nis

bro all you need is one more company brain. one more notion automation. one more agent with all the context on your company and slack comms. i promise bro this will be the one that 10x's your companies growth and gets processes out of peoples head

Y Combinator@ycombinator

Company Brain @t_blom Every company has critical know-how scattered across people's heads, old Slack threads, support tickets, and databases, and AI agents can't operate like that. We think every company in the world is going to need a new primitive: a living map of how the company works that turns its own artifacts into an executable skills file for AI.

English

201

25.5K

Brandon Guo@brandonguo·28 Nis

@caelin_sutch thanks pookie

English

Caelin@caelin_sutch·28 Nis

@brandonguo Wow so insightful

English

Brandon Guo@brandonguo·28 Nis

one reason why there’s so many data startups is that evals and RL envs feel unusually open as a research surface a new attention mechanism has to beat thousands of very sharp people in a narrow technical lane a new eval or environment can come from noticing that some valuable domain of work still has no good benchmark, simulator, rubric, or feedback loop, and pattern matching from there

English

206

Brandon Guo@brandonguo·26 Nis

i would sell a kidney to get allocation

Liam Fedus@LiamFedus

Industrial-scale science. Coming to a lab near you this summer.

English

271

Brandon Guo@brandonguo·24 Nis

@alexgshaw the TAM of harbor is destined to engulf all human labor

English

170

Alex Shaw@alexgshaw·24 Nis

Can agents build off their prior work? Can they continually learn? Answering these questions requires feeding your agent a sequence of tasks, each building off the prior. Today we're releasing the first major addition to the Harbor task format: multi-step tasks. We've partnered with @GOrlanski to add SlopCodeBench to the Harbor Registry as the first benchmark taking advantage of multi-step tasks.

Gabe Orlanski@GOrlanski

Very excited to announce the v1.0 of SlopCodeBench release: - Doubling the size of the dataset - @harborframework support - scb-check: a CLI that flags slop anti-patterns - Way more model results scbench.ai github.com/SprocketLab/sl… 🧵

English

10.5K

Brandon Guo@brandonguo·21 Nis

credits used to be a footnote and now they're a line item every 100k in inference credits is 1% less dilution at a $10m post

Turner Novak 🍌🧢@TurnerNovak

TIL you get $5m in credits through a16z Speedrun

English

186

Brandon Guo@brandonguo·17 Nis

everyone's writing the "ai is killing software" obituary but okta just put up 11% revenue growth with 30% of q4 bookings from auth0 for ai agents and identity governance

English

125

Brandon Guo@brandonguo·17 Nis

how it feels choosing between claude and codex

English

1.3K

Brandon Guo@brandonguo·17 Nis

@JulieKallini this is the correct take the 1.15->3.75MP jump is the more interesting thing to guess at imo

English

1.6K

Julie Kallini ✨@JulieKallini·17 Nis

1/ "New tokenizer" does not imply "new base model," and "new base model" is not the simplest explanation. There are much simpler explanations that fit Anthropic's public description of Opus 4.7 equally well.

Nathan Lambert@natolambert

Opus 4.7 has a new tokenizer. This means it's also a new base model. Glory days of pretraining still very much going.

English

1.1K

184.7K

Brandon Guo@brandonguo·17 Nis

used to do a lot of coding benchmarks and been asked what i think, here's my rundown: 1 - proximal uses native scaffolds like claude code/codex which is understandable, but aisi/metr have been flagging for a year that scaffold effects can swamp model effects at the frontier having neutral scaffold results would be interesting here and i'm surprised they didn't try to include that 2 - sample size of 18 is definitely troubling, although there's obviously some degree of wanting to solicit more demand from labs here. at least 50-100 would've been good 3 - [edit: i got corrected on this one] (imo) central challenge of swe benchmarks is reward shaping and frontier-swe is a bit lacking here. 0.5*correctness + 0.5*speedup is pretty arbitrary and not aligned with "correctness as a gate" a patch that's 60% correct and 2x faster is def not worth the same as one that's 100% correct and 1.2x faster, but it's graded the same here @18jeffreyma's swe-fficiency does correctness as a gate well. it's understandable that larger problems like this make the correctness gate harder to implement, but 50/50 does feel a bit lazy tldr: in a fuller paper i'd love to see larger n, neutral-scaffold results, and some discussion of how rankings change under different scoring rules (or at least justification of 50/50) still super cool work though and in the right direction

Justus Mattern@MatternJustus

Introducing FrontierSWE, an ultra-long horizon coding benchmark. We test agents on some of the hardest technical tasks like optimizing a video rendering library or training a model to predict the quantum properties of molecules. Despite having 20 hours, they rarely succeed

English

429

Brandon Guo@brandonguo·17 Nis

@MatternJustus oops missed the correctness gate. good work!

English

Justus Mattern@MatternJustus·17 Nis

thanks for the feedback! A few comments: 1) fair and will add! We want to continuously maintain this benchmark and add new models and scaffolds (including parallel scaffolds). We opted for native scaffolds initially because we saw massive degradations in other scaffolds and wanted to be fair to labs 2) agree, and we want to update the benchmark with more over time. Building a single task in a way that it is not cheatable was a massive effort so launched with this 3) there is a correctness gate! the "0.5*speedup" score is only added if we have full correctness! More thorough analysis coming soon, and will incorporate feedback 🫡

English

Justus Mattern@MatternJustus·16 Nis

English

141

1.3K

265K

Keşfet

@JohnnyNel_ @jyangballin @18jeffreyma @KLieret @OfirPress @rkundy @himanshustwts @caelin_sutch