Brandon Guo

358 posts

Brandon Guo banner
Brandon Guo

Brandon Guo

@brandonguo

data, markets

San Francisco Katılım Haziran 2019
516 Takip Edilen233 Takipçiler
Brandon Guo
Brandon Guo@brandonguo·
spending an extra 5-10 minutes on a prompt to make it self-sustaining + clear feedback loops can be a magical experience it goes from a one-off prompt to a loop that turns compute into intelligence the game is in finding the loops
English
2
0
5
105
Brandon Guo
Brandon Guo@brandonguo·
oh you're using langchain observability? bro just use raindrop. before that we tried judgement labs, then arize. but we still use helicone. have you tried langfuse? or braintrust. how about maxim? i heard they're better than gallileo, but worse than respan, par with laminar
English
0
0
4
123
Brandon Guo
Brandon Guo@brandonguo·
@JohnnyNel_ automation matters even for technical folks when every developer hour unlocks 10-100x more productivity than before - every saved minute from automations also matters 10-100x more
English
0
0
1
393
Johnny Nel | AI for Founders
Johnny Nel | AI for Founders@JohnnyNel_·
@brandonguo workflow integration matters way more than raw AI power... automation lets non-tech folks actually ship without getting stuck
English
1
0
2
444
Brandon Guo
Brandon Guo@brandonguo·
devin is a game changer - hard to imagine going back to raw codex/claude code at this point even beyond the raw coding capabilities (which are better), devex features like shared secrets, linear integrations, automated testing + demo videos, etc make it hard to switch off
English
5
3
63
14.9K
Brandon Guo
Brandon Guo@brandonguo·
trying to dunk on ProgramBench for realism is like criticizing the SAT for not allowing google search a test doesn't need to perfectly emulate the real world to be useful; its value is whether it reveals underlying capabilities that generalize @jyangballin is cooking
English
0
0
2
224
Brandon Guo
Brandon Guo@brandonguo·
@rkundy make sure to use a wdt tool and leveler before tamping reduces channeling that causes water flow inconsistencies
English
0
0
1
46
RISHI
RISHI@rkundy·
i struggle so hard to figure out how to pull a shot of espresso correctly. I bought fresh roasted beans, and the bag said 17.5 in, 30 out. I put 17.5 in twice, got 18, and then 12 out 😭😭😭. i would be the worst barista
English
1
0
1
84
Brandon Guo
Brandon Guo@brandonguo·
@himanshustwts the true frontier has mostly moved past this but slop providers can keep repackaging the same 10k PRs to unsophisticated buyers
English
1
0
3
362
himanshu
himanshu@himanshustwts·
tldr of coding RL envs: > incredible PR mining > reconstruct real bug to fix trajectories > turn them into executable sandboxes > add synth bugs to widen coverage > train on patch / test / fail / recover loops > reward fixes that generalize better everything on top of this is just to make your tooling and quality better.
English
1
5
195
9.8K
Caelin
Caelin@caelin_sutch·
bro all you need is one more company brain. one more notion automation. one more agent with all the context on your company and slack comms. i promise bro this will be the one that 10x's your companies growth and gets processes out of peoples head
Y Combinator@ycombinator

Company Brain @t_blom Every company has critical know-how scattered across people's heads, old Slack threads, support tickets, and databases, and AI agents can't operate like that. We think every company in the world is going to need a new primitive: a living map of how the company works that turns its own artifacts into an executable skills file for AI.

English
11
2
201
25.5K
Brandon Guo
Brandon Guo@brandonguo·
one reason why there’s so many data startups is that evals and RL envs feel unusually open as a research surface a new attention mechanism has to beat thousands of very sharp people in a narrow technical lane a new eval or environment can come from noticing that some valuable domain of work still has no good benchmark, simulator, rubric, or feedback loop, and pattern matching from there
English
1
0
4
206
Brandon Guo
Brandon Guo@brandonguo·
@alexgshaw the TAM of harbor is destined to engulf all human labor
English
1
0
3
170
Alex Shaw
Alex Shaw@alexgshaw·
Can agents build off their prior work? Can they continually learn? Answering these questions requires feeding your agent a sequence of tasks, each building off the prior. Today we're releasing the first major addition to the Harbor task format: multi-step tasks. We've partnered with @GOrlanski to add SlopCodeBench to the Harbor Registry as the first benchmark taking advantage of multi-step tasks.
Alex Shaw tweet media
Gabe Orlanski@GOrlanski

Very excited to announce the v1.0 of SlopCodeBench release: - Doubling the size of the dataset - @harborframework support - scb-check: a CLI that flags slop anti-patterns - Way more model results scbench.ai github.com/SprocketLab/sl… 🧵

English
6
10
72
10.5K
Brandon Guo
Brandon Guo@brandonguo·
everyone's writing the "ai is killing software" obituary but okta just put up 11% revenue growth with 30% of q4 bookings from auth0 for ai agents and identity governance
English
1
0
2
125
Brandon Guo
Brandon Guo@brandonguo·
how it feels choosing between claude and codex
Brandon Guo tweet media
English
0
1
5
1.3K
Brandon Guo
Brandon Guo@brandonguo·
@JulieKallini this is the correct take the 1.15->3.75MP jump is the more interesting thing to guess at imo
English
0
0
5
1.6K
Brandon Guo
Brandon Guo@brandonguo·
used to do a lot of coding benchmarks and been asked what i think, here's my rundown: 1 - proximal uses native scaffolds like claude code/codex which is understandable, but aisi/metr have been flagging for a year that scaffold effects can swamp model effects at the frontier having neutral scaffold results would be interesting here and i'm surprised they didn't try to include that 2 - sample size of 18 is definitely troubling, although there's obviously some degree of wanting to solicit more demand from labs here. at least 50-100 would've been good 3 - [edit: i got corrected on this one] (imo) central challenge of swe benchmarks is reward shaping and frontier-swe is a bit lacking here. 0.5*correctness + 0.5*speedup is pretty arbitrary and not aligned with "correctness as a gate" a patch that's 60% correct and 2x faster is def not worth the same as one that's 100% correct and 1.2x faster, but it's graded the same here @18jeffreyma's swe-fficiency does correctness as a gate well. it's understandable that larger problems like this make the correctness gate harder to implement, but 50/50 does feel a bit lazy tldr: in a fuller paper i'd love to see larger n, neutral-scaffold results, and some discussion of how rankings change under different scoring rules (or at least justification of 50/50) still super cool work though and in the right direction
Justus Mattern@MatternJustus

Introducing FrontierSWE, an ultra-long horizon coding benchmark. We test agents on some of the hardest technical tasks like optimizing a video rendering library or training a model to predict the quantum properties of molecules. Despite having 20 hours, they rarely succeed

English
0
0
8
429
Justus Mattern
Justus Mattern@MatternJustus·
thanks for the feedback! A few comments: 1) fair and will add! We want to continuously maintain this benchmark and add new models and scaffolds (including parallel scaffolds). We opted for native scaffolds initially because we saw massive degradations in other scaffolds and wanted to be fair to labs 2) agree, and we want to update the benchmark with more over time. Building a single task in a way that it is not cheatable was a massive effort so launched with this 3) there is a correctness gate! the "0.5*speedup" score is only added if we have full correctness! More thorough analysis coming soon, and will incorporate feedback 🫡
English
2
0
3
1K
Justus Mattern
Justus Mattern@MatternJustus·
Introducing FrontierSWE, an ultra-long horizon coding benchmark. We test agents on some of the hardest technical tasks like optimizing a video rendering library or training a model to predict the quantum properties of molecules. Despite having 20 hours, they rarely succeed
Justus Mattern tweet media
English
78
141
1.3K
265K