Santiago

28 posts

Santiago

Santiago

@extindar

Beigetreten Mart 2022
38 Folgt3 Follower
Santiago retweetet
Rhys
Rhys@RhysSullivan·
how it feels to open 50 slop PRs on your own repo
Rhys tweet media
English
16
10
679
19.7K
Blake Emal
Blake Emal@heyblake·
Fork it Drop your landing page URL I'll give 1 piece of advice to as many of you as I can
English
365
4
139
17.4K
Laura Luna
Laura Luna@LauraLunaTech·
4 days ago I had 0 followers. Yesterday I gained 218 in a single day. I genuinely underestimated how powerful X is for tech networking. 🤯 The best part so far has been meeting smart people building cool things. What are you working on right now? 👇
Laura Luna tweet media
English
178
3
211
8.8K
Santiago
Santiago@extindar·
Interesting how much better if you directly ask for the specific exercise
Santiago tweet media
English
0
0
0
12
Santiago
Santiago@extindar·
Using GPT 5.5 Thinking to make a PT plan and got some pretty funky looking proportions
Santiago tweet mediaSantiago tweet media
English
1
0
0
10
Santiago
Santiago@extindar·
@LizBeal12 @can Looks like a little under 1% of tax payers get audited annually with the likelihood going up the higher your income.
English
0
0
2
566
can
can@can·
ill regret asking this but what’s the grift behind everything being hsa/fsa eligible now?
English
37
17
2.5K
417.4K
Santiago
Santiago@extindar·
@SummersJohns69 @Sky1821084 @bryan_king @can You’re gambling that the IRS doesn’t audit you. If they do and you can’t prove you used HSA money for allowable purchases you have to pay it back, pay ordinary income taxes and pay a 20% penalty
English
1
0
4
237
Santiago
Santiago@extindar·
@paul_cal @KLieret I think the point is to see how the underlying models perform not how good are the harnesses
English
2
0
0
62
Paul Calcraft
Paul Calcraft@paul_cal·
@KLieret Thanks Kilian, yes this para was the source. Good agentic benchmarks should use standard sota harnesses to be relevant imo. Overwhelming majority of agentic coding is currently done in Codex & Claude Code, so those are the important results. Look forward to the leaderboard
English
1
0
3
215
Paul Calcraft
Paul Calcraft@paul_cal·
ProgramBench is an *agentic coding* benchmark where everything scores 0%. But they DIDN'T test Codex or Claude Code! Harness doesn't even have context mngment/compaction. Models apparently never hit ctxt limits, so this obvs isn't a fair test of the sota (cc/cx would run long)
Deedy@deedydas

The creators of SWE-Bench just dropped a really simple new benchmark every LLM gets 0% on. ProgramBench asks: can models recreate real executable programs (ffmpeg, SQLite, ripgrep) from scratch with no internet? We are far from saturated on model quality.

English
2
0
73
8.5K
Santiago retweetet
Ofir Press
Ofir Press@OfirPress·
1) Our team at Meta has a tough new coding benchmark challenging models to code entire programs including ffmpeg and the PHP compiler from scratch. 2) Top accuracy is 0% 3) We will be making the benchmark harder.
John Yang@jyangballin

How much of SQLite, FFmpeg, PHP compiler can LMs code from scratch? Given just an executable and no starter code or internet access. Introducing ProgramBench: 200 rigorous, whole-repo generation tasks where models design, build, and ship a working program end to end. 🧵

English
36
65
1K
120.5K
Codetard
Codetard@codetaur·
got kinda sniped by seeing interactive bookshelf on twitter and thinking "there's no way that's not automatable"
English
6
5
81
37.3K
AnechoicMedia
AnechoicMedia@AnechoicMedia_·
@cremieuxrecueil Completely unnecessary imo. More of the internet should be llm-enabled, especially reference material like source code repository and documentation that was difficult for humans, but readily ingested by agents.
English
5
0
70
2.8K
Crémieux
Crémieux@cremieuxrecueil·
A real downside of the rise of LLMs is that we now have to sit through Cloudflare verification pages all the time.
English
51
69
3.7K
63.5K
Santiago
Santiago@extindar·
@0ximjosh If you don’t use and live your product you are too far removed
English
0
0
0
29
Josh
Josh@0ximjosh·
It feels like people are forgetting the best companies are ones born from friction in your own life. I see too many tech startups solving problems in fields not a single employee has worked in
English
31
8
218
10.2K
Steve Ruiz
Steve Ruiz@steveruizok·
github if peak blizzard made it
Steve Ruiz tweet media
English
121
164
3.7K
168.4K
Santiago
Santiago@extindar·
@benhylak thanks for the motivation to post - i built this for my friends and figured i'd release it, appreciate any feedback! movienight.gg
English
0
0
1
1.2K
ben hylak
ben hylak@benhylak·
the most annoying person you've ever met is always a few weeks away from shipping
ben hylak tweet media
English
41
13
1.1K
77.6K
Santiago
Santiago@extindar·
@favo_rion @divydend yeah they just act dumb / send you another link. it's not worth the time to respond
English
1
0
3
619
vsp
vsp@favo_rion·
@divydend has anyone ever tried reverse baiting them for the hell of it? act like you already opened the file, and theres an actual game in there, saying its the most dogshit thing ever? or say they sent the wrong zip file bcs its filled with porn or their school work or something
English
2
0
18
8.4K
div_y
div_y@divydend·
yo i think i'm good
div_y tweet media
English
638
4.2K
90.3K
5M
Santiago
Santiago@extindar·
It's ok Codex, you can say goblins
Santiago tweet media
English
0
0
0
26
Santiago
Santiago@extindar·
@maddiedreese Did you track how much % of usage it took up compared to 5.4? Would be interesting to see if it was more efficient as well
English
0
0
0
24
Maddie D. Reese
Maddie D. Reese@maddiedreese·
GPT-5.5 Extra High’s Codex Computer Use portrait of me! Definitely an improvement over 5.4. I think I’m going to call this “MaddieBench”
English
17
2
100
6.1K
Spuuunk
Spuuunk@spunkweaver·
@felixrieseberg Have you ever done anything valuable, or is it just this kind of piddly shit for your whole career?
English
1
0
0
595
Felix Rieseberg
Felix Rieseberg@felixrieseberg·
Today is a big day! We're launching a ~ new ~ version of Claude Code in the desktop app. It's been redesigned from the ground up for parallel work and is a lot faster. It's been my main way to use Claude Code for the last few weeks.
English
617
461
9.9K
945.6K