Santiago retweetet
Santiago
28 posts


@LauraLunaTech movienight.gg tracking my movie nights with my friends and working on a public version!
English

@LizBeal12 @can Looks like a little under 1% of tax payers get audited annually with the likelihood going up the higher your income.
English

@exigentveracity @can apps.irs.gov/app/vita/conte…
irs.gov/pub/irs-pdf/p5…
You pay a 20% penalty and ordinary income tax for non qualified expenses if you’re under 65.
English

@SummersJohns69 @Sky1821084 @bryan_king @can You’re gambling that the IRS doesn’t audit you. If they do and you can’t prove you used HSA money for allowable purchases you have to pay it back, pay ordinary income taxes and pay a 20% penalty
English

@Sky1821084 @bryan_king @can You most definitely don’t need to provide receipts to hsa company. You use their card to pay and this is it
English

@KLieret Thanks Kilian, yes this para was the source. Good agentic benchmarks should use standard sota harnesses to be relevant imo. Overwhelming majority of agentic coding is currently done in Codex & Claude Code, so those are the important results. Look forward to the leaderboard
English

ProgramBench is an *agentic coding* benchmark where everything scores 0%. But they DIDN'T test Codex or Claude Code!
Harness doesn't even have context mngment/compaction. Models apparently never hit ctxt limits, so this obvs isn't a fair test of the sota (cc/cx would run long)
Deedy@deedydas
The creators of SWE-Bench just dropped a really simple new benchmark every LLM gets 0% on. ProgramBench asks: can models recreate real executable programs (ffmpeg, SQLite, ripgrep) from scratch with no internet? We are far from saturated on model quality.
English
Santiago retweetet

1) Our team at Meta has a tough new coding benchmark challenging models to code entire programs including ffmpeg and the PHP compiler from scratch.
2) Top accuracy is 0%
3) We will be making the benchmark harder.
John Yang@jyangballin
How much of SQLite, FFmpeg, PHP compiler can LMs code from scratch? Given just an executable and no starter code or internet access. Introducing ProgramBench: 200 rigorous, whole-repo generation tasks where models design, build, and ship a working program end to end. 🧵
English

@AnechoicMedia_ @cremieuxrecueil Traffic hitting sites isn’t free and agents aren’t paying
English

@cremieuxrecueil Completely unnecessary imo. More of the internet should be llm-enabled, especially reference material like source code repository and documentation that was difficult for humans, but readily ingested by agents.
English

@benhylak thanks for the motivation to post - i built this for my friends and figured i'd release it, appreciate any feedback! movienight.gg
English

@favo_rion @divydend yeah they just act dumb / send you another link. it's not worth the time to respond
English

@divydend has anyone ever tried reverse baiting them for the hell of it?
act like you already opened the file, and theres an actual game in there, saying its the most dogshit thing ever?
or say they sent the wrong zip file bcs its filled with porn or their school work or something
English

@maddiedreese Did you track how much % of usage it took up compared to 5.4? Would be interesting to see if it was more efficient as well
English

@felixrieseberg Have you ever done anything valuable, or is it just this kind of piddly shit for your whole career?
English
























