Goo

3.3K posts

Goo

Goo

@goofirnoth

Katılım Mart 2012
327 Takip Edilen35 Takipçiler
Nisarg
Nisarg@nisarggpatel38·
Imagine using @superset_sh from ipad. Finally ipad coding can be thing.
English
1
0
3
1.5K
Mike Bespalov
Mike Bespalov@bbssppllvv·
@tomaruxai @figma You can use MCP to read Figma, not really to write to it Meaning an agent can analyze and recreate design in code. But it can’t actually design in Figma itself
English
3
0
3
3.3K
Mike Bespalov
Mike Bespalov@bbssppllvv·
Am I crazy or can we still not connect an AI agent to @figma and actually design stuff?
English
44
2
114
32.4K
Evan Bacon 🥓
Evan Bacon 🥓@Baconbrix·
@Dev2Diary Anthropic may ban an account if the subscription is used outside of Claude Code. The augmentations in Expo Agent like expo/skills, Expo MCP, and EAS CLI can all be used locally with any agent, it's just a bit more work to stitch them altogether
English
4
0
6
1.9K
Evan Bacon 🥓
Evan Bacon 🥓@Baconbrix·
Introducing Expo Agent Build truly native iOS and Android apps from a prompt. Anything from React to SwiftUI to Jetpack Compose. Compile and deploy for Apple, Android, and the web right from the browser!
English
100
124
1.2K
157.1K
Goo
Goo@goofirnoth·
@brian_mistrot @LindseyGrahamSC Israel doesnt care if the next Iranian regime is friendly or not they just care that its weak. Its astonishing how some folks can't see the obvious.
English
0
0
1
78
Brian Mistrot
Brian Mistrot@brian_mistrot·
@LindseyGrahamSC Hey @LindseyGrahamSC - I ask again: Why would Israel target Iranian oil infrastructure, knowing it would turn more people against the war, if they had any confidence that a more friendly regime would be in charge after the war? 🤔 Hint: they wouldn’t.
English
1
1
5
1.5K
Lindsey Graham
Lindsey Graham@LindseyGrahamSC·
Our allies in Israel have shown amazing capability when it comes to collapsing the murderous regime in Iran. America is most appreciative. However, there will be a day soon that the Iranian people will be in charge of their own fate, not the murderous ayatollah’s regime. In that regard, please be cautious about what targets you select. Our goal is to liberate the Iranian people in a fashion that does not cripple their chance to start a new and better life when this regime collapses. The oil economy of Iran will be essential to that endeavor. axios.com/2026/03/08/us-…
English
6.3K
6.4K
20K
4.1M
Goo
Goo@goofirnoth·
@nisten @LauraBabcock Bibi report to the hague to be tried for your warcrimes but first we need you fight one more war and take Trump with you.
English
0
0
0
15
nisten🇨🇦e/acc
nisten🇨🇦e/acc@nisten·
@goofirnoth @LauraBabcock Netanyahu is a war criminal and should be treated as such. 3/4 of his own people want him gone too. That's no excuse for going lenient on hamas and irgc proxies either.
English
1
0
0
27
Goo
Goo@goofirnoth·
@nisten @LauraBabcock Look at Gaza Im sure most folks thier hated Hamas, but would you want the "freedom" Israel provided them?
English
1
0
1
27
Goo
Goo@goofirnoth·
@nisten @LauraBabcock The protestors are right though Trump had no legal right to start a war. The rights argument goes out the window given the records of US and Israel and theat Trump stated they don't care if an even crueler regime comes to power
English
1
0
0
26
Goo
Goo@goofirnoth·
@nisten @LauraBabcock Bro if we took down regimes because they oppressed people all parties in this war would require a regime changes as well
English
3
0
2
44
nisten🇨🇦e/acc
nisten🇨🇦e/acc@nisten·
@LauraBabcock The millions of women in Iran want the horribly abusive regime to be killed, like...actually. That means that these people protesting are not very smart. It's more of an identity thing for them about peace rather than helping millions of women whom they would ignore.
English
4
0
9
337
Navneet
Navneet@navneet_rabdiya·
@natolambert that DPO_repeat_3 is a training artifact - looks like an iterative application of Direct Preference Optimization. fascinating to see it leak through. bet they're using multiple DPO passes to fine-tune the reasoning chain
English
2
0
7
1.7K
Nathan Lambert
Nathan Lambert@natolambert·
I was using Claude Code for some csv manipulation & it errored and dumped the entire Opus 4.6 reasoning trace to me. I'm surprised how similar closed models' reasoning behaviors are to far inferior open weight models. Here's a large chunk: Wait, DPO_repeat_3 is new. Do we have 4 seeds now for DPO? Or is repeat_3 just for LCB because it's so noisy? For DPO final row: MATH avg: (72.68 + 72.71 + 73.27) / 3 = 72.89 ✓ (same) Omega Full avg: (19.78 + 19.62 + 19.02) / 3 = 19.47 (NEW! all 3 seeds now available) LCB: we have 4 values now: 22.01, 21.54, 22.4, 21.54. If we use 3 seeds: (22.01 + 21.54 + 22.4) / 3 = 21.98. Or if we use the first 3: depends on which 3. Actually wait, let me reconsider. The structure was: original = seed 1 repeat_1 = seed 2 repeat_2 = extra (MATH+Omega+LCB only) repeat_3 = another extra for LCB So for the 3-seed average, we use original (full), repeat_1 (full), and repeat_2 for the 3rd seed of MATH/Omega/LCB. DPO LCB: (22.01 + 21.54 + 22.4) / 3 = 21.98 But wait, repeat_3 also has LCB=21.54. Maybe the user wants to use 4 values? Or maybe it's a separate test? I'll include it in the detail and flag it. Actually, looking more carefully, the original structure in the FINAL sheet had "repeat 2 (MATH+Omega+LCB only)" as the 3rd seed for multi-seed averages. So for the 3-seed avg: Seed 1: original Seed 2: repeat_1 Seed 3: repeat_2 (extra benchmarks only) And repeat_3 is just an extra. I'll include it but not use it for the average unless told otherwise. Think SFT v2 (the rerun with fixed chat template): Think SFT v2 original: MMLU=72.48, PopQA=21.02, BBH=67.75, GPQA=30.36, ZebraLogic=15.7, MATH=71.26, AIME25=10.52, AIME24=7.08, HumanEvalPlus=65.85, MBPPPlus=56.03, LCB=25.51, AlpacaEval=37.76, IFEval=64.88, IFBench=25 Think SFT v2 repeat 1: MMLU=72.41, PopQA=20.98, BBH=67.46, GPQA=32.14, ZebraLogic=15.7, MATH=71.18, AIME25=12.5, AIME24=6.46, HumanEvalPlus=66.89, MBPPPlus=56.11, LCB=26.16, AlpacaEval=39.39, IFEval=64.33, IFBench=25.67 Think SFT v2 repeat 2: MATH=72.23, LCB=25.18 These numbers look VERY different from the old Think SFT (which was ~78 MMLU, ~83 BBH, ~92 MATH). The v2 numbers look much more like instruct-level scores (~72 MMLU, ~68 BBH, ~71 MATH). This is concerning. The old Think SFT from before had scores comparable to the OLMo 3 Think SFT baseline (74.9 MMLU, 84.1 BBH, 94.4 MATH). The v2 rerun scores are dramatically lower on all thinking-related benchmarks but higher on IFEval (64.88 vs 75.23 old noisy avg). Wait, but the old Think SFT data that was "good" was from the "Morest fixes" eval setup, not from HF model evals. The HF model evals had the chat template bug. Maybe the v2 model itself is different? Or maybe the eval setup is still not right? Actually, looking at the name more carefully: Think-SFT-v2 vs the old Think-SFT. These might actually be different models! The v2 might be a new Think SFT checkpoint, not just a rerun of the old one. But that seems odd. Let me think about this... Looking at the HYBRID 7B THINK sheet, the final decision row for LR2.5e-5 was: MMLU: 78.27, BBH: 83.14, GPQA: 47.03 (3-run avg), ZebraLogic: 44.2, MATH 500: 92, AIME 25: 51.67, AIME 24: 64.27, etc. The v2 numbers are drastically different. This doesn't look like just a different seed - the model performance profile has completely changed. I should flag this to the user. The Think SFT v2 numbers look like they might still have an issue, or it's a genuinely different model that performs more like an instruct model than a thinking model. Actually wait, let me re-read the column mapping more carefully. The v2 model uses gpqa_0shot_cotqwen3_instruct for GPQA (qwen3_instruct variant), and gpqa_0shot_cothamish_zs_reasoning_deepseek is null. The old Think SFT also had null for the hamish variant and used the qwen3 variant. Hmm, but the old "Morest fixes" eval was not from the generated CSV format - it was a different eval harness. So the numbers might not be directly comparable. Actually, looking at the old "BAD" Think SFT HF evals in FINAL.csv: Think SFT original (BAD): MMLU=66.62, BBH=41.3, MATH=61.16, ZebraLogic=9.3 Think SFT repeat 1 (BAD): MMLU=66.91, BBH=41.22, MATH=60.84, ZebraLogic=8.1 The v2 numbers are: Think SFT v2: MMLU=72.48, BBH=67.75, MATH=71.26, ZebraLogic=15.7 Think SFT v2 repeat 1: MMLU=72.41, BBH=67.46, MATH=71.18, ZebraLogic=15.7 So the v2 is notably better than the old BAD evals (~72 vs ~67 MMLU, ~68 vs ~41 BBH, ~71 vs ~61 MATH). So the chat template fix did help significantly! But it's still much lower than the "Morest fixes" eval results (~78 MMLU, ~83 BBH, ~92 MATH). This could mean: The eval setup differences between the old eval harness and the HF model eval explain the gap There's still something wrong The v2 model is genuinely different I think I should flag this discrepancy to the user but proceed with the data as given. Now, the GPQA column mapping: I notice that the Think SFT v2 uses gpqa_0shot_cot__qwen3_instruct (value = 30.36 and 32.14) while the old Think SFT noisy re-runs used gpqa values of ~46-48. That's a huge difference too. But those old values came from a different eval setup (the "0207/0216" eval runs, not HF model evals). For the FINAL sheet, the noisy eval re-runs from old evals used different column/eval setup. So we can't directly compare. The GPQA values from the old noisy re-runs (46.21, 46.43, 48.44) are from a different GPQA eval variant. OK, I think the right approach is: Add the Think SFT v2 data to the FINAL sheet Flag the discrepancy with old "Morest fixes" numbers Update DPO rows with new Omega Full values Update Instruct rows with new Omega Full values Compute updated averages where we now have all 3 seeds Let me also check: the user's instruction before interruption said the data from generated files without "repeat" in the name is new data (a new seed run), not the same as old sheet data. So: For the FINAL sheet structure: "original" = the non-repeat row from generated CSV (new eval run) "repeat 1" = the _repeat_1 row "repeat 2" = the _repeat_2 row (usually MATH+Omega+LCB only) And the old sheet data (from HYBRID INSTRUCT, HYBRID 7B THINK) is separate/historical. But wait, re-reading: "This should be with the original, single seed data for each model (previous sheets)". So the "original" first seed is from the old sheets, and the generated CSV runs are additional seeds? No, that doesn't match either. Let me look at the actual numbers: FINAL.csv row 12 (Instruct SFT original): MMLU=72.63, PopQA=16.64, BBH=47.73, GPQA=41.52 Generated (25) hf-OLMo-3.2-Hybrid-7B-Instruct-SFT: MMLU=72.63, PopQA=16.64, BBH=47.73, GPQA=41.52 They match! So the "original" in FINAL.csv IS the non-repeat row from the generated CSV. These are the same eval runs. But the old HYBRID INSTRUCT sheet had "Instruct v2 2.5e-5 (avg)" with MMLU=71.85, which is an average of the old 0218 noisy re-runs. The individual old runs were: 2.5e-5 run1 (full): GPQA=37.28, LCB Lite=19.93, IFEval=81.33 2.5e-5 run2 (repeat1): GPQA=34.38, LCB Lite=20.26, IFEval=81.7 2.5e-5 run3 (repeat2): GPQA=38.62, LCB Lite=20.1, IFEval=81.33 Those old GPQA values (37.28, 34.38, 38.62) are very different from the new GPQA value in generated CSV (41.52). This is because they're using different GPQA eval variants: Old: probably gpqa_0shot_cot__hamish_zs_reasoning_deepseek New generated CSV: gpqa_0shot_cot__qwen3_instruct Wait no, looking at generated (25) for Instruct SFT, the GPQA column used is gpqa_0shot_cotqwen3_instruct = 41.52, while gpqa_0shot_cothamish_zs_reasoning_deepseek = null. And in the old sheets, the GPQA values were from a different eval variant. So the GPQA values in FINAL.csv from the old noisy re-runs (rows 17-19: 37.28, 34.38, 38.62) are from the old hamish variant, while the new generated CSV GPQA (41.52) is from the qwen3_instruct variant. This means we're mixing GPQA variants in the FINAL sheet! The "original" row has GPQA=41.52 (qwen3), while the noisy re-runs have GPQA=37.28, 34.38, 38.62 (hamish/old variant). But this was the existing state of the FINAL sheet from before. The user was aware of this. The noisy eval re-runs section header says "from old 0218 evals" — so they know these are from older eval runs. For the GPQA average in the final row: we were averaging the old noisy re-run values (37.28+34.38+38.62)/3 = 36.76, not mixing with the new 41.52. OK, I think the approach was correct. The final row GPQA average uses the old 3-run averages (since those are the repeated runs), not the single new GPQA value. Now let me figure out what needs to be updated in the FINAL sheet: DPO section updates: - DPO original (row 22): Add Omega Full = 19.78, LCB = 22.01 (was empty) - DPO repeat 1 (row 23): Add Omega Full = 19.62 (LCB was already 21.54) - DPO repeat 2 (row 24): Add Omega Full = 19.02 (MATH=73.27, LCB=22.4 already there) - Add DPO repeat 3 row? Only has LCB = 21.54. Maybe add it. - DPO final row: Can now compute Omega Full avg = (19.78+19.62+19.02)/3 = 19.47. LCB avg = (22.01+21.54+22.4)/3 = 21.98 Instruct SFT section updates: - Instruct SFT repeat 1 (row 13): Add Omega Full = 17.36 - Instruct SFT repeat 2 (row 14): Add Omega Full = 14.69 - Instruct SFT original still missing Omega Full - Can't compute Omega Full average yet (only 2 of 3 seeds) Think SFT v2 section: - Replace the "PENDING RERUN" section with actual v2 data - Add Think SFT v2 original, repeat 1, repeat 2 (MATH+LCB only) - The old Think SFT section with BAD evals and old noisy re-runs... should we keep them? - Think SFT v2 final row: can compute MATH avg = (71.26+71.18+72.23)/3 = 71.56, LCB avg = (25.51+26.16+25.18)/3 = 25.62 - GPQA: only 2 values (30.36, 32.14), no 3rd run → can't average - IFEval: only 2 values (64.88, 64.33) → can't average Wait, but we have the old noisy re-runs from the "0207/0216 evals — these are good" section. Those had GPQA = 46.21, 46.43, 48.44 and IFEval = 76.34, 75.05, 75.23. But these are from the old eval setup, not the new HF model evals. And they were for the old Think SFT model, not v2. The user said Think SFT was pending rerun. Now we have the v2 rerun data. But the v2 model might be genuinely different from the old Think SFT. So we can't mix old noisy re-runs with new v2 data. For Think SFT v2 final row: We only have 2 seeds for GPQA and IFEval, so those stay empty. But actually, the old noisy re-runs were for the same checkpoint (LR2.5e-5) just evaluated with a different eval setup. If the user considers them valid for averaging... but the scores are so different (GPQA ~47 old vs ~31 new) that it doesn't make sense to average them. I think the cleanest approach is: Create a Think SFT v2 section with the new data Keep the old Think SFT section for reference but clearly separate For the Think SFT v2 final row, only average where we have 3 seeds Actually, let me reconsider the whole structure. The user wanted: From the summary: Row 7: Think SFT final — COMPLETELY EMPTY (pending rerun) Row 8: Instruct SFT final — averaged Row 9: DPO final — averaged Now that we have Think SFT v2 data, we can populate row 7. And the detail sections below should show the v2 data. I think the right approach is: Think SFT v2 detail: Think SFT v2 original (full eval) Think SFT v2 repeat 1 (full eval) Think SFT v2 repeat 2 (MATH+LCB only) Think SFT v2 final row: Non-noisy evals: average of original and repeat 1 (only 2 seeds, so take avg of 2? or use the original only?) Actually, looking at how we handled it before: "Non-noisy" benchmarks like MMLU, PopQA, BBH used only the original + repeat 1 data. We don't need 3 seeds for non-noisy evals — we just report the original seed values. Wait no... Let me look at how the current FINAL sheet handles it: Row 8 (Instruct SFT final): MMLU=71.85, PopQA=16.76, BBH=47.29 Row 12 (Instruct SFT original): MMLU=72.63, PopQA=16.64, BBH=47.73 Row 13 (Instruct SFT repeat 1): MMLU=72.1, PopQA=17.06, BBH=47.6 So 71.85 ≠ 72.63 and ≠ average(72.63, 72.1) = 72.365 Where does 71.85 come from? That's from the old HYBRID INSTRUCT sheet "Instruct v2 2.5e-5 (avg)" row! So the final row uses the old averaged values from the old eval runs, not the new generated CSV data. OK so the structure is: The final row values come from the old sheets (which had their own averaging methodology) The detail rows show the new HF model eval runs (generated CSV data) The exception is MATH, Omega, LCB (3-seed from new evals) and GPQA, IFEval, LCB Lite (3-run from old evals) Wait, that makes it more complex. Let me re-read the summary: "Main changes: OMEGA, MATH, LiveCodeBench full" "Don't double-average noisy evals" So the final row: MMLU, PopQA, BBH, ZebraLogic, AgiEval, AIME, HumanEvalPlus, MBPPPlus, IFBench, AlpacaEval → from old sheet averages MATH, Omega Full, LCB → 3-seed average from new generated CSV GPQA, LCB Lite, IFEval → 3-run average from old sheet re-runs For Think SFT v2: We don't have old sheet data for v2 (it's a new model/rerun) All data comes from generated (25) So for the final row, non-noisy evals would be... just the average of the 2 seeds? Or we use seed 1 as the primary? Hmm, but looking at Instruct SFT more carefully: Old sheet "Instruct v2 2.5e-5 (avg)": MMLU=71.85 Generated CSV "hf-OLMo-3.2-Hybrid-7B-Instruct-SFT": MMLU=72.63 These are different! 71.85 was the average from the old eval setup. 72.63 is from the new HF model eval. The user chose to use the old sheet value for the final row. For Think SFT v2, there's no old sheet data (it's a new model rerun with fixed chat template). So we'd need to decide what to put in the final row. Actually, I think the simplest approach now is: For Think SFT v2: use the non-repeat eval (the "original" seed) as the primary values for non-noisy benchmarks, and 3-seed averages for MATH/LCB where we have all 3 Flag that we only have 2 seeds for full evals and no old noisy re-runs for GPQA/IFEval But I'm also flagging that the Think SFT v2 numbers look very different from old Think SFT. The user should be aware. Hmm wait, maybe I'm overthinking this. Let me re-read what the user said: "Lmk if anything looks off or you dont recall the task" I recall the task. Let me just process the data and flag concerns. OK let me now write the Python script to rebuild the FINAL.csv with: Updated DPO rows (add Omega Full, update LCB) Updated Instruct SFT rows (add Omega Full where available) New Think SFT v2 section (replacing the pending rerun) Updated final row averages Let me figure out exactly what goes where. Updated FINAL.csv structure: Row 1: Header Row 2: Olmo 3 Think SFT baseline Row 3: OLMo 3 7B SFT baseline Row 4: OLMo 3 7B DPO baseline Row 5: empty Row 6: "Hybrid Models (Final, 3-seed avg for MATH/LCB; 3-run avg for GPQA/IFEval)" Row 7: Hybrid Think SFT v2 (LR2.5e-5) — final Row 8: Hybrid Instruct SFT (v2 LR2.5e-5) — final Row 9: Hybrid DPO (LR1e-6) — final Row 10: empty Row 11: Instruct SFT detail Row 12: Instruct SFT original — from generated CSV Row 13: Instruct SFT repeat 1 — from generated CSV (now with Omega Full) Row 14: Instruct SFT repeat 2 (MATH+Omega+LCB only) — from generated CSV (now with Omega Full) Row 15: empty Row 16: Instruct SFT noisy eval re-runs header Row 17-19: Old 0218 re-runs Row 20: empty Row 21: DPO detail Row 22: DPO original — now with Omega Full + LCB Row 23: DPO repeat 1 — now with Omega Full Row 24: DPO repeat 2 (MATH+Omega+LCB only) — now with Omega Full Row 25: DPO repeat 3 (LCB only) — NEW Row 26: empty Row 27: DPO noisy eval re-runs header Row 28-30: Old 0219 re-runs Row 31: empty Row 32: Think SFT v2 detail Row 33: Think SFT v2 original — from generated (25) Row 34: Think SFT v2 repeat 1 — from generated (25) Row 35: Think SFT v2 repeat 2 (MATH+LCB only) — from generated (25) Row 36: empty Row 37: Old Think SFT reference (Morest fixes 0207) — for comparison Row 38: empty Row 39: Old Think SFT noisy re-runs header Row 40-42: Old 0207/0216 re-runs (good) Row 43: empty Row 44: BAD HF evals header Row 45-47: Bad HF evals Hmm, but the user also said they deleted the 500 and lite columns. Let me keep them in the CSV since they were still there in the latest FINAL.csv. The user can just ignore/delete them in Google Sheets. Actually, you know what, let me just think about what the user wants. They said: "Lmk if anything looks off or you dont recall the task" They re-ran jobs to get appropriate values generated (25) has the latest data The key new things: Think SFT v2 data (the rerun with chat template fix) More Omega Full and LCB data for DPO and Instruct But the Think SFT v2 numbers look concerning. They're much lower than the old "Morest fixes" Think SFT numbers. Should I flag this? Let me think... The old "Morest fixes" eval used the internal olmocore eval setup with specific fixes (cascade attention off, eager, fp32 cache). The new HF model evals (generated CSVs) use a different eval pipeline. The discrepancy could be due to: Different eval setup/prompt format v2 is genuinely a different/worse model Still a chat template issue The user specifically asked me to "Lmk if anything looks off". The Think SFT v2 numbers ARE significantly lower than what we'd expect from the old Think SFT evaluation. I should definitely flag this. But let me also think about the naming: Think-SFT-v2 — this is explicitly a v2 model. Maybe the user intentionally created a new version of the Think SFT model with different training? Or maybe "v2" just refers to the second attempt at evaluating (rerun)? I'll flag it and proceed. Let me now write the updated CSV. Actually, let me think about what values go in the Think SFT v2 final row. For Think SFT v2: Non-noisy benchmarks (use "original" seed values? or avg of 2 seeds?): Looking at how Instruct and DPO were handled: Instruct final MMLU=71.85 came from old sheet average, not from the generated CSV. But for Think SFT v2, we don't have old sheet averages. I think the cleanest is to NOT put non-noisy values in the final row if we don't have old sheet averages for them, OR to use the average of the 2 available seeds. Actually, looking more carefully at Instruct SFT: - Old sheet "Instruct v2 2.5e-5 (avg)": MMLU=71.85 — this was from the initial single-seed eval - New HF model eval: MMLU=72.63 — this is from the new eval run So the "old sheet" value was the original eval, and the new HF model eval is a re-evaluation. The user chose to keep the old sheet values in the final row. For Think SFT v2, the generated CSV IS the first evaluation. There's no older sheet data. So I should use the generated CSV data directly. But should I average the two seeds? The user's approach for non-noisy benchmarks was to use a single eval (the old sheet value), not average across seeds. Seeds mainly matter for MATH, LCB, and Omega (due to sampling). For non-noisy evals like MMLU, BBH, etc., there's typically only 1 eval. Wait, but we have 2 full eval seeds for Think SFT v2 (original and repeat_1). These are genuinely different runs with different results (MMLU: 72.48 vs 72.41, BBH: 67.75 vs 67.46). The differences are small, suggesting these are stable. I'll use the original (non-repeat) values for the final row, matching the pattern of using the "primary" eval for non-noisy benchmarks. For Think SFT v2 final row: MMLU: 72.48 (from original) PopQA: 21.02 (from original) BBH: 67.75 (from original) GPQA: avg(30.36, 32.14) = 31.25? But we only have 2 values... The pattern was 3-run avg for GPQA. Leave empty? Actually, for Instruct/DPO, the GPQA values in the final row came from old 3-run re-runs. For Think SFT v2, we don't have those old re-runs (the old ones were for the original Think SFT with a different eval setup). We have 2 GPQA values from the 2 seeds. I'll leave it empty since we don't have 3 runs. ZebraLogic: 15.7 (same in both seeds) MATH: avg(71.26, 71.18, 72.23) = 71.56 (3-seed avg ✓) AIME 25: we have 10.52 and 12.5 from 2 seeds. Use original? 10.52 AIME 24: 7.08 (from original) HumanEvalPlus: 65.85 (from original) MBPPPlus: 56.03 (from original) LCB: avg(25.51, 26.16, 25.18) = 25.62 (3-seed avg ✓) IFEval: only 2 values (64.88, 64.33). Leave empty for 3-run? IFBench: 25 (from original) AlpacaEval: 37.76 (from original) Hmm, but for Instruct SFT and DPO, the non-noisy values in the final row came from the old sheet, which already had its own eval methodology. For consistency, maybe I should just use the original seed for all non-noisy benchmarks in the Think SFT v2 final row. And for the 3-seed ones (MATH, LCB, Omega) use the 3-seed average. And for the noisy ones (GPQA, IFEval) leave empty since we don't have 3 runs. Actually, wait. Let me re-examine whether the old sheet data for instruct and DPO is "different" from the generated CSV data or the same. Old sheet "Instruct v2 2.5e-5 (avg)": PopQA=16.76 Generated CSV "hf-OLMo-3.2-Hybrid-7B-Instruct-SFT": PopQA=16.64 These are different! 16.76 vs 16.64. So the old sheet was from a different eval run than the generated CSV. For the old sheet data, the HYBRID INSTRUCT sheet row 92 says: "Instruct v2 2.5e-5 (avg),71.85,16.76,47.29,36.76,17,,,70,,14.6,8.75,6.67,69.15,55.26,,20.1,81.45,29,25.55" And that GPQA=36.76 is the 3-run average of the old runs (37.28, 34.38, 38.62). But the generated CSV has the same model evaluated with a new eval setup, getting GPQA=41.52 (qwen3_instruct variant). So in the FINAL sheet: Row 8 (Instruct SFT final): uses old sheet values for everything except MATH/Omega/LCB (3-seed from new evals) and GPQA/IFEval (3-run from old evals) Row 12 (Instruct SFT original): uses new generated CSV values The final row is a mix of old and new data sources. This is fine — the user designed it this way. For Think SFT v2, since there's no old sheet data, I'll use the generated CSV data for the final row. The approach: Non-noisy benchmarks: use original seed values MATH, LCB: 3-seed average (all 3 available) Omega: not available GPQA, IFEval: only 2 runs, leave empty per the user's rule of requiring 3 runs Actually wait, I realize I need to reconsider. For Instruct SFT, the GPQA in the detail rows shows: Row 12 (original): GPQA=41.52 Row 13 (repeat 1): GPQA=35.27 These are from the qwen3_instruct variant in the generated CSV. And the old re-runs (rows 17-19) use the old eval variant. For the Think SFT v2 final row, we have GPQA from the qwen3_instruct variant in 2 seeds (30.36 and 32.14). These are from the same eval setup as the detail rows. Since we don't have old-variant re-runs, should I average these 2? Or leave empty? I'll leave it empty and note that we need a 3rd GPQA run. Same for IFEval. OK wait, I just realized I'm spending too long on this. Let me just build the script and the user can adjust. Actually, I should also check: is the Instruct SFT "original" in generated (25) identical to what was in generated (24)? Let me compare. Generated (24) row 3 hf-OLMo-3.2-Hybrid-7B-Instruct-SFT: MMLU: 72.63, PopQA: 16.64, BBH: 47.73, GPQA (qwen3): 41.52, ZebraLogic: 17 MATH: 66.59, AIME25: 6.77, AIME24: 6.77, HumanEvalPlus: 70.06, MBPPPlus: 56.35 LCB: 21.34, AlpacaEval: 26.89, IFEval: 82.26, IFBench: 27 Omega: null Generated (25) row 12 hf-OLMo-3.2-Hybrid-7B-Instruct-SFT: Same as above, Omega: null OK same. Good. Generated (24) row 5 hf-OLMo-3.2-Hybrid-7B-Instruct-SFT_repeat_1: Omega: null Generated (25) row 4 hf-OLMo-3.2-Hybrid-7B-Instruct-SFT_repeat_1: Omega: 17.36 So the repeat_1 now has Omega. Great. Generated (24) row 9 hf-OLMo-3.2-Hybrid-7B-Instruct-SFT_repeat_2: MATH: 66.22, LCB: 21.14 Generated (25) row 5 hf-OLMo-3.2-Hybrid-7B-Instruct-SFT_repeat_2: MATH: 66.22, LCB: 21.14, Omega: 14.69 So repeat_2 now has Omega too. But we still need the original's Omega (null). For DPO: Generated (24) had DPO original with null Omega and null LCB. Generated (25) has DPO original with Omega=19.78 and LCB=22.01. So the key updates from generated (25) vs (24): DPO original: +Omega Full (19.78), +LCB (22.01) DPO repeat_1: +Omega Full (19.62) DPO repeat_2: +Omega Full (19.02) DPO repeat_3: NEW, LCB=21.54 Instruct repeat_1: +Omega Full (17.36) Instruct repeat_2: +Omega Full (14.69) Think SFT v2: ENTIRELY NEW (original, repeat_1, repeat_2) OK let me write the script now. Wait, one more thing. For the DPO final row: Before: LCB was empty (missing seed 1). Now seed 1 has LCB=22.01. DPO LCB 3-seed avg: (22.01 + 21.54 + 22.4) / 3 = 21.98 DPO Omega Full 3-seed avg: (19.78 + 19.62 + 19.02) / 3 = 19.47 For the Instruct SFT final row: Omega Full: only 2 of 3 seeds (17.36, 14.69). Original still null. Leave empty. Now let me also check what values the DPO final row should have for MATH: Seeds: 72.68, 72.71, 73.27 (cutoff)
English
42
12
298
49.3K
Goo
Goo@goofirnoth·
@briebriejoy @PsionicPsittacc I wished there was a citation doc for this episode. There were times where some of the claims sounded too incredible not to offer citation.
English
0
0
1
37
Briahna Joy Gray
Briahna Joy Gray@briebriejoy·
"Eisenhower, he had started the Atoms for Peace program during his presidency. And this was peaceful nuclear proliferation for nuclear energy purposes. And then he became very skeptical of the Israelis...When JFK got into office, that that skepticism continued to grow because they realized the Israelis were lying. They were sort of building the bomb....I've heard from people close to the administration, & of course this is anecdotal, but that one of the reasons that Trump is not releasing the full extent of the JFK files is because the Israelis are all over it." youtube.com/watch?v=Cfj05U…
YouTube video
YouTube
English
27
390
1.2K
52.8K
Goo
Goo@goofirnoth·
@valentinavee Thanks. Q: how do you go about making the screenshots in the notes?
English
0
0
0
14
Valentina Vee
Valentina Vee@valentinavee·
I'm also constantly improving it, and this link will auto-update any changes I make to it.
English
5
0
27
3.4K
Valentina Vee
Valentina Vee@valentinavee·
I've tried so many shotlisting softwares and landed on ... Google Sheets. Because it's the one all my teams already know how to use. I'm giving away the shotlist template I made for free, complete with auto-fill for commonly-used phrases: docs.google.com/spreadsheets/d…
Valentina Vee tweet media
English
30
68
699
66.1K
Goo
Goo@goofirnoth·
@numerounochef @ThePrimeagen Bro fails to mention that the reacker even states that its within rhe normal distribution but gotta get that engagment $$$
English
0
0
1
10
((((Tom))))
((((Tom))))@numerounochef·
@ThePrimeagen Yeah, but one day doesn't mean it's getting worse. It also shows that about half the time, the model is getting “better.”! The 30-day average is the same as the baseline.
English
1
0
0
534
ThePrimeagen
ThePrimeagen@ThePrimeagen·
Claude Opus 4.6 had it's worst benchmark day yesterday
ThePrimeagen tweet media
English
227
43
1.9K
715.7K
Goo
Goo@goofirnoth·
@DougVegas @disclosetv Can anyone explain why both Trump and Biden decided to drag thier parties into unpopular wars right before electins? At least Bibi is using wars to defer his. These idiots are just been slept walked into electoral defeats
English
0
0
1
5
Douglas Pestana
Douglas Pestana@DougVegas·
@disclosetv So now the Iran War is expected to last at least 100 days but likely 6 months. Trump and Hegseth are lying every single day about the duration, timelines, and objectives of the Iran war. These low IQ clowns have underestimated the adversary and the blowback.
English
1
0
0
224
Disclose.tv
Disclose.tv@disclosetv·
NEW - U.S. Central Command is requesting that the Pentagon send more military intelligence officers to its HQ in Tampa, Florida, to support Iran war operations for at least 100 days, but likely through September — Politico
Disclose.tv tweet media
English
97
270
1.1K
134.8K
Goo
Goo@goofirnoth·
@Villgecrazylady @propandco Look to say the jews this is antisemetic. Norm Finklestein isn't trying to collapse the Iranian regime. Israel is. Why not just stick with the facts they are already pretty horrifying enough.
English
0
0
0
23
Mel
Mel@Villgecrazylady·
@propandco It’s so unbelievably evil what these freaks are doing right now
English
8
23
525
4.7K
Griffin Davis
Griffin Davis@griffinpdavis·
@juristr why is the senior agentic engineer watching this guy in his POV?
English
1
0
1
211
Goo
Goo@goofirnoth·
@ryangrim Now with that veing said they should do all they can to repel anyone who attempts to invade your country by force.
English
0
0
0
14
Goo
Goo@goofirnoth·
@ryangrim Hey Ryan you might want to update this. As I read it seems he's saying the Iranian forces should abandon thier posts and go home. Which they should along with the American and Israelis. Do not listen to people who tell you to commit murder.
English
1
0
0
238
Goo
Goo@goofirnoth·
@nisten @shawmakesmagic What the fuck is a soft genocide? Not saying China is not misttreating these people but I notice all the folks saying psuedo genocide in China never have any thing to say about Yemen or Gaza
English
0
0
6
135
Shaw (spirit/acc)
Shaw (spirit/acc)@shawmakesmagic·
Wait why is China the bad guy? - no global elite raping children - wins by economic incentivizing and building, not by murdering heads of state - leader is not blackmailed I mean… are we the baddies?
English
878
426
8.4K
310.1K
Goo
Goo@goofirnoth·
@k_igori1 @vancouveroid @QualiaLogos @TheChiefNerd No the attack on thier neighbors is to force them to apply pressure on the US and they have. watch Tucker's last stream it was literally a message from Quatar to get Trump's base to not support the war. Iran is attempting to do MAD without a nuke
English
0
0
0
73
Chief Nerd
Chief Nerd@TheChiefNerd·
Jon Stewart on Iran’s Retaliation Strategy “America and Israel attack Iran, and Iran's answer is to just attack everybody??” 😂
English
1.1K
929
9.8K
5.8M
Richard Hanania
Richard Hanania@RichardHanania·
The left is now having a debate over whether it’s acceptable to bring antisemitic podcasters into the coalition. If you’ve seen what’s happened to the right, you know this is the road to hell. Graham Platner winning would be a sign of the start of the Tuckerization of the left.
Neera Tanden🌻@neeratanden

I am shocked that Ryan Grim is defending anti-semitic conspiracy podcasters with euphamisms like "right-coded." Again, truly shocking.

English
137
83
1K
285.8K