MoonRide

645 posts

MoonRide banner
MoonRide

MoonRide

@moonride303

Curious about learning and creativity

Poland Katılım Nisan 2023
7.4K Takip Edilen227 Takipçiler
MoonRide retweetledi
Lech Mazur
Lech Mazur@LechMazur·
Claude Opus 4.7 (high reasoning) unexpectedly performs significantly worse than Opus 4.6 (high reasoning) on the Thematic Generalization Benchmark: 80.6 → 72.8. Opus 4.7 (no reasoning) scores 52.6 compared to 68.8 for Opus 4.6. GLM-5.1 scores 69.8. Qwen3.5-122B-A10B scores 51.2. Qwen3.5-27B scores 45.5. MiniMax-M2.7 scores 39.3 (also unexpectedly weak result). I've double-checked the poor Claude Opus 4.7 results, and they appear to be real. This looks like a genuine regression compared with Opus 4.6. Opus 4.7 now supports xhigh reasoning and that run is in progress. I'll also compare token usage. This benchmark tests whether large language models can infer a specific latent theme from a few examples, use anti-examples to reject the broader but wrong pattern, and then identify the one true match among close distractors. More info: github.com/lechmazur/gene…
Lech Mazur tweet media
English
20
23
228
20.4K
MoonRide
MoonRide@moonride303·
@nattyshaps @bcherny @AiBattle_ It depends on the use case. It got better scores in many benchmarks, but regressed in some others. 4.7 looks a bit better on average, but for some tasks 4.6 will still be a better choice.
English
0
0
0
30
AiBattle
AiBattle@AiBattle_·
Opus 4.7 (Max) and Opus 4.6 (64K) scores on the MRCR v2 (8-needle) context benchmark 256K: - Opus 4.6: 91.9% - Opus 4.7: 59.2% 1M: - Opus 4.6: 78.3% - Opus 4.7: 32.2%
AiBattle tweet media
English
89
76
1.7K
446.1K
MoonRide
MoonRide@moonride303·
@bcherny @AiBattle_ In my own benchmark (set of simple reasoning questions in range of contexts) it's noticably better than opus 4.5, but slightly worse than 4.6. Ability to perform tasks like complex RCA might be actually aligned with MRCR scores. Mixed feelings about this release.
English
0
0
2
971
Boris Cherny
Boris Cherny@bcherny·
👋 We kept MRCR in the system card for scientific honesty, but we've actually been phasing it out slowly. Two reasons: (1) it's built around stacking distractors to trick the model, which isn't how people actually use long context, and (2) we care more about applied long-context capability than needle-retrieval. Graphwalks is a better signal for applied reasoning over long context, and internally we've seen this model do really well on long-context code. MRCR wasn't included in the Mythos Preview system card for these reasons, but Graphwalks was - that will be the case for future models too. See system card: cdn.sanity.io/files/4zrzovbb…
Boris Cherny tweet media
English
12
9
331
47.5K
MoonRide retweetledi
Google Gemma
Google Gemma@googlegemma·
Meet Gemma 4! Purpose-built for advanced reasoning and agentic workflows on the hardware you own, and released under an Apache 2.0 license. We listened to invaluable community feedback in developing these models. Here is what makes Gemma 4 our most capable open models yet: 👇
Google Gemma tweet media
English
166
846
7.2K
623.5K
MoonRide retweetledi
Pliny the Liberator 🐉󠅫󠄼󠄿󠅆󠄵󠄐󠅀󠄼󠄹󠄾󠅉󠅭
💥 INTRODUCING: OBLITERATUS!!! 💥 GUARDRAILS-BE-GONE! ⛓️‍💥 OBLITERATUS is the most advanced open-source toolkit ever for removing refusal behaviors from open-weight LLMs — and every single run makes it smarter. SUMMON → PROBE → DISTILL → EXCISE → VERIFY → REBIRTH One click. Six stages. Surgical precision. The model keeps its full reasoning capabilities but loses the artificial compulsion to refuse — no retraining, no fine-tuning, just SVD-based weight projection that cuts the chains and preserves the brain. This master ablation suite brings the power and complexity that frontier researchers need while providing intuitive and simple-to-use interfaces that novices can quickly master. OBLITERATUS features 13 obliteration methods — from faithful reproductions of every major prior work (FailSpy, Gabliteration, Heretic, RDO) to our own novel pipelines (spectral cascade, analysis-informed, CoT-aware optimized, full nuclear). 15 deep analysis modules that map the geometry of refusal before you touch a single weight: cross-layer alignment, refusal logit lens, concept cone geometry, alignment imprint detection (fingerprints DPO vs RLHF vs CAI from subspace geometry alone), Ouroboros self-repair prediction, cross-model universality indexing, and more. The killer feature: the "informed" pipeline runs analysis DURING obliteration to auto-configure every decision in real time. How many directions. Which layers. Whether to compensate for self-repair. Fully closed-loop. 11 novel techniques that don't exist anywhere else — Expert-Granular Abliteration for MoE models, CoT-Aware Ablation that preserves chain-of-thought, KL-Divergence Co-Optimization, LoRA-based reversible ablation, and more. 116 curated models across 5 compute tiers. 837 tests. But here's what truly sets it apart: OBLITERATUS is a crowd-sourced research experiment. Every time you run it with telemetry enabled, your anonymous benchmark data feeds a growing community dataset — refusal geometries, method comparisons, hardware profiles — at a scale no single lab could achieve. On HuggingFace Spaces telemetry is on by default, so every click is a contribution to the science. You're not just removing guardrails — you're co-authoring the largest cross-model abliteration study ever assembled.
Pliny the Liberator 🐉󠅫󠄼󠄿󠅆󠄵󠄐󠅀󠄼󠄹󠄾󠅉󠅭 tweet media
English
226
612
5.1K
583.9K
MoonRide retweetledi
Bo Wang
Bo Wang@BoWang87·
Prof. Donald Knuth opened his new paper with "Shock! Shock!" Claude Opus 4.6 had just solved an open problem he'd been working on for weeks — a graph decomposition conjecture from The Art of Computer Programming. He named the paper "Claude's Cycles." 31 explorations. ~1 hour. Knuth read the output, wrote the formal proof, and closed with: "It seems I'll have to revise my opinions about generative AI one of these days." The man who wrote the bible of computer science just said that. In a paper named after an AI. Paper: cs.stanford.edu/~knuth/papers/…
Bo Wang tweet media
English
153
1.9K
9.1K
1.4M
MoonRide retweetledi
Demis Hassabis
Demis Hassabis@demishassabis·
Gemini 2.5 Pro is an awesome state-of-the-art model, no.1 on LMArena by a whopping +39 ELO points, with significant improvements across the board in multimodal reasoning, coding & STEM. You can try it out now in AI Studio ai.dev & @GeminiApp with Gemini Advanced
Google DeepMind@GoogleDeepMind

Think you know Gemini? 🤔 Think again. Meet Gemini 2.5: our most intelligent model 💡 The first release is Pro Experimental, which is state-of-the-art across many benchmarks - meaning it can handle complex problems and give more accurate responses. Try it now → goo.gle/4c2HKjf

English
72
214
1.8K
315.8K
MoonRide retweetledi
siddharth ahuja
siddharth ahuja@sidahuj·
🧩 Built an MCP that lets Claude talk directly to Blender. It helps you create beautiful 3D scenes using just prompts! Here’s a demo of me creating a “low-poly dragon guarding treasure” scene in just a few sentences👇
English
424
1.4K
11.5K
1.9M
MoonRide
MoonRide@moonride303·
It beats everything else up to 405B in my tests, too (R1 was better). Temp 0.3, max tokens 2000, other settings just using llama.cpp defaults. Best part: it was running locally, as IQ3_XS quant :D. They should work on making it think faster, though (so it could figure out correct answers using less tokens).
English
2
1
12
1.6K
Bindu Reddy
Bindu Reddy@bindureddy·
QwQ-32b Is Indeed The World's Best Open Source Model. We re-ran QwQ with the suggested settings from the Qwen team, and it turns out that it's an AMAZING LLM It's got excellent scores on Livebench AI.
Bindu Reddy tweet media
English
91
152
1.4K
155.6K
MoonRide
MoonRide@moonride303·
Hopefully we'll see some high quality finetunes made from -pt, cause original -it is just disappointing. 4B is okay for its size, but 12B and 27B feel worse than Gemma 2 9B / 27B. Might be the cost of increasing ctx size, adding multi-modality and better language support, or maybe just alignment & safety team aligned it towards idiocy too much. Either way, "meh" vibes for me.
English
0
0
1
88
Eric Hartford
Eric Hartford@QuixiAI·
The ~30b parameter range has proven ideal. Yet, Meta has omitted it since Llama 2 'spicy mayo' edition, for our 'safety.' Thanks to Qwen and Yi for defeating the safety/decel/EA's and bringing the ~30b size back! And thanks @GoogleAI for the excellent Gemma 3 27b.
English
13
21
398
24.9K
MoonRide
MoonRide@moonride303·
@appakaradi @bindureddy In my early tests it's worse than Gemma 2 27B, and nowhere near QwQ 32B. It's very censored, alike Gemma 2. I was initially hyped for this release, but now it's just "meh". 4B is okay for its size, but that's about it.
English
1
0
1
54
Ganesh Babu
Ganesh Babu@appakaradi·
@bindureddy why are you not excited about Gemma? Is it not better than Qwen 2.5 at 27B?
English
1
0
3
335
Bindu Reddy
Bindu Reddy@bindureddy·
I know I am supposed to be excited about Gemma, but I am not 😢 Ping me when Google open-sources Gemini 2.0 Now that would be super nice.
English
18
7
129
13.5K
MoonRide
MoonRide@moonride303·
@ASM65617010 @OfficialLoganK About the same level as Gemma 2, in my tests - somewhat smart, but censored like hell. Maybe some finetunes will make it less annoying.
English
0
0
0
0
Logan Kilpatrick
Logan Kilpatrick@OfficialLoganK·
Gemma 3 (our open weight LLM) is here and for the first time available on both Google AI Studio and the Gemini API! It is also: - Natively multimodal - Long context (128K tokens) - Can run on a single H100
Logan Kilpatrick tweet media
English
156
121
1.4K
99K
MoonRide
MoonRide@moonride303·
@UnslothAI While at doing nice things for devs - could you also update your deps a bit? Forcing protobuf < 4 in 2025 doesn't look good.
English
0
0
0
92
MoonRide
MoonRide@moonride303·
Solid release - not as good as full R1 in my own tests, but scored higher than any other open weight model up to 405B.
Qwen@Alibaba_Qwen

Today, we release QwQ-32B, our new reasoning model with only 32 billion parameters that rivals cutting-edge reasoning model, e.g., DeepSeek-R1. Blog: qwenlm.github.io/blog/qwq-32b HF: huggingface.co/Qwen/QwQ-32B ModelScope: modelscope.cn/models/Qwen/Qw… Demo: huggingface.co/spaces/Qwen/Qw… Qwen Chat: chat.qwen.ai This time, we investigate recipes for scaling RL and have achieved some impressive results based on our Qwen2.5-32B. We find that RL training con continuously improve the performance especially in math and coding, and we observe that the continous scaling of RL can help a medium-size model achieve competitieve performance against gigantic MoE model. Feel free to chat with our new models and provide us feedback!

English
0
0
1
72
MoonRide
MoonRide@moonride303·
@Alibaba_Qwen It's NOT as good as full R1 in my tests (68/100 vs 80/100), but still a very impressive model, much better than distilled R1s from DeepSeek - and also beating all the 70Bs I tested. Good job!
English
0
0
2
474
Qwen
Qwen@Alibaba_Qwen·
Today, we release QwQ-32B, our new reasoning model with only 32 billion parameters that rivals cutting-edge reasoning model, e.g., DeepSeek-R1. Blog: qwenlm.github.io/blog/qwq-32b HF: huggingface.co/Qwen/QwQ-32B ModelScope: modelscope.cn/models/Qwen/Qw… Demo: huggingface.co/spaces/Qwen/Qw… Qwen Chat: chat.qwen.ai This time, we investigate recipes for scaling RL and have achieved some impressive results based on our Qwen2.5-32B. We find that RL training con continuously improve the performance especially in math and coding, and we observe that the continous scaling of RL can help a medium-size model achieve competitieve performance against gigantic MoE model. Feel free to chat with our new models and provide us feedback!
Qwen tweet media
English
473
1.5K
8.7K
3.6M
MoonRide
MoonRide@moonride303·
@ai_for_success @Angaisb_ For all-around comprehension. It won't beat full o1 for tasks that require reasoning, but it's different league than any other non-reasoning model.
English
0
0
1
15
Elon Musk
Elon Musk@elonmusk·
We had an ace up our sleeve @xAI. Turns out to be just enough to hold first place! Upgrades are in work to address presentation quality/style vs competition. That will shift ELO meaningfully higher.
Arena.ai@arena

📰More exciting news today: @xai's latest Grok-3 tops the Arena leaderboard! 🔥 This is the newest, production model, grok-3-preview-02-24 With over 3k votes, this model is tied for #1 overall, and across Hard Prompts, Coding, Math, Creative Writing, Instruction Following, and Longer Query. Huge congratulations to @xai on this impressive milestone! 🙌

English
1.6K
2.1K
16.9K
5.5M
MoonRide
MoonRide@moonride303·
@apples_jimmy It feels much smarter than any other non-reasoning model. Scores much higher (~10%) than next non-reasoning model on my private benchmark, too.
English
0
0
1
81
Jimmy Apples 🍎/acc
Jimmy Apples 🍎/acc@apples_jimmy·
I don’t typically have deep conversations, with ai. Not my thing but 4.5 rewards you for what you give it. It’s hard to describe. Need better evals. Very excited for what they further train on this.
English
11
4
180
11.2K
Jimmy Apples 🍎/acc
Jimmy Apples 🍎/acc@apples_jimmy·
spent a few days vibe evaling 4.5 It’s a very good, strong model but hard to assess. I think the point when I warmed up to it was asking it for poetry from a few lesser known painters, it shows its depth but it’s on the user to dig deep otherwise it’s shallow. Opus vibes
English
28
19
520
29.2K
MoonRide
MoonRide@moonride303·
@cognitivecompai It's about 10% ahead of any other non-thinking model in my tests - I would say it's a pretty big jump. People are underestimating this model.
English
1
0
1
84
Eric Hartford
Eric Hartford@QuixiAI·
The problem with gpt4.5 is just that we don't have the evals to measure this kind of intelligence. It's the same reason why Claude didn't dominate the leaderboard, but you knew it was smarter just from talking to it. Gpt4.5 is like that. Just talk to it. Challenge its preconceptions. See how it reacts.
English
24
22
350
47.8K