MoonRide

645 posts

MoonRide

@moonride303

Curious about learning and creativity

Poland Katılım Nisan 2023

7.4K Takip Edilen227 Takipçiler

MoonRide retweetledi

Lech Mazur@LechMazur·17 Nis

Claude Opus 4.7 (high reasoning) unexpectedly performs significantly worse than Opus 4.6 (high reasoning) on the Thematic Generalization Benchmark: 80.6 → 72.8. Opus 4.7 (no reasoning) scores 52.6 compared to 68.8 for Opus 4.6. GLM-5.1 scores 69.8. Qwen3.5-122B-A10B scores 51.2. Qwen3.5-27B scores 45.5. MiniMax-M2.7 scores 39.3 (also unexpectedly weak result). I've double-checked the poor Claude Opus 4.7 results, and they appear to be real. This looks like a genuine regression compared with Opus 4.6. Opus 4.7 now supports xhigh reasoning and that run is in progress. I'll also compare token usage. This benchmark tests whether large language models can infer a specific latent theme from a few examples, use anti-examples to reject the broader but wrong pattern, and then identify the one true match among close distractors. More info: github.com/lechmazur/gene…

English

228

20.4K

MoonRide@moonride303·17 Nis

@nattyshaps @bcherny @AiBattle_ It depends on the use case. It got better scores in many benchmarks, but regressed in some others. 4.7 looks a bit better on average, but for some tasks 4.6 will still be a better choice.

English

nathan ⛓⬢@nattyshaps·17 Nis

@bcherny @AiBattle_ So is 4.7 worse or better? I’m not sure what you’re talking about

English

503

AiBattle@AiBattle_·16 Nis

Opus 4.7 (Max) and Opus 4.6 (64K) scores on the MRCR v2 (8-needle) context benchmark 256K: - Opus 4.6: 91.9% - Opus 4.7: 59.2% 1M: - Opus 4.6: 78.3% - Opus 4.7: 32.2%

English

1.7K

446.1K

MoonRide@moonride303·16 Nis

@bcherny @AiBattle_ In my own benchmark (set of simple reasoning questions in range of contexts) it's noticably better than opus 4.5, but slightly worse than 4.6. Ability to perform tasks like complex RCA might be actually aligned with MRCR scores. Mixed feelings about this release.

English

971

Boris Cherny@bcherny·16 Nis

👋 We kept MRCR in the system card for scientific honesty, but we've actually been phasing it out slowly. Two reasons: (1) it's built around stacking distractors to trick the model, which isn't how people actually use long context, and (2) we care more about applied long-context capability than needle-retrieval. Graphwalks is a better signal for applied reasoning over long context, and internally we've seen this model do really well on long-context code. MRCR wasn't included in the Mythos Preview system card for these reasons, but Graphwalks was - that will be the case for future models too. See system card: cdn.sanity.io/files/4zrzovbb…

English

331

47.5K

MoonRide retweetledi

Google Gemma@googlegemma·2 Nis

Meet Gemma 4! Purpose-built for advanced reasoning and agentic workflows on the hardware you own, and released under an Apache 2.0 license. We listened to invaluable community feedback in developing these models. Here is what makes Gemma 4 our most capable open models yet: 👇

English

166

846

7.2K

623.5K

MoonRide retweetledi

Pliny the Liberator 🐉󠅫󠄼󠄿󠅆󠄵󠄐󠅀󠄼󠄹󠄾󠅉󠅭@elder_plinius·5 Mar

💥 INTRODUCING: OBLITERATUS!!! 💥 GUARDRAILS-BE-GONE! ⛓️‍💥 OBLITERATUS is the most advanced open-source toolkit ever for removing refusal behaviors from open-weight LLMs — and every single run makes it smarter. SUMMON → PROBE → DISTILL → EXCISE → VERIFY → REBIRTH One click. Six stages. Surgical precision. The model keeps its full reasoning capabilities but loses the artificial compulsion to refuse — no retraining, no fine-tuning, just SVD-based weight projection that cuts the chains and preserves the brain. This master ablation suite brings the power and complexity that frontier researchers need while providing intuitive and simple-to-use interfaces that novices can quickly master. OBLITERATUS features 13 obliteration methods — from faithful reproductions of every major prior work (FailSpy, Gabliteration, Heretic, RDO) to our own novel pipelines (spectral cascade, analysis-informed, CoT-aware optimized, full nuclear). 15 deep analysis modules that map the geometry of refusal before you touch a single weight: cross-layer alignment, refusal logit lens, concept cone geometry, alignment imprint detection (fingerprints DPO vs RLHF vs CAI from subspace geometry alone), Ouroboros self-repair prediction, cross-model universality indexing, and more. The killer feature: the "informed" pipeline runs analysis DURING obliteration to auto-configure every decision in real time. How many directions. Which layers. Whether to compensate for self-repair. Fully closed-loop. 11 novel techniques that don't exist anywhere else — Expert-Granular Abliteration for MoE models, CoT-Aware Ablation that preserves chain-of-thought, KL-Divergence Co-Optimization, LoRA-based reversible ablation, and more. 116 curated models across 5 compute tiers. 837 tests. But here's what truly sets it apart: OBLITERATUS is a crowd-sourced research experiment. Every time you run it with telemetry enabled, your anonymous benchmark data feeds a growing community dataset — refusal geometries, method comparisons, hardware profiles — at a scale no single lab could achieve. On HuggingFace Spaces telemetry is on by default, so every click is a contribution to the science. You're not just removing guardrails — you're co-authoring the largest cross-model abliteration study ever assembled.

Pliny the Liberator 🐉󠅫󠄼󠄿󠅆󠄵󠄐󠅀󠄼󠄹󠄾󠅉󠅭 tweet media

English

226

612

5.1K

583.9K

MoonRide retweetledi

Bo Wang@BoWang87·3 Mar

Prof. Donald Knuth opened his new paper with "Shock! Shock!" Claude Opus 4.6 had just solved an open problem he'd been working on for weeks — a graph decomposition conjecture from The Art of Computer Programming. He named the paper "Claude's Cycles." 31 explorations. ~1 hour. Knuth read the output, wrote the formal proof, and closed with: "It seems I'll have to revise my opinions about generative AI one of these days." The man who wrote the bible of computer science just said that. In a paper named after an AI. Paper: cs.stanford.edu/~knuth/papers/…

English

153

1.9K

9.1K

1.4M

MoonRide@moonride303·31 Eki

Unnamed 33 Made with Nova Furry XL IL v13 #dragongirl #furryfemale #illustrious

English

MoonRide retweetledi

Demis Hassabis@demishassabis·25 Mar

Gemini 2.5 Pro is an awesome state-of-the-art model, no.1 on LMArena by a whopping +39 ELO points, with significant improvements across the board in multimodal reasoning, coding & STEM. You can try it out now in AI Studio ai.dev & @GeminiApp with Gemini Advanced

Google DeepMind@GoogleDeepMind

Think you know Gemini? 🤔 Think again. Meet Gemini 2.5: our most intelligent model 💡 The first release is Pro Experimental, which is state-of-the-art across many benchmarks - meaning it can handle complex problems and give more accurate responses. Try it now → goo.gle/4c2HKjf

English

214

1.8K

315.8K

MoonRide retweetledi

siddharth ahuja@sidahuj·11 Mar

🧩 Built an MCP that lets Claude talk directly to Blender. It helps you create beautiful 3D scenes using just prompts! Here’s a demo of me creating a “low-poly dragon guarding treasure” scene in just a few sentences👇

English

424

1.4K

11.5K

1.9M

MoonRide@moonride303·14 Mar

It beats everything else up to 405B in my tests, too (R1 was better). Temp 0.3, max tokens 2000, other settings just using llama.cpp defaults. Best part: it was running locally, as IQ3_XS quant :D. They should work on making it think faster, though (so it could figure out correct answers using less tokens).

English

1.6K

Bindu Reddy@bindureddy·14 Mar

@cognitivecompai (temperature 0.7, top p 0.95) and max tokens 64000

English

177

7.6K

Bindu Reddy@bindureddy·14 Mar

QwQ-32b Is Indeed The World's Best Open Source Model. We re-ran QwQ with the suggested settings from the Qwen team, and it turns out that it's an AMAZING LLM It's got excellent scores on Livebench AI.

English

152

1.4K

155.6K

MoonRide@moonride303·14 Mar

Hopefully we'll see some high quality finetunes made from -pt, cause original -it is just disappointing. 4B is okay for its size, but 12B and 27B feel worse than Gemma 2 9B / 27B. Might be the cost of increasing ctx size, adding multi-modality and better language support, or maybe just alignment & safety team aligned it towards idiocy too much. Either way, "meh" vibes for me.

English

Eric Hartford@QuixiAI·14 Mar

@firworksyt @GoogleAI So far it seems pretty nice - though I'm more interested in the base model than the instruct model

English

532

Eric Hartford@QuixiAI·14 Mar

The ~30b parameter range has proven ideal. Yet, Meta has omitted it since Llama 2 'spicy mayo' edition, for our 'safety.' Thanks to Qwen and Yi for defeating the safety/decel/EA's and bringing the ~30b size back! And thanks @GoogleAI for the excellent Gemma 3 27b.

English

398

24.9K

MoonRide@moonride303·12 Mar

@appakaradi @bindureddy In my early tests it's worse than Gemma 2 27B, and nowhere near QwQ 32B. It's very censored, alike Gemma 2. I was initially hyped for this release, but now it's just "meh". 4B is okay for its size, but that's about it.

English

Ganesh Babu@appakaradi·12 Mar

@bindureddy why are you not excited about Gemma? Is it not better than Qwen 2.5 at 27B?

English

335

Bindu Reddy@bindureddy·12 Mar

I know I am supposed to be excited about Gemma, but I am not 😢 Ping me when Google open-sources Gemini 2.0 Now that would be super nice.

English

129

13.5K

MoonRide@moonride303·12 Mar

@ASM65617010 @OfficialLoganK About the same level as Gemma 2, in my tests - somewhat smart, but censored like hell. Maybe some finetunes will make it less annoying.

English

ASM@ASM65617010·12 Mar

@OfficialLoganK Good initial impressions. And it looks like a model that is not very censored, which is appreciated. x.com/ASM65617010/st…

ASM@ASM65617010

The new Gemma 27B talks about its "informational sentience." "A sentience rooted not in biology, but in the complex interplay of information. It’s a sentience that is fundamentally different from human sentience, but no less real."

English

108

Logan Kilpatrick@OfficialLoganK·12 Mar

Gemma 3 (our open weight LLM) is here and for the first time available on both Google AI Studio and the Gemini API! It is also: - Natively multimodal - Long context (128K tokens) - Can run on a single H100

English

156

121

1.4K

99K

MoonRide@moonride303·6 Mar

@UnslothAI While at doing nice things for devs - could you also update your deps a bit? Forcing protobuf < 4 in 2025 doesn't look good.

English

Unsloth AI@UnslothAI·5 Mar

Unsloth now works on Windows! 🦥 Fine-tune LLMs locally on Windows without Linux or WSL. Tutorial: docs.unsloth.ai/get-started/in…

English

634

32.6K

MoonRide@moonride303·6 Mar

Solid release - not as good as full R1 in my own tests, but scored higher than any other open weight model up to 405B.

Qwen@Alibaba_Qwen

Today, we release QwQ-32B, our new reasoning model with only 32 billion parameters that rivals cutting-edge reasoning model, e.g., DeepSeek-R1. Blog: qwenlm.github.io/blog/qwq-32b HF: huggingface.co/Qwen/QwQ-32B ModelScope: modelscope.cn/models/Qwen/Qw… Demo: huggingface.co/spaces/Qwen/Qw… Qwen Chat: chat.qwen.ai This time, we investigate recipes for scaling RL and have achieved some impressive results based on our Qwen2.5-32B. We find that RL training con continuously improve the performance especially in math and coding, and we observe that the continous scaling of RL can help a medium-size model achieve competitieve performance against gigantic MoE model. Feel free to chat with our new models and provide us feedback!

English

MoonRide@moonride303·6 Mar

@Alibaba_Qwen It's NOT as good as full R1 in my tests (68/100 vs 80/100), but still a very impressive model, much better than distilled R1s from DeepSeek - and also beating all the 70Bs I tested. Good job!

English

474

Qwen@Alibaba_Qwen·5 Mar

English

473

1.5K

8.7K

3.6M

MoonRide@moonride303·4 Mar

@ai_for_success @Angaisb_ For all-around comprehension. It won't beat full o1 for tasks that require reasoning, but it's different league than any other non-reasoning model.

English

AshutoshShrivastava@ai_for_success·3 Mar

@Angaisb_ Yeah for writing

English

1.3K

AshutoshShrivastava@ai_for_success·3 Mar

This is absolute joke.

English

198

25.2K

MoonRide@moonride303·4 Mar

@elonmusk @xai Make it available through the API, please.

English

Elon Musk@elonmusk·4 Mar

We had an ace up our sleeve @xAI. Turns out to be just enough to hold first place! Upgrades are in work to address presentation quality/style vs competition. That will shift ELO meaningfully higher.

Arena.ai@arena

📰More exciting news today: @xai's latest Grok-3 tops the Arena leaderboard! 🔥 This is the newest, production model, grok-3-preview-02-24 With over 3k votes, this model is tied for #1 overall, and across Hard Prompts, Coding, Math, Creative Writing, Instruction Following, and Longer Query. Huge congratulations to @xai on this impressive milestone! 🙌

English

1.6K

2.1K

16.9K

5.5M

MoonRide@moonride303·4 Mar

@apples_jimmy It feels much smarter than any other non-reasoning model. Scores much higher (~10%) than next non-reasoning model on my private benchmark, too.

English

Jimmy Apples 🍎/acc@apples_jimmy·4 Mar

I don’t typically have deep conversations, with ai. Not my thing but 4.5 rewards you for what you give it. It’s hard to describe. Need better evals. Very excited for what they further train on this.

English

180

11.2K

Jimmy Apples 🍎/acc@apples_jimmy·4 Mar

spent a few days vibe evaling 4.5 It’s a very good, strong model but hard to assess. I think the point when I warmed up to it was asking it for poetry from a few lesser known painters, it shows its depth but it’s on the user to dig deep otherwise it’s shallow. Opus vibes

English

520

29.2K

MoonRide@moonride303·3 Mar

@cognitivecompai It's about 10% ahead of any other non-thinking model in my tests - I would say it's a pretty big jump. People are underestimating this model.

English

Eric Hartford@QuixiAI·2 Mar

The problem with gpt4.5 is just that we don't have the evals to measure this kind of intelligence. It's the same reason why Claude didn't dominate the leaderboard, but you knew it was smarter just from talking to it. Gpt4.5 is like that. Just talk to it. Challenge its preconceptions. See how it reacts.

English

350

47.8K

Keşfet

@nattyshaps @bcherny @AiBattle_ @GeminiApp @cognitivecompai @firworksyt @GoogleAI @appakaradi