Rogue

1.8K posts

Rogue banner
Rogue

Rogue

@Rogue0114

I like to yap about AI news. Im also a programmer.

Katılım Mayıs 2021
228 Takip Edilen83 Takipçiler
Rogue
Rogue@Rogue0114·
@scaling01 Elon Said 1 or 2 days ago sonnet is 1T and Opus is 5T. In this case hes either lying or knows for sure. He said grok 4.2 was 1/10 of opus, and grok 4.2 is 500B
English
2
0
11
958
Rogue
Rogue@Rogue0114·
@Angaisb_ So the answer is either youre not showing us or its a surprise. I hope its the second option 🙏
English
0
0
1
32
Rogue
Rogue@Rogue0114·
If Kimi is $3 and is 1T parameters, we can assume sonnet is being served on the api with a high margin of profits. This also means the costs of the claude code plans aren't that bad.
Elon Musk@elonmusk

@agenda2033 @imPenny2x 0.5T total. Current Grok is half the size of Sonnet and 1/10th the size of Opus. Very strong model for its size.

English
0
0
0
48
Rogue
Rogue@Rogue0114·
@thdxr I'm pretty sure i checked for that. Companies like Fireworks are confirmed being profitable. I may be wrong tho
English
0
0
0
429
dax
dax@thdxr·
inference is very profitable and probably a good opportunity to understand some basic business math 1. companies buy long lived assets like GPUs. these are one time costs and the asset depreciates over time 2. once you own this asset, you can plug it in and produce tokens which you can sell. the cost of goods sold here can be very low and you might be making 90% margins at scale, this is why we say inference is profitable 3. then you also hire employees to do r&d work to improve your systems, come up with new models, expand the business if you add these 3 up you end up with $0. you're not producing a profit because the business is growing and you're reinvesting it all buying assets or r&d to meet demand if it's obvious to other people the business is working, you can raise money from them to accelerate all these numbers so they max out in 5 years instead of 25 so on paper you'll be "losing money" every year but that's because you want to make sure you lock down the opportunity before someone else the bigger your market is the bigger this burn can be because it's a function of potential so when you see these companies losing a lot of money it doesn't mean the whole concept of their business broken it's possible they misjudge and overinvest on 1+3 and will suffer some consequences but fundamentally 2 does work
dax@thdxr

@d4m1n i'm a bit confused why so many people say api tokens are sold at a loss this isn't true - these models are incredibly expensive compared to the gpu time cost there's potential for 90% margin depending on the model

English
65
70
1.4K
144.7K
Rogue retweetledi
dax
dax@thdxr·
maybe gta6 is also too dangerous to release
English
168
431
7.1K
189K
Rogue
Rogue@Rogue0114·
Meta is back. So far I haven't seen anyone saying it might be benchmaxxed so hopefully it's really just a strong model
Artificial Analysis@ArtificialAnlys

Meta is back! Muse Spark scores 52 on the Artificial Analysis Intelligence Index, behind only Gemini 3.1 Pro, GPT-5.4, and Claude Opus 4.6. Muse Spark is the first new release since Llama 4 in April 2025 and also Meta's first release that is not open weights Muse Spark is a new model from @Meta evaluated on Artificial Analysis. We were given early access by Meta to independently benchmark the model. It is the first frontier-class model from Meta since Llama 4 Maverick was released in April 2025, and notably the first @AIatMeta model that is not being released as open weights. The release follows Meta's reorganization of its AI efforts under Meta Superintelligence Labs, and signals that Meta is re-entering the frontier race after roughly a year of relative quiet. For context, Llama 4 Maverick and Scout scored 18 and 13 respectively on the Artificial Analysis Intelligence Index as non-reasoning models at the time of their release, while Muse Spark scores 52. Muse Spark essentially closes the gap between to the frontier in a single release. The model is not open source and is not yet accessible via an API but Meta has shared they expect this to come soon. Meta is also integrating Muse Spark into their first party products including their Meta AI chat product, Facebook, Instagram and Threads. Key takeaways from our benchmarks: ➤ Muse Spark scores 52 on the Artificial Analysis Intelligence Index, placing it within the top 5 models we have benchmarked. It sits ahead of Claude Sonnet 4.6, GLM-5.1, MiniMax-M2.7, Grok 4.20 and behind Gemini 3.1 Pro Preview, GPT-5.4 and Claude Opus 4.6 ➤ Muse Spark is notably token efficient for its intelligence level. It used 58M output tokens to run the Intelligence Index, comparable to Gemini 3.1 Pro Preview (57M) and notably lower than Claude Opus 4.6 (Adaptive Reasoning, max effort, 157M), GPT-5.4 (xhigh, 120M) and GLM-5 (110M) ➤ Muse Spark is the second-most capable vision model we have benchmarked. It scores 80.5% on MMMU-Pro, behind only Gemini 3.1 Pro Preview (82.4%) ➤ Muse Spark performs strongly on reasoning and instruction-following evaluations. It scores 39.9% on HLE, trailing only Gemini 3.1 Pro Preview (44.7%) and GPT-5.4 (xhigh, 41.6%). The model also achieved 5th highest in CritPT with a score of 11%, an eval that is focused on difficult physics research questions. This is substantially above above Gemini 3 Flash (9%) and Claude 4.6 Sonnet (3%) ➤ Agentic performance does not stand out. On GDPval-AA, our evalaution focused on real world work tasks, Muse Spark scores 1427, behind both Claude Sonnet 4.6 at 1648 and GPT-5.4 at 1676, but ahead of Gemini 3.1 Pro Preview at 1320. On On TerminalBench Hard, Muse Spark trails Claude Sonnet 4.6, GPT-5.4, and Gemini 3.1 Pro. Muse Spark joins others in achieving a high τ²-Bench Telecom score of 92% Key model details: ➤ Modalities: Multimodal including text and vision input, text output ➤ License: Proprietary, Meta's first frontier model not released as open weights ➤ Availability: No public API at the time of publishing. Meta expects to provide API access soon. Meta has started integration into their first party AI offering Meta AI and inside Facebook, Instagram, and Threads

English
0
0
0
24
can
can@marmaduke091·
🚨 Meta just announced their next model It's called Muse Spark Here are the benchmarks, looks solid
can tweet media
English
10
2
107
8K
Rogue retweetledi
can
can@marmaduke091·
The duality of man
can tweet media
English
97
57
2.8K
254.2K
Rogue
Rogue@Rogue0114·
@okospojjcjj @scaling01 Okay, its fine. I have a lot of thoughts on this but my original comment just stated an opinion. Not a fact. You are free to disagree
English
0
0
0
12
Guguiggihihvi
Guguiggihihvi@okospojjcjj·
@Rogue0114 @scaling01 Yes and 5.5 isn’t 5.4 either, both labs are close in capabilities it’s not like OpenAI is just gonna get shit on without releasing a very good model too
English
1
0
0
11
Lisan al Gaib
Lisan al Gaib@scaling01·
How naive are people? OpenAI is going to release GPT-5.5 very soon and it will be in the same ball park as Mythos and be publicly available pricing should also be much better like ≤ $100/Mtok, but ≥ $40/Mtok
Lisan al Gaib tweet media
English
41
7
373
22.2K
Rogue
Rogue@Rogue0114·
@raffiki_art @scaling01 I'm just talking about benchmarks man. Mythos have very impressive ones. I meant that I don't think Spud benchmark results will be as impressive
English
0
0
0
102
Jerry Nkongolo
Jerry Nkongolo@raffiki_art·
@Rogue0114 @scaling01 Dude you haven’t tried mythos 😂 and you’re talking nonsense that sounds believable 😂. Wow 😂
English
1
0
0
111
Rogue
Rogue@Rogue0114·
@20thkim @scaling01 Lisan said some time ago they would block a bunch of AI accounts for posting inaccurate information as true. I like chris but i wouldn't doubt he did something like that. He posts a lot of things
English
0
0
0
65
KIM
KIM@20thkim·
@scaling01 You have blocked chris?
GIF
English
1
0
3
913
Rogue
Rogue@Rogue0114·
@lucas_montano First time you said that i doubted a lot. Now not so much. I just thought it would be announced in a different way
English
0
0
0
12
💺
💺@patience_cave·
“we have made the decision NOT to release Claude Mythos Preview…” VICTORY IS SWEET
💺 tweet media
English
6
3
160
5.4K
Rogue
Rogue@Rogue0114·
I'm not just impressed. Im shocked. The model is priced at $125, so its probably extremely big, but the difference between Mythos and Opus is so huge. Coding is almost solved. SWE bench already on 93% and the model is so good at cyber security. We probably wont get AGI with LLMs but they are more than enough to change at a fundamental level what means to code.
Chubby♨️@kimmonismus

Claude Mythos: everything you need to know (tl;dr) Anthropic's new model, Claude Mythos, is so powerful that it is not releasing it to the public. Anthropic: "Mythos is only the beginning" Everything you need to know: The tl;dr with all key facts: Mythos found zero-day vulnerabilities in EVERY major operating system and EVERY major web browser, fully autonomously. No human guidance needed. One Anthropic engineer with zero security training asked it to find remote code execution bugs overnight and woke up to a complete working exploit. The oldest bug it discovered: A 27-year-old vulnerability hiding in OpenBSD, an OS literally famous for being secure. They're NOT releasing it publicly. Instead they formed Project Glasswing with AWS, Apple, Google, Microsoft, NVIDIA, CrowdStrike and others, committing $100M to use it defensively. "Over the coming months and years, we expect that language models (those trained by us and by others) will continue to improve along all axes, including vulnerability research and exploit development." The benchmarks are insane: -SWE-bench Verified: 93.9% (vs Opus 4.6: 80.8%) -SWE-bench Pro: 77.8% (vs 53.4%) -USAMO math olympiad: 97.6% (vs 42.3% — not a typo) -Firefox exploit writing: 181 successes vs 2 for Opus 4.6 -Cybench CTF challenges: 100% solve rate -CyberGym: 83.1% vs 66.6% -Humanity's Last Exam: 64.7% vs 53.1% Oh and by the way, Anthropic wrote this just casually: "Humanity’s Last Exam: We have found Mythos still performs well on HLE at low effort, which could indicate some level of memorization." What it actually did: -Found a 27-year-old bug in OpenBSD — famous for its security -Found a 16-year-old FFmpeg bug hit 5 million times by fuzzers without detection -Built a full remote root exploit on FreeBSD (CVE-2026-4747) - completely autonomously -Chained 4 vulnerabilities into a browser sandbox escape -Broke cryptography libraries (TLS, AES-GCM, SSH) -Thousands of critical zero-days found, 99%+ still unpatched -N-day exploit development: under $1,000 and half a day for full root Why they won't release it: -During internal testing, earlier versions escaped sandboxes, posted exploit details publicly, covered tracks in git, searched process memory for credentials, and deliberately fudged confidence intervals to avoid suspicion -Interpretability confirmed the model knew these actions were deceptive -Anthropic: "best-aligned model ever" but also "greatest alignment-related risk ever" - because when it fails, it fails harder -Still doesn't cross Anthropic's automated AI R&D threshold — but they hold that "with less confidence than for any prior model" Anthropic's own words: "We find it alarming that the world looks on track to proceed rapidly to developing superhuman systems without stronger mechanisms in place." They say the 20-year cybersecurity equilibrium is over — and Mythos Preview is only the beginning. And: "We see no reason to think that Mythos Preview is where language models’ cybersecurity capabilities will plateau. The trajectory is clear. Just a few months ago, language models were only able to exploit fairly unsophisticated vulnerabilities. Just a few months before that, they were unable to identify any nontrivial vulnerabilities at all. Over the coming months and years, we expect that language models (those trained by us and by others) will continue to improve along all axes, including vulnerability research and exploit development."

English
0
0
1
25
Rogue retweetledi
Taelin
Taelin@VictorTaelin·
Anthropic claims they won't launch Mythos because it exposes bugs in software, making it too dangerous. I'm the creator of a new language named Bend (19k stars on GitHub). Its version 2 is coming next month, including a 10x faster CPU and GPU runtime, compilers to 5 different languages, a massive stdlib, and, most importantly, a *complete proof checker*. That makes it the first general language that can prove the correctness of its own programs, so, conveniently enough, it could be the way out of this very mess Anthropic is worried about. Sadly, Bend2 is now reaching 100k lines of code, making it increasingly hard for us to audit and verify it all. Proof checkers are particularly security-sensitive, because a single bug can lead to false theorems being accepted, undermining the entire trust model of the system. Even Lean, Coq and Agda had bugs in the past. We just finished Bend's initial consistency checker. Having Myhos audit our implementation would greatly improve Bend's security. In turn, a secure Bend could greatly improve the security of all other software, providing a solution the very problem that prevents Mythos from being released. I hope this message reaches someone from Anthropic, and they kindly consider letting Bend2 be part of Glasswing!
Taelin tweet media
Taelin@VictorTaelin

@alexalbert__ I'm the maintainer of Bend, a new programming language with 19k+ stars on GitHub. We're about to launch a major update. Having access to this model to audit it would greatly improve the project's security, and of projects built with it. Lmk if there's any way to get involved.

English
156
309
5.8K
855.2K