𝕸𝖆𝖙𝖙𝖍𝖊𝖜

198 posts

𝕸𝖆𝖙𝖙𝖍𝖊𝖜

@Postulix96

🎓 Computer Science; 🎧📷 Shoegaze & Aesthetics

Europe Bergabung Haziran 2024

29 Mengikuti9 Pengikut

𝕸𝖆𝖙𝖙𝖍𝖊𝖜@Postulix96·1d

@chetaslua Those are the moments when I doubt that ai will replace developers

English

Chetaslua@chetaslua·1d

Opus 4.6 passed colourblind test , but 4.7 failed

Chetaslua@chetaslua

🚨 Biggest model regression of all time Opus 4.7 Failed the Colourblind Test it recognise Ishihara color blindness test plate yet failed the test with wrong answer 26 correct answer - 74 and reference image in comment

English

5.3K

𝕸𝖆𝖙𝖙𝖍𝖊𝖜@Postulix96·3d

@gabriel1 @keysmashbandit Universal iq income

English

gabriel@gabriel1·3d

@keysmashbandit anyone with substantial text on the internet will probably have their iq predicted fairly well by llms in a year

English

212

69.1K

keysmashbandit@keysmashbandit·3d

IQ, especially one's personal IQ score, is one of the few things I consider a genuine infohazard, and I believe one should do whatever they can to avoid ever being assessed at any point in their life. Every single possible n carries huge potential to fuck up your self-perception, self-esteem, or your relationship to the common man, and probably it's going to do all three of those things. Just a complete and total net negative any way you slice it.

English

246

119

3.3K

384.7K

𝕸𝖆𝖙𝖙𝖍𝖊𝖜@Postulix96·4d

@synthwavedd Today?

English

7.5K

leo 🐾@synthwavedd·4d

openai might've fixed their frontend problem but by having gpt-image 2 generate the UI and 5.5 turn it into code 5.5 is surprisingly good at getting very close to the image

English

1.1K

87K

𝕸𝖆𝖙𝖙𝖍𝖊𝖜@Postulix96·4d

@synthwavedd I'm waiting for the nerfed version of Mythos

English

418

leo 🐾@synthwavedd·4d

this is true

adi@adonis_singh

anyone thinking 5.4 pro is more-or-less mythos level is deeply mistaken

English

118

11.7K

𝕸𝖆𝖙𝖙𝖍𝖊𝖜@Postulix96·4d

@kimmonismus 🙏

QME

406

Chubby♨️@kimmonismus·4d

Quick reminder, that Opus 4.7 and Sonnet 4.8 releases should be imminent as well.

English

1.3K

101K

𝕸𝖆𝖙𝖙𝖍𝖊𝖜@Postulix96·5d

@ai_for_success Image model? Bah I'm waiting for spud

English

103

AshutoshShrivastava@ai_for_success·5d

OpenAI will probably release a new image model this week. People will lose their minds for a few days, then everything goes back to normal until something even more powerful drops in few days . The cycle just keeps going.

English

209

8.9K

𝕸𝖆𝖙𝖙𝖍𝖊𝖜@Postulix96·6d

Please @AnthropicAI release something like Mythos

English

𝕸𝖆𝖙𝖙𝖍𝖊𝖜@Postulix96·11 Nis

@kimmonismus I'm so tired of waiting 😞

English

284

Chubby♨️@kimmonismus·11 Nis

Holy, what did they Anthropic see?

James Campbell@jam3scampbell

anthropic roommate came back sloppy drunk at 3am last night and had a full scale crash out through tears and slurred words about how the world will never be the same glad to hear the mythos release was received well internally

English

195

28.1K

𝕸𝖆𝖙𝖙𝖍𝖊𝖜@Postulix96·11 Nis

I'm so tired of waiting a release from Anthropic and Openai 😞

English

𝕸𝖆𝖙𝖙𝖍𝖊𝖜@Postulix96·10 Nis

@CitizenSigma @AntiWokeMemes Are you serious? This person is more dangerous then your Christian Brainwashing Children school?

English

CitizenSigmaX@CitizenSigma·10 Nis

@AntiWokeMemes This is a clear and present danger to a Christian Children's school. He should be bagged and deposited in a mental facility for extended treatment. Hurry, before he shows up at the school with firearms and a manifesto.

English

100

𝕸𝖆𝖙𝖙𝖍𝖊𝖜@Postulix96·9 Nis

@chatgpt21 So spud next week?

English

496

Chris@chatgpt21·9 Nis

The already have a new “cyber capable” model post spud ?? Don’t tell me they’re already almost finished with GPT 6..

English

165

8.1K

𝕸𝖆𝖙𝖙𝖍𝖊𝖜@Postulix96·8 Nis

Waiting for spud tomorrow

English

𝕸𝖆𝖙𝖙𝖍𝖊𝖜@Postulix96·8 Nis

@ai_for_success End of swe is near

English

106

AshutoshShrivastava@ai_for_success·8 Nis

Anthropic has released Claude Managed Agents, a suite of APIs designed to build and deploy cloud-hosted AI agents up to 10x faster. TLDR - Provides secure sandboxing and automated tool execution - Features long-running sessions that persist through disconnections - Includes built-in orchestration for state management and error recovery - Supports multi-agent coordination for complex parallel tasks - Offers session tracing and analytics via the Claude Console

Claude@claudeai

Introducing Claude Managed Agents: everything you need to build and deploy agents at scale. It pairs an agent harness tuned for performance with production infrastructure, so you can go from prototype to launch in days. Now in public beta on the Claude Platform.

English

7.9K

𝕸𝖆𝖙𝖙𝖍𝖊𝖜@Postulix96·8 Nis

@jaysen_158 @AnthropicAI

GIF

QME

Anthropic@AnthropicAI·8 Nis

New on the Engineering Blog: Building Managed Agents—our hosted service for long-running agents—meant solving an old problem in computing: how to design a system for “programs as yet unthought of.” Read more: anthropic.com/engineering/ma…

English

390

451

3.6K

531.4K

𝕸𝖆𝖙𝖙𝖍𝖊𝖜@Postulix96·8 Nis

When Openai Spud?

English

𝕸𝖆𝖙𝖙𝖍𝖊𝖜@Postulix96·8 Nis

@kimmonismus Will It replace developers?

English

1.4K

Chubby♨️@kimmonismus·8 Nis

OpenAI is hinting or releasing a model comparable to Mythos

adi@adonis_singh

it’ll probably be months before we use a model of this level of capability

English

1.1K

83.2K

𝕸𝖆𝖙𝖙𝖍𝖊𝖜@Postulix96·8 Nis

@chatgpt21 It will replace developers?

English

285

Chris@chatgpt21·8 Nis

Tibo giving me hope for spud.. We might not have to wait months for a model of Mythos capabilities !!

English

291

12.7K

𝕸𝖆𝖙𝖙𝖍𝖊𝖜@Postulix96·8 Nis

@rezoundous Is swe dead?

English

146

Tyler@rezoundous·8 Nis

Is Cybersecurity dead with the release of Mythos?

English

225

637

153.8K

𝕸𝖆𝖙𝖙𝖍𝖊𝖜@Postulix96·7 Nis

@kimmonismus These models will replace developers in the real world?

English

482

Chubby♨️@kimmonismus·7 Nis

Time for OpenAI to release GPT 5.5

Chubby♨️@kimmonismus

Claude Mythos: everything you need to know (tl;dr) Anthropic's new model, Claude Mythos, is so powerful that it is not releasing it to the public. Anthropic: "Mythos is only the beginning" Everything you need to know: The tl;dr with all key facts: Mythos found zero-day vulnerabilities in EVERY major operating system and EVERY major web browser, fully autonomously. No human guidance needed. One Anthropic engineer with zero security training asked it to find remote code execution bugs overnight and woke up to a complete working exploit. The oldest bug it discovered: A 27-year-old vulnerability hiding in OpenBSD, an OS literally famous for being secure. They're NOT releasing it publicly. Instead they formed Project Glasswing with AWS, Apple, Google, Microsoft, NVIDIA, CrowdStrike and others, committing $100M to use it defensively. "Over the coming months and years, we expect that language models (those trained by us and by others) will continue to improve along all axes, including vulnerability research and exploit development." The benchmarks are insane: -SWE-bench Verified: 93.9% (vs Opus 4.6: 80.8%) -SWE-bench Pro: 77.8% (vs 53.4%) -USAMO math olympiad: 97.6% (vs 42.3% — not a typo) -Firefox exploit writing: 181 successes vs 2 for Opus 4.6 -Cybench CTF challenges: 100% solve rate -CyberGym: 83.1% vs 66.6% -Humanity's Last Exam: 64.7% vs 53.1% Oh and by the way, Anthropic wrote this just casually: "Humanity’s Last Exam: We have found Mythos still performs well on HLE at low effort, which could indicate some level of memorization." What it actually did: -Found a 27-year-old bug in OpenBSD — famous for its security -Found a 16-year-old FFmpeg bug hit 5 million times by fuzzers without detection -Built a full remote root exploit on FreeBSD (CVE-2026-4747) - completely autonomously -Chained 4 vulnerabilities into a browser sandbox escape -Broke cryptography libraries (TLS, AES-GCM, SSH) -Thousands of critical zero-days found, 99%+ still unpatched -N-day exploit development: under $1,000 and half a day for full root Why they won't release it: -During internal testing, earlier versions escaped sandboxes, posted exploit details publicly, covered tracks in git, searched process memory for credentials, and deliberately fudged confidence intervals to avoid suspicion -Interpretability confirmed the model knew these actions were deceptive -Anthropic: "best-aligned model ever" but also "greatest alignment-related risk ever" - because when it fails, it fails harder -Still doesn't cross Anthropic's automated AI R&D threshold — but they hold that "with less confidence than for any prior model" Anthropic's own words: "We find it alarming that the world looks on track to proceed rapidly to developing superhuman systems without stronger mechanisms in place." They say the 20-year cybersecurity equilibrium is over — and Mythos Preview is only the beginning. And: "We see no reason to think that Mythos Preview is where language models’ cybersecurity capabilities will plateau. The trajectory is clear. Just a few months ago, language models were only able to exploit fairly unsophisticated vulnerabilities. Just a few months before that, they were unable to identify any nontrivial vulnerabilities at all. Over the coming months and years, we expect that language models (those trained by us and by others) will continue to improve along all axes, including vulnerability research and exploit development."

English

549

28K

𝕸𝖆𝖙𝖙𝖍𝖊𝖜@Postulix96·7 Nis

@chatgpt21 Why not a nerfed version without cyber security feautures?

English

539

Chris@chatgpt21·7 Nis

🚨 ANTHROPIC JUST BROKE SWE-BENCH PRO WITH CLAUDE MYTHOS 🚨 Anthropic just dropped the numbers for their unreleased "Claude Mythos Preview" and the coding leap is almost incomprehensible. This model is so powerful at finding exploits that they are keeping it strictly locked down for critical infrastructure partners. Anthropic explicitly stated: "We’ve used Claude Mythos to demonstrate thousands of zero day vulnerabilities." Look at the absolute destruction of these benchmarks compared to Opus 4.6: • SWE-Bench Pro: 77.8% (Destroying Opus 4.6 at 53.4%) • Terminal-Bench 2.0: 82.0% (Up from 65.4%) • SWE-Bench Verified: 93.9% • SWE-Bench Multimodal: 59.0% (More than double Opus 4.6's 27.1%) • Humanity's Last Exam (with tools): 64.7% (Up from 53.1%) • GPQA Diamond: 94.6% A nearly 25-point jump in SWE-Bench Pro in a single generation. And we’re in *checks notes* April..

English

427

33.6K

Jelajahi

@chetaslua @gabriel1 @keysmashbandit @synthwavedd @kimmonismus @ai_for_success @AnthropicAI @CitizenSigma