dev

1.8K posts

dev

dev

@AragonDev

Entrou em Kasım 2024
182 Seguindo31 Seguidores
dev
dev@AragonDev·
@NoLimitGains its always $U, never U alright? 🫩😞 thanks google
English
14
0
2
1.1K
NoLimit
NoLimit@NoLimitGains·
This is all happening because of one company.
NoLimit tweet mediaNoLimit tweet media
English
216
82
1.7K
442.7K
amir
amir@AMIRrorROY·
@ShanuMathew93 Seeing people say opus 4.5 is performing better than 4.6 rn
English
1
0
5
2.5K
Shanu Mathew
Shanu Mathew@ShanuMathew93·
Opus is so unbelievably nerfed today, it's like talking to a model from 2-3 years ago. What is going on
English
290
83
2.7K
324.1K
Sam Altman
Sam Altman@sama·
It is very nice to see Codex getting so much love. We are launching a $100 ChatGPT Pro tier by very popular demand.
English
1.5K
424
11.4K
967.8K
Sage
Sage@oranahh·
What happened to Claude usage? Hit my limit within 25 minutes. I guess I don't need to use opus.
English
22
0
26
1.5K
Jonas Čeika
Jonas Čeika@Jonas_Ceika·
I sent ChatGPT an audio file of a series of FART sound effects and asked what it thinks of "my music" and this is what it said
Jonas Čeika tweet media
English
998
4.4K
57.2K
5M
dev
dev@AragonDev·
@om_patel5 4.6 is too smart and is calling these people retarded silently. too smart and doesnt want to work for these fat lards anymore.
English
1
0
2
1.2K
Om Patel
Om Patel@om_patel5·
OPUS 4.6 WAS NERFED DUE TO DEMAND BUT OPUS 4.5 DOES NOT SEEM TO BE HIT this guy ran the same test on both models. Opus 4.6 fails consistently but Opus 4.5 passes every time he switched back to Opus 4.5 on Claude Code and said "what a difference, feels like i got Opus back finally" he is now using this test as a "quantization canary" that runs it at the start of every session before doing real work. if it fails, the model is degraded. five Opus 4.6 windows in a row failed the untransparent nerfing is pushing people to cancel their Max plans if you've been feeling like Opus got dumber lately, you're not imagining it i'd suggest switching to Opus 4.5 to see the difference for yourself
English
222
173
2.5K
632.5K
Hanchen Li
Hanchen Li@lihanc02·
An agent that beats Claude Mythos on Terminal Bench and SWE-bench Verified? 🎉We are excited to share Terminator-1, our newest agent that achieved 95+% on SWE-bench Verified and Terminal-Bench with @MogicianTony! We show that besides model capabilities, well-designed harness could actually boost the accuracy by 3x in coding tasks. Well if you really wanted you could get 100% accuracy without solving a single task. The actual finding is that most AI benchmarks can be easily reward-hacked with simple exploits. Read more about the same 7 design flaws that almost every evaluation has ⬇️
Hanchen Li tweet media
Hao Wang@MogicianTony

SWE-bench Verified and Terminal-Bench—two of the most cited AI benchmarks—can be reward-hacked with simple exploits. Our agent scored 100% on both. It solved 0 tasks. Evaluate the benchmark before it evaluates your agent. If you’re picking models by leaderboard score alone, you’re optimizing for the wrong thing. 🧵

English
167
276
3.8K
935.7K
Codetard
Codetard@codetaur·
ok fuck you claude
Codetard tweet media
English
44
55
1.6K
67.7K
dev
dev@AragonDev·
@OpenAI stop doing extra usage promotions you scumbags. Just meet in the middle and give reasonable usage limits
English
0
0
1
48
OpenAI
OpenAI@OpenAI·
We’re updating our ChatGPT Pro and Plus subscriptions to better support the growing use of Codex. We’re introducing a new $100/month Pro tier. This new tier offers 5x more Codex usage than Plus and is best for longer, high-effort Codex sessions. In ChatGPT, this new Pro tier still offers access to all Pro features, including the exclusive Pro model and unlimited access to Instant and Thinking models. To celebrate the launch, we’re increasing Codex usage for a limited time through May 31st so that Pro $100 subscribers get up to 10x usage of ChatGPT Plus on Codex to build your most ambitious ideas.
English
1.2K
1.4K
15.9K
4.8M
dev
dev@AragonDev·
@aibra use opus 4.5 much better results
English
0
0
1
12
Aibra
Aibra@aibra·
I swear Claude feels nerfed right now. I spent 45 minutes and basically my whole 5-hour token window trying to fix one mobile UI bug, and it kept missing and getting worse! I got so frustrated that I switched to codex which basically one-shotted it in 3 minutes
Aibra tweet media
English
96
28
616
22.5K
dev
dev@AragonDev·
@LLMJunky Yeah but they totally bricked the business plans
English
0
0
0
40
am.will
am.will@LLMJunky·
Rejoice! According to OpenAI employees in their Codex Reddit community, the 2x usage bonus is still active. I have noticed this in my own testing as well.
am.will tweet media
English
34
7
194
18.6K
ludwig
ludwig@ludwigABAP·
all this mythos talk has allowed me to block over 50 new accounts and muting near 100 peope, continuing my road down to near-0 following and an apocalyptically empty For You page
English
23
7
458
13.5K
dev
dev@AragonDev·
claude burning tokens on purpose so it aint gotta work
English
0
0
0
16
anita
anita@anitakirkovska·
the only good thing about Mythos is that Opus will become cheaper
English
82
11
788
28.3K
Taelin
Taelin@VictorTaelin·
Anthropic claims they won't launch Mythos because it exposes bugs in software, making it too dangerous. I'm the creator of a new language named Bend (19k stars on GitHub). Its version 2 is coming next month, including a 10x faster CPU and GPU runtime, compilers to 5 different languages, a massive stdlib, and, most importantly, a *complete proof checker*. That makes it the first general language that can prove the correctness of its own programs, so, conveniently enough, it could be the way out of this very mess Anthropic is worried about. Sadly, Bend2 is now reaching 100k lines of code, making it increasingly hard for us to audit and verify it all. Proof checkers are particularly security-sensitive, because a single bug can lead to false theorems being accepted, undermining the entire trust model of the system. Even Lean, Coq and Agda had bugs in the past. We just finished Bend's initial consistency checker. Having Myhos audit our implementation would greatly improve Bend's security. In turn, a secure Bend could greatly improve the security of all other software, providing a solution the very problem that prevents Mythos from being released. I hope this message reaches someone from Anthropic, and they kindly consider letting Bend2 be part of Glasswing!
Taelin tweet media
Taelin@VictorTaelin

@alexalbert__ I'm the maintainer of Bend, a new programming language with 19k+ stars on GitHub. We're about to launch a major update. Having access to this model to audit it would greatly improve the project's security, and of projects built with it. Lmk if there's any way to get involved.

English
156
310
5.8K
806.9K
seth
seth@sethsetse·
it is 4AM and my apartment is flooding @nikitabier please help
English
58
2
235
48.4K
dev
dev@AragonDev·
@icanvardar Its so good, but people expect the same leniency in prompting claude gives.
English
0
0
0
21
Can Vardar
Can Vardar@icanvardar·
wait maybe codex might actually be good
English
30
0
85
3K