Prathmesh Pandey

296 posts

Prathmesh Pandey banner
Prathmesh Pandey

Prathmesh Pandey

@file_mutex

Building Next-gen Coding Platform | Programmer | Xoogler

Mountain View, CA เข้าร่วม Temmuz 2017
351 กำลังติดตาม64 ผู้ติดตาม
David Cramer
David Cramer@zeeg·
codex writes the most digusting code idk who's responsible for pre-training over there but you gotta flip the script
David Cramer tweet media
English
87
14
715
111.7K
Kirill Skrygan
Kirill Skrygan@kskrygan·
Real assessment of Fable5 from engineers around me: -somewhat better than Opus on OSS repos -about the same on closed-source repos -much more expensive So for real orgs, the value prop is pretty vague But sure, keep believing it was so powerful the government had to ban it
English
15
5
105
11.7K
Prathmesh Pandey
Prathmesh Pandey@file_mutex·
@mattshumer_ That's the problem with unhardened harnesses if one need to wait for better models.
English
0
0
1
157
Prathmesh Pandey
Prathmesh Pandey@file_mutex·
@zeeg hmm there are perhaps five people out there in the world, who can beat _the_ sota coding harness. no one will be able to read and understand that much code. dig in when required? sure. but know it in-n-out? most probably not.
English
13
0
2
3.1K
David Cramer
David Cramer@zeeg·
@file_mutex people who dont read the code are not serious people and it takes a serious person to ship production software
English
22
39
488
37.5K
David Crawshaw
David Crawshaw@davidcrawshaw·
Current status: data analysis and code analysis (and both combined!) with Fable. It appears unmatched at extracting insight from a mountain of code and logs. Then I take the last stanza it creates and hand it to another model for implementation.
English
3
0
28
2.3K
Prathmesh Pandey
Prathmesh Pandey@file_mutex·
My MCP is roughly saving 50% on blended token consumption in codex and claude. That doesn't mean codex, claude can't build something similar but as a server owner their philosophy will be rooted in "seeing the maximum queries coming through".
Dan Robinson@danrobinson

If you’re proud of your really sophisticated skill or harness, try benchmarking it against a simple one-sentence prompt as a sanity check Codex, Claude Code, and ChatGPT Pro are really, really good

English
0
0
0
251
Prathmesh Pandey
Prathmesh Pandey@file_mutex·
@AndrewCurran_ @bayeslord It def has big model smell. I have my reviewers on GPT 5.5 xhigh, and Opus 4.8 used to stumble through 10 revisions before getting past the reviewers. Fable takes 2-3 revisions.
English
0
0
1
231
Andrew Curran
Andrew Curran@AndrewCurran_·
@bayeslord I've been trying to tell people. And Fable isn't the new Mythos. And the new Mythos isn't what they have internally.
English
3
0
94
4.4K
bayes
bayes@bayeslord·
Fable is in fact Built Different
English
3
1
78
6.4K
Jeffrey Emanuel
Jeffrey Emanuel@doodlestein·
@__paleologo This seems to really vary a lot. I’ve been surprised by how much mileage I’ve gotten so far with Fable across a variety of tasks. Granted, I have 22 Max accounts, but it’s not like I’ve blasted through all of them already either. I’m asking them to do really hard stuff, though.
English
6
0
25
5.1K
Gappy (Giuseppe Paleologo)
Clearly, Fable is doing a lot of work, and unleashing a ton of agents. To review a short technical note, it released 31 agents, coded simulations to verify my results, did "adversarial reviews". Eventually, it only made the assumptions slightly more rigorous. It is all good. For a four-page technical note+a little code, though, it consumed all my Pro session tokens, *plus* $17 worth of credits. It is ridiculously expensive. I have 20-page reports that are way more complex than this. I can see how Anthropic has entered the phase of market-clearing prices, yield management, and pre-IPO. I recall Boris Cherny saying in a podcast, "run Opus [4.6], not Sonnet. It's worth it". I feel comfortable saying that running your top-shelf model is *not* worth it anymore. Decreasing returns, on most tasks. Like in the real world, some people can be real smart, but real expensive.
English
58
58
1.1K
162.9K
Claude
Claude@claudeai·
Introducing Claude Fable 5: a Mythos-class model that we’ve made safe for general use. Its capabilities exceed those of any model we’ve ever made generally available.
English
5K
14.5K
104.6K
55.6M
finbarr
finbarr@finbarrtimbers·
As my entire feed is criticizing Anthropic, I think that the team there genuinely believes what they’re saying. It’s not a marketing/anticompetitive tactic. They genuinely believe these models are dangerous and that AI research should be slowed down.
English
101
12
406
94.5K
Prathmesh Pandey
Prathmesh Pandey@file_mutex·
@charliermarsh both will eventually end up measuring the same thing; as the latter tends to 100%, the former will tend to zero.
English
0
0
0
144
Charlie Marsh
Charlie Marsh@charliermarsh·
I think "percent of code read/reviewed by a human" is perhaps a more interesting metric than "percent of code authored by an agent"
English
16
6
161
7.7K
Prathmesh Pandey
Prathmesh Pandey@file_mutex·
In short, Anthropic is asking for IOCs-like distribution mechanism controlled by the US govt. What tech exists and who is allowed to share what with whom needs to be essentially controlled.
Andrew Curran@AndrewCurran_

Anthropic says Recursive Self Improvement is approaching faster than they expected. Quoting from the blog: 'What should we do? If it were possible to effectively slow the development of this technology to give ourselves more time to deal with its immense implications, we think that would likely be a good thing. But if a slowdown simply lets the least cautious actors catch up technologically, it could leave everyone less safe. Without a global coordination mechanism, companies and governments will have to make difficult decisions about safety while under competitive and geopolitical pressures. We believe it would be good for the world to have the option to slow or temporarily pause frontier AI development to enable societal structures and alignment research to keep up with the advance of the technology. The Anthropic Institute will conduct research—in collaboration with many others—and take actions to help build the systems that a credible slowdown or pause would require. These systems would enable frontier AI developers to verify that others globally have actually stopped or slowed, and that a bad actor could not use the auspices of a coordinated slowdown to jump ahead in secret. If such systems existed, we expect that we would slow down or temporarily pause, if other developers at or near the frontier also did so in a verifiable manner. A meaningful slowdown or pause would require multiple well-resourced labs at or near the frontier, in multiple countries, agreeing to stop under the same conditions. It would also require that each can verify that the others have actually stopped. Due to the unique characteristics of AI systems, the detectability (a lower standard than verifiability) element of this arms control problem is much more challenging than with other technologies. Training runs are far easier to conceal than missile silos, their inputs are general-purpose, and the incentive to defect quietly is enormous, because whoever continues while others pause could inherit the lead. A credible pause also has to specify what triggers it, what lifts it, and who adjudicates. None of this is necessarily impossible in principle—the world has built verification regimes for other complex technologies (e.g., the Intermediate-Range Nuclear Forces Treaty)—but those regimes took decades to build both the infrastructure and the trust. We don’t have that long. A unilateral pause by one lab, by contrast, is achievable immediately, but accomplishes much less: it would change who the front-runner is, but it would not create the wider deliberative process that is currently missing. In the coming months, we will organize conversations where policymakers, researchers, civil society, and other AI companies can help answer some of the questions this piece raises, especially around full recursive self-improvement and how to create better options for coordination and deliberation. We’ll publish what comes out of it. The window to investigate the questions together is here, and people outside AI companies should be involved in this deliberation.'

English
0
0
1
180