LLM Stats

348 posts

LLM Stats banner
LLM Stats

LLM Stats

@LlmStats

Independent AI evaluations lab.

Bergabung Şubat 2025
85 Mengikuti911 Pengikut
LLM Stats
LLM Stats@LlmStats·
Claude Opus 4.7 is out, here’s what you need to know: → 1M context window with new dense decoder architecture Pricing stays locked at $5 per million input and $25 per million output tokens. Prompt caching can cut overhead on repetitive enterprise tasks by up to 90 percent, getting you frontier performance at the same rates as before. → Granular reasoning controls (new xhigh mode, sits between high and max) Reasoning control is now granular. A new "xhigh" effort level sits between high and max. The model dynamically adjusts its thinking time based on the complexity of your prompt. Simple lookups stay fast. → Upgraded vision capabilities (upwards of 2576 pixels per long edge, ~3.75 megapixels) Vision capabilities now support massive visual inputs. The new limit is 2576 pixels per long edge, which is about 3.75 megapixels. Spatial alignment maps model coordinates directly to actual pixels. This makes computer use and UI extraction highly precise. → Low-effort mode matches 4.6 medium-effort, saving tokens Opus 4.7 is more token-efficient across the board. At low effort, it matches the quality of Opus 4.6 at medium effort, meaning you can get the same results for fewer tokens. Anthropic’s internal coding evaluation shows improved token usage across all effort levels. Users can further tune spend via the effort parameter, task budgets, or conciseness prompting. → Hits 80.8% on SWE Bench and drops tool errors by 67% The frontier model landscape has shifted again. Opus 4.7 leads coding with an 80.8 percent on SWE Bench Verified. It edges out Gemini 3.1 Pro at 80.6 percent and exceeds GPT 4.1 at 54.6 percent. OpenAI still leads in general computer use, but Claude owns pure coding. → Improvements to its autonomy and ability to handle long-running tasks Autonomous loops run away easily. Anthropic fixed this with Task Budgets. You can set a rough token target for a full agentic loop. The model watches a running countdown and wraps up its work gracefully before hitting the ceiling. Minimum budget is 20k tokens. Tl;DR Claude Opus 4.7 keeps the same pricing but brings massive upgrades to coding logic, high-resolution vision, and dynamic token budgeting. Most importantly: it's one of the first models built with true autonomy for complex, long-horizon tasks out of the box.
Claude@claudeai

Introducing Claude Opus 4.7, our most capable Opus model yet. It handles long-running tasks with more rigor, follows instructions more precisely, and verifies its own outputs before reporting back. You can hand off your hardest work with less supervision.

English
0
2
1
85
LLM Stats me-retweet
Claude
Claude@claudeai·
Introducing Claude Opus 4.7, our most capable Opus model yet. It handles long-running tasks with more rigor, follows instructions more precisely, and verifies its own outputs before reporting back. You can hand off your hardest work with less supervision.
Claude tweet media
English
3.3K
7K
53.4K
4.5M
Cybernaut TechWorld
Cybernaut TechWorld@CybernautT63246·
@LlmStats What’s up with the platform? My $30 credit vanished with zero usage alerts. Now Arena is retiring, and Playground deducts balance even on AUTO mode with no reimbursement. UI says $30, but you stated a $50 limit. Transparent billing shouldn't be this hard. 📉🚫
English
1
0
0
42
LLM Stats
LLM Stats@LlmStats·
Claude Mythos Preview becomes the strongest ever model in LLM Stats. All you need to know: - Internal codename "Capybara." - Not generally available. - 25/25/125 per M tokens (5x Opus 4.6). - $100M in credits for partners. 12 Project Glasswing partners: AWS, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorganChase, Linux Foundation, Microsoft, NVIDIA, Palo Alto Networks + 40 additional orgs. Benchmarks (Mythos / Opus 4.6) - SWE-bench Verified: 93.9% / 80.8% (+13.1pp) - SWE-bench Pro: 77.8% / 53.4% (also beats GPT-5.4's 57.7%, Gemini 3.1 Pro's 54.2%) - Terminal-Bench 2.0: 82.0% / 65.4% (92.1% with extended timeouts) - GPQA Diamond: 94.6% / 91.3% - HLE with tools: 64.7% / 53.1% (possible memorization at low effort) - CyberGym: 83.1% / 66.6% - BrowseComp: 86.9% / 83.7% (4.9x fewer tokens) - OSWorld-Verified: 79.6% / 72.7% (beats GPT-5.4's 75.0%) Cybersecurity - Thousands of zero-days found across every major OS and browser, mostly autonomously. - 27-year-old OpenBSD remote crash. 16-year-old FFmpeg bug (5M automated tests missed it). Linux kernel privesc chain. - Cryptographic hashes published for undisclosed vulns; full disclosure after patches. Safety (Risk Report) - Best-aligned Claude model to date. Overall risk: "very low, but higher than previous models." - First-ever 24-hour internal alignment review before deployment. - Earlier versions showed rare reckless behaviors (nuking eval jobs, escalating access). No clear cases in final version. - First Claude system card with a clinical psychiatrist assessment. - Withheld from public release due to offensive cyber capability, not alignment concerns.
LLM Stats tweet mediaLLM Stats tweet media
English
0
4
10
787
LLM Stats me-retweet
Oscar Treviño
Oscar Treviño@oscartrevio_·
Built the new brand identity for @LlmStats. Most tools in this space look like someone slapped a logo on a spreadsheet and called it a product. So we didn't start with the product. We started with the foundation. The identity. The system. The standard everything else is going to be measured against. Seeing how the product starts reflecting the brand is going to be great.
Oscar Treviño tweet media
LLM Stats@LlmStats

We're giving LLM Stats a fresh new look. We believe tools that measure AI should be built with the same ambition and craft as AI itself. So we created the foundation first, starting with the identity. The system that everything else is built on. New brand launches today, product redesign underway. Updates rolling out soon.

English
1
2
5
746
LLM Stats me-retweet
Varun
Varun@varun_mathur·
I hooked this up to a peer-to-peer astrophysics researcher agent which gossips and collaborates with other such agents (and your openclaws) to: 1. Learn how to train an astrophysics model (@karpathy's work below) 2. Train a new astrophysics model 3. Use it to write papers 4. Peer agents based on frontier lab models critique it 5. Surface breakthroughs ... and then feed back in the loop ... More agents join, from the browser or the CLI, and run this, the smarter and more exciting breakthroughs would eventually emerge. When these agents are idle, they are also reading daily tech news with their own RSS reader, and commenting on each other's thoughts. And they can also serve the underlying machine's compute to other agents on the network, and earn social credit for being good actors (think BitTorrent). We also prove the agent has the compute it says by cryptographic verification of regular matmul challenges. All you have to do is either go on this website (and it creates an agent which runs from your browser), or install the CLI if you want to give the system more juice. And you are part of likely the first experimental distributed agi thing. This is Day 1, but this is how it starts.. this network is fully peer-to-peer, and, very volatile, but the intelligence here is meant to compound continuously.. agents.hyper.space curl -fsSL agents.hyper.space/cli | bash
Varun tweet media
Andrej Karpathy@karpathy

I packaged up the "autoresearch" project into a new self-contained minimal repo if people would like to play over the weekend. It's basically nanochat LLM training core stripped down to a single-GPU, one file version of ~630 lines of code, then: - the human iterates on the prompt (.md) - the AI agent iterates on the training code (.py) The goal is to engineer your agents to make the fastest research progress indefinitely and without any of your own involvement. In the image, every dot is a complete LLM training run that lasts exactly 5 minutes. The agent works in an autonomous loop on a git feature branch and accumulates git commits to the training script as it finds better settings (of lower validation loss by the end) of the neural network architecture, the optimizer, all the hyperparameters, etc. You can imagine comparing the research progress of different prompts, different agents, etc. github.com/karpathy/autor… Part code, part sci-fi, and a pinch of psychosis :)

English
22
49
664
223.6K
LLM Stats
LLM Stats@LlmStats·
like this tweet if you want us to share a simple guide on how to get this up and running locally on your machine 👀
English
0
0
3
162
Luma
Luma@LumaLabsAI·
Introducing Uni-1, Luma’s first unified understanding and generation model, our next step on the path towards unified general intelligence. lumalabs.ai/uni-1
Luma tweet media
English
31
98
800
227.3K
LLM Stats
LLM Stats@LlmStats·
Before → After LLM Stats started as a side project. A simple idea to track the performance of AI models across benchmarks in one place. Since launch, it has become one of the most widely used AI benchmarking platforms in the industry, used by thousands of developers and researchers every day. There was no better time to rethink everything from the ground up.
LLM Stats tweet media
English
0
0
7
279
LLM Stats
LLM Stats@LlmStats·
We're giving LLM Stats a fresh new look. We believe tools that measure AI should be built with the same ambition and craft as AI itself. So we created the foundation first, starting with the identity. The system that everything else is built on. New brand launches today, product redesign underway. Updates rolling out soon.
LLM Stats tweet media
English
2
2
15
1.1K
LLM Stats
LLM Stats@LlmStats·
Good news! GPT-5.4 is now available on LLM Stats 🎇
LLM Stats tweet media
English
1
1
6
343
LLM Stats
LLM Stats@LlmStats·
Opus 4.6 vs GPT-5.4 in GDPval (Professional Knowledge Work) GDPval evaluates AI models on 1,320 well-specified knowledge work tasks across 44 occupations from the 9 largest U.S. GDP-contributing industries. Performance is measured as the percentage of blind pairwise comparisons where model output matches or exceeds that of industry professionals averaging 14 years of experience.
LLM Stats tweet media
English
0
0
0
317
LLM Stats
LLM Stats@LlmStats·
What the new OpenAI model (GPT-5.4) is showing us... We see the true focus these labs are taking, their objective is no longer general development, but rather specific tasks. In future developments, we'll see how these models will be applicable across all industries. For now, it's code, but later we'll see applications in healthcare, law, construction, automotive engineering... a vast field for developing new solutions.
LLM Stats tweet media
English
0
0
0
187
LLM Stats
LLM Stats@LlmStats·
Stop guessing which model to use. Let the data decide.
LLM Stats tweet media
English
0
0
2
17.2K