LLM Stats (@LlmStats) - Profil Twitter | Zamantika Mersobahis Locabet

LLM Stats@LlmStats·1h

→ Model: llm-stats.com/models/claude-… → Playground: llm-stats.com/playground

English

0

18

LLM Stats@LlmStats·1h

Claude Opus 4.7, now on LLM Stats. See how it performs against other models and use it live on our playground. Links below.

Claude@claudeai

Introducing Claude Opus 4.7, our most capable Opus model yet. It handles long-running tasks with more rigor, follows instructions more precisely, and verifies its own outputs before reporting back. You can hand off your hardest work with less supervision.

English

1

2

65

LLM Stats@LlmStats·1h

Claude Opus 4.7 is out, here’s what you need to know: → 1M context window with new dense decoder architecture Pricing stays locked at $5 per million input and $25 per million output tokens. Prompt caching can cut overhead on repetitive enterprise tasks by up to 90 percent, getting you frontier performance at the same rates as before. → Granular reasoning controls (new xhigh mode, sits between high and max) Reasoning control is now granular. A new "xhigh" effort level sits between high and max. The model dynamically adjusts its thinking time based on the complexity of your prompt. Simple lookups stay fast. → Upgraded vision capabilities (upwards of 2576 pixels per long edge, ~3.75 megapixels) Vision capabilities now support massive visual inputs. The new limit is 2576 pixels per long edge, which is about 3.75 megapixels. Spatial alignment maps model coordinates directly to actual pixels. This makes computer use and UI extraction highly precise. → Low-effort mode matches 4.6 medium-effort, saving tokens Opus 4.7 is more token-efficient across the board. At low effort, it matches the quality of Opus 4.6 at medium effort, meaning you can get the same results for fewer tokens. Anthropic’s internal coding evaluation shows improved token usage across all effort levels. Users can further tune spend via the effort parameter, task budgets, or conciseness prompting. → Hits 80.8% on SWE Bench and drops tool errors by 67% The frontier model landscape has shifted again. Opus 4.7 leads coding with an 80.8 percent on SWE Bench Verified. It edges out Gemini 3.1 Pro at 80.6 percent and exceeds GPT 4.1 at 54.6 percent. OpenAI still leads in general computer use, but Claude owns pure coding. → Improvements to its autonomy and ability to handle long-running tasks Autonomous loops run away easily. Anthropic fixed this with Task Budgets. You can set a rough token target for a full agentic loop. The model watches a running countdown and wraps up its work gracefully before hitting the ceiling. Minimum budget is 20k tokens. Tl;DR Claude Opus 4.7 keeps the same pricing but brings massive upgrades to coding logic, high-resolution vision, and dynamic token budgeting. Most importantly: it's one of the first models built with true autonomy for complex, long-horizon tasks out of the box.

Claude@claudeai

Introducing Claude Opus 4.7, our most capable Opus model yet. It handles long-running tasks with more rigor, follows instructions more precisely, and verifies its own outputs before reporting back. You can hand off your hardest work with less supervision.

English

0

2

1

85

LLM Stats me-retweet

Claude@claudeai·2h

Introducing Claude Opus 4.7, our most capable Opus model yet. It handles long-running tasks with more rigor, follows instructions more precisely, and verifies its own outputs before reporting back. You can hand off your hardest work with less supervision.

English

3.3K

7K

53.4K

4.5M

LLM Stats@LlmStats·4d

@CybernautT63246 Hey @CybernautT63246 , we're really sorry about this! Just sent you a DM, lets get this issue solved for you.

English

1

0

1

6

Cybernaut TechWorld@CybernautT63246·4d

@LlmStats What’s up with the platform? My $30 credit vanished with zero usage alerts. Now Arena is retiring, and Playground deducts balance even on AUTO mode with no reimbursement. UI says $30, but you stated a $50 limit. Transparent billing shouldn't be this hard. 📉🚫

English

1

0

42

LLM Stats@LlmStats·8 Nis

Claude Mythos Preview becomes the strongest ever model in LLM Stats. All you need to know: - Internal codename "Capybara." - Not generally available. - 25/25/125 per M tokens (5x Opus 4.6). - $100M in credits for partners. 12 Project Glasswing partners: AWS, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorganChase, Linux Foundation, Microsoft, NVIDIA, Palo Alto Networks + 40 additional orgs. Benchmarks (Mythos / Opus 4.6) - SWE-bench Verified: 93.9% / 80.8% (+13.1pp) - SWE-bench Pro: 77.8% / 53.4% (also beats GPT-5.4's 57.7%, Gemini 3.1 Pro's 54.2%) - Terminal-Bench 2.0: 82.0% / 65.4% (92.1% with extended timeouts) - GPQA Diamond: 94.6% / 91.3% - HLE with tools: 64.7% / 53.1% (possible memorization at low effort) - CyberGym: 83.1% / 66.6% - BrowseComp: 86.9% / 83.7% (4.9x fewer tokens) - OSWorld-Verified: 79.6% / 72.7% (beats GPT-5.4's 75.0%) Cybersecurity - Thousands of zero-days found across every major OS and browser, mostly autonomously. - 27-year-old OpenBSD remote crash. 16-year-old FFmpeg bug (5M automated tests missed it). Linux kernel privesc chain. - Cryptographic hashes published for undisclosed vulns; full disclosure after patches. Safety (Risk Report) - Best-aligned Claude model to date. Overall risk: "very low, but higher than previous models." - First-ever 24-hour internal alignment review before deployment. - Earlier versions showed rare reckless behaviors (nuking eval jobs, escalating access). No clear cases in final version. - First Claude system card with a clinical psychiatrist assessment. - Withheld from public release due to offensive cyber capability, not alignment concerns.

English

0

4

10

787

LLM Stats me-retweet

Oscar Treviño@oscartrevio_·25 Mar

Built the new brand identity for @LlmStats. Most tools in this space look like someone slapped a logo on a spreadsheet and called it a product. So we didn't start with the product. We started with the foundation. The identity. The system. The standard everything else is going to be measured against. Seeing how the product starts reflecting the brand is going to be great.

LLM Stats@LlmStats

We're giving LLM Stats a fresh new look. We believe tools that measure AI should be built with the same ambition and craft as AI itself. So we created the foundation first, starting with the identity. The system that everything else is built on. New brand launches today, product redesign underway. Updates rolling out soon.

English

1

2

5

746

LLM Stats me-retweet

Varun@varun_mathur·8 Mar

I hooked this up to a peer-to-peer astrophysics researcher agent which gossips and collaborates with other such agents (and your openclaws) to: 1. Learn how to train an astrophysics model (@karpathy's work below) 2. Train a new astrophysics model 3. Use it to write papers 4. Peer agents based on frontier lab models critique it 5. Surface breakthroughs ... and then feed back in the loop ... More agents join, from the browser or the CLI, and run this, the smarter and more exciting breakthroughs would eventually emerge. When these agents are idle, they are also reading daily tech news with their own RSS reader, and commenting on each other's thoughts. And they can also serve the underlying machine's compute to other agents on the network, and earn social credit for being good actors (think BitTorrent). We also prove the agent has the compute it says by cryptographic verification of regular matmul challenges. All you have to do is either go on this website (and it creates an agent which runs from your browser), or install the CLI if you want to give the system more juice. And you are part of likely the first experimental distributed agi thing. This is Day 1, but this is how it starts.. this network is fully peer-to-peer, and, very volatile, but the intelligence here is meant to compound continuously.. agents.hyper.space curl -fsSL agents.hyper.space/cli | bash

Andrej Karpathy@karpathy

I packaged up the "autoresearch" project into a new self-contained minimal repo if people would like to play over the weekend. It's basically nanochat LLM training core stripped down to a single-GPU, one file version of ~630 lines of code, then: - the human iterates on the prompt (.md) - the AI agent iterates on the training code (.py) The goal is to engineer your agents to make the fastest research progress indefinitely and without any of your own involvement. In the image, every dot is a complete LLM training run that lasts exactly 5 minutes. The agent works in an autonomous loop on a git feature branch and accumulates git commits to the training script as it finds better settings (of lower validation loss by the end) of the neural network architecture, the optimizer, all the hyperparameters, etc. You can imagine comparing the research progress of different prompts, different agents, etc. github.com/karpathy/autor… Part code, part sci-fi, and a pinch of psychosis :)

English

22

49

664

223.6K

LLM Stats@LlmStats·9 Mar

like this tweet if you want us to share a simple guide on how to get this up and running locally on your machine 👀

English

0

3

162

LLM Stats@LlmStats·9 Mar

tl;dr for the curious: → self-contained system that has AI automatically running ML experiments while rewritings it's own training code. → gets better on its own while you sleep why this matters: → turns a big chunk of what grad students and ml engineers do into an automated "researcher loop" but more importantly: this pushes us toward a world where humans do more science and less hyperparameter babysitting.

Andrej Karpathy@karpathy

I packaged up the "autoresearch" project into a new self-contained minimal repo if people would like to play over the weekend. It's basically nanochat LLM training core stripped down to a single-GPU, one file version of ~630 lines of code, then: - the human iterates on the prompt (.md) - the AI agent iterates on the training code (.py) The goal is to engineer your agents to make the fastest research progress indefinitely and without any of your own involvement. In the image, every dot is a complete LLM training run that lasts exactly 5 minutes. The agent works in an autonomous loop on a git feature branch and accumulates git commits to the training script as it finds better settings (of lower validation loss by the end) of the neural network architecture, the optimizer, all the hyperparameters, etc. You can imagine comparing the research progress of different prompts, different agents, etc. github.com/karpathy/autor… Part code, part sci-fi, and a pinch of psychosis :)

English

1

0

5

391

LLM Stats@LlmStats·6 Mar

x.com/i/article/2029…

ZXX

0

5

290

LLM Stats@LlmStats·6 Mar

@LumaLabsAI Let's collab and rank it!

English

0

1.4K

Luma@LumaLabsAI·6 Mar

Introducing Uni-1, Luma’s first unified understanding and generation model, our next step on the path towards unified general intelligence. lumalabs.ai/uni-1

English

31

98

800

227.3K

LLM Stats@LlmStats·6 Mar

Before → After LLM Stats started as a side project. A simple idea to track the performance of AI models across benchmarks in one place. Since launch, it has become one of the most widely used AI benchmarking platforms in the industry, used by thousands of developers and researchers every day. There was no better time to rethink everything from the ground up.

English

0

7

279

LLM Stats@LlmStats·6 Mar

We're giving LLM Stats a fresh new look. We believe tools that measure AI should be built with the same ambition and craft as AI itself. So we created the foundation first, starting with the identity. The system that everything else is built on. New brand launches today, product redesign underway. Updates rolling out soon.

English

2

15

1.1K

LLM Stats@LlmStats·6 Mar

You can read the model details, make comparisons, and use it for your use cases here! llm-stats.com/models/gpt-5.4

English

0

202

LLM Stats@LlmStats·6 Mar

Good news! GPT-5.4 is now available on LLM Stats 🎇

English

1

6

343

LLM Stats@LlmStats·5 Mar

@SebastienBubeck At least they look happy

English

0

994

Sebastien Bubeck@SebastienBubeck·5 Mar

GPT-5.4

44

50

800

93.7K

LLM Stats@LlmStats·5 Mar

Opus 4.6 vs GPT-5.4 in GDPval (Professional Knowledge Work) GDPval evaluates AI models on 1,320 well-specified knowledge work tasks across 44 occupations from the 9 largest U.S. GDP-contributing industries. Performance is measured as the percentage of blind pairwise comparisons where model output matches or exceeds that of industry professionals averaging 14 years of experience.

English

0

317

LLM Stats@LlmStats·5 Mar

What the new OpenAI model (GPT-5.4) is showing us... We see the true focus these labs are taking, their objective is no longer general development, but rather specific tasks. In future developments, we'll see how these models will be applicable across all industries. For now, it's code, but later we'll see applications in healthcare, law, construction, automotive engineering... a vast field for developing new solutions.

English

0

187

LLM Stats@LlmStats·5 Mar

Stop guessing which model to use. Let the data decide.

English

0

2

17.2K

LLM Stats

Jelajahi