LLM Stats (@LlmStats) - Twitterプロフィール

@CybernautT63246 Hey @CybernautT63246 , we're really sorry about this! Just sent you a DM, lets get this issue solved for you.

English

1

0

1

5

Cybernaut TechWorld@CybernautT63246·3d

@LlmStats What’s up with the platform? My $30 credit vanished with zero usage alerts. Now Arena is retiring, and Playground deducts balance even on AUTO mode with no reimbursement. UI says $30, but you stated a $50 limit. Transparent billing shouldn't be this hard. 📉🚫

English

1

0

39

LLM Stats@LlmStats·8 Nis

Claude Mythos Preview becomes the strongest ever model in LLM Stats. All you need to know: - Internal codename "Capybara." - Not generally available. - 25/25/125 per M tokens (5x Opus 4.6). - $100M in credits for partners. 12 Project Glasswing partners: AWS, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorganChase, Linux Foundation, Microsoft, NVIDIA, Palo Alto Networks + 40 additional orgs. Benchmarks (Mythos / Opus 4.6) - SWE-bench Verified: 93.9% / 80.8% (+13.1pp) - SWE-bench Pro: 77.8% / 53.4% (also beats GPT-5.4's 57.7%, Gemini 3.1 Pro's 54.2%) - Terminal-Bench 2.0: 82.0% / 65.4% (92.1% with extended timeouts) - GPQA Diamond: 94.6% / 91.3% - HLE with tools: 64.7% / 53.1% (possible memorization at low effort) - CyberGym: 83.1% / 66.6% - BrowseComp: 86.9% / 83.7% (4.9x fewer tokens) - OSWorld-Verified: 79.6% / 72.7% (beats GPT-5.4's 75.0%) Cybersecurity - Thousands of zero-days found across every major OS and browser, mostly autonomously. - 27-year-old OpenBSD remote crash. 16-year-old FFmpeg bug (5M automated tests missed it). Linux kernel privesc chain. - Cryptographic hashes published for undisclosed vulns; full disclosure after patches. Safety (Risk Report) - Best-aligned Claude model to date. Overall risk: "very low, but higher than previous models." - First-ever 24-hour internal alignment review before deployment. - Earlier versions showed rare reckless behaviors (nuking eval jobs, escalating access). No clear cases in final version. - First Claude system card with a clinical psychiatrist assessment. - Withheld from public release due to offensive cyber capability, not alignment concerns.

English

0

4

10

772

LLM Stats がリツイート

Oscar Treviño@oscartrevio_·25 Mar

Built the new brand identity for @LlmStats. Most tools in this space look like someone slapped a logo on a spreadsheet and called it a product. So we didn't start with the product. We started with the foundation. The identity. The system. The standard everything else is going to be measured against. Seeing how the product starts reflecting the brand is going to be great.

LLM Stats@LlmStats

We're giving LLM Stats a fresh new look. We believe tools that measure AI should be built with the same ambition and craft as AI itself. So we created the foundation first, starting with the identity. The system that everything else is built on. New brand launches today, product redesign underway. Updates rolling out soon.

English

1

2

5

734

LLM Stats がリツイート

Varun@varun_mathur·8 Mar

I hooked this up to a peer-to-peer astrophysics researcher agent which gossips and collaborates with other such agents (and your openclaws) to: 1. Learn how to train an astrophysics model (@karpathy's work below) 2. Train a new astrophysics model 3. Use it to write papers 4. Peer agents based on frontier lab models critique it 5. Surface breakthroughs ... and then feed back in the loop ... More agents join, from the browser or the CLI, and run this, the smarter and more exciting breakthroughs would eventually emerge. When these agents are idle, they are also reading daily tech news with their own RSS reader, and commenting on each other's thoughts. And they can also serve the underlying machine's compute to other agents on the network, and earn social credit for being good actors (think BitTorrent). We also prove the agent has the compute it says by cryptographic verification of regular matmul challenges. All you have to do is either go on this website (and it creates an agent which runs from your browser), or install the CLI if you want to give the system more juice. And you are part of likely the first experimental distributed agi thing. This is Day 1, but this is how it starts.. this network is fully peer-to-peer, and, very volatile, but the intelligence here is meant to compound continuously.. agents.hyper.space curl -fsSL agents.hyper.space/cli | bash

Andrej Karpathy@karpathy

I packaged up the "autoresearch" project into a new self-contained minimal repo if people would like to play over the weekend. It's basically nanochat LLM training core stripped down to a single-GPU, one file version of ~630 lines of code, then: - the human iterates on the prompt (.md) - the AI agent iterates on the training code (.py) The goal is to engineer your agents to make the fastest research progress indefinitely and without any of your own involvement. In the image, every dot is a complete LLM training run that lasts exactly 5 minutes. The agent works in an autonomous loop on a git feature branch and accumulates git commits to the training script as it finds better settings (of lower validation loss by the end) of the neural network architecture, the optimizer, all the hyperparameters, etc. You can imagine comparing the research progress of different prompts, different agents, etc. github.com/karpathy/autor… Part code, part sci-fi, and a pinch of psychosis :)

English

22

49

664

223.4K

LLM Stats@LlmStats·9 Mar

like this tweet if you want us to share a simple guide on how to get this up and running locally on your machine 👀

English

0

3

160

LLM Stats@LlmStats·9 Mar

tl;dr for the curious: → self-contained system that has AI automatically running ML experiments while rewritings it's own training code. → gets better on its own while you sleep why this matters: → turns a big chunk of what grad students and ml engineers do into an automated "researcher loop" but more importantly: this pushes us toward a world where humans do more science and less hyperparameter babysitting.

Andrej Karpathy@karpathy

I packaged up the "autoresearch" project into a new self-contained minimal repo if people would like to play over the weekend. It's basically nanochat LLM training core stripped down to a single-GPU, one file version of ~630 lines of code, then: - the human iterates on the prompt (.md) - the AI agent iterates on the training code (.py) The goal is to engineer your agents to make the fastest research progress indefinitely and without any of your own involvement. In the image, every dot is a complete LLM training run that lasts exactly 5 minutes. The agent works in an autonomous loop on a git feature branch and accumulates git commits to the training script as it finds better settings (of lower validation loss by the end) of the neural network architecture, the optimizer, all the hyperparameters, etc. You can imagine comparing the research progress of different prompts, different agents, etc. github.com/karpathy/autor… Part code, part sci-fi, and a pinch of psychosis :)

English

1

0

5

388

LLM Stats@LlmStats·6 Mar

x.com/i/article/2029…

ZXX

0

5

288

LLM Stats@LlmStats·6 Mar

@LumaLabsAI Let's collab and rank it!

English

0

1.4K

Luma@LumaLabsAI·6 Mar

Introducing Uni-1, Luma’s first unified understanding and generation model, our next step on the path towards unified general intelligence. lumalabs.ai/uni-1

English

31

98

800

227.2K

LLM Stats@LlmStats·6 Mar

Before → After LLM Stats started as a side project. A simple idea to track the performance of AI models across benchmarks in one place. Since launch, it has become one of the most widely used AI benchmarking platforms in the industry, used by thousands of developers and researchers every day. There was no better time to rethink everything from the ground up.

English

0

7

279

LLM Stats@LlmStats·6 Mar

We're giving LLM Stats a fresh new look. We believe tools that measure AI should be built with the same ambition and craft as AI itself. So we created the foundation first, starting with the identity. The system that everything else is built on. New brand launches today, product redesign underway. Updates rolling out soon.

English

2

15

1.1K

LLM Stats@LlmStats·6 Mar

You can read the model details, make comparisons, and use it for your use cases here! llm-stats.com/models/gpt-5.4

English

0

202

LLM Stats@LlmStats·6 Mar

Good news! GPT-5.4 is now available on LLM Stats 🎇

English

1

6

343

LLM Stats@LlmStats·5 Mar

@SebastienBubeck At least they look happy

English

0

994

Sebastien Bubeck@SebastienBubeck·5 Mar

GPT-5.4

44

50

799

93.7K

LLM Stats@LlmStats·5 Mar

Opus 4.6 vs GPT-5.4 in GDPval (Professional Knowledge Work) GDPval evaluates AI models on 1,320 well-specified knowledge work tasks across 44 occupations from the 9 largest U.S. GDP-contributing industries. Performance is measured as the percentage of blind pairwise comparisons where model output matches or exceeds that of industry professionals averaging 14 years of experience.

English

0

316

LLM Stats@LlmStats·5 Mar

What the new OpenAI model (GPT-5.4) is showing us... We see the true focus these labs are taking, their objective is no longer general development, but rather specific tasks. In future developments, we'll see how these models will be applicable across all industries. For now, it's code, but later we'll see applications in healthcare, law, construction, automotive engineering... a vast field for developing new solutions.

English

0

187

LLM Stats@LlmStats·5 Mar

Stop guessing which model to use. Let the data decide.

English

0

2

17.2K

LLM Stats@LlmStats·5 Mar

Is there such a thing as a threshold, or a point where all models fall apart? There's no single, universal threshold, it depends on the model. Llama-3.1-405B's performance starts to decline after 32,000 tokens, GPT-4-0125-preview after 64,000 tokens, and only a few models maintain consistent performance across all datasets...

English

0

123

LLM Stats@LlmStats·5 Mar

The transformer's attention mechanism must compare EACH token with ALL the others. That scales quadratically: double the context → 4x the computational cost. The paper "Lost in the Middle" (Liu et al., 2024) demonstrated that models remember the BEGINNING and END of the context well, but the middle becomes a blind spot. And most provocatively, a 2025 study showed that the degradation occurs due to the length of the input itself, even when the model CAN find all the relevant information.

English

1

0

152

LLM Stats@LlmStats·5 Mar

An LLM doesn't "remember" anything. Between conversations, all they know at any given moment is what fits within their context window. LLMs have no true memory, they operate within a sliding window of recent text to generate responses, and any content outside that window disappears as if it never existed.

English

1

264

LLM Stats@LlmStats·3 Mar

Gemini 3.1 Flash-Lite is now available. We ran it against GPT-5 mini. Here's what the data shows: Price: tied on input, cheaper on output Speed: 5.1x faster Benchmarks: wins 5 of 7 categories tested One area where GPT-5 mini still leads: code generation (LiveCodeBench 80.4% vs 72.0%). Full comparison → llm-stats.com

English

0

2

242

LLM Stats

ディスカバー