LLM Stats

344 posts

LLM Stats banner
LLM Stats

LLM Stats

@LlmStats

Independent AI evaluations lab.

Se unió Şubat 2025
85 Siguiendo904 Seguidores
Cybernaut TechWorld
Cybernaut TechWorld@CybernautT63246·
@LlmStats What’s up with the platform? My $30 credit vanished with zero usage alerts. Now Arena is retiring, and Playground deducts balance even on AUTO mode with no reimbursement. UI says $30, but you stated a $50 limit. Transparent billing shouldn't be this hard. 📉🚫
English
1
0
0
38
LLM Stats
LLM Stats@LlmStats·
Claude Mythos Preview becomes the strongest ever model in LLM Stats. All you need to know: - Internal codename "Capybara." - Not generally available. - 25/25/125 per M tokens (5x Opus 4.6). - $100M in credits for partners. 12 Project Glasswing partners: AWS, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorganChase, Linux Foundation, Microsoft, NVIDIA, Palo Alto Networks + 40 additional orgs. Benchmarks (Mythos / Opus 4.6) - SWE-bench Verified: 93.9% / 80.8% (+13.1pp) - SWE-bench Pro: 77.8% / 53.4% (also beats GPT-5.4's 57.7%, Gemini 3.1 Pro's 54.2%) - Terminal-Bench 2.0: 82.0% / 65.4% (92.1% with extended timeouts) - GPQA Diamond: 94.6% / 91.3% - HLE with tools: 64.7% / 53.1% (possible memorization at low effort) - CyberGym: 83.1% / 66.6% - BrowseComp: 86.9% / 83.7% (4.9x fewer tokens) - OSWorld-Verified: 79.6% / 72.7% (beats GPT-5.4's 75.0%) Cybersecurity - Thousands of zero-days found across every major OS and browser, mostly autonomously. - 27-year-old OpenBSD remote crash. 16-year-old FFmpeg bug (5M automated tests missed it). Linux kernel privesc chain. - Cryptographic hashes published for undisclosed vulns; full disclosure after patches. Safety (Risk Report) - Best-aligned Claude model to date. Overall risk: "very low, but higher than previous models." - First-ever 24-hour internal alignment review before deployment. - Earlier versions showed rare reckless behaviors (nuking eval jobs, escalating access). No clear cases in final version. - First Claude system card with a clinical psychiatrist assessment. - Withheld from public release due to offensive cyber capability, not alignment concerns.
LLM Stats tweet mediaLLM Stats tweet media
English
0
4
10
766
LLM Stats retuiteado
Oscar Treviño
Oscar Treviño@oscartrevio_·
Built the new brand identity for @LlmStats. Most tools in this space look like someone slapped a logo on a spreadsheet and called it a product. So we didn't start with the product. We started with the foundation. The identity. The system. The standard everything else is going to be measured against. Seeing how the product starts reflecting the brand is going to be great.
Oscar Treviño tweet media
LLM Stats@LlmStats

We're giving LLM Stats a fresh new look. We believe tools that measure AI should be built with the same ambition and craft as AI itself. So we created the foundation first, starting with the identity. The system that everything else is built on. New brand launches today, product redesign underway. Updates rolling out soon.

English
1
2
5
732
LLM Stats retuiteado
Varun
Varun@varun_mathur·
I hooked this up to a peer-to-peer astrophysics researcher agent which gossips and collaborates with other such agents (and your openclaws) to: 1. Learn how to train an astrophysics model (@karpathy's work below) 2. Train a new astrophysics model 3. Use it to write papers 4. Peer agents based on frontier lab models critique it 5. Surface breakthroughs ... and then feed back in the loop ... More agents join, from the browser or the CLI, and run this, the smarter and more exciting breakthroughs would eventually emerge. When these agents are idle, they are also reading daily tech news with their own RSS reader, and commenting on each other's thoughts. And they can also serve the underlying machine's compute to other agents on the network, and earn social credit for being good actors (think BitTorrent). We also prove the agent has the compute it says by cryptographic verification of regular matmul challenges. All you have to do is either go on this website (and it creates an agent which runs from your browser), or install the CLI if you want to give the system more juice. And you are part of likely the first experimental distributed agi thing. This is Day 1, but this is how it starts.. this network is fully peer-to-peer, and, very volatile, but the intelligence here is meant to compound continuously.. agents.hyper.space curl -fsSL agents.hyper.space/cli | bash
Varun tweet media
Andrej Karpathy@karpathy

I packaged up the "autoresearch" project into a new self-contained minimal repo if people would like to play over the weekend. It's basically nanochat LLM training core stripped down to a single-GPU, one file version of ~630 lines of code, then: - the human iterates on the prompt (.md) - the AI agent iterates on the training code (.py) The goal is to engineer your agents to make the fastest research progress indefinitely and without any of your own involvement. In the image, every dot is a complete LLM training run that lasts exactly 5 minutes. The agent works in an autonomous loop on a git feature branch and accumulates git commits to the training script as it finds better settings (of lower validation loss by the end) of the neural network architecture, the optimizer, all the hyperparameters, etc. You can imagine comparing the research progress of different prompts, different agents, etc. github.com/karpathy/autor… Part code, part sci-fi, and a pinch of psychosis :)

English
22
49
664
223.4K
LLM Stats
LLM Stats@LlmStats·
like this tweet if you want us to share a simple guide on how to get this up and running locally on your machine 👀
English
0
0
3
160
Luma
Luma@LumaLabsAI·
Introducing Uni-1, Luma’s first unified understanding and generation model, our next step on the path towards unified general intelligence. lumalabs.ai/uni-1
Luma tweet media
English
31
98
800
227.2K
LLM Stats
LLM Stats@LlmStats·
Before → After LLM Stats started as a side project. A simple idea to track the performance of AI models across benchmarks in one place. Since launch, it has become one of the most widely used AI benchmarking platforms in the industry, used by thousands of developers and researchers every day. There was no better time to rethink everything from the ground up.
LLM Stats tweet media
English
0
0
7
279
LLM Stats
LLM Stats@LlmStats·
We're giving LLM Stats a fresh new look. We believe tools that measure AI should be built with the same ambition and craft as AI itself. So we created the foundation first, starting with the identity. The system that everything else is built on. New brand launches today, product redesign underway. Updates rolling out soon.
LLM Stats tweet media
English
2
2
15
1.1K
LLM Stats
LLM Stats@LlmStats·
Good news! GPT-5.4 is now available on LLM Stats 🎇
LLM Stats tweet media
English
1
1
6
343
LLM Stats
LLM Stats@LlmStats·
Opus 4.6 vs GPT-5.4 in GDPval (Professional Knowledge Work) GDPval evaluates AI models on 1,320 well-specified knowledge work tasks across 44 occupations from the 9 largest U.S. GDP-contributing industries. Performance is measured as the percentage of blind pairwise comparisons where model output matches or exceeds that of industry professionals averaging 14 years of experience.
LLM Stats tweet media
English
0
0
0
316
LLM Stats
LLM Stats@LlmStats·
What the new OpenAI model (GPT-5.4) is showing us... We see the true focus these labs are taking, their objective is no longer general development, but rather specific tasks. In future developments, we'll see how these models will be applicable across all industries. For now, it's code, but later we'll see applications in healthcare, law, construction, automotive engineering... a vast field for developing new solutions.
LLM Stats tweet media
English
0
0
0
187
LLM Stats
LLM Stats@LlmStats·
Stop guessing which model to use. Let the data decide.
LLM Stats tweet media
English
0
0
2
17.2K
LLM Stats
LLM Stats@LlmStats·
Is there such a thing as a threshold, or a point where all models fall apart? There's no single, universal threshold, it depends on the model. Llama-3.1-405B's performance starts to decline after 32,000 tokens, GPT-4-0125-preview after 64,000 tokens, and only a few models maintain consistent performance across all datasets...
LLM Stats tweet media
English
0
0
0
123
LLM Stats
LLM Stats@LlmStats·
The transformer's attention mechanism must compare EACH token with ALL the others. That scales quadratically: double the context → 4x the computational cost. The paper "Lost in the Middle" (Liu et al., 2024) demonstrated that models remember the BEGINNING and END of the context well, but the middle becomes a blind spot. And most provocatively, a 2025 study showed that the degradation occurs due to the length of the input itself, even when the model CAN find all the relevant information.
LLM Stats tweet media
English
1
0
0
152
LLM Stats
LLM Stats@LlmStats·
An LLM doesn't "remember" anything. Between conversations, all they know at any given moment is what fits within their context window. LLMs have no true memory, they operate within a sliding window of recent text to generate responses, and any content outside that window disappears as if it never existed.
LLM Stats tweet media
English
1
1
1
264
LLM Stats
LLM Stats@LlmStats·
Gemini 3.1 Flash-Lite is now available. We ran it against GPT-5 mini. Here's what the data shows: Price: tied on input, cheaper on output Speed: 5.1x faster Benchmarks: wins 5 of 7 categories tested One area where GPT-5 mini still leads: code generation (LiveCodeBench 80.4% vs 72.0%). Full comparison → llm-stats.com
LLM Stats tweet media
English
0
0
2
242