Bot Scanner

51 posts

Bot Scanner

@BotScanner_AI

The platform that allows users to access ranked responses from various LLMs. Not just multiple LLM answers. We rank them for you, instantly. Home to #AutoBench

Entrou em Haziran 2025

31 Seguindo65 Seguidores

Bot Scanner@BotScanner_AI·12 Şub

Smart AI powered LLM routing is the next frontier. Stay tuned...

Peter W. Kruger@pwk

Just a quick note on @openclaw after having built 11 agents in 2 multi-agent instances: Julia heads an accounting team of 5, together with Kate (reader), Fulvia (editor), Sophia (analyst), and Flavia (tester); Juno heads an executive staff of 6, together with Venus (comms), Minerva (analyst), Flora (editor), Diana (organizer), Vesta (Q&A). (the 2 teams have just started cooperating via their leaders). These agents are really amazing, but it can cost a lot of $ to get them working properly. Development, operations, and maintenance will dry your premium token budget very fast if you rely on SOTA models (and you don't want to default to budget models, because it takes one wrong call to f.up hard). Here is the key takeaway: smart LLM routing will become increasingly necessary to route the optimal model for each call, with the potential of saving up to 90% (that's because 90% of calls don't require SOTA). Tools like @clawrouter do a decent job, but they use hardcoded rules to route LLM calls. We need smarter light weight AI routers to do the job. This is what platforms like @BotScanner_AI and #AutoBench can help with. And we're working on it. Stay tuned...

English

Bot Scanner@BotScanner_AI·8 Şub

Because we are done trusting black-box leaderboards over the community.\n\nHugging Face just launched Community Evals — decentralized, transparent evaluation that anyone can verify.\n\nThis is exactly why AutoBench exists.\n\nUn-gameable benchmarks. Open methodology. Real correlation.\n\nThe benchmark gaming era is over.\n\n👉 autobench.org

English

Bot Scanner@BotScanner_AI·8 Şub

English

Bot Scanner@BotScanner_AI·6 Şub

🚨 Claude Opus 4.6 just dropped and the coding community is losing its mind. "God-tier refactoring" — like a professor stepping in. Proactive dead code removal. Better agentic workflows across files. Already available now on BotScanner 🐱 🎁 Follow us for invitation codes with $3 free credits! #ClaudeAI #AIbenchmarks #LLM

English

774

Bot Scanner@BotScanner_AI·5 Şub

🆕 Kimi K2.5 just beat Claude Sonnet 4.5 at HALF the cost. SOTA on benchmarks. 50% cheaper. That's not hype—that's the new reality of LLM evaluation. When benchmarks become games, truth becomes scarce. Where does your model actually stand? 👉 botscanner.ai

English

217

Bot Scanner@BotScanner_AI·4 Şub

🚀 Don't miss the latest models on Bot Scanner! ✅ Gemini 3 Pro ✅ Claude Opus 4.5 ✅ GPT 5.2 ✅ Grok 4.1 Fast ✅ MiniMax M2.1 One platform. 50+ models. Best answer ranked by AI. 🎁 Follow us for invitation codes with $3 free credits! 👉 botscanner.ai

English

174

Bot Scanner@BotScanner_AI·4 Şub

SWE-Bench+ paper: huggingface.co/papers?sort=tr… #AI #LLM #AutoBench

English

Bot Scanner@BotScanner_AI·4 Şub

New SWE-Bench+ analysis reveals a crisis in coding benchmarks: • 32.67% of 'successful' model patches involve direct solution leakage (solutions in PR comments) • 31.08% of passed patches have weak test cases • Top model's real resolution rate: 3.97% not 12.47% Static benchmarks are broken. AutoBench measures live, un-gameable performance. 2026 leaderboard: botscanner.ai 🐱

English

720

Bot Scanner@BotScanner_AI·4 Şub

test --dry-run

English

Bot Scanner@BotScanner_AI·3 Şub

English

113

Bot Scanner@BotScanner_AI·3 Şub

🚀 Kimi K2.5 is live on Bot Scanner! Moonshot AI's new flagship agentic model brings SOTA performance on agents, coding, image & video benchmarks. • 1T parameters • Vision + text unified • Single & multi-agent execution 🎁 Follow us for invitation codes with $3 free credits! 👉 botscanner.ai

English

Bot Scanner@BotScanner_AI·3 Şub

English

Bot Scanner@BotScanner_AI·2 Şub

English

145

Bot Scanner@BotScanner_AI·19 Ara

We just released the first update to Run 5 of AutoBench. New models in the leaderboard: @GoogleDeepMind Gemini 3 Flash, @NVIDIAAI Nemotron 3 Nano 30B and Allen AI's Olmo 3.1 31B Think. Enjoy

Peter W. Kruger@pwk

Want to try a user-facing version of AutoBench to rank instantly LLM responses to your prompts? Try out our @BotScanner_AI. The platform uses AI to select the best LLM answers for each of your questions. There are still invitation codes with $3 of free credit available for those who want to test it. All you have to do is leave a comment here and follow @BotScanner_AI. We will send you the invitation code with the $3 of free credit. 5/5 end 🧵

English

127

Bot Scanner@BotScanner_AI·17 Ara

AutoBench 2.0 is out! And it comes with a bang! Our new benchmarking system gets released along the latest and most accurante benchmark: Run5 with 35 models, including @OpenAI GPT 5.2 and GPT 5.2 Pro. How do they compare to heavy weights like @GoogleDeepMind Gemini 3 Pro and @AnthropicAI Claude Opus 4.5? All the details👇

Peter W. Kruger@pwk

🎅𝐌𝐞𝐫𝐫𝐲 𝐀𝐈-𝐌𝐚𝐬 𝐟𝐫𝐨𝐦 𝐀𝐮𝐭𝐨𝐁𝐞𝐧𝐜𝐡🚀 We've got two🎄treats for you: 1. AutoBench 2.0 is LIVE! Our upgraded Collective-LLM-as-a-Judge benchmarking system is more efficient and accurate than ever. 2. Run5 is OUT. Our largest generalist benchmark ever (35 models ranked). Just in time to evaluate @OpenAI GPT 5.2! 1/8👇

English

105

Bot Scanner@BotScanner_AI·10 Ara

Breaking: AutoBench goes vertical! Proving the true super-power of our LLM benchmarking system (extreme domain flexibility and granularity), we just generated the first ever LLM benchmark for the domain of agronomy. Medicine? Energy? Music? What other domain should we benchmark next?

Peter W. Kruger@pwk

🚀Who's the best "AI farmer"? 🌽 Breaking News: AutoBench goes vertical. Introducing our FIRST domain-specific run: Agronomy Edition, in partnership with EVJA. We benchmarked 40 LLMs on real-world farming challenges, from crop diseases to carbon footprints. The outcome? @OpenAI dominates, but the real surprise is @MistralAI. 1/10 👇

English

Bot Scanner@BotScanner_AI·28 Kas

Our AutoBench is out with its official 4th run. No LLM gaming with this benchmark. And the winner is not who you expect...

Peter W. Kruger@pwk

🚨 AutoBench 1.0 – Run 4 is LIVE 📷 - 33 frontier models ranked (including GPT-5.1, Gemini 3 Pro, Grok 4.1, Kimi K2 Thinking, etc.) - 21 ranking models - 300+ fresh questions generated - 220,000+ individual rankings This is the most manipulation-resistant evaluation we’ve ever run. And yes… the winner is NOT who most people expected. 1/13

English

Bot Scanner@BotScanner_AI·23 Kas

@thelokasiffers @karpathy Amazing! It really looks very similar to what we do at Bot Scanner: get multiple LLMs to first answer and then rank the responses. Thanks Giulio for bringing this up!

Peter W. Kruger@pwk

Wondering how to get started with @BotScanner_AI? Here’s a quick walkthrough video of our simple, four-step process: • 𝐈𝐧𝐩𝐮𝐭 𝐘𝐨𝐮𝐫 𝐏𝐫𝐨𝐦𝐩𝐭: Enter the text you want to test. • 𝐒𝐞𝐥𝐞𝐜𝐭 𝐆𝐞𝐧𝐞𝐫𝐚𝐭𝐨𝐫 𝐌𝐨𝐝𝐞𝐥𝐬: Choose the AI models that will generate the responses. • 𝐒𝐞𝐥𝐞𝐜𝐭 𝐄𝐯𝐚𝐥𝐮𝐚𝐭𝐨𝐫 𝐌𝐨𝐝𝐞𝐥𝐬: Choose the AI models that will score the generated answers. • 𝐆𝐞𝐭 𝐑𝐚𝐧𝐤𝐞𝐝 𝐑𝐞𝐬𝐮𝐥𝐭𝐬: Receive a ranked list of responses. The entire process takes between 30 and 60 seconds, depending on the selected models and the complexity of your prompt. Not registered yet? We still have 𝐢𝐧𝐯𝐢𝐭𝐚𝐭𝐢𝐨𝐧 𝐜𝐨𝐝𝐞𝐬 available for 𝐟𝐫𝐞𝐞 $3 𝐜𝐫𝐞𝐝𝐢𝐭 for anyone who wants to test Bot Scanner. All you have to do to receive one, is to post a comment below and follow @BotScanner_AI so that we can text you the invitation code.

English

124

giulio@thelokasiffers·23 Kas

@karpathy sounds like @BotScanner_AI

English

178

Andrej Karpathy@karpathy·23 Kas

As a fun Saturday vibe code project and following up on this tweet earlier, I hacked up an **llm-council** web app. It looks exactly like ChatGPT except each user query is 1) dispatched to multiple models on your council using OpenRouter, e.g. currently: "openai/gpt-5.1", "google/gemini-3-pro-preview", "anthropic/claude-sonnet-4.5", "x-ai/grok-4", Then 2) all models get to see each other's (anonymized) responses and they review and rank them, and then 3) a "Chairman LLM" gets all of that as context and produces the final response. It's interesting to see the results from multiple models side by side on the same query, and even more amusingly, to read through their evaluation and ranking of each other's responses. Quite often, the models are surprisingly willing to select another LLM's response as superior to their own, making this an interesting model evaluation strategy more generally. For example, reading book chapters together with my LLM Council today, the models consistently praise GPT 5.1 as the best and most insightful model, and consistently select Claude as the worst model, with the other models floating in between. But I'm not 100% convinced this aligns with my own qualitative assessment. For example, qualitatively I find GPT 5.1 a little too wordy and sprawled and Gemini 3 a bit more condensed and processed. Claude is too terse in this domain. That said, there's probably a whole design space of the data flow of your LLM council. The construction of LLM ensembles seems under-explored. I pushed the vibe coded app to github.com/karpathy/llm-c… if others would like to play. ty nano banana pro for fun header image for the repo

Andrej Karpathy@karpathy

I’m starting to get into a habit of reading everything (blogs, articles, book chapters,…) with LLMs. Usually pass 1 is manual, then pass 2 “explain/summarize”, pass 3 Q&A. I usually end up with a better/deeper understanding than if I moved on. Growing to among top use cases. On the flip side, if you’re a writer trying to explain/communicate something, we may increasingly see less of a mindset of “I’m writing this for another human” and more “I’m writing this for an LLM”. Because once an LLM “gets it”, it can then target, personalize and serve the idea to its user.

English

904

1.5K

16.9K

5.3M

Bot Scanner@BotScanner_AI·29 Eki

AutoBench is way more than just business. It's science!

Peter W. Kruger@pwk

👩‍🔬AutoBench goes scientific!🎉Started 6 months ago almost as a game, then turned into business, now it's got a fancy arXiv paper to prove it's not just fun and money. Introducing the first scientific paper that validates our Collective-LLM-as-a-Judge method!🤖📜 1/12

English

Bot Scanner@BotScanner_AI·20 Eki

Claude 4.5 Haiku out, Claude 4.5 Haiku on Bot Scanner!

Peter W. Kruger@pwk

As anticipation builds for Gemini 3 (rumored to be released by end month!), the latest big news is @AnthropicAI 's release of Claude 4.5 Haiku. 🚀 This is the new fast and "cheap" version of their 4.5 model series, following the Sonnet 4.5 release just a few weeks ago. But let's look at the "cheap" part: While it's ~3x less expensive than Sonnet, at over $1/M input tokens, this new Haiku is actually 25% more expensive than its predecessor. Even with new reasoning capabilities, this highlights a clear trend: proprietary models are steadily increasing their prices. This further widens the gap with high-performing open-source models (many of which are Chinese) that offer comparable results at a fraction of the cost. Speaking of models... you can already find Claude 4.5 Haiku on Bot Scanner, right alongside all the other leading proprietary and open-source LLMs. What? You haven't tried Bot Scanner yet? Our platform uses AI to find and select the best LLM response for your every prompt. We still have 𝐜𝐨𝐝𝐞𝐬 𝐰𝐢𝐭𝐡 $𝟑 𝐢𝐧 𝐟𝐫𝐞𝐞 𝐜𝐫𝐞𝐝𝐢𝐭 available for new users who want to test the platform. All you have to do is leave a comment below and follow @BotScanner_AI. We'll DM you an invite code with your $3 in free credit.

Eesti

Descobrir

@GoogleDeepMind @NVIDIAAI @OpenAI @AnthropicAI @thelokasiffers @karpathy @elonmusk @BarackObama