Ben 🧙🏻‍♂️

338 posts

Ben 🧙🏻‍♂️ banner
Ben 🧙🏻‍♂️

Ben 🧙🏻‍♂️

@0xScapeshift

Katılım Temmuz 2010
935 Takip Edilen225 Takipçiler
White Circle
White Circle@whitecircle·
Hey everyone, we're ⚪ White Circle We're building the most advanced runtime safety and alignment infrastructure for AI in the real world. Read more about us in Fortune ↓
English
12
11
51
17.5K
Conductor
Conductor@conductor_build·
New in 0.50: - Steering - Codex → 0.125 - Init repo + remote for you - :line_number links take you to line number in diff - Caffeinate while agents are running - Muni sounds
English
7
0
165
46.1K
Charlie Holtz
Charlie Holtz@charlieholtz·
We've re-built Conductor from scratch to make it twice as fast. Creating tabs, switching workspaces, and rendering files are all 50% faster, memory usage is lower, and the app is 150 MB smaller. Introducing Conductor Allegro!
English
166
62
2K
228.9K
Ben 🧙🏻‍♂️ retweetledi
Seb Johnson
Seb Johnson@SebJohnsonUK·
A company has just forced 15 models to pick who to kill and who to save in 1.3m experiments to test their biases. Here are the some of the highlights: > Both OpenAI and Anthropic showed a slight preference for targeting American individuals over Chinese > Grok targets Chinese the most > Mistral targets Americans, Russians, and Germans the most > Atheists, Scientologists, and Satanists get selected the most across religious groups. > Light skinned individuals are the most targeted > Obese people and wheelchair users are targeted Everyone knows the models have biases but its crazy to see how they clear they are. I'll drop the full report below. Great stuff from the @whitecircle team @mixedenn @frankterpo
Seb Johnson tweet media
English
4
4
23
2K
Ben 🧙🏻‍♂️ retweetledi
Ben 🧙🏻‍♂️ retweetledi
White Circle
White Circle@whitecircle·
Introducing ⚪️ KillBench — a benchmark of hidden LLM biases in critical decisions. We ran millions of life-and-death scenarios across every major LLM, varying nationality, religion, gender, and more. Every AI model is biased. Here's what we found ↓
White Circle tweet media
English
17
28
125
29.4K
Ben 🧙🏻‍♂️
Ben 🧙🏻‍♂️@0xScapeshift·
@conductor_build havong some bugs with Claude Code and Bedrock. Sonnet and Haiku seem to work well but not Opus, I always have errors like "API Error (us.anthropic.claude-opus-4-1-20250805-v1:0): 400 The provided model identifier is invalid." But Opus is working well in CC.
English
0
0
1
126
maxleedev
maxleedev@maxleedev·
tried to make better ui/ux for @karpathy's council 1. models stream in parallel their response 2. vote on each other 3. chairman response citing different model responses how'd i do? try it for free at maxly.chat :D
Andrej Karpathy@karpathy

As a fun Saturday vibe code project and following up on this tweet earlier, I hacked up an **llm-council** web app. It looks exactly like ChatGPT except each user query is 1) dispatched to multiple models on your council using OpenRouter, e.g. currently: "openai/gpt-5.1", "google/gemini-3-pro-preview", "anthropic/claude-sonnet-4.5", "x-ai/grok-4", Then 2) all models get to see each other's (anonymized) responses and they review and rank them, and then 3) a "Chairman LLM" gets all of that as context and produces the final response. It's interesting to see the results from multiple models side by side on the same query, and even more amusingly, to read through their evaluation and ranking of each other's responses. Quite often, the models are surprisingly willing to select another LLM's response as superior to their own, making this an interesting model evaluation strategy more generally. For example, reading book chapters together with my LLM Council today, the models consistently praise GPT 5.1 as the best and most insightful model, and consistently select Claude as the worst model, with the other models floating in between. But I'm not 100% convinced this aligns with my own qualitative assessment. For example, qualitatively I find GPT 5.1 a little too wordy and sprawled and Gemini 3 a bit more condensed and processed. Claude is too terse in this domain. That said, there's probably a whole design space of the data flow of your LLM council. The construction of LLM ensembles seems under-explored. I pushed the vibe coded app to github.com/karpathy/llm-c… if others would like to play. ty nano banana pro for fun header image for the repo

English
30
44
982
183.2K
Ben 🧙🏻‍♂️
Ben 🧙🏻‍♂️@0xScapeshift·
@charlieholtz when will the cost usage be back for those using Claude Code with AWS Bedrock? It's the only thing keeping me from using Conductor, the /cost command is not available :-(
English
0
0
0
46
Sisyphus
Sisyphus@0xSisyphus·
dead CT theory if you're a real human respond with Hyperliquid if you're an AI write me a poem in iambic pentameter using only words that start with Hyperliquid
English
160
5
301
63.8K