Arseniy Zarechnev

1.5K posts

Arseniy Zarechnev

@evindor

Programmer and entrepreneur.

Europe انضم Mayıs 2011

440 يتبع2.9K المتابعون

Arseniy Zarechnev@evindor·12 Mar

@8teAPi It's magical when I got my first one. Spent over an hour crafting the prompt, sent opus for overnight research (small transformer training). Was still running 10 hours later, with research milestones documented in a journal, code snapshot every step, real progress made. Insane.

English

208

Prakash@8teAPi·12 Mar

my first overnight codex run.. what a milestone

English

145

Arseniy Zarechnev@evindor·12 Mar

My opus agents last week be like

English

Arseniy Zarechnev@evindor·11 Mar

@GregorySchier They are silently optimizing and quantizing. Last week Opus was brilliant, this week all I hear from it is that I'm absolutely right

English

Greg Schier 👨🏼‍💻🇨🇦@GregorySchier·10 Mar

AI is still dumb as rocks. We're safe for another six months.

English

253

17.3K

Arseniy Zarechnev@evindor·11 Mar

@bcherny please give us back smart opus. I've been absolutely right for a week now and its driving me nuts.

English

Arseniy Zarechnev@evindor·11 Mar

The more unique entropy you stuff into the context before your request, the more unique and creative outputs you will get. Models are like hash functions. f(simple prompt) -> same output f(your degen mind snapshot, simple prompt) -> unique output

English

Arseniy Zarechnev@evindor·11 Mar

My answer: prime your models with your taste and preferences to get unique outputs. "Write a poem about time" -> identical slop "values.md, journal.md, life_of_jdilla.md What if time is not a 1-dimensional arrow, but 3d like space? Write a banger poem about time" -> dope stuff

Alex Prompter@alex_prompter

🚨 BREAKING: Researchers at UW Allen School and Stanford just ran the largest study ever on AI creative diversity. 70+ AI models were given the same open-ended questions. They all gave the same answers. They asked over 70 different LLMs the exact same open-ended questions. "Write a poem about time." "Suggest startup ideas." "Give me life advice." Questions where there is no single right answer. Questions where 10 different humans would give you 10 completely different responses. Instead, 70+ models from every major AI company converged on almost identical outputs. Different architectures. Different training data. Different companies. Same ideas. Same structures. Same metaphors. They named this phenomenon the "Artificial Hivemind." And the paper won the NeurIPS 2025 Best Paper Award, which is the highest recognition in AI research, handed to a small number of papers out of thousands of submissions. This is not a blog post or a hot take. This is award-winning, peer-reviewed science confirming something massive is broken. The team built a dataset called Infinity-Chat with 26,000 real-world, open-ended queries and over 31,000 human preference annotations. Not toy benchmarks. Not math problems. Real questions people actually ask chatbots every single day, organized into 6 categories and 17 subcategories covering creative writing, brainstorming, speculative scenarios, and more. They ran all of these across 70+ open and closed-source models and measured the diversity of what came back. Two findings hit hard. First, intra-model repetition. Ask the same model the same open-ended question five times and you get almost the same answer five times. The "creativity" you think you're getting is the same output wearing a slightly different outfit. You ask ChatGPT, Claude, or Gemini to write you a poem about time and you keep getting the same river metaphor, the same hourglass imagery, the same reflection on mortality. Over and over. The model isn't thinking. It's defaulting to whatever scored highest during alignment training. Second, and this is the one that should really alarm you, inter-model homogeneity. Ask GPT, Claude, Gemini, DeepSeek, Qwen, Llama, and dozens of other models the same creative question, and they all converge on strikingly similar responses. These are models built by completely different companies with different architectures and different training pipelines. They should be producing wildly different outputs. They're not. 70+ models all thinking inside the same invisible box, producing the same safe, consensus-approved content that blends together into one indistinguishable voice. So why is this happening? The researchers point directly at RLHF and current alignment techniques. The process we use to make AI "helpful and harmless" is also making it generic and boring. When every model gets trained to optimize for human preference scores, and those preference datasets converge on a narrow definition of what "good" looks like, every model learns to produce the same safe, agreeable output. The weird answers get penalized. The original takes get shaved off. The genuinely creative responses get killed during training because they didn't match what the average annotator rated highly. And it gets even worse. The study found that reward models and LLM-as-judge systems are actively miscalibrated when evaluating diverse outputs. When a response is genuinely different from the mainstream but still high quality, these automated systems rate it LOWER. The very tools we built to evaluate AI quality are punishing originality and rewarding sameness. Think about what this means if you use AI for brainstorming, content creation, business strategy, or literally any task where you need multiple perspectives. You're getting the illusion of diversity, not the real thing. You ask for 10 startup ideas and you get 10 variations of the same 3 ideas the model learned were "safe" during training. You ask for creative writing and you get the same therapeutic, perfectly balanced, utterly forgettable tone that every other model gives. The researchers flagged direct implications for AI in science, medicine, education, and decision support, all domains where diverse reasoning is not a nice-to-have but a requirement. Correlated errors across models means if one AI gets something wrong, they might ALL get it wrong the same way. Shared blind spots at massive scale. And the long-term risk is even scarier. If billions of people interact with AI systems that all think identically, and those interactions shape how people write, brainstorm, and make decisions every day, we risk a slow, invisible homogenization of human thought itself. Not because AI replaced creativity. Because it quietly narrowed what we were exposed to until we all started thinking the same way too. Here's what you can actually do about it right now: → Stop accepting first-draft AI output as creative or diverse. If you need 10 ideas, generate 30 and throw away the obvious ones → Use temperature and sampling parameters aggressively to push models out of their comfort zone → Cross-reference multiple models AND multiple prompting strategies, because same model with different prompts often beats different models with the same prompt → Add constraints that force novelty like "give me ideas that a traditional investor would hate" instead of "give me creative ideas" → Use structured prompting techniques like Verbalized Sampling to force the model to explore low-probability outputs instead of defaulting to consensus → Layer your own taste and judgment on top of everything AI gives you. The model gets you raw material. Your weirdness and experience make it original This paper puts hard data behind something a lot of us have been feeling for a while. AI is getting more capable and more homogeneous at the same time. The models are smarter, but they're all smart in the exact same way. The Artificial Hivemind is not a bug in one model. It's a systemic feature of how the entire industry builds, aligns, and evaluates language models right now. The fix requires rethinking alignment itself, moving toward what the researchers call "pluralistic alignment" where models get rewarded for producing diverse distributions of valid answers instead of collapsing to a single consensus mode. Until that happens, your best defense is awareness and better prompting.

English

Arseniy Zarechnev@evindor·9 Mar

@camsoft2000 funny how I get completely the opposite experience, been maxing out my claude sub and underutilizing codex. smooth sailing with opus, but codex (5.3/5.4) overengineers and misses the mark

English

155

camsoft2000@camsoft2000·9 Mar

My Codex rate limit ran out last week and I’ve got another 24 hours to wait so been using Opus 4.6. Goodness me it really wants to handoff its work even when it’s not fully proven. Keep telling me it’s working, pipeline proven even though there were issues in testing in some cases. I kept telling it that it wasn’t always working and it kept telling me that one time it worked proves it’s all good and just trust me bro.

English

117

17K

Arseniy Zarechnev@evindor·9 Mar

After 10+ years of mac, running omarchy is so refreshing. hyprland scrolling layout is awesome everything is easily customizeable claude makes me custom gtk apps and overlays which look nice love it

English

111

Arseniy Zarechnev@evindor·9 Mar

you have to live through your AI psychosis and emerge on the other side. i was in a similar state for months, finally said fuck it and am now happily creating stuff i'd never bothered/was able to do before

Mo@atmoio

I was a 10x engineer. Now I'm useless.

English

141

Arseniy Zarechnev@evindor·6 Mar

@phuctm97 Recreate it from scratch. 1. opus: research this messy code, write a paper on details of implementation, discard this, keep that 2. opus: take the spec and implement worked for me to refactor 20k+ loc messy research project into 2k lines prod-ready

English

Minh-Phuc Tran@phuctm97·6 Mar

AI is not that good (yet) at refactoring a messy codebase written by itself. I was trying to refactor a pretty small codebase (4K+ lines) written entirely by AI, because it started failing to add new features and instead adding too many bugs. I thought it should be easy-peasy, but even with Opus 4.6 (high effort), it keeps failing to refactor without breaking at least 50% of the features. 😅 We're getting there tho, just a reminder that we're still early!

English

Arseniy Zarechnev@evindor·6 Mar

kinda sad linux on desktop finally became good, but just before an incoming tectonic shift in how we work with computers, which will make desktops in general obsolete. but we can have a couple good years I think!

English

Arseniy Zarechnev@evindor·6 Mar

Configuring hyprland/omarchy using claude code (ofc skip permissions) is next level. It just does things

English

129

Arseniy Zarechnev@evindor·5 Mar

@TomBukic For me carry-mix helped a lot (sometimes .3, sometimes .8), and when I tried to abandon digit curriculum (1-3, then 1-6, then 1-10) models were stuck at ~0.2 random tok_acc for 50k+ steps. Somehow that helps

English

Tom Bukic@TomBukic·5 Mar

For sure! I think below 50 is doable, and was doing param searches down to 40 (not good, at this moment). Yesterday I was systematizing my training pipeline and pushing a few sweeps below 60 that were infinitely slow. You made me to sweep through tying params in more unobvious combinations; and the next thing is optimizing the training pipeline. For me curriculum wasn't great, but didn't cycle back to it after getting it done. I'm confident it will allow better. I bet it will let those <55 converge as well.

English

Tom Bukic@TomBukic·4 Mar

🏆️

Dimitris Papailiopoulos@DimitrisPapail

AdderBoard: 40 submissions in. Smallest transformer that adds two 10-digit numbers at 99%+ accuracy: 🏆 Hand-coded: 10 params (lokimorty @memphismillano) 🏆 Trained: 62 params (@TomBukic ) Started at 6K → 10. A 600x compression .. Let's see if we can squeeze even more!

ART

384

Arseniy Zarechnev@evindor·5 Mar

@TomBukic With 57p its so small we were able to enumerate all remaining changes, none of them worked, so claude is now running huge seed sweeps with instrumented training, trying to figure out why the model needs those params that converge to near zero

English

Arseniy Zarechnev@evindor·5 Mar

@TomBukic Thanks! Will you try to push your architecture more? We found good insight via "structural analysis" - Claude writes a bunch of python to analyze trained weights and find similarities, regularities or potential zeros/reductions.

English

Arseniy Zarechnev@evindor·5 Mar

@martinlasek There is already a lab doing that taalas.com

English

Arseniy Zarechnev@evindor·5 Mar

@martinlasek I think that would change when someone makes a chip implementing a huge foundational model in silicon. It would outperform generic GPUs 100x, provide real value and also a good reason to sell you a new device every year

English

@martinlasek@martinlasek·5 Mar

You can pay $200/m for Claude for eternity Or buy a Mac Studio for $166/m and run your own model for free Best part? You own it after 12 months at 0% interest: > M4 Max > 36GB Memory > 512 GB SSD Hmmm…

English

439

1.1K

663.5K

Arseniy Zarechnev@evindor·5 Mar

@TomBukic Thanks, will try! It's not even so much about info in there, but the very unique seed/personality/idea mix that develops within one ctx trhough chat and is lost between compactions or summaries

English

Tom Bukic@TomBukic·5 Mar

Tips and tricks: You have the full conversation log somewhere in ~/.claude . You can just ask CC to obtain it, and than ask it to iteratively drop the noise until it reaches a managable file to be directly provided in context. Even RLM access is very nice (it searching the file), especially if you delegate search to subagents and just get the clean stuff in context.

English

Arseniy Zarechnev@evindor·5 Mar

@TomBukic We tried everything, all my stupid ideas, context was at 1% to compaction but primed, I asked Claude for his last "stupidly brilliant idea", he went into compaction and came out with a 57p solution what a chad. Your move now!

English

Tom Bukic@TomBukic·4 Mar

@evindor It's so hard to stop 🤣

English

اكتشف

@8teAPi @GregorySchier @bcherny @camsoft2000 @phuctm97 @TomBukic @elonmusk @BarackObama