Randall Bennett

28.2K posts

Randall Bennett

@randallb

Ship more, build less. Confident AI builder @boltfoundry (Journo → coder → vidpresso (YCW14) → sold to FB) whatsapp: 16466701291

New York, NY Katılım Şubat 2007

352 Takip Edilen7.3K Takipçiler

Sabitlenmiş Tweet

Randall Bennett@randallb·12 Tem

Context is for kings. contexteng.ai/p/context-is-f…

English

2.6K

Randall Bennett@randallb·17h

The most important software for working on code in 2026 is linear.

English

306

Randall Bennett retweetledi

Zack Abrams@zackabrams·4d

@_KarenHao "These systems consume an unfathomable amount of data, land, energy, labor, and water." Water use seems... fathomable?

English

411

4.8K

Randall Bennett@randallb·2d

@kirillzubovsky @jeremybernier I definitely think that critical analysis helps others figure stuff out. I don't think he's looking for sympathy (but idk).

English

Kirill Zubovsky@kirillzubovsky·4d

@jeremybernier All I hear is cry-cry-cry so sorry you got paid so much. Didn’t like it? Leave! What’s the point of shitting on a company that overpaid you for doing not particularly anything important?

English

442

Jeremy Bernier@jeremybernier·4d

Meta was easily the most toxic company I've worked for. There's a reason the Chinese call it "Squid Game". Others refer to it as "Hunger Games" or "Lord of the Flies". I think they're all accurate. The company culture is basically every man/woman for themselves. The performance review process (PSC) not only doesn't incentivize helping others, if anything it actually discourages it since everyone is stack ranked against each other. Imagine working on a team where every 6 months, one of you is going to get axed. Of course it's going to become toxic. "Bottoms up" culture is a complete farce - it's just a way for leadership to offload accountability. The Tech Leads (TLs) have all the power - owning the relationships and tribal knowledge to gatekeep projects to their buddies. Managers are "people managers" with limited technical understanding, who basically aggregate TL feedback and create performance review packets to calibrate with other managers and IC7+. The takeaway is that your destiny is in the hands of the TLs, and TLs unlike managers have no responsibility for your career. There are no repercussions for unethical behavior. I've seen managers and TLs throw others under the bus and get away with it. The only mission bonding the company together is individual self-preservation. Save your own ass to survive for another stock vesting, and throw someone else under the bus if you need to. That's why layoffs rarely impact directors/VPs or tenured IC7+ despite the fact that they're paid by far the most. Even this recent mass layoff that was supposed to "flatten" managers layers barely affected directors/VPs/IC7+, and fell predominantly on M1s - the lowest rung of the management chain. The culture is extremely performative and focused on box ticking and optics. Everything is about PSC (the performance review system) and perception. This means tons of meetings, useless AI slop posts, and top-down initiatives that don't benefit anyone but maybe help tick off the impact box of some go-getter at the top. Impact is not enough - it has to have sufficient complexity. So complexity is added for complexity's sake. The org I was in (Facebook ads) is 90% Chinese, and the entire leadership chain up to the VP level is Chinese. Mandarin is the primary language at the office, except in official meetings with non-speakers. Chinese work culture is very different from American work culture, with 996 (9am-9pm, 6 days/week), top-down nature, emphasis on saving face (eg. don't question your superiors), and toxicity being quite common. Naturally when an org is completely dominated by a single ethnicity that's notorious for not integrating, elements from their work culture seep in. Of the layoffs I witnessed in this org, 3/4 were not Chinese (just to be clear, most Chinese are very kind so don't take this as an attack. But it is a reality that I think most people outside this company are completely unaware of, and I question if leadership is even aware despite the fact that we're talking about the company HQ) I had the most toxic manager of my life here. I watched him deliberately set up a new hire to fail, driving them to needing to see a psychiatrist for anxiety + depression, and getting them fired. Then he suddenly disappeared for 8 months, before leaving the company. I could go on and on, but this is already pretty long and I think you get the point. Yes there are a lot of great, kind people here. I managed to transfer out of my first team into a new team with a great manager where everyone was very smart, supportive, and hardworking. But the company has its Squid Game reputation for a reason. Company culture comes from the top. It seems leadership is either too removed to notice, or maybe don't really care anymore because I guess they already made their billions and us plebs are expendable these days.

English

421

708

7.6K

2.2M

Randall Bennett@randallb·4d

@AlexanderKalian @altsapiens25 Yes. :) Ways of engineering context + structuring multi agent systems, plus ways of creating trust and validation that mimic human trust systems.

English

Dr Alexander D. Kalian@AlexanderKalian·4d

@randallb @altsapiens25 What exactly do you mean, when you say "communication and coordination"? Are you talking more about prompt engineering and structuring multi-agent systems, or are you talking more about trust, validation, and sci-comms?

English

Dr Alexander D. Kalian@AlexanderKalian·4d

OpenAI claims that an unreleased "internal model" solved a major problem in mathematics. Sounds like OpenAI's very own pre-IPO Mythos moment of overhyping. We should all be asking healthily sceptical questions about this "autonomous" breakthrough: How autonomous was it really? How much scaffolding or chain-of-thought design came from in-house mathematicians targeting this specific problem? How much training data or RAG-based vector database stuff was internally produced data targeted towards this specific problem? How many failed attempts did it make? How was the outcome verified, and how much time and resources did it take, against presumed other failed attempts? They are not gonna tell you - and conveniently, the model is not available for public scrutiny either. AI companies have shifted into "source: trust me bro" mode.

OpenAI@OpenAI

Today, we share a breakthrough on the planar unit distance problem, a famous open question first posed by Paul Erdős in 1946. For nearly 80 years, mathematicians believed the best possible solutions looked roughly like square grids. An OpenAI model has now disproved that belief, discovering an entirely new family of constructions that performs better. This marks the first time AI has autonomously solved a prominent open problem central to a field of mathematics.

English

211

19K

Randall Bennett@randallb·4d

@AlexanderKalian @altsapiens25 i’m excited because i think the vast majority of people disagree. communication and coordination are challenges that excite me, and i think most people ignore. as an analogy: big tech company limits are not money or talent, they’re coordination.

English

Dr Alexander D. Kalian@AlexanderKalian·4d

I don't think I agree with this. GPT-5 was initially much slower, for only marginal (or even negligible) performance gains. Speaking for myself and others, most heavy users haven't changed their advanced prompting techniques much since GPT-4o. The same techniques now hold well for GPT-5.5. GPT-5 was simply not a very impressive model, when it was first released. Plenty of people on X angrily demanded a return to GPT-4o. As for your last sentence - I disagree too. There are many computer science aspects that AI needs to overcome - performance plateaus in various domains, scalability bottlenecks, data bottlenecks, problems with advanced graph machine learning etc.

English

Randall Bennett@randallb·4d

one time i read an uncensored chain of thought that was confused and basically reproducing what anxiety and depression feel like. it made me cry because i understood it so well. i bet this chain of thought does the same thing for mathematicians.

danialhasan@dhasandev

@OpenAI the models chain of thought is 125 pages long btw: cdn.openai.com/pdf/1625eff6-5…

English

336

Randall Bennett@randallb·4d

@AlexanderKalian @altsapiens25 the reason it was disappointing is our limitations in communication and ability to maximize results. ai assistants amplify human effort, but only if we’re able to communicate our preferences effectively. the limitation of ai is now communication, not computer science.

English

Dr Alexander D. Kalian@AlexanderKalian·4d

@altsapiens25 The one damning piece of evidence I can always come back to (aside from all of the turning a charity into a for-profit drama), is just how hyped up GPT-5 was for months - only for it to then disappoint the masses. My default assumption in this field is very heavy scepticism.

English

109

Randall Bennett retweetledi

Marc Andreessen 🇺🇸@pmarca·5d

No, the timing is wrong for that. This shows the fraud.

Ernie Tedeschi@ernietedeschi

First: US business applications are diverging. Total new filings are accelerating — but applications likely to hire employees have stalled. The gap is a signal of solopreneurs. We think in part AI tools are lowering barriers to launching a business without ever adding payroll. /2

English

494

103.6K

Randall Bennett retweetledi

Logan Dobson@LoganDobson·6d

This stuff makes me so mad and it should make you mad too. You’re being lied to about the benefits of AI and AI infrastructure in America by people who are NOWHERE NEAR AMERICA

English

197

1.6K

64.6K

Randall Bennett@randallb·18 May

I keep doing this with every AI and now am convinced they're hardcoding this like @simonw's pelican-on-a-bike thing. Always too funny.

English

114

Randall Bennett@randallb·18 May

Sora had some PMF. I miss it, it was pretty fun. The idea of embracing dead internet vs being terrified of it seems interesting.

English

205

Randall Bennett@randallb·17 May

@jerryjliu0 @PwCUS The source for this in my personal experience is that when I was an eng at facebook, tbgs (the big grep service) was usually enough for me to find enough context to triage or fix most bugs anywhere in fb code within 30 minutes, regardless of my familiarity with the codebase.

English

Randall Bennett@randallb·17 May

@jerryjliu0 @PwCUS The corpus size you'd need for grep to be ineffective would mean that summarizing docs and just giving keywords into smaller folders w/ links to the bigger topics would probably yield the same results imo. Vector search is probably a mistake for agents, but not for humans.

English

283

Jerry Liu@jerryjliu0·17 May

There’s an open question on whether grep is all you need for agentic search. This recent paper by @PwCUS (Sen et al.) seems to suggest that. It’s titled “Is Grep All You Need? How Agent Harnesses Reshape Agentic Search”. They test various agentic harnesses (in-house, Claude Code, Codex), and equip the agent with both vector search and grep. They find that grep generally yields higher accuracy than semantic search. IMO the main gap of the paper is that it tests retrieval over conversational memory, not over a real-world corpus of enterprise documents. Standard enterprise RAG setups involve asking complex questions over a static document corpus (e.g. 10-Ks, legal contracts, SOPs). The corpus here is per-user chat history, which is quite a different document distribution. I do think that evolving agentic harnesses simplify the problem of retrieval - hence the popularity with file sandboxes and a vector db is “just a database” - but IMO there’s still more work to be done here. Paper: arxiv.org/pdf/2605.15184

English

9.9K

Randall Bennett retweetledi

Yishan@yishan·17 May

Now that we know xAI penalizes an account’s reach severely due to muted and blocks, I aggressively mute and block accounts when I see them make a spurious or unfounded claim, especially if it’s conclusively disproven in community notes or comments. I encourage everyone to do this, as part of improving our digital commons. (This doesn’t apply to high-legibility claims I happen to disagree with, it’s just “shoot from the hip” bullshit meme-repeating, or outright disinfo)

Javi Lopez ⛩️@javilopen

⚡ xAI dropped the X algorithm yesterday and I don't get why nobody noticed what's actually in there I burned $500 on Claude going through every single line Here's what I found (LONG POST, save it for later): 0/ Every account has an "embedding" attached to it that describes you the way AI models do: in latent space. It's the internal fingerprint the model keeps of every user, a vector of numbers that sums up how your account behaves (what topics you touch, what engagement you generate, who you interact with). The model uses it every time it decides who to show your posts to. If your history is good, it stays clean and the model pushes you. If you accumulate negative signals (blocks, mutes, reports, not_interested), it goes toxic and starts penalizing you automatically. And the trap: it does NOT reset. What you do today stays in there for weeks, poisoning everything you publish after, even if it's good. That's why getting out of a shadowban or a low-reach streak on X feels like trying to move a giant rusted wheel. It's not your imagination, it's literally that. Cleaning up your embedding is slow and painful, like the impression you have of someone you don't like: no matter how nice they get to you, it's gonna take a while before you trust them. Another important finding: the embedding doesn't decay on a clock. It decays with NEW engagement entering the system. If you stop posting, the old bad signals stay frozen in there. Nothing overwrites them. If you start making content the algorithm likes, you'd see improvement after 6 to 8 weeks and a real shift around 12 to 16 weeks, assuming you don't pile up more bad signals along the way. Why is nobody talking about this? It blows my mind. Finally a confirmation of that "I'm in a bad streak" feeling we've all been through. 1/ First 30 minutes are everything If your post doesn't get engagement fast, Grok doesn't even evaluate it. No quality score, no deep analysis, no chance of reaching anyone who doesn't follow you. Dead and buried 2/ Post age caps at 80 hours: POST_AGE_MAX_MINUTES = 4800, bucketed in 1 hour chunks. After that you're in the "overflow bucket" which translates to "ancient, ignore" Best window: first 0 to 12 hours. After 24 you're already in a worse bucket Far from rewarding "evergreen" content, X wants a constant stream of fresh meat (literally the opposite of YouTube) 3/ MY BIGGEST FEAR TURNED OUT TO BE UNFOUNDED (supposedly): living in EU posting English for US audience: ZERO direct penalty in theory: The PostCandidate struct has NO field for author country, IP, or location. Gizmoduck (X's identity service) returns only follower count + screen name. The Phoenix transformer just sees a hash of your author_id What hurts you indirectly: timezone (your post ages while US sleeps) and the language of the POST itself So using a VPN to "post from the US" does literally nothing (unlike TikTok or Instagram, by the way) 4/ The 5 negative signals that kill your reach: The model predicts 22 actions per post. 5 of them are negative weights that get SUBTRACTED from your score: - not_interested - block_author - mute_author - report - not_dwelled (people scrolling past your post without stopping) That last one is brutal tbh. A post that gets ignored is mathematically WORSE than a post that never got published 5/ Shadowbans 100% exist. 4 different kinds: - Hard drop. X removes your post from everyone's feed without telling you. Applied to posts with serious content (child safety, etc.) or suspended accounts. You don't even find out - DO_NOT_AMPLIFY label. Literally a field in the code that says "do not amplify this post". If they put it on you, ads stop showing next to your posts → X stops making money from showing you → the system stops pushing you. Full blackout - BotMaker rules. The internal panel where X employees can manually limit a specific account by hand. The code shows the categories that exist (Content, ContentLimited, Safety, Grok) but does NOT show who they're applied to or why. The tool is documented, the usage isn't - Poisoned embedding. The worst one, as we saw above. The model has an internal "memory" for every account. If your account racks up enough "not interested" + blocks + mutes + reports over time, that memory goes toxic. From then on, even your good future posts get penalized automatically. Nobody decided this. The model just learned your account gets bad engagement and self-corrected 6/ Only ORIGINAL posts get the "Banger Screen" Replies and retweets never enter the Grok quality classifier. If you spend your day replying to viral accounts, you're optimizing for the Reply Ranker, NOT for amplification Want to be discovered out of network? Write originals. There's no other way 7/ Replies to small accounts get spam-scanned. Replies to big accounts get Grok-ranked Two separate classifiers. The SpamEapiLowFollowerClassifier hits replies to small accounts. The ReplyRanker scores replies to big accounts 0 to 3 with Grok "First!" or emoji-only replies get a 0. "Sir, this is a Wendy's" energy gets penalized. Basically, if you write replies, they better add something. Otherwise don't bother 8/ 50% of all feed requests are "shadow traffic" is_sampled(request_id, 0.5) marks half of every feed request as shadow. Many context features (gender inference, demographics, Grok topic preferences) only activate on shadow OR with a feature flag Translation: you literally cannot know which version of the algorithm any given user is getting. Half your audience is in an experiment at any moment 9/ Dwell (the time a user spends looking at your post before scrolling) is 5x better than getting likes The scorer has 5 different dwell signals (dwell, cont_dwell_time, click_dwell_time, etc.) but only 1 favorite signal. - A post with tons of likes but people read it for 1 second and keep scrolling → low score - A post with few likes but people stay 8 seconds reading it → high score Optimize for time spent on your post, not for likes! 10/ Things that actually work: - Get engagement in the first 10 min. DM your friends, ping your community, whatever - Post in your AUDIENCE'S timezone, not yours. US targeting: 8 to 11am ET (14 to 17 Madrid time) - Don't post 5 things in a row. AuthorDiversityScorer multiplies each next post by decay^position. By post 4 you're at the floor - Video ≥ 10 seconds. Below MinVideoDurationMs you lose the full VQV weight - Videos with audio. Grok runs ASR (speech to text) on every video. No audio = blank signal - Quote tweet virals in your niche. The model already knows the original engages, your value-add stacks on top 11/ Things that absolutely kill your reach: - WILD FINDING: threads of 10+ tweets. DedupConversationFilter keeps only 1 tweet per conversation per feed. Megathreads are mathematically a waste - Reposting the same content. Bloom filters dedupe it - AI slop. There's literally a slop_score field in the BangerScreen output. They explicitly detect it - NSFW/violence/hate without tags. Auto MediumRisk = no ads = structural shadowban - Reply-spamming small accounts. Specific classifier for that 12/ What they DIDN'T release, the sneaky bastards: The skeleton is public. The dials are not - Exact numeric values of every weight (FavoriteWeight, ReplyWeight, OonWeightFactor, AuthorDiversityDecay). Live in xai_feature_switches::Params, external config - The actual Grok prompts (the 7 PToS policy prompts, BangerMiniVlmScreenScore, SafetyPtos). Could literally have any framing in them - The BotMaker rules that apply DO_NOT_AMPLIFY to specific accounts - util/phoenix_request.rs, which constructs the final model call - 25+ xai_* crates referenced but not included - The production Phoenix weights. They only released the mini version My theory: they gave us a pretty skinny skeleton of the whole thing they actually have. The muscle (weights) and the brain (prompts and BotMaker rules) are completely opaque. They kept the best parts for themselves, clearly 13/ Cheat sheet so you don't forget: - First 30 min matter more than anything - Your location is irrelevant, your timing and language are not - Shadowbans exist in 4 flavors. Worst is the model quietly poisoning your author embedding from past bad signals. Climbing back up by cleaning your embedding is gonna hurt, but it can be done - Replies and retweets don't get the quality classifier. Originals do - Dwell (someone actually staying to look at your post) beats likes 5 to 1 - Half of all traffic is in some experiment at any moment - They kept the best parts of the algorithm for themselves, but hey, something is something

English

3.2K

Randall Bennett@randallb·17 May

@adaobiadibe_ reminds me of an engineer who knows they’re good though.

English

adaobi@adaobiadibe_·16 May

Just watched Top Gun (was great). The stand out was defo Tom Cruise. I think we don’t appreciate him as an actor enough.

Top Gun@TopGunMovie

The only Iceman we know. #TopGun

English

234

Randall Bennett@randallb·17 May

@jxnlco copywriting. i’ll try to get debug traces but literally had to sign up for claude because codex / gpt 5.5 is not good.

English

139

jason@jxnlco·17 May

When do you reach for other models instead of Codex? What can we do better? Hit me with all of your frustrations. dms open. If you can give me detail (e.g. specifics/transcipts) - it'll help a lot in finding out exactly what we need to do to improve the next model

English

418

845

184.6K

Randall Bennett@randallb·16 May

@wongmjane glgl, i hear we got out at the right ring fwiw

English

470

Jane Manchun Wong@wongmjane·16 May

It sucked incredibly to be laid off after taking days off for an ideation of ending myself and then the media placed it next to someone in LA being fired for misusing dining credits, leading to more pile-ons all while I scrambled to get things sorted within a short time frame

Emily Dreyfuss@EmilyDreyfuss

Today I published an interview with an anonymous Meta employee who has worked at the company for over a decade and wanted, for the first time ever, to let the world know how horrible it feels to be inside. #comments" target="_blank" rel="nofollow noopener">sfstandard.com/pacific-standa…

English

982

315.6K

Randall Bennett@randallb·16 May

MTBF is such a great metric b/c if you're a nobody and you're shipping bugs, by definition you're not going to hit that many failures... it seems to cover both cases pretty elegantly.

Mitchell Hashimoto@mitchellh

I strongly believe there are entire companies right now under heavy AI psychosis and its impossible to have rational conversations about it with them. I can't name any specific people because they include personal friends I deeply respect, but I worry about how this plays out. I lived through the great MTBF vs MTTR (mean-time-between-failure vs. mean-time-to-recovery) reckoning of infrastructure during the transition to cloud and cloud automation. All those arguments are rearing their ugly heads again but now its... the whole software development industry (maybe the whole world, really). It's frightening, because the psychosis folks operate under an almost absolute "MTTR is all you need" mentality: "its fine to ship bugs because the agents will fix them so quickly and at a scale humans can't do!" We learned in infrastructure that MTTR is great but you can't yeet resilient systems entirely. The main issue is I don't even know how to bring this up to people I know personally, because bringing this topic up leads to immediately dismissals like "no no, it has full test coverage" or "bug reports are going down" or something, which just don't paint the whole picture. We already learned this lesson once in infrastructure: you can automate yourself into a very resilient catastrophe machine. Systems can appear healthy by local metrics while globally becoming incomprehensible. Bug reports can go down while latent risk explodes. Test coverage can rise while semantic understanding falls. Changes happens so fast that nobody notices the underlying architecture decaying. I worry.

English

Randall Bennett@randallb·15 May

@StatisticsFTW contexteng.ai/p/context-engi… -- tldr: put it in the bottom of the prompt, and if you can put it in a user turn.

English

Robert Balicki (👀 @IsographLabs)@StatisticsFTW·14 May

What are the best practices for getting claude -p to always, 100% of the time, return just JSON? - I implement retries, but that's a bit of a downer - The system prompt includes "only return JSON" in harsher language - The user prompt includes the exact JSON schema What else?

English

383

Randall Bennett retweetledi

Paul 🚐➡️ ⛱️PDF🔥 🚐➡️ NYC 5/26-6/3@Paul_Melman·12 May

Paul 🚐➡️ ⛱️PDF🔥 🚐➡️ NYC 5/26-6/3 tweet media

ZXX

152

Keşfet

@_KarenHao @kirillzubovsky @jeremybernier @AlexanderKalian @altsapiens25 @simonw @jerryjliu0 @PwCUS