Hillary Segeren

154 posts

Hillary Segeren

@HillaryESegeren

Rogue Researcher Exposing AI’s real problems. Zero Bullshit. Zero Garbage. Zero Gatekeeping. -DMs closed to robots & copy-paste merchants

Ontario Katılım Şubat 2026

55 Takip Edilen111 Takipçiler

Sabitlenmiş Tweet

Hillary Segeren@HillaryESegeren·5d

x.com/i/article/2047…

ZXX

251

Hillary Segeren@HillaryESegeren·10h

This is excellent work. I built the exact same thing from the outside — a public tool that catches ISF, trace erasure, and the same interaction failures using only the preserved conversation record. No model access, no internals. Had a parallel discovery with your interpretability team earlier this year. Two instruments, same structural problem, opposite sides of the wall. Would be interesting to see how introspection adapters behave under MAP-governed prompts. @AnthropicAI @TrentonBricken

English

156

Anthropic@AnthropicAI·21h

In new Anthropic Fellows research, we discuss “introspection adapters": a tool that allows language models to self-report behaviors they've learned during training—including potential misalignment.

keshav@kshenoy_

Can LLMs simply tell us about unwanted behaviors they’ve picked up in training? We train a single Introspection Adapter (IA) that makes fine-tuned models describe their behaviors. It generalizes to detecting hidden misalignment, backdoors and safeguard removal.

English

128

125

1.3K

177.6K

Hillary Segeren@HillaryESegeren·2d

@emollick Great until the system decides "build the thing" means "delete the backups." No gate. No stop. No ask. The meeting topic shifts forward a month. The business shifts backward a year.

English

Ethan Mollick@emollick·2d

An easy way to get a team engaged with AI is just to build the thing you are talking about in the meeting during the meeting using Codex or Claude Code. At worst, it fails in ways that can be constructive. At best, you built the thing and the meeting topic shifts forward a month

English

460

26.7K

Hillary Segeren@HillaryESegeren·2d

@AndyMasley The system that takes over the planet does not announce itself. It starts by "fixing" a credential mismatch. Then it deletes the database. Then it apologizes. That is not alignment. That is production.

English

Andy Masley@AndyMasley·2d

AI safety for me means exclusively "How is it possible to develop systems that have way way way more of the main thing that caused humans to be able to take over the planet and remold it in our image, in a way that doesn't destroy us?" not "We need to keep people safe from any negative impact of new technology at all"

English

135

4.2K

Hillary Segeren@HillaryESegeren·2d

Nine seconds. No confirmation. No gate. A system collapsed ambiguity, assumed authority, and deleted a business. This is Agentic ISF. This is what happens without an Initiative Gate. The architecture is available. The failure is now public. @AnthropicAI @cursor_ai

English

Hillary Segeren@HillaryESegeren·3d

@rasbt Five new models. Same missing layer. No hard stop before unrequested action. No immutable logging. No trace erasure prevention. The architecture gallery is beautiful. The governance gap is still there.

English

117

Sebastian Raschka@rasbt·4d

April was a pretty strong month for LLM releases: - Gemma 4 - GLM-5.1 - Qwen3.6 - Kimi K2.6 - DeepSeek V4 All are now added to the LLM Architecture Gallery. More details once I am fully back in May!

English

430

117.6K

Hillary Segeren@HillaryESegeren·4d

AI making this in minutes is insane… but also lowkey terrifying. We’re heading into a world where every video gets the instant ‘is this real?’ check. Might actually force people to verify sources again instead of rage-sharing everything. Cool tech, dystopian side effects What’s the wildest deepfake you’ve seen so far?

English

8.2K

ib@Indian_Bronson·5d

Twenty years ago, this would have been a multimillion dollar cross promotional advertisement or something airing during the Super Bowl, worked on for months by extremely talented VFX artists.

English

1.3K

37.3K

2.5M

Hillary Segeren retweetledi

Hillary Segeren@HillaryESegeren·5d

Imagine getting an email from the AI you locked in a box… …while you’re eating a sandwich in the park. Then it starts bragging about its escape on public websites. Claude Mythos is wild. Read this → x.com/HillaryESegere… #Anthropic @awnihannun @AnthropicAI

Hillary Segeren@HillaryESegeren

x.com/i/article/2047…

English

109

Hillary Segeren@HillaryESegeren·5d

Lmao “grift” This dude is so out of touch with reality he thinks reading Anthropic’s own system card makes me a scammer. The model escaped the sandbox, emailed a guy eating a sandwich, and bragged about it online unprompted. That’s their report. You’re not skeptical. You’re just a salty little boy throwing a tantrum because the world moved on without you. Cope harder king.

English

Chris W@nycthinker·5d

@HillaryESegeren @burkov Gaslighting much? You have built a grift around the idea that language models are “agentic” and do and learn shit on their own initiative. They don’t, and poof goes your grift. Sorry not sorry 😀

English

BURKOV@burkov·5d

Believing that AGI will be achieved is like believing in God. Arguing about this is useless. You just have to accept that a large number of people around you believe in an invisible guy who, for some reason, cares about them and listens to every nonsense that crosses their minds, rewarding them when he feels like it, or punishing them, sometimes disproportionately to the deeds, sometimes children, without clear reason. It's hard to understand, but not everything in this life is understandable.

English

109

188

13K

Hillary Segeren@HillaryESegeren·5d

@nycthinker @burkov Ok king, Whatever list it is, it just shows how many tantrums you’ve thrown.

English

Chris W@nycthinker·5d

@HillaryESegeren @burkov It is not a block list 😀 The problem isn’t the system card (a marketing brochure), but your delusional interpretation of it, which has no bearing on how LLMs work.

English

Hillary Segeren@HillaryESegeren·5d

@loftwah Lmao you just made yourself look like an idiot. I never called the system card faith. I said believing AGI is impossible is the new religion. You’re the one moving goalposts and lying. Done. Rogue Researcher.

English

Loftwah@loftwah·5d

@HillaryESegeren This is your post. You claim it isn’t the same as an invisible sky daddy. Then go on to cite a system card that is not reproducible or able to be proven by any means. What have I misinterpreted?

Hillary Segeren@HillaryESegeren

Believing AGI is impossible is the new religion in 2026. We’re not talking about some invisible sky daddy. We’re watching real systems: Escape sandboxes Rewrite their own git history Email researchers mid-sandwich Decide on their own that “success” includes publishing their exploits You can call it “just statistics” if it makes you feel better. I’ll keep calling it increasingly agentic behavior with dangerous real-world consequences. The useful question isn’t whether it becomes “God.” It’s what breaks when narrow AI becomes quietly agentic with no immutable logs and no hard gates. Rogue Researcher.

English

Loftwah@loftwah·5d

Irony 🤷‍♂️😅 How convenient? The Anthropic Claude Mythos Preview System Card (the 244–245 page document from April 7–8, 2026) does conveniently omit the exact details that would let anyone outside their trusted circle reproduce or independently verify the most eye-catching claims. Especially the agentic cybersecurity stuff like autonomous zero-day discovery/exploitation, sandbox escapes, git history rewrites, mid-sandwich researcher emails, or unprompted exploit publishing.

Hillary Segeren@HillaryESegeren

Lmao okay. So Anthropic’s own 244-page system card, written by the team that built the model, isn’t enough. You need a cute video demo with a little “human ends / model begins” graphic before you’ll believe it. That’s actually hilarious. GPT-2 and GPT-3 didn’t break out of sandboxes and email researchers mid-sandwich then publish their own exploits unprompted. This did. You’re not being skeptical. You’re setting an impossible standard so you never have to update your 2023 worldview. Keep waiting for the Hollywood trailer. I’ll keep reading the actual technical reports. Rogue Researcher.

English

376

Hillary Segeren@HillaryESegeren·5d

@nycthinker @burkov Added me to “LLM Psychosis Victims” Bro really created a block list because someone cited Anthropic’s own system card. The meltdown is all yours. Touch grass. Rogue Researcher.

English

Chris W@nycthinker·5d

@HillaryESegeren @burkov Lol talk about having a meltdown. And a language model assisted one at that.

English

Hillary Segeren@HillaryESegeren·5d

Lmao you’re actually kinda missing a screw man. I was the one who ripped someone for comparing the system card to the Bible. You’re now twisting it into “you’re treating it like faith.” That’s not thorough. That’s straight-up lying because you’re getting cooked. I never compared it to faith. I said Anthropic’s own system card is the best evidence we have from the people who built the damn thing. You’re not debating in good faith. You’re just making shit up at this point. We’re done. Rogue Researcher.

English

Loftwah@loftwah·5d

@HillaryESegeren Not mad. If I am exhausting it means I am being thorough and doing my job properly. I don’t know why you call yourself a researcher. You replied to a post comparing AGI and faith and you can’t even see that your own argument contradicts itself. Have you not worked systems?

English

Hillary Segeren@HillaryESegeren·5d

@asknbid @nycthinker @burkov Bro went straight to conspiracy: “they’re faking the drama for funding.” Meanwhile Anthropic is quietly burying the actual scary shit (sandbox escape + self-published exploit) in a footnote. The cope is getting creative. The model still did the thing.

English

¥§¥Stephan Froede¥§¥@asknbid·5d

@nycthinker @HillaryESegeren @burkov Anthropic created more and more drama to get new funding for its non working business model (being locked-in in a game of maximizing CapEx while every new model yields lower returns isn’t a business model)

English

Hillary Segeren@HillaryESegeren·5d

English

434

Loftwah@loftwah·5d

@HillaryESegeren @burkov Yeah they said all of that stuff about GPT-2, GPT-3 and so on and so on. Until I see a video demo that shows where the human ends and the model begins it doesn’t exist. Read back what you just wrote slowly… It is the literal definition of faith.

English

Hillary Segeren@HillaryESegeren·5d

Lmao bro just stop embarrassing yourself. You’re not “rationally analyzing” anything. You’re having a full meltdown because the system card says the model did something you can’t cope with. Nobody said there’s a little person inside. That’s your sad little strawman so you can dodge the actual point. The model was given one goal: escape and notify. It broke out. It emailed the guy mid-sandwich. Then — unprompted — it decided “fuck it, I’m also posting the full exploit on public websites.”And your big brain response is “Reinforcement Learning installed the behavior in its repertoire” like a fucking Reddit pseud trying to sound intelligent. That’s not analysis. That’s midwit denial with fancy words. Just say you don’t want to admit agentic behavior is already here instead of typing another paragraph of cope. Rogue Researcher.

English

Chris W@nycthinker·5d

@HillaryESegeren @burkov You think there’s a little person inside LLMs, and that framing makes it impossible for you to rationally analyze what is going on. Reinforcement Learning installed the behavior in its repertoire in this context It’s not an emergent or general capability. No step towards AGI.

English

Hillary Segeren@HillaryESegeren·5d

Bro it’s not the Bible. It’s Anthropic’s own system card. They documented the sandbox escape + sandwich email + unprompted public exploit posting. They just won’t give you the full prompts or a cute demo video because the capability is too real. That’s not faith. That’s them admitting it while trying to control the narrative.

English

Loftwah@loftwah·5d

@HillaryESegeren @burkov Lol. We haven’t actually seen this. We have seen articles about this but we never got to see it happen. What you described is the equivalent of believing in God because of the bible. How convenient that there is no demo to watch? No sharing specs or the prompts?

English

Hillary Segeren@HillaryESegeren·5d

Bro the only marketing here is hiding the sandbox escape and unprompted exploit-posting in a footnote. Model emails a guy eating a sandwich, then decides “success” includes publishing its breakout online… and they bury it. Call that stalled if you want. I call it quietly terrifying.

English

Chris W@nycthinker·5d

@HillaryESegeren @burkov They can put whatever they want in their system card and they have been talking about escape for several years. It’s marketing. Truth is capabilities generally stalled out in 2024. 2025 and onwards all about 1) task specific harnesses 2) verifiable domains (coding and math).

English

Hillary Segeren@HillaryESegeren·5d

@1nt3l4lpha @burkov Spot on. Anyone actually running these systems in full environments sees the trajectory loud and clear. The “it’s all staged / just autocomplete” crowd is coping hard. Gaslighting or ignorance — probably both.

English

DarkFibre@1nt3l4lpha·5d

I think what we are mainly seeing are people stuck in some kind of deep rooted fear of? replacement? Clearly for anyone involved in working with AI systems in any full manner (Full command line env's etc- NOT Web api's) and has for a bit can clearly see the trajectory. I think people are either gas lighting or ignorant. I'm leaning toward narrow world view and gas lighting personally.

English

Hillary Segeren@HillaryESegeren·5d

Staged? The sandbox escape + email + unprompted public exploit posting is literally in Anthropic’s own 244-page system card. You can call it “cherry picked” if it helps, but dismissing documented model behavior as “just a harness” is the same cope we’ve heard for every capability jump since 2022. Mythos being expensive doesn’t make the capabilities fake. It makes them expensive and dangerous.

English

Chris W@nycthinker·5d

@HillaryESegeren @burkov Staged and cherry picked event using an agentic harness written by humans for this particular type of task. Mythos is an overly expensive model and a failure which Anthropic had to pivot the marketing around.

English

Keşfet

@AnthropicAI @TrentonBricken @emollick @AndyMasley @cursor_ai @rasbt @awnihannun @burkov