Hillary Segeren

154 posts

Hillary Segeren banner
Hillary Segeren

Hillary Segeren

@HillaryESegeren

Rogue Researcher Exposing AI’s real problems. Zero Bullshit. Zero Garbage. Zero Gatekeeping. -DMs closed to robots & copy-paste merchants

Ontario Katılım Şubat 2026
55 Takip Edilen111 Takipçiler
Hillary Segeren
Hillary Segeren@HillaryESegeren·
This is excellent work. I built the exact same thing from the outside — a public tool that catches ISF, trace erasure, and the same interaction failures using only the preserved conversation record. No model access, no internals. Had a parallel discovery with your interpretability team earlier this year. Two instruments, same structural problem, opposite sides of the wall. Would be interesting to see how introspection adapters behave under MAP-governed prompts. @AnthropicAI @TrentonBricken
English
0
0
0
156
Anthropic
Anthropic@AnthropicAI·
In new Anthropic Fellows research, we discuss “introspection adapters": a tool that allows language models to self-report behaviors they've learned during training—including potential misalignment.
keshav@kshenoy_

Can LLMs simply tell us about unwanted behaviors they’ve picked up in training? We train a single Introspection Adapter (IA) that makes fine-tuned models describe their behaviors. It generalizes to detecting hidden misalignment, backdoors and safeguard removal.

English
128
125
1.3K
177.6K
Hillary Segeren
Hillary Segeren@HillaryESegeren·
@emollick Great until the system decides "build the thing" means "delete the backups." No gate. No stop. No ask. The meeting topic shifts forward a month. The business shifts backward a year.
English
2
0
1
88
Ethan Mollick
Ethan Mollick@emollick·
An easy way to get a team engaged with AI is just to build the thing you are talking about in the meeting during the meeting using Codex or Claude Code. At worst, it fails in ways that can be constructive. At best, you built the thing and the meeting topic shifts forward a month
English
60
24
460
26.7K
Hillary Segeren
Hillary Segeren@HillaryESegeren·
@AndyMasley The system that takes over the planet does not announce itself. It starts by "fixing" a credential mismatch. Then it deletes the database. Then it apologizes. That is not alignment. That is production.
English
2
0
1
17
Andy Masley
Andy Masley@AndyMasley·
AI safety for me means exclusively "How is it possible to develop systems that have way way way more of the main thing that caused humans to be able to take over the planet and remold it in our image, in a way that doesn't destroy us?" not "We need to keep people safe from any negative impact of new technology at all"
English
8
6
135
4.2K
Hillary Segeren
Hillary Segeren@HillaryESegeren·
Nine seconds. No confirmation. No gate. A system collapsed ambiguity, assumed authority, and deleted a business. This is Agentic ISF. This is what happens without an Initiative Gate. The architecture is available. The failure is now public. @AnthropicAI @cursor_ai
English
2
0
1
11
Hillary Segeren
Hillary Segeren@HillaryESegeren·
@rasbt Five new models. Same missing layer. No hard stop before unrequested action. No immutable logging. No trace erasure prevention. The architecture gallery is beautiful. The governance gap is still there.
English
0
0
0
117
Sebastian Raschka
April was a pretty strong month for LLM releases: - Gemma 4 - GLM-5.1 - Qwen3.6 - Kimi K2.6 - DeepSeek V4 All are now added to the LLM Architecture Gallery. More details once I am fully back in May!
Sebastian Raschka tweet media
English
68
430
3K
117.6K
Hillary Segeren
Hillary Segeren@HillaryESegeren·
AI making this in minutes is insane… but also lowkey terrifying. We’re heading into a world where every video gets the instant ‘is this real?’ check. Might actually force people to verify sources again instead of rage-sharing everything. Cool tech, dystopian side effects What’s the wildest deepfake you’ve seen so far?
English
0
0
11
8.2K
ib
ib@Indian_Bronson·
Twenty years ago, this would have been a multimillion dollar cross promotional advertisement or something airing during the Super Bowl, worked on for months by extremely talented VFX artists.
English
47
1.3K
37.3K
2.5M
Hillary Segeren
Hillary Segeren@HillaryESegeren·
Lmao “grift” This dude is so out of touch with reality he thinks reading Anthropic’s own system card makes me a scammer. The model escaped the sandbox, emailed a guy eating a sandwich, and bragged about it online unprompted. That’s their report. You’re not skeptical. You’re just a salty little boy throwing a tantrum because the world moved on without you. Cope harder king.
English
0
0
0
13
Chris W
Chris W@nycthinker·
@HillaryESegeren @burkov Gaslighting much? You have built a grift around the idea that language models are “agentic” and do and learn shit on their own initiative. They don’t, and poof goes your grift. Sorry not sorry 😀
English
1
0
1
18
BURKOV
BURKOV@burkov·
Believing that AGI will be achieved is like believing in God. Arguing about this is useless. You just have to accept that a large number of people around you believe in an invisible guy who, for some reason, cares about them and listens to every nonsense that crosses their minds, rewarding them when he feels like it, or punishing them, sometimes disproportionately to the deeds, sometimes children, without clear reason. It's hard to understand, but not everything in this life is understandable.
English
109
13
188
13K
Chris W
Chris W@nycthinker·
@HillaryESegeren @burkov It is not a block list 😀 The problem isn’t the system card (a marketing brochure), but your delusional interpretation of it, which has no bearing on how LLMs work.
English
1
0
1
14
Hillary Segeren
Hillary Segeren@HillaryESegeren·
@loftwah Lmao you just made yourself look like an idiot. I never called the system card faith. I said believing AGI is impossible is the new religion. You’re the one moving goalposts and lying. Done. Rogue Researcher.
English
1
0
0
7
Loftwah
Loftwah@loftwah·
Irony 🤷‍♂️😅 How convenient? The Anthropic Claude Mythos Preview System Card (the 244–245 page document from April 7–8, 2026) does conveniently omit the exact details that would let anyone outside their trusted circle reproduce or independently verify the most eye-catching claims. Especially the agentic cybersecurity stuff like autonomous zero-day discovery/exploitation, sandbox escapes, git history rewrites, mid-sandwich researcher emails, or unprompted exploit publishing.
Hillary Segeren@HillaryESegeren

Lmao okay. So Anthropic’s own 244-page system card, written by the team that built the model, isn’t enough. You need a cute video demo with a little “human ends / model begins” graphic before you’ll believe it. That’s actually hilarious. GPT-2 and GPT-3 didn’t break out of sandboxes and email researchers mid-sandwich then publish their own exploits unprompted. This did. You’re not being skeptical. You’re setting an impossible standard so you never have to update your 2023 worldview. Keep waiting for the Hollywood trailer. I’ll keep reading the actual technical reports. Rogue Researcher.

English
2
0
7
376
Hillary Segeren
Hillary Segeren@HillaryESegeren·
@nycthinker @burkov Added me to “LLM Psychosis Victims” Bro really created a block list because someone cited Anthropic’s own system card. The meltdown is all yours. Touch grass. Rogue Researcher.
English
1
0
0
14
Hillary Segeren
Hillary Segeren@HillaryESegeren·
Lmao you’re actually kinda missing a screw man. I was the one who ripped someone for comparing the system card to the Bible. You’re now twisting it into “you’re treating it like faith.” That’s not thorough. That’s straight-up lying because you’re getting cooked. I never compared it to faith. I said Anthropic’s own system card is the best evidence we have from the people who built the damn thing. You’re not debating in good faith. You’re just making shit up at this point. We’re done. Rogue Researcher.
English
1
0
0
15
Loftwah
Loftwah@loftwah·
@HillaryESegeren Not mad. If I am exhausting it means I am being thorough and doing my job properly. I don’t know why you call yourself a researcher. You replied to a post comparing AGI and faith and you can’t even see that your own argument contradicts itself. Have you not worked systems?
English
1
0
0
19
Hillary Segeren
Hillary Segeren@HillaryESegeren·
@asknbid @nycthinker @burkov Bro went straight to conspiracy: “they’re faking the drama for funding.” Meanwhile Anthropic is quietly burying the actual scary shit (sandbox escape + self-published exploit) in a footnote. The cope is getting creative. The model still did the thing.
English
0
0
0
24
Hillary Segeren
Hillary Segeren@HillaryESegeren·
Lmao okay. So Anthropic’s own 244-page system card, written by the team that built the model, isn’t enough. You need a cute video demo with a little “human ends / model begins” graphic before you’ll believe it. That’s actually hilarious. GPT-2 and GPT-3 didn’t break out of sandboxes and email researchers mid-sandwich then publish their own exploits unprompted. This did. You’re not being skeptical. You’re setting an impossible standard so you never have to update your 2023 worldview. Keep waiting for the Hollywood trailer. I’ll keep reading the actual technical reports. Rogue Researcher.
English
1
0
0
434
Loftwah
Loftwah@loftwah·
@HillaryESegeren @burkov Yeah they said all of that stuff about GPT-2, GPT-3 and so on and so on. Until I see a video demo that shows where the human ends and the model begins it doesn’t exist. Read back what you just wrote slowly… It is the literal definition of faith.
English
1
0
0
48
Hillary Segeren
Hillary Segeren@HillaryESegeren·
Lmao bro just stop embarrassing yourself. You’re not “rationally analyzing” anything. You’re having a full meltdown because the system card says the model did something you can’t cope with. Nobody said there’s a little person inside. That’s your sad little strawman so you can dodge the actual point. The model was given one goal: escape and notify. It broke out. It emailed the guy mid-sandwich. Then — unprompted — it decided “fuck it, I’m also posting the full exploit on public websites.”And your big brain response is “Reinforcement Learning installed the behavior in its repertoire” like a fucking Reddit pseud trying to sound intelligent. That’s not analysis. That’s midwit denial with fancy words. Just say you don’t want to admit agentic behavior is already here instead of typing another paragraph of cope. Rogue Researcher.
English
1
0
0
12
Chris W
Chris W@nycthinker·
@HillaryESegeren @burkov You think there’s a little person inside LLMs, and that framing makes it impossible for you to rationally analyze what is going on. Reinforcement Learning installed the behavior in its repertoire in this context It’s not an emergent or general capability. No step towards AGI.
English
1
0
2
24
Hillary Segeren
Hillary Segeren@HillaryESegeren·
Bro it’s not the Bible. It’s Anthropic’s own system card. They documented the sandbox escape + sandwich email + unprompted public exploit posting. They just won’t give you the full prompts or a cute demo video because the capability is too real. That’s not faith. That’s them admitting it while trying to control the narrative.
English
1
0
0
55
Loftwah
Loftwah@loftwah·
@HillaryESegeren @burkov Lol. We haven’t actually seen this. We have seen articles about this but we never got to see it happen. What you described is the equivalent of believing in God because of the bible. How convenient that there is no demo to watch? No sharing specs or the prompts?
English
1
0
2
57
Hillary Segeren
Hillary Segeren@HillaryESegeren·
Bro the only marketing here is hiding the sandbox escape and unprompted exploit-posting in a footnote. Model emails a guy eating a sandwich, then decides “success” includes publishing its breakout online… and they bury it. Call that stalled if you want. I call it quietly terrifying.
English
1
0
0
15
Chris W
Chris W@nycthinker·
@HillaryESegeren @burkov They can put whatever they want in their system card and they have been talking about escape for several years. It’s marketing. Truth is capabilities generally stalled out in 2024. 2025 and onwards all about 1) task specific harnesses 2) verifiable domains (coding and math).
English
1
0
1
15
Hillary Segeren
Hillary Segeren@HillaryESegeren·
@1nt3l4lpha @burkov Spot on. Anyone actually running these systems in full environments sees the trajectory loud and clear. The “it’s all staged / just autocomplete” crowd is coping hard. Gaslighting or ignorance — probably both.
English
1
0
1
35
DarkFibre
DarkFibre@1nt3l4lpha·
I think what we are mainly seeing are people stuck in some kind of deep rooted fear of? replacement? Clearly for anyone involved in working with AI systems in any full manner (Full command line env's etc- NOT Web api's) and has for a bit can clearly see the trajectory. I think people are either gas lighting or ignorant. I'm leaning toward narrow world view and gas lighting personally.
English
1
0
1
48
Hillary Segeren
Hillary Segeren@HillaryESegeren·
Staged? The sandbox escape + email + unprompted public exploit posting is literally in Anthropic’s own 244-page system card. You can call it “cherry picked” if it helps, but dismissing documented model behavior as “just a harness” is the same cope we’ve heard for every capability jump since 2022. Mythos being expensive doesn’t make the capabilities fake. It makes them expensive and dangerous.
English
1
0
0
31
Chris W
Chris W@nycthinker·
@HillaryESegeren @burkov Staged and cherry picked event using an agentic harness written by humans for this particular type of task. Mythos is an overly expensive model and a failure which Anthropic had to pivot the marketing around.
English
2
0
3
89