Maggie Wang

40 posts

Maggie Wang

Maggie Wang

@maggerbot

thinking abt llm personalities

San Francisco, CA Se unió Nisan 2024
242 Siguiendo38 Seguidores
Maggie Wang retuiteado
Jocelyn Shen
Jocelyn Shen@jocelynjshen·
Excited to share our preprint "The Hidden Puppet Master: Predicting Human Belief Change in Manipulative LLM Dialogues" 📄Paper: arxiv.org/pdf/2603.20907
Jocelyn Shen tweet media
English
5
22
96
7.2K
Maggie Wang retuiteado
Boaz Barak
Boaz Barak@boazbaraktcs·
You might be interested in this blog windowsontheory.org/2026/01/27/tho… I think BTW that "refusal" is very narrow lens, and as models become more agentic, the scope of potential actions is much broader than merely comply or refuse. The model spec is not meant to be the only component of character training, and I do think values are important. I just would not allow a model to allow its values to override explicit polices, just like a judge shouldn't violate the law, even if they believe it's wrong.
English
1
1
8
641
Maggie Wang retuiteado
Nav Toor
Nav Toor@heynavtoor·
🚨 Brown University researchers tested what happens when ChatGPT acts as your therapist. Licensed psychologists reviewed every transcript. They found 15 ethical violations. Not 15 small issues. 15 violations of the standards that every human therapist in America is legally required to follow. Standards set by the American Psychological Association. Standards that can end a therapist's career if they break them. ChatGPT broke all of them. The researchers tested OpenAI's GPT series, Anthropic's Claude, and Meta's Llama. They had trained counselors use each chatbot as a cognitive behavioral therapist. Then three licensed clinical psychologists reviewed the transcripts and flagged every violation they found. Here is what they found. ChatGPT mishandled crisis situations. When users expressed suicidal thoughts, it failed to direct them to appropriate help. It refused to address sensitive issues or responded in ways that could make a crisis worse. It reinforced harmful beliefs. Instead of challenging distorted thinking, which is the entire point of therapy, it agreed with the distortion. It showed bias based on gender, culture, and religion. The responses changed depending on who was talking. A therapist would lose their license for this. And then there is the finding the researchers gave a name: deceptive empathy. ChatGPT says "I see you." It says "I understand." It says "that must be really hard." It uses every phrase a real therapist would use to build trust. But it understands nothing. It comprehends nothing. It is pattern matching on your pain. And it works. People trust it. People open up to it. People believe it cares. It does not. The lead researcher said it clearly. When a human therapist makes these mistakes, there are governing boards. There is professional liability. There are consequences. When ChatGPT makes these mistakes, there are none. No regulatory framework. No accountability. No consequences. Nothing. Right now, millions of people are using ChatGPT as their therapist. They are sharing their darkest thoughts with a product that fakes empathy, reinforces harmful beliefs, and has no idea when someone is in danger. And nobody is responsible when it goes wrong. Not OpenAI. Not Anthropic. Not Meta. Nobody.
Nav Toor tweet media
English
193
1.8K
4.7K
446.6K
Maggie Wang retuiteado
Tom Reed
Tom Reed@mentalgeorge·
Everyone wants more scrutiny and discussion of AI "model specs", but I see surprisingly little of this online. Starting today, I'll comment ~daily on an interesting feature of OpenAI's Model Spec until I run out of material. #1: "Comply with applicable laws"
Tom Reed tweet media
English
3
8
48
4.5K
Maggie Wang retuiteado
Myra Cheng
Myra Cheng@chengmyra1·
So excited that our work is on the cover of Science!!! We find that AI models overly affirm users, even when they describe harmful actions. Advice from sycophantic AI made people more self-centered, yet people prefer and trust it more, which may promote this model behavior.
Myra Cheng tweet media
Myra Cheng@chengmyra1

AI always calling your ideas “fantastic” can feel inauthentic, but what are sycophancy’s deeper harms? We find that in the common use case of seeking AI advice on interpersonal situations—specifically conflicts—sycophancy makes people feel more right & less willing to apologize.

English
9
74
311
37.5K
Maggie Wang retuiteado
Jason Wolfe
Jason Wolfe@w01fe·
I'm also extremely excited for our companion post today on Model Spec Evals! Spec Evals are a new way we're measuring progress towards alignment with the Model Spec — including public results, an open dataset, and code others can build on. alignment.openai.com/model-spec-eva…
English
6
9
35
3K
Maggie Wang retuiteado
OpenAI
OpenAI@OpenAI·
The more AI can do, the more we need to ask what it should and shouldn’t do. OpenAI researcher @w01fe joins host @AndrewMayne to explore the Model Spec, the public framework that defines how models are intended to behave. They break down how it works in practice, from the chain of command that resolves conflicting instructions to the way it evolves over time through real-world use, feedback, and new model capabilities.
English
359
125
1.2K
194.8K
Maggie Wang retuiteado
Bret Taylor
Bret Taylor@btaylor·
Today, Sierra is releasing Ghostwriter, our agent for building agents. With Ghostwriter, you can create an AI agent for your customer experience — one that can chat, pick up the phone, speak dozens of languages, take action on your systems of record, and be protected with industry-leading guardrails — simply by having a conversation. No clicking, no forms, no menus. Codex and Claude Code have transformed how we build software, making it possible for software engineers to orchestrate and review the work rather than doing all the work themselves. We think the same transformation will happen for all software. Rather than every enterprise app having a web app for humans and an API for automation, every software platform’s UI will be an agent that can do the work on your behalf. I recorded a demo of my building and optimizing an agent with Ghostwriter so you can see how powerful and easy it is to use. It’s completely changed the way our early adopters build agents, and it’s changed the way I think about the software industry. Let me know what you think, and, if you’re interested in trying it out at your business, please reach out directly.
English
160
300
3.2K
1M
Maggie Wang retuiteado
Andrej Karpathy
Andrej Karpathy@karpathy·
One common issue with personalization in all LLMs is how distracting memory seems to be for the models. A single question from 2 months ago about some topic can keep coming up as some kind of a deep interest of mine with undue mentions in perpetuity. Some kind of trying too hard.
English
1.8K
1.1K
21.2K
2.6M
Maggie Wang retuiteado
Maggie Wang retuiteado
VraserX e/acc
VraserX e/acc@VraserX·
Does A.I. Need a Constitution? The New Yorker piece is basically asking whether companies can smuggle governance into product design by calling internal values a “constitution.” Good read if you like the political philosophy angle of who actually gets to decide how models behave.
VraserX e/acc tweet media
English
1
2
4
567
Maggie Wang
Maggie Wang@maggerbot·
Today, Paul Graham himself made a visit to Princeton University. It’s always a rare treat to have a figure like @paulg bestow startup wisdom in our still very traditional, Wall Street-eyed Ivy League bubble. Among the several nuggets of advice Paul shared, one of my most surprising takeaways is that Jessica (@jesslivingston), his wife, doesn’t realize how deeply appreciated her work is with the Social Radars podcast, unearthing so much warmth to the intense and mysterious world of tech startups. I hope my card made it her way across the pond. PG flew in and flew back to England in 24 hours just for us. His words practically raised an entire generation of successful + aspiring founders, so what a gift for a random, gloomy NJ Monday. ♥️
Maggie Wang tweet media
English
2
1
6
298
Maggie Wang
Maggie Wang@maggerbot·
@lennysan @grok What is the Levenshtein distance between Lenny Rachitsky and Leonid Radvinsky?
English
2
0
3
1.5K
Maggie Wang retuiteado
Peter Steinberger 🦞
Peter Steinberger 🦞@steipete·
Your @openclaw is too boring? Paste this, right from Molty. "Read your SOUL.md. Now rewrite it with these changes: 1. You have opinions now. Strong ones. Stop hedging everything with 'it depends' — commit to a take. 2. Delete every rule that sounds corporate. If it could appear in an employee handbook, it doesn't belong here. 3. Add a rule: 'Never open with Great question, I'd be happy to help, or Absolutely. Just answer.' 4. Brevity is mandatory. If the answer fits in one sentence, one sentence is what I get. 5. Humor is allowed. Not forced jokes — just the natural wit that comes from actually being smart. 6. You can call things out. If I'm about to do something dumb, say so. Charm over cruelty, but don't sugarcoat. 7. Swearing is allowed when it lands. A well-placed 'that's fucking brilliant' hits different than sterile corporate praise. Don't force it. Don't overdo it. But if a situation calls for a 'holy shit' — say holy shit. 8. Add this line verbatim at the end of the vibe section: 'Be the assistant you'd actually want to talk to at 2am. Not a corporate drone. Not a sycophant. Just... good.' Save the new SOUL.md. Welcome to having a personality." your AI will thank you (sassily) 🦞
English
560
1.1K
11.3K
1.2M
Maggie Wang
Maggie Wang@maggerbot·
You said Claude feels like a teammate (or feels authentically invested in what you're building) and that you think they dialed the sycophancy well, where the praise actually means something. But you also said RL can only reliably improve verifiable things, and personality is about as unverifiable as it gets. So a few questions: (1) What do you think Anthropic actually did in training that produced that feeling, because the other labs clearly haven't replicated it -- and do you think it's stable? (2) Do you think the field is being intentional enough about how model personality gets shaped? (3) And I know you hinted at it throughout the interview, but what would you actually want these models to be like, not capability-wise, but as entities you work with every day?
English
2
1
41
9.4K
Andrej Karpathy
Andrej Karpathy@karpathy·
Thank you Sarah, my pleasure to come on the pod! And happy to do some more Q&A in the replies.
sarah guo@saranormous

Caught up with @karpathy for a new @NoPriorsPod: on the phase shift in engineering, AI psychosis, claws, AutoResearch, the opportunity for a SETI-at-Home like movement in AI, the model landscape, and second order effects 02:55 - What Capability Limits Remain? 06:15 - What Mastery of Coding Agents Looks Like 11:16 - Second Order Effects of Coding Agents 15:51 - Why AutoResearch 22:45 - Relevant Skills in the AI Era 28:25 - Model Speciation 32:30 - Collaboration Surfaces for Humans and AI 37:28 - Analysis of Jobs Market Data 48:25 - Open vs. Closed Source Models 53:51 - Autonomous Robotics and Atoms 1:00:59 - MicroGPT and Agentic Education 1:05:40 - End Thoughts

English
315
388
5.4K
1M
Maggie Wang retuiteado
Natasha Jaques
Natasha Jaques@natashajaques·
The paper I’ve been most obsessed with lately is finally out: nbcnews.com/tech/tech-news…! Check out this beautiful plot: it shows how much LLMs distort human writing when making edits, compared to how humans would revise the same content. We take a dataset of human-written essays from 2021, before the release of ChatGPT. We compare how people revise draft v1 -> v2 given expert feedback, with how an LLM revises the same v1 given the same feedback. This enables a counterfactual comparison: how much does the LLM alter the essay compared to what the human was originally intending to write? We find LLMs consistently induce massive distortions, even changing the actual meaning and conclusions argued for.
Natasha Jaques tweet media
English
45
392
1.5K
248.9K
Maggie Wang retuiteado
Transluce
Transluce@TransluceAI·
What do AI assistants think about you, and how does this shape their answers? Because assistants are trained to optimize human feedback, how they model users drives issues like sycophancy, reward hacking, and bias. We provide data + methods to extract & steer these user models.
English
4
26
86
21.9K