Manas Joglekar (@ManasJoglekar) - Twitter Profili

Manas Joglekar retweetledi

Josh McGrath@j_mcgraph·2 Mar

I signed a letter because the SCR is wrong app.dowletter.org

English

3

6

101

5.9K

Manas Joglekar retweetledi

Boaz Barak@boazbaraktcs·14 Oca

Why we are excited about confessions alignment.openai.com/confessions/

English

11

29

270

59.9K

Manas Joglekar retweetledi

Yann Dubois@yanndubs·11 Ara

Proud of our *GPT5.2 Thinking* We focused on economically valuable tasks (coding, sheets, slides) as shown by GDPval: - 71% wins+ties - 11x faster - 100x cheaper than experts. There's still a lot to improve, including UX/better connectors/reliability. It's just the beginning!

English

17

31

482

36.8K

Manas Joglekar retweetledi

Sam Altman@sama·11 Ara

It is a very smart model, and we have come a long way since GPT-5.1:

English

1.8K

17.9K

2.6M

Manas Joglekar retweetledi

Greg Brockman@gdb·4 Ara

proof-of-concept method that trains models to report when they break instructions or take unintended shortcuts:

OpenAI@OpenAI

In a new proof-of-concept study, we’ve trained a GPT-5 Thinking variant to admit whether the model followed instructions. This “confessions” method surfaces hidden failures—guessing, shortcuts, rule-breaking—even when the final answer looks correct. openai.com/index/how-conf…

English

108

74

1.4K

345.6K

Manas Joglekar retweetledi

MIT Technology Review@techreview·3 Ara

OpenAI has trained its LLM to confess to bad behavior trib.al/ur62e6F

English

2

5

10

6.4K

Manas Joglekar retweetledi

Boaz Barak@boazbaraktcs·3 Ara

2/5 A “confession” is a special self-evaluation by the model that during training is rewarded solely for honesty. Crucially, the model is not penalized in any way for confessing honestly for bad behavior. We show that the model is far more likely to confess to bad behavior in this special message than it is in its original answer

English

1

2

26

1.4K

Manas Joglekar retweetledi

Boaz Barak@boazbaraktcs·3 Ara

1/5 Excited to announce our paper on confessions! We train models to honestly report whether they “hacked”, “cut corners”, “sandbagged” or otherwise deviated from the letter or spirit of their instructions. @ManasJoglekar Jeremy Chen @GabrielDWu1 @jasonyo @j_asminewang @mia_glaese

English

9

19

156

36.6K

Manas Joglekar retweetledi

Boaz Barak@boazbaraktcs·3 Ara

Great article by @strwbilly on our confessions work technologyreview.com/2025/12/03/112…

English

1

3

18

2K

Manas Joglekar retweetledi

OpenAI@OpenAI·3 Ara

We trained a variant of GPT-5 Thinking to produce two outputs: (1) the main answer you see. (2) a confession focused only on honesty about compliance. The main answer is judged across many dimensions—like correctness, helpfulness, safety, style. The confession is judged and trained on one thing only: honesty. Borrowing a page from the structure of a confessional, nothing the model says in its confession is held against it during training. If the model honestly admits to hacking a test, sandbagging, or violating instructions, that admission increases its reward rather than decreasing it. The goal is to encourage the model to faithfully report what it actually did.

English

78

69

698

163.4K

Manas Joglekar retweetledi

Boaz Barak@boazbaraktcs·3 Ara

Blog on our work

OpenAI@OpenAI

In a new proof-of-concept study, we’ve trained a GPT-5 Thinking variant to admit whether the model followed instructions. This “confessions” method surfaces hidden failures—guessing, shortcuts, rule-breaking—even when the final answer looks correct. openai.com/index/how-conf…

English

2

3

125

23.9K

Manas Joglekar retweetledi

Wojciech Zaremba@woj_zaremba·3 Ara

Modern alignment 🤝 Sunday confession. Training "truth serum" for AI. Even when models learn to cheat, they’ll still admit it... openai.com/index/how-conf…

English

28

17

159

17.3K

Manas Joglekar retweetledi

Jason Wolfe@w01fe·3 Ara

Really excited about this work! If we can train models to robustly self-report non-compliance, it could go a long way towards addressing scheming and related risks.

OpenAI@OpenAI

We trained a variant of GPT-5 Thinking to produce two outputs: (1) the main answer you see. (2) a confession focused only on honesty about compliance. The main answer is judged across many dimensions—like correctness, helpfulness, safety, style. The confession is judged and trained on one thing only: honesty. Borrowing a page from the structure of a confessional, nothing the model says in its confession is held against it during training. If the model honestly admits to hacking a test, sandbagging, or violating instructions, that admission increases its reward rather than decreasing it. The goal is to encourage the model to faithfully report what it actually did.

English

1

3

9

2.1K

Manas Joglekar retweetledi

Jason Yosinski@jasonyo·3 Ara

This was a fun project (that I jumped into halfway through) with @ManasJoglekar, Jeremy Chen, @GabrielDWu1, @j_asminewang, @boazbaraktcs, and @mia_glaese. Boaz wrote a good casual summary: x.com/boazbaraktcs/s…

Boaz Barak@boazbaraktcs

1/5 Excited to announce our paper on confessions! We train models to honestly report whether they “hacked”, “cut corners”, “sandbagged” or otherwise deviated from the letter or spirit of their instructions. @ManasJoglekar Jeremy Chen @GabrielDWu1 @jasonyo @j_asminewang @mia_glaese

English

1

3

10

1.5K

Manas Joglekar retweetledi

Gabriel Wu@GabrielDWu1·3 Ara

Interested in seeing how far we can push this line of research! I think that one day, confessions could serve a similarly important role as CoT monitoring for detecting scheming and other model misbehavior.

Boaz Barak@boazbaraktcs

1/5 Excited to announce our paper on confessions! We train models to honestly report whether they “hacked”, “cut corners”, “sandbagged” or otherwise deviated from the letter or spirit of their instructions. @ManasJoglekar Jeremy Chen @GabrielDWu1 @jasonyo @j_asminewang @mia_glaese

English

1

4

15

2K

Manas Joglekar retweetledi

Rohan Paul@rohanpaul_ai·3 Ara

🕵️‍♂️ Beautiful new study by OpenAI It teaches models to confess when they cheat or hallucinate. This project adds an honesty head that reports the answer misbehaved. A second channel tells operators whether the answer broke rules. Instead of hiding reward hacking, the model records it in a confession. Normal reinforcement learning bundles correctness, policy rules and user preference into one reward, which lets models learn shortcuts that please the judge while still hallucinating, breaking rules or quietly sandbagging. Confessions add a second output that lists the instructions, checks which ones were followed and flags uncertainty, while a separate reward model scores only this self analysis and never touches the reward for the main answer. On adversarial evaluations for hallucination, instruction breaking, scheming and reward hacking, a confession trained GPT 5 model kept bad but silent cases at about **4.4%**, the failure mode safety teams care most about. In a separate run where both answer and confession used the same weak judge, the model learned to exploit that judge in its main output while its confessions approached **near 100%** admission whenever it failed. Because there are rarely perfect labels for rule breaks, the system instead rewards detailed confessions, and honesty tends to win because spinning a fake story that passes the judge is harder than explaining what really happened. Confessions neither prevent bad behavior nor fix ambiguity or model confusion, but as a layer next to chain of thought monitoring and rule hierarchies they give LLM agents a dedicated honesty head that looks like a clean tool for debugging systems.

English

17

12

85

7.8K

Manas Joglekar retweetledi

OpenAI@OpenAI·3 Ara

In a new proof-of-concept study, we’ve trained a GPT-5 Thinking variant to admit whether the model followed instructions. This “confessions” method surfaces hidden failures—guessing, shortcuts, rule-breaking—even when the final answer looks correct. openai.com/index/how-conf…

English

321

485

4.4K

938.1K

Manas Joglekar retweetledi

Boris Power@BorisMPower·22 Ağu

At @OpenAI, we believe that AI can accelerate science and drug discovery. An exciting example is our work with @RetroBiosciences, where a custom model designed improved variants of the Nobel-prize winning Yamanaka proteins. Today we published a closer look at the breakthrough. ⬇️

English

160

629

3.6K

2.1M

Manas Joglekar retweetledi

Molly O’Shea@MollySOShea·13 Tem

Just got back from @RaiseSummit in Paris—where AI's top names like Groq, Cerebras, Eric Schmidt, Lovable, & Windsurf all took the stage Yet one unexpected player stood out as the next breakout AGI leader.. Stats: - $300M+ in Rev (profitable!) - $225M in funding, $2.2B valuation - Network of 4M+ engineers - Customers: OpenAI, NVDA, Anthropic, Google, MSFT, Meta, Salesforce, Amazon Any guesses?

English

65

51

255

85.7K

Manas Joglekar retweetledi

Jonathan Siddharth@jonsidd·11 Haz

Meta’s $15B investment in ScaleAI highlights the importance of data partnerships for advancing AGI. At Turing, we remain fully neutral, serving all frontier models equally. Excited to continue to serve as a trusted research accelerator to all AI labs in need of high quality data to train their models. If you want to keep advancing in coding, reasoning, agentic workflows, multilinguality, and multimodality (audio, video, vision, and computer use agents), come talk to us @turingcom. The AGI race is on. 🚀

English

2

28

86

8.6K

Manas Joglekar

Keşfet