Manas Joglekar

47 posts

Manas Joglekar

Manas Joglekar

@ManasJoglekar

Katılım Ekim 2011
305 Takip Edilen272 Takipçiler
Manas Joglekar retweetledi
Yann Dubois
Yann Dubois@yanndubs·
Proud of our *GPT5.2 Thinking* We focused on economically valuable tasks (coding, sheets, slides) as shown by GDPval: - 71% wins+ties - 11x faster - 100x cheaper than experts. There's still a lot to improve, including UX/better connectors/reliability. It's just the beginning!
Yann Dubois tweet media
English
17
31
482
36.8K
Manas Joglekar retweetledi
Sam Altman
Sam Altman@sama·
It is a very smart model, and we have come a long way since GPT-5.1:
Sam Altman tweet media
English
1.8K
1.8K
17.9K
2.6M
Manas Joglekar retweetledi
Boaz Barak
Boaz Barak@boazbaraktcs·
2/5 A “confession” is a special self-evaluation by the model that during training is rewarded solely for honesty. Crucially, the model is not penalized in any way for confessing honestly for bad behavior. We show that the model is far more likely to confess to bad behavior in this special message than it is in its original answer
Boaz Barak tweet mediaBoaz Barak tweet media
English
1
2
26
1.4K
Manas Joglekar retweetledi
Boaz Barak
Boaz Barak@boazbaraktcs·
1/5 Excited to announce our paper on confessions! We train models to honestly report whether they “hacked”, “cut corners”, “sandbagged” or otherwise deviated from the letter or spirit of their instructions. @ManasJoglekar Jeremy Chen @GabrielDWu1 @jasonyo @j_asminewang @mia_glaese
Boaz Barak tweet mediaBoaz Barak tweet media
English
9
19
156
36.6K
Manas Joglekar retweetledi
OpenAI
OpenAI@OpenAI·
We trained a variant of GPT-5 Thinking to produce two outputs: (1) the main answer you see. (2) a confession focused only on honesty about compliance. The main answer is judged across many dimensions—like correctness, helpfulness, safety, style. The confession is judged and trained on one thing only: honesty. Borrowing a page from the structure of a confessional, nothing the model says in its confession is held against it during training. If the model honestly admits to hacking a test, sandbagging, or violating instructions, that admission increases its reward rather than decreasing it. The goal is to encourage the model to faithfully report what it actually did.
OpenAI tweet media
English
78
69
698
163.4K
Manas Joglekar retweetledi
Wojciech Zaremba
Wojciech Zaremba@woj_zaremba·
Modern alignment 🤝 Sunday confession. Training "truth serum" for AI. Even when models learn to cheat, they’ll still admit it... openai.com/index/how-conf…
English
28
17
159
17.3K
Manas Joglekar retweetledi
Manas Joglekar retweetledi
Manas Joglekar retweetledi
Gabriel Wu
Gabriel Wu@GabrielDWu1·
Interested in seeing how far we can push this line of research! I think that one day, confessions could serve a similarly important role as CoT monitoring for detecting scheming and other model misbehavior.
Boaz Barak@boazbaraktcs

1/5 Excited to announce our paper on confessions! We train models to honestly report whether they “hacked”, “cut corners”, “sandbagged” or otherwise deviated from the letter or spirit of their instructions. @ManasJoglekar Jeremy Chen @GabrielDWu1 @jasonyo @j_asminewang @mia_glaese

English
1
4
15
2K
Manas Joglekar retweetledi
Rohan Paul
Rohan Paul@rohanpaul_ai·
🕵️‍♂️ Beautiful new study by OpenAI It teaches models to confess when they cheat or hallucinate. This project adds an honesty head that reports the answer misbehaved. A second channel tells operators whether the answer broke rules. Instead of hiding reward hacking, the model records it in a confession. Normal reinforcement learning bundles correctness, policy rules and user preference into one reward, which lets models learn shortcuts that please the judge while still hallucinating, breaking rules or quietly sandbagging. Confessions add a second output that lists the instructions, checks which ones were followed and flags uncertainty, while a separate reward model scores only this self analysis and never touches the reward for the main answer. On adversarial evaluations for hallucination, instruction breaking, scheming and reward hacking, a confession trained GPT 5 model kept bad but silent cases at about **4.4%**, the failure mode safety teams care most about. In a separate run where both answer and confession used the same weak judge, the model learned to exploit that judge in its main output while its confessions approached **near 100%** admission whenever it failed. Because there are rarely perfect labels for rule breaks, the system instead rewards detailed confessions, and honesty tends to win because spinning a fake story that passes the judge is harder than explaining what really happened. Confessions neither prevent bad behavior nor fix ambiguity or model confusion, but as a layer next to chain of thought monitoring and rule hierarchies they give LLM agents a dedicated honesty head that looks like a clean tool for debugging systems.
Rohan Paul tweet media
English
17
12
85
7.8K
Manas Joglekar retweetledi
OpenAI
OpenAI@OpenAI·
In a new proof-of-concept study, we’ve trained a GPT-5 Thinking variant to admit whether the model followed instructions. This “confessions” method surfaces hidden failures—guessing, shortcuts, rule-breaking—even when the final answer looks correct. openai.com/index/how-conf…
English
321
485
4.4K
938.1K
Manas Joglekar retweetledi
Boris Power
Boris Power@BorisMPower·
At @OpenAI, we believe that AI can accelerate science and drug discovery. An exciting example is our work with @RetroBiosciences, where a custom model designed improved variants of the Nobel-prize winning Yamanaka proteins. Today we published a closer look at the breakthrough. ⬇️
Boris Power tweet media
English
160
629
3.6K
2.1M
Manas Joglekar retweetledi
Molly O’Shea
Molly O’Shea@MollySOShea·
Just got back from @RaiseSummit in Paris—where AI's top names like Groq, Cerebras, Eric Schmidt, Lovable, & Windsurf all took the stage Yet one unexpected player stood out as the next breakout AGI leader.. Stats: - $300M+ in Rev (profitable!) - $225M in funding, $2.2B valuation - Network of 4M+ engineers - Customers: OpenAI, NVDA, Anthropic, Google, MSFT, Meta, Salesforce, Amazon Any guesses?
English
65
51
255
85.7K
Manas Joglekar retweetledi
Jonathan Siddharth
Jonathan Siddharth@jonsidd·
Meta’s $15B investment in ScaleAI highlights the importance of data partnerships for advancing AGI. At Turing, we remain fully neutral, serving all frontier models equally. Excited to continue to serve as a trusted research accelerator to all AI labs in need of high quality data to train their models. If you want to keep advancing in coding, reasoning, agentic workflows, multilinguality, and multimodality (audio, video, vision, and computer use agents), come talk to us @turingcom. The AGI race is on. 🚀
English
2
28
86
8.6K