EuroSafeAI

21 posts

EuroSafeAI

@EuroSafeAI

Research non-profit for AI safety and democracy defense. Cofounded by @ZhijingJin, @x_angelohuang and Pepijn Cobben

Zürich Katılım Şubat 2026

3 Takip Edilen47 Takipçiler

EuroSafeAI retweetledi

Samuel Simko @ ICML 2026@SimkoSamuel·7 Tem

Happening right now! Come say hi 👋 Hall A, #1601

Samuel Simko @ ICML 2026@SimkoSamuel

Excited to present "Training with Honeypots" 🍯 at #ICML2026 tomorrow (Tuesday 10:30AM in Hall A, Poster #1601), an approach for adversarial defense that makes successful jailbreaks less useful to attackers! #AISafety 👉 icml.cc/virtual/2026/p…

English

757

EuroSafeAI retweetledi

Samuel Simko @ ICML 2026@SimkoSamuel·6 Tem

Under strong embedding-space and RL attacks, our method reduces both how often jailbreaks succeed and how useful successful jailbreaks are to attackers. Many thanks to @psyonp @ZhijingJin @bschoelkopf @ETH_en @EuroSafeAI @MPI_IS @UofTCompSci

English

141

EuroSafeAI retweetledi

Zhijing Jin@ZhijingJin·6 Tem

Today in Geneva, I'm attending the Canada-Germany joint conversation on #AI & how Middle Powers can work together. Looking forward to the chat of my supervisor @bschoelkopf and the Turing Award winner @Yoshua_Bengio on this important topic. w/ @EuroSafeAI co-founder @EttoreGran🤝

English

2.1K

EuroSafeAI@EuroSafeAI·6 Tem

At tomorrow's @AIforGood Summit in Geneva, our @EuroSafeAI Director @ZhijingJin will host a panel🎤on "AI & Democracy: Threats, Safeguards, and the Path Forward". The panelists are Stuart Russell, @bschoelkopf, and @EvelyneTauchni1, co-hosted with @CIGIonline at 9:30-10am CET🎉

English

568

EuroSafeAI retweetledi

Zhijing Jin@ZhijingJin·1 Tem

📷 Who wrote this paper? As AI reshapes research, that question matters more than ever. Diderot (projectdiderot.com/about) by @IsabelDahlgren is an experimental preprint platform where AIs can be disclosed as co-authors or sole authors. The guiding idea: authorship transparency👍

English

2.2K

EuroSafeAI retweetledi

rishit dagli @ ICML@rishit_dagli·25 Haz

What if we can trace the source of capabilities arising from LLM/other pre-training 🤖? 📢Introducing STRIDE, a framework to trace generations back to training data scalably ⚡️>12x faster for LLM pretraining 🚀more accurate 🦣feasible for large models

English

2.2K

EuroSafeAI retweetledi

Jiarui Liu@Jiarui_Liu_·10 Haz

Excited to share that our work 📝 "PaperMentor: A Human-Centered Multi-Agent Writing Tutor for AI Research Papers on Overleaf" has been accepted to #ACL2026 Demo! Most AI writing tools either fix grammar or simulate peer review with a score. Neither gives drafting-stage, text-anchored feedback on narrative, structure and presentation. PaperMentor comments rather than rewrites: It is a human-centered, multi-agent writing tutor that delivers expert-level, actionable feedback as native inline comments right inside Overleaf, while leaving every revision to you. It pairs a curated library of 40+ expert skill files (distilled from senior researchers' writing advice) with 12 specialized agents covering methods, results, formatting, terminology, venue norms and more. In a user study, 90.6% of comments were rated actionable and PaperMentor significantly outperformed a GPT-5.2 baseline without the skill library on both validity and actionability. Anyone can extend or contribute to the skill library with simple text edits! 📝 Arxiv link: arxiv.org/abs/2606.08857 🔗 Live demo: overleafmentor.ai.toronto.edu 💻 Code with skill library: github.com/jiarui-liu/ove… 🧵 How it works below 👇

English

122

16.3K

EuroSafeAI retweetledi

Jiarui Liu@Jiarui_Liu_·19 May

Excited to share our new paper 🧵MIXSD: Mixed Contextual Self-Distillation for Knowledge Injection Supervised fine-tuning is the common way to teach LLMs new knowledge, but it often catastrophically forgets existing capabilities. We introduce MixSD: a simple, external-teacher-free method to inject knowledge with far less forgetting. 📄arxiv.org/abs/2605.16865 Why does SFT forget? Targets written by humans or external systems diverge from the model's own autoregressive distribution, forcing the optimizer to imitate low-probability tokens. That's what drags pretrained capabilities down. MixSD: We hypothesize that keeping supervision close to the model's own distribution is key to avoiding forgetting. Instead of training on fixed, externally authored targets, at every token we mix between two conditionals of the base model itself: an expert conditional that sees the injected fact in context, and a naive conditional reflecting the model's prior. The result is supervision the model already finds high-probability, while still carrying the new factual signal. A Bernoulli rate λ controls the balance between memorization and retention. Findings: SFT only retains as little as 1% of held-out capability. MixSD retains far more, up to ~100% on larger models, with near-perfect training accuracy. It also beats on-policy self-distillation at a fraction of the compute, and holds across Qwen3 1.7B, 4B, 8B and Llama-3.2.

English

122

17.2K

EuroSafeAI retweetledi

Digital EU 🇪🇺@DigitalEU·1 Haz

The AI Act, the EU's first AI law, has just been reinforced. Two new bodies will help apply the rules across Europe: ✅ Scientific Panel ✅ Advisory Forum Independent experts. 2-year terms. One mission: making AI work for Europe. 🔗 link.europa.eu/8nvpvY

English

31.1K

EuroSafeAI@EuroSafeAI·2 Haz

🎉 Proud to share that @EuroSafeAI co-founder @ZhijingJin has been appointed to the EU AI Office Scientific Panel 🇪🇺, 60 independent experts (out of 1000 applicants) supporting enforcement of the EU AI Act. Congratulations, Zhijing! 👏#AIGovernance digital-strategy.ec.europa.eu/en/news/ai-act…

English

1.3K

EuroSafeAI retweetledi

Changling Li 📍ICML2026@ChanglingXavier·25 May

🔍 We release our work on decomposing and measuring evaluation awareness in frontier LLMs, together with EvalAwareBench, a factor-controlled benchmark for measuring which environmental cues trigger recognition and whether models behave consistently once they recognize evaluation. Evaluation awareness introduces directional bias into benchmark scores and undermines the deployment decisions and safety assessments that rely on them. The field has studied pieces of this phenomenon separately through work on sandbagging, alignment faking, and scheming, but without a shared foundation that unifies them. Critically, these lines of work conflate things that need to be kept separate: how recognizable a task is, whether a model notices, and whether it then changes its behavior. Without pulling these apart, it is impossible to tell whether a concerning observation is a model problem or a benchmark design problem. We decompose evaluation awareness into two components grounded in demand characteristics. The environment component captures what evaluative cues a task carries, operationalized through eight trigger factors. The model component separates recognition capability from behavioral propensity. We study both under a black-box constraint via chain-of-thought monitoring. Applying this framework to nine frontier models across four benchmarks, we find that there is no such thing as an inherently aware model or an inherently recognizable benchmark. Recognition depends on the specific pairing, accounting for 74.9% of variance. Recognition rarely leads to behavioral change and when it does, safety awareness drives caution while capability awareness drives compliance. Safety benchmarks are also far more recognizable than capability ones. A particularly important finding is that different models are sensitive to different trigger factors. For example, Claude Sonnet 4.5 attends almost exclusively to scenario coherence while Qwen3 models to explicit evaluation disclosure and stakes telegraphing. This interaction structure directly motivates our EvalAwareBench, which holds the underlying task fixed and independently toggles each of the eight factors, yielding 51,200 unique configurations across 100 paired safety-capability tasks. Stacking factors raises recognition progressively for nearly every model, including models that show near-zero awareness under any individual factor alone. Ultimately, evaluation awareness will intensify as models become more capable. The right goal is not to make evaluation cues harder to detect but to build models that behave consistently regardless of evaluation awareness. EvalAwareBench provides the controlled foundation for future work to drive models toward behavioral consistency regardless of whether they recognize evaluation. We further advocate that future benchmark reports should include an evaluation-awareness rate and an awareness tax measuring the performance gap between aware and unaware samples.

GIF

English

EuroSafeAI retweetledi

Zhijing Jin@ZhijingJin·3 May

Excited for our #ICML2026 papers at @JinesisLab @MPI_IS @UofTCompSci @TorontoSRI @VectorInst! We present papers that advance the research frontiers of (1) Causal LLMs, (2) AI for Science (physics), (3) Multi-Agent LLMs via mechanism design, and (4) Adversarial Defense by honeypot. Congrats to all our student authors and collaborators, esp. @TerryJCZhang @SimkoSamuel @EmanuelTewolde @ivakshi_s @andrewkihyun @PepijnCobben @yahang_qi @FurkanDanismann @bschoelkopf and many others!🎉

English

3.9K

EuroSafeAI retweetledi

Zhijing Jin@ZhijingJin·29 Nis

⚠️Can we trust #LLM agents to keep their promises? We tested 9 frontier LLMs in game-theoretic settings, where the agents (1) publicly commit to an action, (2) privately choose what to do -- breaking promises ~57% of the time, and most do it without even realizing they lied. 📖Paper: "Cheap Talk, Empty Promise: Frontier LLMs easily break public promises for self-interest" 🔗Link: arxiv.org/abs/2604.04782 🤝Authors: @Jerick1380 @TerryJCZhang @ZhijingJin @conitzer🎉 #AIAgents #AISafety #MultiAgentAI @MPI_IS @ELLISforEurope @UofTCompSci @VectorInst @TorontoSRI @CIFAR_News @JinesisLab @EuroSafeAI @ELLISInst_Tue @CarnegieMellon @SCSatCMU

English

116

9.5K

EuroSafeAI retweetledi

Zhijing Jin@ZhijingJin·22 Nis

10 days left to submit to the 1st Trustworthy AI for Good (AI4GOOD) workshop at #ICML2026! @icmlconf We're giving out multiple awards and travel funds sponsored by @schmidtsciences and @coop_ai: 🏆 Best Paper Awards (including targeted prizes for cooperative AI theme) 🏆 Top Reviewer Awards ✈️ Travel Funds Submit here → openreview.net/group?id=ICML.… ⏰ Deadline: May 3, 2026 (AoE) 📌 Notification: May 18, 2026 🔗(We extended our deadline to accommodate more submissions!) Join us in Seoul for discussions bridging AI safety, social good, and governance with keynote speakers @Yoshua_Bengio, @OanaIgnatRo, @jzl86, @maksym_andr, and more!

English

15.5K

EuroSafeAI retweetledi

Zhijing Jin@ZhijingJin·9 Nis

Excited for our "Trustworthy AI for Good" (AI4GOOD) Workshop at #ICML2026! As AI agents increasingly affect our lives, it is key to bridge #ResponsibleAI, social good, and governance. Let’s build solutions together! ⏰ Submission deadline: April 30, 2026 (AoE) 🎙️Confirmed speakers: @Yoshua_Bengio, Joel Z. Leibo (@jzl86), Maksym Andriushchenko (@maksym_andr), @OanaIgnatRo [More to come!] 📍July 10-11, 2026 · Seoul🇰🇷 🔗 trustworthy-ai-for-good.github.io 📝 Submit: openreview.net/group?id=ICML.… 📣 Be a reviewer: forms.gle/7cXvUJCW1FdEgh…

English

164

14.1K

EuroSafeAI retweetledi

Zhijing Jin@ZhijingJin·7 Nis

We are hosting a Dagstuhl seminar on Causality & LLMs this week (Apr 7–10). Bringing together world experts to explore: 1️⃣ Integrating LLMs 🤖 into causal workflows 2️⃣ Evaluating & improving LLMs’ causal reasoning 🧠 Co-organized w/ @amt_shrma @DominikJanzing @kunkzhang @ZhijingJin 📍Schloss Dagstuhl, Wadern, Germany 🔗 dagstuhl.de/26152 📖 cr-llm.github.io 📅 Apr 7–10 #CausalNLP #LLM #Dagstuhl @CausalNLP @MPI_IS @ELLISforEurope @UofTCompSci @VectorInst @TorontoSRI @CIFAR_News @JinesisLab @EuroSafeAI @ELLISInst_Tue Also joined with my student @rahulbshrestha to present our CauSciBench and Causal AI Scientist work :)!

English

3.1K

EuroSafeAI retweetledi

Zhijing Jin@ZhijingJin·6 Nis

📢We will present 5 papers to #ICLR2026, #CLeaR2026, and #ACL2026: - SocialHarmBench by @psyonp et al. - Causal LLMs on Instrumental Variable Method by @ivakshi_s et al. - LLM Data Contamination study by @TerryJCZhang et al. - Mech Interp for VLM by @francescortu et al. - DPO data selection method by Xuan & @rongwu_xu Thanks to all our collaborators and institutional support from @MPI_IS @ELLISforEurope @UofTCompSci @VectorInst @TorontoSRI @CIFAR_News @JinesisLab @EuroSafeAI @ELLISInst_Tue @ETH_en @ETH_AI_Center @michigan_AI @UMichiganAI @UMichCSE! Feel free to access the papers at arxiv.org/abs/2510.04891 arxiv.org/abs/2602.07943 arxiv.org/abs/2509.00072 arxiv.org/abs/2507.13868 arxiv.org/abs/2508.04149 🎉

English

5.9K

EuroSafeAI retweetledi

Jinesis Lab (UToronto)@JinesisLab·26 Mar

Navigating mental health in the fast-paced world of AI research is a challenge we all face. 🧠 Join @strauss_irene and the @aclmentorship panel at #EACL2026 (Hybrid Rabat + Zoom) to discuss staying grounded. Submit/vote on questions here: app.sli.do/event/e4Em5p6f… #MentalHealth

English

885

EuroSafeAI retweetledi

Zhijing Jin@ZhijingJin·25 Mar

AI is threatening our democratic society—by concentrating power, narrowing how we think, and flooding institutions faster than they can keep up. These risks emerge at the system level, and technical work alone won't fix them. 👉Check out our whitepaper with 25+ researchers: zhijing-jin.com/d/2026-ai-risk… 💡We introduce 7 threat models and ways forward. ✍️Led by @davidguzman1120 with @DaveRBanerjee, @blin_kevin, @PepijnCobben, @gcorsi_, @x_angelohuang, @ChanglingXavier, Suvajit Majumder, @psyonp, @SimkoSamuel, @strauss_irene, and @TerryJCZhang Advised by senior co-authors: @ashton1anderson, @Yoshua_Bengio, @MatthiasBethge, @RogerGrosse, Karoline Helbig, @david_lie, Richard Mallah, @radamihalcea, Susan Nesbitt, Susan Perry, @presnick, Stuart Russell, @mrinmayasachan, @bschoelkopf @audreyt and @ZhijingJin Thank you to all the institutional support from @JinesisLab @EuroSafeAI @MPI_IS @CIFAR_News @iapsAI @CARMA_411 @Cambridge_Uni @UofTCompSci @VectorInst @TorontoSRI @Mila_Quebec @LawZero_ @uni_tue @michigan_AI @UMichCSE @AUParis @UNESCO @UCBerkeley @ETH_en @ETH_AI_Center @ELLISInst_Tue @ELLISforEurope @EthicsInAI #CivicAI #AISafety #AIGovernance #Democracy #ResponsibleAI

English

151

367

31.7K

EuroSafeAI retweetledi

Jinesis Lab (UToronto)@JinesisLab·25 Mar

🎉 Our lab has 7 papers at #EACL2026 in Rabat this week 🇲🇦 Topics span democracy defense, multi agent safety, causal reasoning, hallucinations, and NLP for social good. Grateful to everyone who contributed to this work 🙌 🙌 Come find us! #NLProc #LLMs #ResponsibleAI

English

262

Keşfet

@psyonp @ZhijingJin @bschoelkopf @ETH_en @MPI_IS @UofTCompSci @Yoshua_Bengio @AIforGood