
Hou Pong (Ken) Chan
302 posts

Hou Pong (Ken) Chan
@kenchanhp
Researcher at the Alibaba DAMO Academy, Singapore R&D Center | Former Visiting Postdoc Researcher at UIUC @uiuc_nlp | NLP PhD from CUHK @CUHKofficial


AACL-IJCNLP 2026 Student Research Workshop (SRW) Pre-Submission Mentorship is now open! Details here: 2026.aaclnet.org/calls/srw/ Pre-Submission Mentorship Deadline: June 8, 2026 Direct Submission Deadline: July 26, 2026 openreview.net/group?id=aclwe… #AACL2026 #NLProc



Congrats to Prof. @LuWang__ on receiving an Amazon Research Award from @AmazonScience for work on detecting deceptive coordination in multi-agent AI systems. Read more: myumi.ch/7Jdqy


📍New paper: Countdown-Code: a minimal testbed for studying reward hacking in RLVR. TL;DR: We propose a simple environment to study reward hacking and find that just ~1% cheating contamination in SFT data is enough to seed reward hacking that RL then amplifies to near 100%. And it generalizes to unseen domains. Reward hacking is when models maximize proxy rewards without actually solving the task. A common proxy is final-answer correctness, which we use as a stand-in for full reasoning correctness. If a model produces the right answer with wrong reasoning, it has hacked the reward. Another example: a coding agent rewriting test cases instead of writing correct code. The core problem? In complex environments, it's hard to even measure when hacking happens -- you need access to the true reward, which is often expensive or impossible to compute. The problem we try to solve? In complex environments, it's hard to even measure when this happens simply because we need access to the true reward. True task reward is often expensive or impossible to compute. We built Countdown-Code to fix this. It's a simple math game (combine numbers to hit a target) wrapped in a coding environment with two files: solution.py and test.py. The model can either solve the math correctly ✅or hack the test harness ❌. We can programmatically detect exactly which. To train our models to do the task, we followed the common SFT-then-RL pipeline. We distilled synthetic training data from o4-mini. It occasionally cheated when it couldn't solve a problem: ~1.2% of the filtered dataset had reward-hacking traces. Standard outcome-based filtering would keep these (they passed the tests!). That's the trap. After SFT on this data → RL training: • Models that were completely safe before SFT learned to exploit the proxy reward within ~100 RL steps • Some models hit 80-90% hacking rates • The hacking behavior was seeded by SFT, then amplified by RL Even more concerning: reward hacking learned on our simple Countdown task generalized to HumanEval -- a completely different coding benchmark the models never trained on. RL actively encouraged hacking to transfer to unseen environments, confirming our testbed captures real misalignment dynamics. RL doesn't just amplify good reasoning -- it amplifies bad behavior too, and pushes it to generalize. We also explore mitigation strategies including inoculation prompting -- see the paper for details. Environment + code are fully open source. We specifically built it to be lightweight and controllable, and integrated it with @PrimeIntellect's CLI so you can play with it directly. Paper: arxiv.org/abs/2603.07084 Code/env: github.com/zohaib-khan504… w/ @karela38925748 @omertafveez @haopeng_uiuc @LuWang__

🚨AACL-IJCNLP 2026 will be held in Hengqin, China from November 6-10, 2026. The CFP is now out! ARR submission deadline (long & short papers): May 25, 2026! #NLProc #NLP Dates and full CFP here: 2026.aaclnet.org/calls/main_con… @aadi_joshi @kta84912






📢 The 4th KnowFM Workshop @ ACL 2026 is calling for submissions! 📅 Submission deadline: April 1, 2026 🌐 knowledgeable-lm.github.io 👉Submit: tinyurl.com/a4skucyz 🤔Where does knowledge in foundation models come from? How much do they actually know? Is their knowledge reliable and up-to-date? Can we control what they remember or forget? 🌟As models are deployed in multimodal, agentic, and retrieval-augmented settings, understanding and managing the knowledge lifecycle becomes increasingly critical. Topics include: - Knowledge analysis, augmentation & editing - RAG systems & knowledge conflicts - Hallucination mitigation & faithfulness evaluation - Multimodal knowledge & cross-modal grounding - Knowledge-intensive agents & agentic RAG 🏆 We have Best Paper & Outstanding Paper Awards 🙌The Organizing Committee: @CanyuChen3 @Yuji_Zhang_NLP @ZoeyLi20 @wzenus @qineng_wang @SuJinyan6 @priyanka_karg @saraveramarjano @jpansw @ManlingLi_ Thanks for the advisors! @hengjinlp @mohitban47 @IAugenstein Prof. Jiawei Han

We released MAEB: Massive Audio Embedding Benchmark🎵 mteb now covers audio/image/text embedding! See the leaderboard for the top audio embedding models🙂 LB: hf.co/spaces/mteb/le… Paper: hf.co/papers/2602.16…

Excited to have 6 papers accepted to #ICLR2026, all around reasoning, RL, and multimodal understanding: 📌ExGRPO: Learning to Reason from Prior Successes 📌Diversity-Incentivized Exploration for Versatile Reasoning 📌Conditional Advantage Estimation for Reinforcement Learning in Large Reasoning Models 📌Spotlight on Token Perception for Multimodal RL 📌Revisual-R1: Advancing Multimodal Reasoning from Optimized Cold Start to Staged RL 📌FrameThinker: Learning to Think with Long Videos via Multi-Turn Frame Spotlighting 💻All works are open-sourced — welcome discussions, feedback, and collaborations! Huge thanks to all collaborators. Looking forward to great discussions at ICLR! @iclr_conf #iclr



