Pinjia He

252 posts

Pinjia He banner
Pinjia He

Pinjia He

@PinjiaHE

Assistant Professor at The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen) @cuhksz.

Shenzhen, China Katılım Mart 2015
586 Takip Edilen1.7K Takipçiler
Sabitlenmiş Tweet
Pinjia He
Pinjia He@PinjiaHE·
📢 Can LLMs locate software service failures? 🤔 My student @SiyuexiH's #ICLR2025 paper introduces OpenRCA, the first benchmark dataset for evaluating LLMs' root cause analysis capabilities in software systems. LLMs/Agents need to analyze system telemetry data to infer results for natural language queries. Experiments show current LLMs struggle with OpenRCA tasks without specialized RCA tools. Joint work with Microsoft and Tsinghua University. 🔗 Learn more: 📜 Paper: openreview.net/pdf?id=M4qNIzQ… 💻 Code: github.com/microsoft/Open… 📊 Leaderboard: microsoft.github.io/OpenRCA/ #iclr2025 #AI4SE #LLM #rootcauseanalysis
Pinjia He tweet media
English
0
6
15
3.5K
Pinjia He retweetledi
Boxi Yu
Boxi Yu@BoshCavendish·
🔥 SWE-ABS accepted by ICML2026 @icmlconf 🔥 OpenAI @OpenAI showed SWE-Bench @SWEbench tests reject correct patches. We reveal the other side: they also accept wrong ones. SWE-ABS strengthens SWE-Bench (Verified & Pro) via: coverage-driven tests + mutation-based attacks. Key results: • All top-30 rankings shift (#1#5) • 19.78% “solved” patches are actually wrong • 50.2% Verified strengthened • 64.7% Pro subset strengthened 👉 Test quality—not benchmark difficulty—is the real bottleneck. Links 👇
Boxi Yu tweet media
English
3
8
15
592
Pinjia He retweetledi
Boxi Yu
Boxi Yu@BoshCavendish·
OpenAI just confirmed (openai.com/index/why-we-n…): SWE-Bench Verified has flawed tests that reject correct solutions -- 59.4% of their audited 27.6% subset. Their recommendation: stop using Verified, switch to Pro. But is Pro safe? We tested it. SWE-ABS strengthens 64.7% of sampled 150 SWE-Bench Pro instances -- weak tests are not a Verified-only problem. Instead of abandoning SWE-Bench Verified, we fix the tests. SWE-ABS rejects 19.78% of "solved" patches from the top-30 agents as semantically wrong, leading to a 14.56% average resolved rate drop -- and all 30 agents' rankings change. Introducing SWE-ABS: adversarial benchmark strengthening for code-agent evaluation. Paper: arxiv.org/abs/2603.00520 Code: github.com/OpenAgentEval/… Data: huggingface.co/datasets/OpenA…
Boxi Yu tweet media
English
2
6
12
855
Xiang Li | 李想
Xiang Li | 李想@XiangHCI·
I’m very happy to share that I will be joining @HKUniversity as an Assistant Professor in the Department of Data and Systems Engineering, starting in October 2026. 🥰 1/4
Xiang Li | 李想 tweet media
English
14
1
159
11.2K
Pinjia He
Pinjia He@PinjiaHE·
Thrilled to see OpenRCA has been used by @AnthropicAI to evaluate its new @claudeai model's capability on the Root Cause Analysis (RCA) task. 👇Check out the original paper thread below.
Pinjia He tweet mediaPinjia He tweet mediaPinjia He tweet media
Pinjia He@PinjiaHE

📢 Can LLMs locate software service failures? 🤔 My student @SiyuexiH's #ICLR2025 paper introduces OpenRCA, the first benchmark dataset for evaluating LLMs' root cause analysis capabilities in software systems. LLMs/Agents need to analyze system telemetry data to infer results for natural language queries. Experiments show current LLMs struggle with OpenRCA tasks without specialized RCA tools. Joint work with Microsoft and Tsinghua University. 🔗 Learn more: 📜 Paper: openreview.net/pdf?id=M4qNIzQ… 💻 Code: github.com/microsoft/Open… 📊 Leaderboard: microsoft.github.io/OpenRCA/ #iclr2025 #AI4SE #LLM #rootcauseanalysis

English
0
0
4
353
Pinjia He retweetledi
Yichen Li
Yichen Li@CSEI4·
Couldn't out-code Claude Code, so I decided to work for it instead. We built an MCP server (will release it in 1 month) with program analysis tools. We did a lot of Claude Code-friendly optimization since LLMs can read more analysis results (e.g., trace call chains across 10+ packages) at a glance than humans. Claude Code tried it, was pleased, signaled I should keep working. All I could say is: YES SIR!🤣
Yichen Li tweet mediaYichen Li tweet media
English
2
2
3
3.8K
Pinjia He
Pinjia He@PinjiaHE·
ICLR is a great conference, we just hope the process can be more robust against high variance.
English
1
0
10
5.3K
Pinjia He
Pinjia He@PinjiaHE·
Heartbroken to receive a Reject for our #ICLR2026 submission (Rating: 8/6/6/6). The hardest part isn't the rejection itself, but the Meta-Review reasoning. The AC dismissed all reviewers' unanimous support, raised two new concerns (with factual errors themselves), and claimed "All reviews were superficial (while being marginally above the minimum bar for reviewers)." We believe in the peer review process, but a "single point of failure" overriding full consensus is tough to swallow.
Pinjia He tweet mediaPinjia He tweet media
English
13
11
383
67.4K
Pinjia He
Pinjia He@PinjiaHE·
@JAldrichPL Can't agree more. The job of a meta-reviewer is to break ties, not to be the decider.
English
0
0
2
128
Jonathan Aldrich
Jonathan Aldrich@JAldrichPL·
I don't understand how anyone could think it's reasonable to have a reviewing system where meta-reviewers routinely override the clear consensus of reviewers. The job of a meta-reviewer (or PC chair) is to break ties, not to be the decider. I'm glad SIGPLAN does it better.
Pinjia He@PinjiaHE

Heartbroken to receive a Reject for our #ICLR2026 submission (Rating: 8/6/6/6). The hardest part isn't the rejection itself, but the Meta-Review reasoning. The AC dismissed all reviewers' unanimous support, raised two new concerns (with factual errors themselves), and claimed "All reviews were superficial (while being marginally above the minimum bar for reviewers)." We believe in the peer review process, but a "single point of failure" overriding full consensus is tough to swallow.

English
5
0
13
2.1K
Pinjia He retweetledi
Daniel Kang
Daniel Kang@ddkang·
SWE-bench Verified is the gold standard for evaluating coding agents: 500 real-world issues + tests by OpenAI. Sounds bullet-proof? Not quite. We show passing its unit tests != matching ground truth. In our ACL paper, we fixed buggy evals: 24% of agents moved up or down the leaderboard! 1/7
Daniel Kang tweet media
English
11
34
200
29K
Pinjia He retweetledi
Chengyu Zhang
Chengyu Zhang@chengyuzh·
I'm looking for PhD students starting Fall 2026! If you're interested in automated testing and trustworthy program verification, feel free to reach out via email or come chat with me at ISSTA/FSE next week!
Chengyu Zhang@chengyuzh

Excited to share that two of our papers will be presented next week: one at SIGMOD (Tuesday), and another at the FUZZING Workshop @ ISSTA (Saturday)! The student collaborators from @ECNUER will present the papers. I’ll be at ISSTA/FSE next week—come say hi! Looking forward to great conversations and feedback. 👋 The SIGMOD work is a collaboration with @RiggerManuel, @DengWenjin48334, and Qiuyang Mang. We propose a geometry-aware test generator for spatial databases and prove metamorphic relations under affine transformations. This helped us uncover 34 previously unknown bugs in mainstream spatial database systems. The FUZZING workshop paper revisits combining static analysis and symbolic execution for precise bug finding. We show that accurate error traces from static analysis can actually help symbolic execution, but inaccurate traces can mislead symbolic execution and potentially human users.

English
3
11
42
4.8K
Pinjia He
Pinjia He@PinjiaHE·
My student Xiaoyuan Liu's @xyliu_cs collaboration work with Tencent. #ACL2025NLP
Zhaopeng Tu@tuzhaopeng

When eyes and memory clash, who wins? 👁️🧠 Introducing a comprehensive study on vision-knowledge conflicts in MLLMs, where visual input contradicts the model's internal commonsense knowledge—and the results might surprise you. #ACL2025NLP 📈 We developed an automated framework to generate ConflictVis benchmark: 374 original images with 1,122 QA pairs designed to test when MLLMs see one thing but "know" another. 📊 Shocking findings across 9 leading MLLMs: 1⃣ ~20% over-reliance on parametric knowledge over visual evidence 2⃣ Yes-No questions show 43.6% memorization bias (Claude-3.5-Sonnet) 3️⃣ Action-related conflicts are 10.4% more problematic than place conflicts 👀 We propose "Focus-on-Vision" prompting strategy that significantly improves performance by instructing models to prioritize what they see over what they remember. Despite improvements, vision-knowledge conflicts remain a persistent challenge for multimodal AI systems. 📃 Paper: arxiv.org/abs/2410.08145

English
0
0
4
723
Pinjia He retweetledi
Dominik Winterer
Dominik Winterer@DominikWinterer·
🚀 I'll be launching the Formal Methods Engineering Lab (manchester-fme.github.io) – and I am hiring! If you’re interested in working with me, feel free to reach out.
Dominik Winterer@DominikWinterer

Super excited to share that I will be joining The University of Manchester (@OfficialUoM) as a Lecturer (Assistant Professor) in Cyber Security! The Systems and Software Security group at Manchester is already incredibly impressive, and I’m honored to help further strengthen it.

English
1
11
29
4.9K
Dominik Winterer
Dominik Winterer@DominikWinterer·
Super excited to share that I will be joining The University of Manchester (@OfficialUoM) as a Lecturer (Assistant Professor) in Cyber Security! The Systems and Software Security group at Manchester is already incredibly impressive, and I’m honored to help further strengthen it.
Dominik Winterer tweet media
English
15
5
68
8.2K
Pinjia He retweetledi
Chao Peng
Chao Peng@chao_peng_·
We’re proud to bring @Trae_ai to @ICSEconf. Our booth, product showcase, banquet, and workshops were a great success. Huge thanks to everyone who joined our events. Looking forward to deeper collaboration in AI4SE research. See you again at @FSEconf !
Chao Peng tweet mediaChao Peng tweet mediaChao Peng tweet mediaChao Peng tweet media
English
2
2
24
3.1K
Lin Tan
Lin Tan@Lin0Tan·
Our SELP paper is an #ICRA25 Best Paper Award Finalist, among a selected few from 4,153 submissions! 🏆 Proud of my PhD student @yiwu5cs & the team! cs.purdue.edu/homes/lintan/p… #robotics #LLM #ConstrainedDecoding #Agent #LLMPlanner @PurdueCS @anikbera @ieee_ras_icra
Lin Tan@Lin0Tan

Introducing our first #ICRA2025 paper, SELP (Safe Efficient LLM Planner), a method for generating plans for robot agents that adhere to user constraints while optimizing for time-efficient execution. 🔗 Preprint: arxiv.org/pdf/2409.19471 #LLMs #Robotics #Agent

English
8
2
39
7.2K
Lionel Briand
Lionel Briand@lionel_c_briand·
Good news! I was elected to Academia Europaea, the European Academy of Science. I am very honored. Thank you to my nominator and endorsers! ae-info.org
English
5
0
47
1.6K
Chengyu Zhang
Chengyu Zhang@chengyuzh·
@PinjiaHE Thank you, Pinjia! I may look for some excellent students from your university😉.
English
1
0
1
120
Chengyu Zhang
Chengyu Zhang@chengyuzh·
I am pleased to share that I have started a new position as a Lecturer (equivalent to an Assistant Professor in the US) at @lborouniversity. Thanks to @zhendongsu and all my colleagues and friends at @ast_eth and beyond. Your support has meant a lot. I will be working on trustworthy automated reasoning and its applications to software reliability. Feel free to reach out if you are interested in PhD opportunities, visiting, or collaboration.
Chengyu Zhang tweet media
English
19
6
59
4K