Pinjia He (@PinjiaHE) - Twitter Profili | Zamantika Mersobahis Locabet

Sabitlenmiş Tweet

Pinjia He@PinjiaHE·29 Mar

📢 Can LLMs locate software service failures? 🤔 My student @SiyuexiH's #ICLR2025 paper introduces OpenRCA, the first benchmark dataset for evaluating LLMs' root cause analysis capabilities in software systems. LLMs/Agents need to analyze system telemetry data to infer results for natural language queries. Experiments show current LLMs struggle with OpenRCA tasks without specialized RCA tools. Joint work with Microsoft and Tsinghua University. 🔗 Learn more: 📜 Paper: openreview.net/pdf?id=M4qNIzQ… 💻 Code: github.com/microsoft/Open… 📊 Leaderboard: microsoft.github.io/OpenRCA/ #iclr2025 #AI4SE #LLM #rootcauseanalysis

English

0

6

15

3.5K

Pinjia He retweetledi

Boxi Yu@BoshCavendish·5 May

🔥 SWE-ABS accepted by ICML2026 @icmlconf 🔥 OpenAI @OpenAI showed SWE-Bench @SWEbench tests reject correct patches. We reveal the other side: they also accept wrong ones. SWE-ABS strengthens SWE-Bench (Verified & Pro) via: coverage-driven tests + mutation-based attacks. Key results: • All top-30 rankings shift (#1 → #5) • 19.78% “solved” patches are actually wrong • 50.2% Verified strengthened • 64.7% Pro subset strengthened 👉 Test quality—not benchmark difficulty—is the real bottleneck. Links 👇

English

3

8

15

592

Pinjia He retweetledi

Boxi Yu@BoshCavendish·3 Mar

OpenAI just confirmed (openai.com/index/why-we-n…): SWE-Bench Verified has flawed tests that reject correct solutions -- 59.4% of their audited 27.6% subset. Their recommendation: stop using Verified, switch to Pro. But is Pro safe? We tested it. SWE-ABS strengthens 64.7% of sampled 150 SWE-Bench Pro instances -- weak tests are not a Verified-only problem. Instead of abandoning SWE-Bench Verified, we fix the tests. SWE-ABS rejects 19.78% of "solved" patches from the top-30 agents as semantically wrong, leading to a 14.56% average resolved rate drop -- and all 30 agents' rankings change. Introducing SWE-ABS: adversarial benchmark strengthening for code-agent evaluation. Paper: arxiv.org/abs/2603.00520 Code: github.com/OpenAgentEval/… Data: huggingface.co/datasets/OpenA…

English

2

6

12

855

Pinjia He@PinjiaHE·12 Şub

@XiangHCI @HKUniversity Congrats!

English

0

1

86

Xiang Li | 李想@XiangHCI·10 Şub

I’m very happy to share that I will be joining @HKUniversity as an Assistant Professor in the Department of Data and Systems Engineering, starting in October 2026. 🥰 1/4

English

14

1

159

11.2K

Pinjia He@PinjiaHE·6 Şub

Thrilled to see OpenRCA has been used by @AnthropicAI to evaluate its new @claudeai model's capability on the Root Cause Analysis (RCA) task. 👇Check out the original paper thread below.

Pinjia He@PinjiaHE

📢 Can LLMs locate software service failures? 🤔 My student @SiyuexiH's #ICLR2025 paper introduces OpenRCA, the first benchmark dataset for evaluating LLMs' root cause analysis capabilities in software systems. LLMs/Agents need to analyze system telemetry data to infer results for natural language queries. Experiments show current LLMs struggle with OpenRCA tasks without specialized RCA tools. Joint work with Microsoft and Tsinghua University. 🔗 Learn more: 📜 Paper: openreview.net/pdf?id=M4qNIzQ… 💻 Code: github.com/microsoft/Open… 📊 Leaderboard: microsoft.github.io/OpenRCA/ #iclr2025 #AI4SE #LLM #rootcauseanalysis

English

0

4

353

Pinjia He retweetledi

Yichen Li@CSEI4·28 Oca

Couldn't out-code Claude Code, so I decided to work for it instead. We built an MCP server (will release it in 1 month) with program analysis tools. We did a lot of Claude Code-friendly optimization since LLMs can read more analysis results (e.g., trace call chains across 10+ packages) at a glance than humans. Claude Code tried it, was pleased, signaled I should keep working. All I could say is: YES SIR!🤣

English

2

3

3.8K

Pinjia He@PinjiaHE·27 Oca

ICLR is a great conference, we just hope the process can be more robust against high variance.

English

1

0

10

5.3K

Pinjia He@PinjiaHE·27 Oca

Heartbroken to receive a Reject for our #ICLR2026 submission (Rating: 8/6/6/6). The hardest part isn't the rejection itself, but the Meta-Review reasoning. The AC dismissed all reviewers' unanimous support, raised two new concerns (with factual errors themselves), and claimed "All reviews were superficial (while being marginally above the minimum bar for reviewers)." We believe in the peer review process, but a "single point of failure" overriding full consensus is tough to swallow.

English

13

11

383

67.4K

Pinjia He@PinjiaHE·27 Oca

@JAldrichPL Can't agree more. The job of a meta-reviewer is to break ties, not to be the decider.

English

0

2

128

Jonathan Aldrich@JAldrichPL·27 Oca

I don't understand how anyone could think it's reasonable to have a reviewing system where meta-reviewers routinely override the clear consensus of reviewers. The job of a meta-reviewer (or PC chair) is to break ties, not to be the decider. I'm glad SIGPLAN does it better.

Pinjia He@PinjiaHE

Heartbroken to receive a Reject for our #ICLR2026 submission (Rating: 8/6/6/6). The hardest part isn't the rejection itself, but the Meta-Review reasoning. The AC dismissed all reviewers' unanimous support, raised two new concerns (with factual errors themselves), and claimed "All reviews were superficial (while being marginally above the minimum bar for reviewers)." We believe in the peer review process, but a "single point of failure" overriding full consensus is tough to swallow.

English

5

0

13

2.1K

Pinjia He retweetledi

Daniel Kang@ddkang·22 Tem

SWE-bench Verified is the gold standard for evaluating coding agents: 500 real-world issues + tests by OpenAI. Sounds bullet-proof? Not quite. We show passing its unit tests != matching ground truth. In our ACL paper, we fixed buggy evals: 24% of agents moved up or down the leaderboard! 1/7

English

11

34

200

29K

Pinjia He retweetledi

Cindy Rubio González@cindy_rubio·3 Haz

Are you interested in serving on the Program Committee for @issta_conf 2026? Please let us know by filling out this form: forms.gle/MaXAKysfMSdqTg…

English

1

6

10

2.5K

Pinjia He retweetledi

Chengyu Zhang@chengyuzh·21 Haz

I'm looking for PhD students starting Fall 2026! If you're interested in automated testing and trustworthy program verification, feel free to reach out via email or come chat with me at ISSTA/FSE next week!

Chengyu Zhang@chengyuzh

Excited to share that two of our papers will be presented next week: one at SIGMOD (Tuesday), and another at the FUZZING Workshop @ ISSTA (Saturday)! The student collaborators from @ECNUER will present the papers. I’ll be at ISSTA/FSE next week—come say hi! Looking forward to great conversations and feedback. 👋 The SIGMOD work is a collaboration with @RiggerManuel, @DengWenjin48334, and Qiuyang Mang. We propose a geometry-aware test generator for spatial databases and prove metamorphic relations under affine transformations. This helped us uncover 34 previously unknown bugs in mainstream spatial database systems. The FUZZING workshop paper revisits combining static analysis and symbolic execution for precise bug finding. We show that accurate error traces from static analysis can actually help symbolic execution, but inaccurate traces can mislead symbolic execution and potentially human users.

English

3

11

42

4.8K

Pinjia He@PinjiaHE·3 Haz

My student Xiaoyuan Liu's @xyliu_cs collaboration work with Tencent. #ACL2025NLP

Zhaopeng Tu@tuzhaopeng

When eyes and memory clash, who wins? 👁️🧠 Introducing a comprehensive study on vision-knowledge conflicts in MLLMs, where visual input contradicts the model's internal commonsense knowledge—and the results might surprise you. #ACL2025NLP 📈 We developed an automated framework to generate ConflictVis benchmark: 374 original images with 1,122 QA pairs designed to test when MLLMs see one thing but "know" another. 📊 Shocking findings across 9 leading MLLMs: 1⃣ ~20% over-reliance on parametric knowledge over visual evidence 2⃣ Yes-No questions show 43.6% memorization bias (Claude-3.5-Sonnet) 3️⃣ Action-related conflicts are 10.4% more problematic than place conflicts 👀 We propose "Focus-on-Vision" prompting strategy that significantly improves performance by instructing models to prioritize what they see over what they remember. Despite improvements, vision-knowledge conflicts remain a persistent challenge for multimodal AI systems. 📃 Paper: arxiv.org/abs/2410.08145

English

0

4

723

Pinjia He retweetledi

Dominik Winterer@DominikWinterer·22 May

🚀 I'll be launching the Formal Methods Engineering Lab (manchester-fme.github.io) – and I am hiring! If you’re interested in working with me, feel free to reach out.

Dominik Winterer@DominikWinterer

Super excited to share that I will be joining The University of Manchester (@OfficialUoM) as a Lecturer (Assistant Professor) in Cyber Security! The Systems and Software Security group at Manchester is already incredibly impressive, and I’m honored to help further strengthen it.

English

1

11

29

4.9K

Pinjia He@PinjiaHE·22 May

@DominikWinterer @OfficialUoM Big congrats, Dominik!!!

English

1

0

2

246

Dominik Winterer@DominikWinterer·22 May

Super excited to share that I will be joining The University of Manchester (@OfficialUoM) as a Lecturer (Assistant Professor) in Cyber Security! The Systems and Software Security group at Manchester is already incredibly impressive, and I’m honored to help further strengthen it.

English

15

5

68

8.2K

Pinjia He@PinjiaHE·21 May

Check out my student Xiaoyuan Liu's @xyliu_cs collaboration work with Tencent: RISE (Reinforcing Reasoning with Self-Verification), enabling LLMs to simultaneously level-up BOTH their problem-solving AND self-checking skills.

Zhaopeng Tu@tuzhaopeng

Trust your AI, but can it trust itself? 🤔 Introducing an online reinforcement learning framework, RISE (Reinforcing Reasoning with Self-Verification), enabling LLMs to simultaneously level-up BOTH their problem-solving AND self-checking skills! 🧐 Problems tackled: ✅ "Superficial self-reflection" — models failing to verify their own reasoning robustly. ✅ Separation between reasoning and self-verification training. 🚀 RISE empowers models to critique their OWN reasoning via on-the-fly feedback and verifiable rewards, promoting stronger, more dynamic reasoning loops and effective self-assessment skills. 📊 Key results: 📈 Up to 2.8× better self-verification accuracy on challenging math tasks. 📈 Outperforms instruction-tuned models (Qwen2.5): +3.7% in reasoning, +33.4% in verification accuracy. 📈 Better internal reasoning: frequent, more accurate verification behaviors. 🧑‍💻 Code: github.com/xyliu-cs/RISE 📃 Paper: arxiv.org/abs/2505.13445

English

0

1

15

1.2K

Pinjia He retweetledi

Chao Peng@chao_peng_·4 May

We’re proud to bring @Trae_ai to @ICSEconf. Our booth, product showcase, banquet, and workshops were a great success. Huge thanks to everyone who joined our events. Looking forward to deeper collaboration in AI4SE research. See you again at @FSEconf !

English

2

24

3.1K

Pinjia He@PinjiaHE·26 Nis

@Lin0Tan @yiwu5cs @PurdueCS @anikbera @ieee_ras_icra Congrats!!

English

1

0

1

202

Lin Tan@Lin0Tan·26 Nis

Our SELP paper is an #ICRA25 Best Paper Award Finalist, among a selected few from 4,153 submissions! 🏆 Proud of my PhD student @yiwu5cs & the team! cs.purdue.edu/homes/lintan/p… #robotics #LLM #ConstrainedDecoding #Agent #LLMPlanner @PurdueCS @anikbera @ieee_ras_icra

Lin Tan@Lin0Tan

Introducing our first #ICRA2025 paper, SELP (Safe Efficient LLM Planner), a method for generating plans for robot agents that adhere to user constraints while optimizing for time-efficient execution. 🔗 Preprint: arxiv.org/pdf/2409.19471 #LLMs #Robotics #Agent

English

8

2

39

7.2K

Pinjia He@PinjiaHE·24 Nis

@lionel_c_briand Congratulations! Very well-deserved!

English

0

2

170

Lionel Briand@lionel_c_briand·23 Nis

Good news! I was elected to Academia Europaea, the European Academy of Science. I am very honored. Thank you to my nominator and endorsers! ae-info.org

English

5

0

47

1.6K

Pinjia He@PinjiaHE·9 Nis

@chengyuzh Let me recommend students to you😎

English

1

0

92

Chengyu Zhang@chengyuzh·9 Nis

@PinjiaHE Thank you, Pinjia! I may look for some excellent students from your university😉.

English

1

0

1

120

Chengyu Zhang@chengyuzh·7 Nis

I am pleased to share that I have started a new position as a Lecturer (equivalent to an Assistant Professor in the US) at @lborouniversity. Thanks to @zhendongsu and all my colleagues and friends at @ast_eth and beyond. Your support has meant a lot. I will be working on trustworthy automated reasoning and its applications to software reliability. Feel free to reach out if you are interested in PhD opportunities, visiting, or collaboration.

English

19

6

59

4K

Pinjia He

Keşfet