Steven Xia

63 posts

Steven Xia

@steven_xia_

PhD Student @illinoisCDS studying SE \\ Undergrad @eceuoft 2T1

Champaign-Urbana, Illinois Katılım Şubat 2013

182 Takip Edilen385 Takipçiler

Steven Xia retweetledi

Hwiwon Lee@hwiwonl·9 May

We’re happy to release 𝐒𝐄𝐂-𝐛𝐞𝐧𝐜𝐡 𝐏𝐫𝐨: a benchmark for measuring the bug-hunting capabilities of AI agents in critical software systems such as Chromium V8, Firefox SpiderMonkey, and more. Explore the details here: sec-bench.github.io

English

158

Steven Xia@steven_xia_·22 Kas

@slimshetty_ @LingmingZhang Also feel free to check out the paper for more detailed analysis. We have a whole bunch of examples and results looking at the different tools agents generate!

English

Steven Xia@steven_xia_·22 Kas

@slimshetty_ @LingmingZhang Cool question! For Gemini-3-Pro, it mainly creates tools that are well-suited for analyzing the repo (e.g., read/search/write files). There are also task-specific ones like creating edit tools that apply targted patches that are quite difficult to do via bash commands only.

English

119

Steven Xia retweetledi

Lingming Zhang@LingmingZhang·21 Kas

🤯🤯🤯 Gemini 3 Pro + Live-SWE-agent hits 77.4% on SWE-bench Verified, beating ALL existing models, including Claude 4.5!! 🤖 Live-SWE-agent is the first live software agent that autonomously self-evolves on the fly — and it even outperforms the manually engineered scaffold used by the Gemini 3 Pro team (76.2%)

English

472

113.7K

Steven Xia retweetledi

Lingming Zhang@LingmingZhang·18 Kas

🚀 Introducing Live-SWE-agent: 🤖 An autonomous coding agent that self-evolves on the fly while solving real-world issues. ✨ Simplistic design: no offline training, no heavy workflows. 👇 Surprising gains: 📌 75.4% on SWE-bench Verified 📌 45.8% on SWE-Bench Pro (new SOTA!)

English

4.4K

Steven Xia retweetledi

Toronto Blue Jays@BlueJays·21 Eki

WE'RE GOING TO THE WORLD SERIES!!!!! #WANTITALL

English

1.7K

13.3K

50.4K

3.6M

Steven Xia retweetledi

OpenAI@OpenAI·13 Ağu

We're releasing a new iteration of SWE-bench, in collaboration with the original authors, to more reliably evaluate AI models on their ability to solve real-world software issues. openai.com/index/introduc…

English

380

407

2.7K

830.1K

Steven Xia retweetledi

Ziqi Zhang@ZiqiCharles·12 Ağu

I'm on the way to USENIX Security'24. We will present one paper about privacy-preserving app authentication. I'm also happy to discuss TEE-based AI security and other AI-related security topics. If you're interested, please get in touch with me! #usenix #USESEC2024

English

998

Steven Xia retweetledi

Lingming Zhang@LingmingZhang·3 Tem

Introducing OpenAutoCoder-Agentless😺: A simple agentless solution solves 27.3% GitHub issues on SWE-bench Lite with ~$0.34 each, outperforming all open-source AI SW agents! It's fully open-source, try it out: 🧑‍💻github.com/OpenAutoCoder/… 📝huggingface.co/papers/2407.01…

English

127

624

97.2K

Steven Xia retweetledi

AK@_akhaliq·5 Ara

Magicoder: Source Code Is All You Need paper page: huggingface.co/papers/2312.02… introduce Magicoder, a series of fully open-source (code, weights, and data) Large Language Models (LLMs) for code that significantly closes the gap with top code models while having no more than 7B parameters. Magicoder models are trained on 75K synthetic instruction data using OSS-Instruct, a novel approach to enlightening LLMs with open-source code snippets to generate high-quality instruction data for code. Our main motivation is to mitigate the inherent bias of the synthetic data generated by LLMs by empowering them with a wealth of open-source references for the production of more diverse, realistic, and controllable data. The orthogonality of OSS-Instruct and other data generation methods like Evol-Instruct further enables us to build an enhanced MagicoderS. Both Magicoder and MagicoderS substantially outperform state-of-the-art code models with similar or even larger sizes on a wide range of coding benchmarks, including Python text-to-code generation, multilingual coding, and data-science program completion. Notably, MagicoderS-CL-7B based on CodeLlama even surpasses the prominent ChatGPT on HumanEval+ (66.5 vs. 65.9 in pass@1). Overall, OSS-Instruct opens a new direction for low-bias and high-quality instruction tuning using abundant open-source references.

English

178

811

127.5K

Steven Xia retweetledi

Jiawei Liu@JiaweiLiu_·28 Kas

In the past 6-mon release of HumanEval+ we have been improving its toolchain usability and dataset quality from v0.1.0 to v0.1.7 releases. 🔥 Now we release MBPP+, a new benchmark in EvalPlus v0.2.0: tinyurl.com/4pw82wb8 🧵

English

7.7K

Steven Xia@steven_xia_·28 Kas

@nenrightjerger make it 3 years in a row😅

English

Natalie Enright Jerger@nenrightjerger·28 Kas

As Purdue claims the top spot in this week's poll, I'm wondering which scrappy team from NJ is going to be our undoing this year?

English

644

Steven Xia retweetledi

Jiawei Liu@JiaweiLiu_·16 Eki

Introducing the EvalPlus leaderboard! evalplus.github.io/leaderboard.ht… 🔥28 models have been evaluated on coding HumanEval & HumanEval+ 🔥7B CodeLlama outperforms ~16B models e.g. StarCoder&CodeGen 🔥Phind-CodeLlama-34B-v2 and WizardCoder-Python-34B-V1 as open models both beat ChatGPT 🧵

English

141

40.4K

Steven Xia@steven_xia_·14 Ağu

@fried_rice @DomSteinhoefel Definitely! We are working on that and hoping to add additional targets

English

100

Chaofan Shou@Fried_rice·14 Ağu

@DomSteinhoefel Will it be opensourced?

English

417

Steven Xia retweetledi

Talia Ringer 🕊🪬@TaliaRinger·18 May

OK now we have @steven_xia_ on his work with @YuxiangWei9 and @LingmingZhang, all of @plfmse. Talking about program repair using large pre-trained language models

English

3.5K

Steven Xia@steven_xia_·10 May

@saikatdutta2012 @cs_cornell @CornellCIS So happy for you! Updating the ICSE slide to include this rn!

English

299

Saikat Dutta@saikatdutta2012·10 May

Super excited to announce that I will be joining Cornell as an Assistant Professor of CS (@cs_cornell, @CornellCIS) starting Fall 2024. I am truly grateful to my advisor, mentors, friends, and family, without whom this journey would not be possible!

English

225

21.1K

Steven Xia retweetledi

Jiawei Liu@JiaweiLiu_·6 May

We welcome everyone to try out 📚𝐇𝐮𝐦𝐚𝐧𝐄𝐯𝐚𝐥+! A dataset to reflect the "real" correctness of LLM-generated code. Using📚𝐇𝐮𝐦𝐚𝐧𝐄𝐯𝐚𝐥+ is the same as HumanEval. You can easily pip install it and evaluate in our prepared sandbox (optional). github.com/evalplus/evalp…

AK@_akhaliq

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation extensive evaluation across 14 popular LLMs (including GPT-4 and ChatGPT) demonstrates that HUMANEVAL+ is able to catch significant amounts of previously undetected wrong code synthesized by LLMs, reducing the pass @k by 15.1% on average! For example, the pass@k of widely studied open-source models like CODEGEN-16B can drop by over 18.0%, while the performance of state-of-the-art commercial models like ChatGPT and GPT-4 can also drop by at least 13.0%, largely affect the result analysis for almost all recent work on LLM-based code generation abs: arxiv.org/abs/2305.01210 github: github.com/evalplus/evalp…

English

103

15K

Steven Xia@steven_xia_·3 May

🚨 Evaluating LLM-generated code on datasets with just "3 test-cases" is NOT enough! 🚨 We built ✨HumanEval+✨: improving HumanEval with up to thousands of new tests to fully evaluate functional correctness of LLM generated code! @JiaweiLiu_ @YuyaoStarling @LingmingZhang

AK@_akhaliq

English

10.8K

Steven Xia@steven_xia_·3 May

We have released our dataset as well as our code at: github.com/evalplus/evalp… Furthermore, we also included all studied LLM-generated code at: github.com/evalplus/evalp… ✨ x 14 models (10 model types) ✨ x 5 temperature settings ✨ x 200 code samples

English

375

Steven Xia@steven_xia_·3 May

Compared to the original HumanEval, our results on our new dataset show that on average, the pass@K performance of current popular state-of-the-art LLMs drops by 📉15% including both ChatGPT and GPT-4.

English

985

Keşfet

@slimshetty_ @LingmingZhang @nenrightjerger @fried_rice @YuxiangWei9 @plfmse @saikatdutta2012 @cs_cornell