Steven Xia

63 posts

Steven Xia banner
Steven Xia

Steven Xia

@steven_xia_

PhD Student @illinoisCDS studying SE \\ Undergrad @eceuoft 2T1

Champaign-Urbana, Illinois Katılım Şubat 2013
182 Takip Edilen385 Takipçiler
Steven Xia retweetledi
Hwiwon Lee
Hwiwon Lee@hwiwonl·
We’re happy to release 𝐒𝐄𝐂-𝐛𝐞𝐧𝐜𝐡 𝐏𝐫𝐨: a benchmark for measuring the bug-hunting capabilities of AI agents in critical software systems such as Chromium V8, Firefox SpiderMonkey, and more. Explore the details here: sec-bench.github.io
English
0
3
3
158
Steven Xia
Steven Xia@steven_xia_·
@slimshetty_ @LingmingZhang Also feel free to check out the paper for more detailed analysis. We have a whole bunch of examples and results looking at the different tools agents generate!
English
0
0
1
86
Steven Xia
Steven Xia@steven_xia_·
@slimshetty_ @LingmingZhang Cool question! For Gemini-3-Pro, it mainly creates tools that are well-suited for analyzing the repo (e.g., read/search/write files). There are also task-specific ones like creating edit tools that apply targted patches that are quite difficult to do via bash commands only.
English
1
0
2
119
Steven Xia retweetledi
Lingming Zhang
Lingming Zhang@LingmingZhang·
🤯🤯🤯 Gemini 3 Pro + Live-SWE-agent hits 77.4% on SWE-bench Verified, beating ALL existing models, including Claude 4.5!! 🤖 Live-SWE-agent is the first live software agent that autonomously self-evolves on the fly — and it even outperforms the manually engineered scaffold used by the Gemini 3 Pro team (76.2%)
Lingming Zhang tweet media
English
32
68
472
113.7K
Steven Xia retweetledi
Lingming Zhang
Lingming Zhang@LingmingZhang·
🚀 Introducing Live-SWE-agent: 🤖 An autonomous coding agent that self-evolves on the fly while solving real-world issues. ✨ Simplistic design: no offline training, no heavy workflows. 👇 Surprising gains: 📌 75.4% on SWE-bench Verified 📌 45.8% on SWE-Bench Pro (new SOTA!)
Lingming Zhang tweet media
English
3
12
43
4.4K
Steven Xia retweetledi
OpenAI
OpenAI@OpenAI·
We're releasing a new iteration of SWE-bench, in collaboration with the original authors, to more reliably evaluate AI models on their ability to solve real-world software issues. openai.com/index/introduc…
English
380
407
2.7K
830.1K
Steven Xia retweetledi
Ziqi Zhang
Ziqi Zhang@ZiqiCharles·
I'm on the way to USENIX Security'24. We will present one paper about privacy-preserving app authentication. I'm also happy to discuss TEE-based AI security and other AI-related security topics. If you're interested, please get in touch with me! #usenix #USESEC2024
Ziqi Zhang tweet mediaZiqi Zhang tweet media
English
0
1
3
998
Steven Xia retweetledi
AK
AK@_akhaliq·
Magicoder: Source Code Is All You Need paper page: huggingface.co/papers/2312.02… introduce Magicoder, a series of fully open-source (code, weights, and data) Large Language Models (LLMs) for code that significantly closes the gap with top code models while having no more than 7B parameters. Magicoder models are trained on 75K synthetic instruction data using OSS-Instruct, a novel approach to enlightening LLMs with open-source code snippets to generate high-quality instruction data for code. Our main motivation is to mitigate the inherent bias of the synthetic data generated by LLMs by empowering them with a wealth of open-source references for the production of more diverse, realistic, and controllable data. The orthogonality of OSS-Instruct and other data generation methods like Evol-Instruct further enables us to build an enhanced MagicoderS. Both Magicoder and MagicoderS substantially outperform state-of-the-art code models with similar or even larger sizes on a wide range of coding benchmarks, including Python text-to-code generation, multilingual coding, and data-science program completion. Notably, MagicoderS-CL-7B based on CodeLlama even surpasses the prominent ChatGPT on HumanEval+ (66.5 vs. 65.9 in pass@1). Overall, OSS-Instruct opens a new direction for low-bias and high-quality instruction tuning using abundant open-source references.
AK tweet media
English
8
178
811
127.5K
Steven Xia retweetledi
Jiawei Liu
Jiawei Liu@JiaweiLiu_·
In the past 6-mon release of HumanEval+ we have been improving its toolchain usability and dataset quality from v0.1.0 to v0.1.7 releases. 🔥 Now we release MBPP+, a new benchmark in EvalPlus v0.2.0: tinyurl.com/4pw82wb8 🧵
Jiawei Liu tweet media
English
2
7
50
7.7K
Natalie Enright Jerger
Natalie Enright Jerger@nenrightjerger·
As Purdue claims the top spot in this week's poll, I'm wondering which scrappy team from NJ is going to be our undoing this year?
Natalie Enright Jerger tweet media
English
1
0
1
644
Steven Xia retweetledi
Jiawei Liu
Jiawei Liu@JiaweiLiu_·
Introducing the EvalPlus leaderboard! evalplus.github.io/leaderboard.ht… 🔥28 models have been evaluated on coding HumanEval & HumanEval+ 🔥7B CodeLlama outperforms ~16B models e.g. StarCoder&CodeGen 🔥Phind-CodeLlama-34B-v2 and WizardCoder-Python-34B-V1 as open models both beat ChatGPT 🧵
Jiawei Liu tweet media
English
4
18
141
40.4K
Saikat Dutta
Saikat Dutta@saikatdutta2012·
Super excited to announce that I will be joining Cornell as an Assistant Professor of CS (@cs_cornell, @CornellCIS) starting Fall 2024. I am truly grateful to my advisor, mentors, friends, and family, without whom this journey would not be possible!
Saikat Dutta tweet media
English
32
3
225
21.1K
Steven Xia retweetledi
Jiawei Liu
Jiawei Liu@JiaweiLiu_·
We welcome everyone to try out 📚𝐇𝐮𝐦𝐚𝐧𝐄𝐯𝐚𝐥+! A dataset to reflect the "real" correctness of LLM-generated code. Using📚𝐇𝐮𝐦𝐚𝐧𝐄𝐯𝐚𝐥+ is the same as HumanEval. You can easily pip install it and evaluate in our prepared sandbox (optional). github.com/evalplus/evalp…
Jiawei Liu tweet media
AK@_akhaliq

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation extensive evaluation across 14 popular LLMs (including GPT-4 and ChatGPT) demonstrates that HUMANEVAL+ is able to catch significant amounts of previously undetected wrong code synthesized by LLMs, reducing the pass@k by 15.1% on average! For example, the pass@k of widely studied open-source models like CODEGEN-16B can drop by over 18.0%, while the performance of state-of-the-art commercial models like ChatGPT and GPT-4 can also drop by at least 13.0%, largely affect the result analysis for almost all recent work on LLM-based code generation abs: arxiv.org/abs/2305.01210 github: github.com/evalplus/evalp…

English
2
17
103
15K
Steven Xia
Steven Xia@steven_xia_·
🚨 Evaluating LLM-generated code on datasets with just "3 test-cases" is NOT enough! 🚨 We built ✨HumanEval+✨: improving HumanEval with up to thousands of new tests to fully evaluate functional correctness of LLM generated code! @JiaweiLiu_ @YuyaoStarling @LingmingZhang
Steven Xia tweet media
AK@_akhaliq

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation extensive evaluation across 14 popular LLMs (including GPT-4 and ChatGPT) demonstrates that HUMANEVAL+ is able to catch significant amounts of previously undetected wrong code synthesized by LLMs, reducing the pass@k by 15.1% on average! For example, the pass@k of widely studied open-source models like CODEGEN-16B can drop by over 18.0%, while the performance of state-of-the-art commercial models like ChatGPT and GPT-4 can also drop by at least 13.0%, largely affect the result analysis for almost all recent work on LLM-based code generation abs: arxiv.org/abs/2305.01210 github: github.com/evalplus/evalp…

English
1
12
39
10.8K
Steven Xia
Steven Xia@steven_xia_·
Compared to the original HumanEval, our results on our new dataset show that on average, the pass@K performance of current popular state-of-the-art LLMs drops by 📉15% including both ChatGPT and GPT-4.
Steven Xia tweet media
English
1
1
3
985