Jielin Qiu

32 posts

Jielin Qiu

Jielin Qiu

@_Jason_Q

Research Scientist @Salesforce AI Research, Ph.D. from @SCSatCMU

Carnegie Mellon University Katılım Ocak 2021
177 Takip Edilen72 Takipçiler
Jielin Qiu retweetledi
Weiran Yao
Weiran Yao@iscreamnearby·
Today I finally get to share something our team has been quietly grinding on for months – we've created an 𝗼𝗽𝗲𝗻 𝘀𝗼𝘂𝗿𝗰𝗲𝗱 𝘃𝗲𝗿𝘀𝗶𝗼𝗻 𝗼𝗳 Cursor 𝗕𝗲𝗻𝗰𝗵 @cursor_ai . If you’ve been following Cursor’s Composer launch and their internal "Cursor Bench" for testing vibe coding models, you can think of our 𝗟𝗖𝗕𝗔 𝗯𝗲𝗻𝗰𝗵 as the open-source, model-agnostic counterpart. Here is what we provide by @SFResearch . With 𝗟𝗖𝗕𝗔 𝗯𝗲𝗻𝗰𝗵 we: • Ship a 𝗖𝘂𝗿𝘀𝗼𝗿-𝘀𝘁𝘆𝗹𝗲 𝗮𝗴𝗲𝗻𝘁 𝘀𝘁𝗮𝗰𝗸: ReAct loop, semantic @ codebase search, grep, file read/write, refactor tools, and a three-tier memory system inspired by production coding assistants like Cursor. • 𝗧𝗮𝗸𝗲 𝟴,𝟬𝟬𝟬 𝗿𝗲𝗮𝗹-𝘄𝗼𝗿𝗹𝗱 𝘃𝗶𝗯𝗲 𝗰𝗼𝗱𝗶𝗻𝗴 𝘀𝗰𝗲𝗻𝗮𝗿𝗶𝗼𝘀 and turn them into interactive agent gyms across 10 languages and 10K–1M token codebases. • Let you plug in any model (GPT-5, Claude Sonnet 4.5, Gemini 2.5 Pro, etc.) and see how it actually behaves on long, messy, multi-turn coding tasks. A few fun findings: Cursor-style agents with context management are surprisingly robust at 1M-token contexts, but there’s a hard trade-off between deep exploration vs. efficiency — no one frontier model sits in the “perfect” top-right corner yet. Anthropic Claude 4.5 and Google Gemini 2.5 pro are at the Pareto Frontier. Everything is open source (agent, code, scenarios, traces, metrics) on @huggingface: 📄 Tech Report: arxiv.org/pdf/2509.09614 🤖 GitHub:github.com/SalesforceAIRe… 🤗 Dataset: huggingface.co/datasets/jason… If you’re building coding agents, benchmarking your model against GPT/Claude/Gemini, or want to train your coding agents with RL in real coding environments, we’d love for you to try LCBA bench, and tell us your findings!
Weiran Yao tweet media
English
2
6
7
509
Jielin Qiu retweetledi
Salesforce AI Research
Salesforce AI Research@SFResearch·
🚨 Introducing LoCoBench-Agent: a comprehensive benchmark for evaluating LLM agents in long-context software engineering 📄 Paper: bit.ly/49mPrBv 🔗 GitHub: bit.ly/3KbpkTN ✨ Key Features: 🤖 8,000 interactive agent scenarios with multi-turn conversations (up to 50 turns) 🔍 Context lengths: 10K-1M tokens across 10 programming languages ⚡ 9 bias-free evaluation metrics (5 comprehension + 4 efficiency) 🛠️ 8 specialized development tools: file operations, semantic search, grep, code analysis 🎯 8 task categories: architectural understanding, cross-file refactoring, multi-session development, bug investigation, feature implementation, code comprehension, integration testing, and security analysis 🔬 Key Findings: - Fundamental comprehension-efficiency trade-off - Tool usage patterns matter more than raw capabilities - Strategic exploration > exhaustive exploration LoCoBench-Agent assesses agent behavior across extended development sessions, measuring context retention, adaptive strategy refinement, and tool usage efficiency. Authors: Jielin Qiu @Jason_Q, Zuxin Liu @LiuZuxin, Zhiwei Liu @JYJimLiu, Rithesh Murthy @rithesh__rn, Jianguo Zhang @JianguoZhang3, Haolin Chen @HaolinChen11, Shiyu Wang @shiyu04490786, Ming Zhu@ming_zhu0527, Liangwei Yang @Liangwei_Yang, Juntao Tan @chrisjtan, Roshan Ram @shoonyaka1, Akshara Prabhakar @aksh_555, Tulika Awalgaonkar @tulika614, Zixiang Chen @_zxchen_, Zhepeng Cen @ZhepengCen, Cheng Qian @qiancheng1231, Shelby Heinecke @shelbyh_ai, Weiran Yao @iscreamnearby, Silvio Savarese @silviocinguetta, Caiming Xiong @CaimingXiong, Huan Wang @huan__wang #LLM #AIAgents #SoftwareEngineering #MachineLearning #Benchmark #FutureOfAI #EnterpriseAI
Salesforce AI Research tweet media
English
4
3
13
2.4K
Jielin Qiu retweetledi
Salesforce AI Research
Salesforce AI Research@SFResearch·
🚨 Introducing LoCoBench: a comprehensive benchmark for evaluating long-context LLMs in complex software development 📄 Paper: bit.ly/4ponX3P 🔗 GitHub: bit.ly/4pvIfbZ ✨ Key Features: 📊 8,000 evaluation scenarios across 10 programming languages 🔍 Context lengths: 10K-1M tokens (100× variation!) ⚡ 17 evaluation metrics across 4 dimensions (6 newly proposed) 🎯 8 essential task categories: architectural understanding, cross-file refactoring, multi-session development, bug investigation, feature implementation, code comprehension, integration testing, and security analysis Current SOTA models show dramatic performance drops as context increases - highlighting critical gaps in long-context understanding for real-world software engineering. Authors: Jielin Qiu @_Jason_Q, Zuxin Liu @LiuZuxin, Zhiwei Liu @JYJimLiu, Rithesh Murthy @rithesh__rn, Jianguo Zhang @JianguoZhang3, Haolin Chen @HaolinChen11, Shiyu Wang @shiyu04490786, Ming Zhu@ming_zhu0527, Liangwei Yang @Liangwei_Yang, Juntao Tan @chrisjtan, Zhepeng Cen @ZhepengCen, Cheng Qian @qiancheng1231, Shelby Heinecke @shelbyh_ai, Weiran Yao @iscreamnearby, Silvio Savarese @silviocinguetta, Caiming Xiong @CaimingXiong, Huan Wang @huan__wang #LLM #SoftwareEngineering #MachineLearning #Benchmark #FutureOfAI #EnterpriseAI
Salesforce AI Research tweet media
English
0
13
19
2.3K
Jielin Qiu retweetledi
Ce Zhang
Ce Zhang@ce_zhang·
Excited to see the first paper getting accepted at @DMLRJournal. In the last few months, we are fascinated by the quality of reviews and the engaging interactions between authors and reviewers! Thanks everyone! Please continue to send your best work about Data x ML😀
Journal of Data-centric Machine Learning Research@DMLRJournal

'Benchmarking Robustness of Multimodal Image-Text Models under Distribution Shift' by Jielin Qiu, Yi Zhu, Xingjian Shi, Florian Wenzel, Zhiqiang Tang, Ding Zhao, Bo Li, Mu Li Action Editor: Hongyang Zhang openreview.net/forum?id=Vc1fX… #Multimodal #Robustness #DistributionShift

English
0
3
15
2.3K
Jielin Qiu retweetledi
Danqing Wang
Danqing Wang@dqwang122·
📚🌟 Evaluate any story to your heart's content with our new personalized story evaluation model, PerSE! No more worries about diverse preferences - get your own story evaluation report now! 📝🎯 arxiv.org/abs/2310.03304 1/5
Danqing Wang tweet media
English
1
9
30
19.1K
Jielin Qiu retweetledi
Wenda Xu
Wenda Xu@WendaXu2·
What is missing in the text generation evaluation for BERTScore, BLERUT, COMET, SEScore & SEScore2? Explanation! Can we build a metric that not only produces a well-correlated quality score but also tell you the rationales, error type, and error location? Checkout InstructScore!
Wenda Xu tweet media
English
7
13
85
15K
Jielin Qiu retweetledi
Danqing Wang
Danqing Wang@dqwang122·
🚀 Excited to share our latest work in EMNLP main conference: "Learning from Mistakes via Interactive Study Assistant for Large Language Models". We introduce a study assistant (SALAM) to conduct thoughtful analysis on LLMs' mistakes and provide guidelines to avoid past mistakes
Danqing Wang tweet media
English
1
5
17
3K
Jielin Qiu retweetledi
Kexun Zhang
Kexun Zhang@kexun_zhang·
😭Tired of in-context demos & docs for LLM tool use? 💰Too GPU-poor to tune LLMs for unseen tools? 🤬Frustrated with frequent syntax errors in tool calls? Check out our new preprint 𝐓𝐨𝐨𝐥𝐃𝐞𝐜 that addresses all these issues from the decoding side! arxiv.org/abs/2310.07075 1/5
Kexun Zhang tweet media
English
4
32
99
36.2K
Jielin Qiu retweetledi
Seungwhan Shane Moon
Seungwhan Shane Moon@shane_moon·
Excited to share our recent work, AnyMAL -- a unified Multimodal LLM built on LLaMA-2 that can reason over various inputs, e.g. images, audio, motion sensors. Check out our paper for more information on the model training, evaluation, safety and more! ➡️ arxiv.org/abs/2309.16058
Seungwhan Shane Moon tweet media
English
4
24
122
22.5K
Jielin Qiu retweetledi
Santiago
Santiago@svpino·
A topic that comes up in every interview: Bias, variance, and their relationship with machine learning algorithms. Here is a simple summary that you will easily remember. ↓
English
23
209
964
0
Jielin Qiu retweetledi
Jia-Bin Huang
Jia-Bin Huang@jbhuang0604·
How to present a line plot? Line plots are effective for describing the relationship between two variables of interests. Unfortunately, most junior students would simply copy&paste the figure from the paper in their talk and cause much confusion. 😕 Let's break it down ... 🧵
Jia-Bin Huang tweet media
English
6
106
547
0
Jielin Qiu retweetledi
Jiahui Yu
Jiahui Yu@jhyuxm·
Our team at Google Brain is looking for outstanding PhD students (expected graduation after 2023) who are interested in student researcher internships this year 2022. careers.google.com/jobs/results/9…
English
1
28
89
0
Jielin Qiu retweetledi
Andrew White 🐦‍⬛
Andrew White 🐦‍⬛@andrewwhite01·
I've been writing research articles for over 10 years now and one of the hardest parts is writing consistently and efficiently without procrastinating. I'm going to share some of my tips here 🧵 1/10
English
77
1.4K
11.5K
0
Jielin Qiu retweetledi
Ai2
Ai2@allen_ai·
AI2's computer vision team PRIOR announced an exciting new release of their #EmbodiedAI platform AI2-THOR – in partnership with @unity, you can now train headlessly on multiple GPUs. 📈 Learn more: medium.com/ai2-blog/ai2-t…
English
0
13
44
0