Zhiwei-Jim (@JYJimLiu) - Twitter Profili | Zamantika Mersobahis Locabet

Zhiwei-Jim@JYJimLiu·14 Ağu

Thanks for introducing our work! We are still actively developing this framework to advance the auto deep evaluation of AI agents. @SFResearch

马东锡 NLP@dongxi_nlp

「 MCP, Agent, Evaluation 」 MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models 为每个 MCP Agent 自动做 benchmarking。 MCPEval 框架专门为MCP Agent 的评估设计，亮点在于无需用户定义评估任务，自动化 “任务生成->执行->分析”。其中的task generation部分非常有价值，用户无需自己手动设计benchmark： - 读取 get_tools() 元数据，自动生成可调用多个工具的复合任务。 - 让一个 Frontier Agent 循环修正任务，直至成功 - 成功生成的任务用来做进一步的评估和报告

English

1

0

7

629

Zhiwei-Jim@JYJimLiu·30 Tem

@dair_ai it seems you tagged a different paper. here is the arxiv paper link for our MCPEval arxiv.org/abs/2507.12806. And also we posted a demo about how to use it. x.com/SFResearch/sta…

Salesforce AI Research@SFResearch

⚡ Introducing MCPEval: the first automated evaluation framework for AI agents built on Model Context Protocol: 🔗 Paper: bit.ly/3TKXpLR 🔗 Code: bit.ly/44ZnUSN ✅ End-to-end task generation & verification ✅ Deep evaluation across 5 real-world domains ✅ Standardized metrics for reproducible research ✅ Open-source & eliminates manual bottlenecks Our evaluation of 10+ models (GPT-4o, O3, Qwen3, etc.) reveals surprising insights: smaller tool-enhanced models can match larger ones in specific domains! Perfect for researchers & developers building reliable AI agents. #AIAgents #FutureOfAI #EnterpriseAI

English

0

4

DAIR.AI@dair_ai·27 Tem

9. MCPEval MCPEval is an open-source framework that automates end-to-end evaluation of LLM agents using a standardized Model Context Protocol, eliminating manual benchmarking. arxiv.org/abs/2507.15015

English

4

1

22

3.6K

DAIR.AI@dair_ai·27 Tem

Top AI Papers of The Week (July 21 - 27): - MCPEval - Subliminal Learning - Learning without Training - Alignment Auditing Agents - Structural Planning for LLM Agents - Inverse Scaling in Test-Time Compute - Deep Researcher with Test-Time Diffusion Read on for more:

English

12

99

649

100.6K

Zhiwei-Jim@JYJimLiu·19 Tem

Check our latest work about automatic mcp-based evaluation pipeline

Salesforce AI Research@SFResearch

⚡ Introducing MCPEval: the first automated evaluation framework for AI agents built on Model Context Protocol: 🔗 Paper: bit.ly/3TKXpLR 🔗 Code: bit.ly/44ZnUSN ✅ End-to-end task generation & verification ✅ Deep evaluation across 5 real-world domains ✅ Standardized metrics for reproducible research ✅ Open-source & eliminates manual bottlenecks Our evaluation of 10+ models (GPT-4o, O3, Qwen3, etc.) reveals surprising insights: smaller tool-enhanced models can match larger ones in specific domains! Perfect for researchers & developers building reliable AI agents. #AIAgents #FutureOfAI #EnterpriseAI

English

0

37

Zhiwei-Jim retweetledi

Sheng Zhang@sheng_zh·10 Oca

📢Our team at Microsoft Research (@MSFTResearch) is hiring summer interns. If you have expertise in building image encoders, reward models, or vision-language models, I'd love to hear from you. Please send me your CV or website via email (zhang.sheng@microsoft.com) or DM!

English

6

30

225

28.4K

Zhiwei-Jim@JYJimLiu·10 Oca

@SFResearch Check our research work towards the visual agent model!

English

0

60

Zhiwei-Jim retweetledi

Salesforce AI Research@SFResearch·10 Oca

🌮 Introducing 🌮 TACO - our new family of multimodal action models that combine reasoning with real-world actions to solve complex visual tasks! 📊Results: 20% gains on MMVet 3.9% average improvement across 8 benchmarks 1M+ synthetic CoTA traces in training 🔓 🔓🔓Fully open-sourced! 🔓🔓🔓 Get started with: 📄 Paper: bit.ly/3PufThl 💻 Code: bit.ly/3Pw8azw 📱 Demo: bit.ly/3PwrEE2 🤖 Models: bit.ly/4j2ZG0h 📚 Datasets: bit.ly/3Pxtzbv 🧵 ...and our Technical deep-dive starts here ⤵️ (1/4) How does TACO work? 🤔 ⛓️TACO answers complex questions by generating Chains-of-Thought-and-Action (CoTA), executing intermediate actions with external tools such as OCR, calculator, and depth estimation, then integrating both the thoughts and action outputs to produce final responses. We generate the synthetic CoTA data with two approaches: model-based generation (top) and programmatic generation (bottom).

English

6

57

178

70.5K

Zhiwei-Jim retweetledi

Salesforce AI Research@SFResearch·10 Oca

Excited to open source TACO and see how the AI research community builds on these multimodal innovations! Together we'll push the boundaries of visual reasoning and agent capabilities. 🌮🚀 📄 Paper: bit.ly/3PufThl 💻 Code: bit.ly/3Pw8azw 📱 Demo: bit.ly/3PwrEE2 🤖 Models: bit.ly/4j2ZG0h 📚 Datasets: bit.ly/3Pxtzbv Huge thanks to our 🌮 research team! @zixianma02 @JianguoZhang3 @JYJimLiu @JieyuZhang20 @chrisjtan @ManliShu @jcniebles @shelbyh_ai @huan__wang @CaimingXiong @RanjayKrishna @silviocinguetta

English

2

1

11

1.2K

Zhiwei-Jim retweetledi

Silvio Savarese@silviocinguetta·6 Eyl

Happy to see our team's hard work come to fruition. The xLAM family of models represents a huge leap in AI capabilities for function calling, planning and reasoning—fit-for-purpose for varied needs of modern business. Eager to see where its application takes us! #AIInnovation

Salesforce AI Research@SFResearch

Introducing the full xLAM family, our groundbreaking suite of Large Action Models! 🚀 From the 'Tiny Giant' to industrial powerhouses, xLAM is revolutionizing AI efficiency! #AIResearch #AIEfficiency 🤗 Hugging Face Collection: bit.ly/4faoYaQ 🤩 Research Blog bit.ly/3MxliCZ 🗞️ Press Release: sforce.co/3XzaOt9 Meet the family: • xLAM-1B / TINY: Our 1B parameter marvel, ideal for on-device AI. Outperforms larger models despite its compact size • xLAM-7B / SMALL: Perfect for swift academic exploration with limited GPU resources. • xLAM-8x7B / MEDIUM: Mixture-of-experts model balancing latency, resources, and performance for industrial applications. • xLAM-8x22B / LARGE: Our large-scale model for optimal performance in high-resource environments. 🎉 Huge congrats to the team of AI scientists who brought xLAM series to life! Zuxin Liu @LiuZuxin Shirley Kokane @KokaneShirley Ming Zhu @ming_zhu0527 Tian Lan @TLan001 Jianguo Zhang @JianguoZhang3 Thai Hoang @TeeH912. Caiming Xiong @CaimingXiong Silvio Savarese @silviocinguetta

English

0

12

18

4.1K

Zhiwei-Jim retweetledi

Juan Carlos Niebles@jcniebles·20 Haz

The slides for my #CVPR2024 Tutorial on Agents are now available! I’ve also posted an accompanying blog and links to all the @SFResearch Open-Source repos to make it easy for people to get started. Check them out here: niebles.net/blog/2024/agen…

Juan Carlos Niebles@jcniebles

I’m back at #CVPR2024! I’m speaking tomorrow 8:40am at the Generalist Agent AI Tutorial about Language-based AI Agents and Large Action Models (LAMs). #aiagent I’m also a panelist tomorrow 11:30am at the Workshop on What is next in Multimodal Foundation Models? #multimodalai

English

2

18

40

8.2K

Zhiwei-Jim@JYJimLiu·18 Mar

check our repos and play with our xLAM model and AgentLite Library!

Caiming Xiong@CaimingXiong

🎉🎉We are excited to release a full package for AI Agent R&D: 1) For Data & Training, 🎙️AgentOhana🎙️: Design Unified Data and Training Pipeline for Effective Agent Learning. 2) For model, 🔥xLAM-v0.1-R🔥: A strong large action model for AI Agent while maintaining abilities on general tasks. 3) For agent inference framework, 🤖AgentLite🤖: a lightweight agent/multi-agent library. AgentOhana aggregated, standardized and unified agent trajectories from distinct environments. xLAM-v0.1-r, fine-tuned on #Mixtral, outperforms #GPT-3.5-Turbo on the benchmarks (WebShop, HotpotQA, ToolBench, and MINT-Bench) and #GPT-4 on several of them. AgentLite is implemented with <1K lines of code, and magically supports quickly building LLM agents, designing new agent reasoning, new agent architectures and multi-agent orchestration. AgentOhana Paper: arxiv.org/abs/2402.15506… xLAM GitHub and Model:github.com/SalesforceAIRe… and huggingface.co/Salesforce/xLA… AgentLite Github: github.com/SalesforceAIRe… AgentLite Paper: arxiv.org/abs/2402.15538

English

0

4

69

Zhiwei-Jim@JYJimLiu·14 Mar

@cognition_labs Does Cognition train an LLM for Devin or just use gpt-4 as a foundation?

English

0

4

Cognition@cognition·12 Mar

Today we're excited to introduce Devin, the first AI software engineer. Devin is the new state-of-the-art on the SWE-Bench coding benchmark, has successfully passed practical engineering interviews from leading AI companies, and has even completed real jobs on Upwork. Devin is an autonomous agent that solves engineering tasks through the use of its own shell, code editor, and web browser. When evaluated on the SWE-Bench benchmark, which asks an AI to resolve GitHub issues found in real-world open-source projects, Devin correctly resolves 13.86% of the issues unassisted, far exceeding the previous state-of-the-art model performance of 1.96% unassisted and 4.80% assisted. Check out what Devin can do in the thread below.

English

4.3K

9.7K

42.7K

31.4M

Zhiwei-Jim retweetledi

Salesforce AI Research@SFResearch·12 Mar

Good AI starts with good data. 💡 That's why #DataCloud is the foundation of AI at Salesforce. 💻 Excited to announce our latest AI-powered capabilities for #IdentityResolution in Data Cloud! Check out our blog to learn more. 🔍 #SalesforceAI blog.salesforceairesearch.com/identity-resol…

English

1

2

5

1.9K

Zhiwei-Jim retweetledi

Shelby Heinecke@shelbyh_ai·12 Mar

Data Cloud powers AI - but there is also AI powering Data Cloud! Unifying data from different sources is challenging, and we build the AI to get it done intelligently. Check out our most recent work, now available in Data Cloud ⬇️

Salesforce AI Research@SFResearch

Good AI starts with good data. 💡 That's why #DataCloud is the foundation of AI at Salesforce. 💻 Excited to announce our latest AI-powered capabilities for #IdentityResolution in Data Cloud! Check out our blog to learn more. 🔍 #SalesforceAI blog.salesforceairesearch.com/identity-resol…

English

0

1

2

334

Zhiwei-Jim@JYJimLiu·12 Eyl

🔥🔥🔥🔥🔥🔥🔥

Salesforce AI Research@SFResearch

Explore how Salesforce leverages #AI on time series and causality to automate IT Ops and boost reliability. Learn how our #LLMs are helping customers improve code quality to reduce response time. Join us 9/13 at 10:00 am in AI Landing! #DF23 @doyensahoo reg.salesforce.com/flow/plus/df23…

ART

0

1

47

Zhiwei-Jim@JYJimLiu·12 Eyl

🔥🔥🔥🔥🔥🔥🔥🔥🔥

Salesforce AI Research@SFResearch

Join us at #DF23 on Wednesday, September 13th at 2:00 PM in Einstein Theater and discover how our in-house Large Action Models will transform Customer Service at Salesforce. @jasonwu0731 reg.salesforce.com/flow/plus/df23…

ART

0

3

33

Zhiwei-Jim retweetledi

Caiming Xiong@CaimingXiong·26 Ağu

🎉🎉We are excited to release 👉BOLAA👈: Benchmarking and Orchestrating LLM-augmented Autonomous Agents. In this release, we compare 6 different agent arches (including BOLAA) and 15 popular LLMs under web & QA agents tasks. We will keep expanding the benchmark!

GIF

English

3

30

121

18.7K

Zhiwei-Jim retweetledi

Caiming Xiong@CaimingXiong·25 Tem

The collaboration between @JianguoZhang3 @qbetterk @JYJimLiu @shelbyh_ai @huan__wang @memray0 @Liuye918 @Zhou_Yu_AI @silviocinguetta and @CaimingXiong

English

0

1

3

581

Zhiwei-Jim retweetledi

Caiming Xiong@CaimingXiong·28 Haz

We introduce 🔥XGen-7B 🔥, a new 7B LLM trained on up to 8K sequence length for 1.5T tokens. Achieves better or comparable results with MPT, Falcon, LLaMA, Redpajama, and OpenLLaMA in the text and code tasks. 🔗Blog: blog.salesforceairesearch.com/xgen/ 🔗Code: github.com/salesforce/xgen

English

7

108

445

61.9K

Zhiwei-Jim retweetledi

Huan Wang@huan__wang·17 Şub

Salesforce has been actively engaged in the integration of cutting-edge artificial intelligence technologies into our product offerings. We are excited to share some significant progress in our efforts to enhance the entity resolution capabilities. #salesforce #AI

Salesforce AI Research@SFResearch

Unify Profiles with Salesforce Data Cloud Identity Resolution Soft-Matching 🪪 @shelbyh_ai @huan__wang @JYJimLiu Read our blog: blog.salesforceairesearch.com/data-cloud-ide… Visit our website: salesforceairesearch.com/projects/data-…

English

0

5

7

2.7K

Zhiwei-Jim retweetledi

Salesforce AI Research@SFResearch·17 Şub

Unify Profiles with Salesforce Data Cloud Identity Resolution Soft-Matching 🪪 @shelbyh_ai @huan__wang @JYJimLiu Read our blog: blog.salesforceairesearch.com/data-cloud-ide… Visit our website: salesforceairesearch.com/projects/data-…