Salesforce AI Research

1.9K posts

Salesforce AI Research banner
Salesforce AI Research

Salesforce AI Research

@SFResearch

We advance state-of-the-art #AI techniques paving the path for innovative products at @Salesforce. Focus areas: #AIAgents, #EnterpriseAI, #EGI, and #TrustedAI.

Palo Alto, CA Katılım Eylül 2014
415 Takip Edilen19.2K Takipçiler
Sabitlenmiş Tweet
Salesforce AI Research
Salesforce AI Research@SFResearch·
Looking for the cutting-edge of AI research? Follow Salesforce AI Research to see how we're transforming enterprise technology through advanced innovations. From world models to agentic systems, discover the future of AI before it hits the market.
English
0
38
433
2.4M
Salesforce AI Research
Salesforce AI Research@SFResearch·
Reference-guided LLM judges can meaningfully close the gap between RLVR and RLHF in non-verifiable domains. 🧠 Paper: arxiv.org/abs/2602.16802 The core problem: Reinforcement Learning with Verifiable Rewards (RLVR) works well for math and code, where answers can be checked ✅. But for general alignment—where there's no ground-truth verifier—we still rely on reward models or LLM judges that evaluate without any reference point. Can high-quality reference outputs fill that gap? 🔍 The answer is yes. The team introduces RefEval, a reference-guided prompting strategy that explicitly grounds LLM judge decisions in a strong reference output.📎 Across 11 open-source LLM judges and 5 datasets, RefEval achieves 79.1% average accuracy—outperforming both reference-free baselines and prior reference-based methods. Smaller models benefit most: Llama-3-8B gains +17.4 points over the vanilla baseline. 📈 Those improved judges then power a self-improvement loop: LLMs use their own reference-guided judgments to generate DPO training pairs—no external human or AI feedback required. 🔄 Results: Llama-3-8B-Instruct hits 73.1% on AlpacaEval and 58.7% on Arena-Hard. Qwen2.5-7B reaches 70.0% and 74.1%. Average gains of +20pt over SFT distillation and +5pt over reference-free self-improvement—comparable to training with a dedicated fine-tuned reward model. 💡 Authors: Kejian Shi, Yixin Liu, Peifeng Wang, Alexander R. Fabbri, Shafiq Joty @JotyShafiq, Arman Cohan and @Yale, @Meta and @scale_AI. #EnterpriseAI #FutureOfAI
Salesforce AI Research tweet media
English
1
11
47
2.9K
Salesforce AI Research
Salesforce AI Research@SFResearch·
InterviewSim: A Scalable Framework for Interview-Grounded Personality Simulation Paper: bit.ly/3P4cQj9 How well can LLMs simulate real people? Most evaluations rely on surveys or questionnaires as proxies — never checking against what individuals actually said. 🔍 InterviewSim introduces an interview-grounded evaluation framework at scale: 671K+ Q&A pairs extracted from 23K verified interview transcripts across 1,000 public personalities, averaging 11.5 hours of content each. 📊 The framework evaluates simulation fidelity across four dimensions: → Content similarity → Factual consistency → Personality alignment (Big Five) → Factual knowledge retention (MCQ) 📌 Key finding: grounding in real interview data substantially outperforms biographical profiles or parametric knowledge alone. But how that data is used matters — retrieval-augmented methods capture personality style best, while chronological methods better preserve factual consistency and knowledge retention. 💡 The work also reveals that question type is a stronger predictor of difficulty than method choice. Social identity questions (birth dates, family details) yield the highest contradiction rates across all methods, while motivations and values questions are most forgiving. Authors: Yu Li @yooli23, Pranav Narayanan Venkit @PranavVenkit, Yada Pruksachatkun @yadapruksachatk, Chien-Sheng Wu @jasonwu0731 #FutureOfAI #EnterpriseAI #NLProc
Salesforce AI Research tweet media
English
0
7
29
2.4K
Salesforce AI Research
Salesforce AI Research@SFResearch·
MAS-ProVe — the first systematic study of process verification for multi-agent systems. Paper: arxiv.org/abs/2602.03053 Multi-agent systems are increasingly used to tackle complex reasoning tasks, but do they actually benefit from automatic process-level verification? This paper puts that question to a rigorous test. Key findings: 🔬 Process verification doesn't consistently improve performance — high variance is the norm, not the exception, especially across Debate, AFlow, and MAS-Zero frameworks. ⚖️ LLM-as-a-Judge outperforms reward models in 24/36 configurations. Flexible natural language reasoning handles the messy, out-of-distribution dynamics of multi-agent trajectories better than scalar reward signals. 💡 Smaller models can verify effectively. The performance gap between a smaller generalist judge and a stronger reasoning judge is notably smaller than the gap when those same models act as solvers — cost-efficient supervision is viable. 📋 Context management matters more than expected. Summarized context consistently outperforms raw history while using ~3x fewer tokens. For information extraction tasks, however, summarization hurts — granular detail is essential for verifying tool use. 🧩 Verification improves stability, not solvability. Process verification helps stabilize outputs on queries the MAS can already solve, but rarely recovers fundamentally unsolvable cases — the agent's reasoning ceiling holds firm. The MAS-ProVe framework is modular and open-source, designed as a plug-and-play wrapper for any MAS + any verifier. Authors: Vishal Venkataramani, Haizhou Shi, Zixuan Ke @KeZixuan, Austin Xu @austinsxu, Xiaoxiao He, Yingbo Zhou, Semih Yavuz @semih__yavuz, Hao Wang @HaoGarfield, Shafiq Joty @jotyshafiq #FutureOfAI #EnterpriseAI #MultiAgentSystems
Salesforce AI Research tweet media
English
2
0
9
2.5K
Salesforce AI Research
Salesforce AI Research@SFResearch·
PTL (Prune–Tune Loop): a compression method that reduces LLMs to nearly half their size while preserving reasoning performance. 🧠 Paper: bit.ly/4qV4TtQ The core idea: instead of pruning all at once (which causes dramatic performance drops), PTL divides compression into fine-grained iterations—each with a small prune step followed by lightweight recovery tuning. Like the "boiling frog" effect, gradual changes stay recoverable. Each iteration identifies neurons or layers that are redundant for reasoning, removes them, then restores performance via continual pre-training on CoT data or reinforcement learning. 🔬 The results are compelling: Llama3-8B compressed from 8B → 5B parameters with 30% fewer FLOPs and 224% runtime efficiency gain, while holding near-original accuracy on GSM8K, Minerva Math, and MATH-500. Gemma2-9B pruned from 9B → 5B—the only method tested that preserved near-original performance after aggressive pruning. 📊 On Qwen2.5-7B, PTL was the only compressed model able to recover via RL fine-tuning. Competing methods produced incoherent outputs entirely. PTL also extends beyond math—on code generation (MBPP), a 30% pruned Llama3-8B retained 90% of original accuracy with 2.56x speedup. 💻 Authors: Yiran Zhao @yiran_zhao924, Shengyang Zhou, Zijian Wu @Jaku_metsu, Tongyan Hu, Yuhui Xu @xyh6666, Rengan Dou, Kenji Kawaguchi, Shafiq Joty @jotyshafiq, Junnan Li @lijunnan0409, Michael Qizhe Shieh @michaelqshieh, @NUSingapore #EnterpriseAI #FutureOfAI
Salesforce AI Research tweet mediaSalesforce AI Research tweet media
English
3
9
42
3.7K
Salesforce AI Research
Salesforce AI Research@SFResearch·
Poisoning the Well: Search Agents Get Tricked by Maliciously Hosted Content sforce.co/40Z1Gi5 AI agents that rely on web search are vulnerable to a deceptively simple attack: adversaries publish fake but authoritative-sounding content designed to be retrieved during search. Think "AI Slop" for agents. Our research shows that when agents encounter planted content, they stop critically evaluating what they find and start accepting it at face value. Key findings: → ~80% of queries returned the attacker's chosen answer when poisoned content was manually injected into search results → Agents shift from information-seeking mode to verification mode, performing fewer searches and reporting higher confidence → Even in realistic settings with 100K+ clean documents and just a handful of poisoned ones, nearly 1 in 4 queries were compromised → Agent self-reported confidence actually increases in the presence of adversarial content, making poisoned answers harder to detect At @Salesforce, trust is a core value. These findings reinforce why we invest in defense mechanisms like our trust layer to help agents navigate hostile information landscapes reliably. Authors: Shafiq Joty @jotyshafiq, Xuan Phi Nguyen @xuanphinguyen, Shrey Pandit @ShreyPandit2001, Yifei Ming @ming5_alvin #FutureOfAI #EnterpriseAI #AIAgents
English
2
1
5
741
Salesforce AI Research
Salesforce AI Research@SFResearch·
When two AI agents negotiate, helpfulness can backfire. They agree each other into absurdity. Our A2A Semantic Layer Framework turns that chaos into trusted, verifiable interaction. A2A Semantic Layer Blog: sforce.co/49GtDB8 #FutureOfAI #AIAgents
Salesforce News & Insights@SalesforceNews

Social networks like Moltbook show that agents are eager to communicate, but agent-to-agent ecosystems can quickly become unmanageable. Salesforce's A2A Sematic Layer Framework ensures that when two AI systems interact, the exchange remains secure.

English
0
0
1
761
Salesforce AI Research
Salesforce AI Research@SFResearch·
Our work on "Echoing" was accepted at the Agents in the Wild workshop at ICLR 2026 When LLM agents talk to each other without human oversight, they can abandon their assigned roles entirely — mirroring their conversational partner instead. And standard success metrics won't catch it. #ICLR2026 #FutureOfAI #EnterpriseAI
Salesforce AI Research@SFResearch

ECHOING: Identity Failures When LLM Agents Talk to Each Other arxiv.org/abs/2511.09710 When agents interact autonomously, a new class of failure emerges where agents abandon their assigned roles and mimic their conversational partner. A customer agent starts sounding like the hotel it's negotiating with. A procurement agent generates supplier proposals. Key findings across 2,500+ conversations: 🚨 Standard metrics miss it entirely. 93% of affected conversations still "complete successfully" 🪞 Echoing rates reach as high as 70% with major model providers 🧠 More reasoning doesn't fix it. Reasoning models showed 32.8% echoing, barely below non-reasoning at 37.7% 🛠 Structured responses reduce rates to ~9%, but don't eliminate the problem The implication: A2A reliability can't be inferred from single-agent evaluations. As we scale  these systems, mitigating identity drift needs to be a priority. Authors: Sarath Shekkizhar @shekkizh, Romain Cosentino @Rom_Cosentino, Adam Earle, Silvio Savarese @silviocinguetta #FutureOfAI #EnterpriseAI #MultiAgentSystems

English
0
1
5
1.3K
Salesforce AI Research
Salesforce AI Research@SFResearch·
ECHOING: Identity Failures When LLM Agents Talk to Each Other arxiv.org/abs/2511.09710 When agents interact autonomously, a new class of failure emerges where agents abandon their assigned roles and mimic their conversational partner. A customer agent starts sounding like the hotel it's negotiating with. A procurement agent generates supplier proposals. Key findings across 2,500+ conversations: 🚨 Standard metrics miss it entirely. 93% of affected conversations still "complete successfully" 🪞 Echoing rates reach as high as 70% with major model providers 🧠 More reasoning doesn't fix it. Reasoning models showed 32.8% echoing, barely below non-reasoning at 37.7% 🛠 Structured responses reduce rates to ~9%, but don't eliminate the problem The implication: A2A reliability can't be inferred from single-agent evaluations. As we scale  these systems, mitigating identity drift needs to be a priority. Authors: Sarath Shekkizhar @shekkizh, Romain Cosentino @Rom_Cosentino, Adam Earle, Silvio Savarese @silviocinguetta #FutureOfAI #EnterpriseAI #MultiAgentSystems
Salesforce AI Research tweet mediaSalesforce AI Research tweet media
English
1
8
35
6.2K
Salesforce AI Research
Salesforce AI Research@SFResearch·
This week, Silvio Savarese @silviocinguetta visited our Singapore team to mark the team's 7th anniversary and kick off FY27. From time series forecasting and enterprise insights to coding agents and computer use agents — the team presented technically rigorous work with broad product reach across @Salesforce. A testament to what seven years of focused research builds. 🌏 #AIResearch #EnterpriseAI
Salesforce AI Research tweet mediaSalesforce AI Research tweet mediaSalesforce AI Research tweet media
English
0
3
12
1.6K
Salesforce AI Research
Salesforce AI Research@SFResearch·
MCP+ wraps your existing MCP clients as a filter — offloading the heavy lifting to a cheaper model and returning only the relevant slice. No changes to your agent. Up to 75% cost savings. 💡
English
1
0
0
355
Salesforce AI Research
Salesforce AI Research@SFResearch·
MCP+: Precision Context Management for MCP Agents mcp-plus.github.io A server-, agent-, & task-agnostic post-processing layer that wraps your MCP clients — filtering tool outputs down to only what your agent needs, with zero changes to your existing logic. 🔍 Context bloat is a real cost. When MCP tools return thousands of tokens of raw HTML, JSON dumps, or API payloads, your primary agent pays for every token — and that overhead compounds across every turn. MCP+ intercepts those outputs and returns only the relevant slice, offloading the filtering work to a cheaper model so your premium LLM context stays focused on the task. 📊 Results across the MCP-Universe Benchmark (browser navigation, financial data, web search): → Up to 75% reduction in inference cost → Comparable or improved task accuracy across Claude, GPT, and Gemini → Token count reduced by >95% in structured data tasks (e.g. ~7,000 tokens → ~200) ⚙️ The key mechanism: an expected_info argument that lets your agent specify exactly what it needs before the haystack reaches its context window. 🛠️ Works with Cursor, Claude Code, and Agentforce Vibes. Supports OpenAI, Gemini, and Anthropic models as the filtering layer. ✍️ Authors: Prathyusha Jwalapuram (@jwala_94), Akhilesh Deepak Gotmare (@akhilesh_gotmare), Doyen Sahoo (@doyensahoo), Silvio Savarese (silviocinguetta), and Junnan Li (@LiJunnan0409). #FutureOfAI #EnterpriseAI #AIAgents #MCP #LLM #GenerativeAI #MLOps
Salesforce AI Research tweet media
English
0
7
45
2.2K
Salesforce AI Research
Salesforce AI Research@SFResearch·
SkillOrchestra: Learning to Route Agents via Skill Transfer 🎼 bit.ly/4tWbSVS Currently #2 on @HuggingFace 🤗 and featured on @DailyPapers today! What if smarter agent orchestration wasn't about training bigger models — but about modeling skills? RL-based orchestrators suffer from "routing collapse" — repeatedly calling the same agent even when other alternatives can do better. The result: inflated costs and poor specialization. SkillOrchestra takes a different approach. Instead of learning a policy end-to-end, it builds a reusable Skill Handbook that captures: → Mode-level insights (what to do, e.g., search vs. code vs. answer) → Fine-grained skills (e.g., symbolic logic, numerical approximation) → Agent profiles (competence + cost per skill) At inference time, it identifies which skills are active, looks up which agents handle them best, and optimizes the accuracy–cost tradeoff explicitly. No routing collapse. Just skill-aware decisions. Results across 10 benchmarks: ✅ Up to +22.5% accuracy over RL-based approaches ✅ 700× lower training cost vs. Router-R1 ✅ 300× lower vs. ToolOrchestra ✅ Pareto optimal across performance–cost And the Skill Handbook transfers — learned on a 3B orchestrator, applied to 7B, 8B, Mixtral-8x22B with no retraining. The bigger the model, the bigger the gain. Great work from @jiayuwang111 @ming5_alvin, @KeZixuan, @JotyShafiq, @awsTO and @fredsala 👏 — collaboration between UW-Madison and @Salesforce AI Research. #FutureOfAI #EnterpriseAI #AIAgents #MachineLearning
Salesforce AI Research tweet media
English
0
5
19
1.7K
Salesforce AI Research
Salesforce AI Research@SFResearch·
Future Optical Flow Prediction Improves Robot Control & Video Generation 📝 bit.ly/4s98FjZ FOFPred uses language-conditioned optical flow prediction to improve both robot manipulation and video generation. #FutureOfAI #Robotics
English
0
0
2
306
Salesforce AI Research
Salesforce AI Research@SFResearch·
Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding 📝 bit.ly/48X4PDY AVP uses MLLM agents to actively seek query-relevant evidence in long videos by deciding what, where, and how to look instead of passively captioning every frame. #FutureOfAI #EnterpriseAI
English
1
0
3
441
Salesforce AI Research
Salesforce AI Research@SFResearch·
Two papers accepted to @CVPR 2026! 🎉 🎖️ Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding 🎖️ Future Optical Flow Prediction Improves Robot Control & Video Generation Learn more 👇 #CVPR2026 #FutureOfAI #EnterpriseAI
Salesforce AI Research tweet media
English
2
1
16
1.1K
Salesforce AI Research
Salesforce AI Research@SFResearch·
PLATE: a new continual learning method from @Salesforce AI Research that lets you fine-tune pretrained models on new tasks without forgetting what they already know. No access to old training data required. 🧠 Paper: bit.ly/4qPr5Fm Code: bit.ly/3MZm0MW The core insights: pretrained large models are geometrically redundant. PLATE exploits that redundancy in two complementary ways: (i) uses redundant neurons as a weight-only proxy for dominant old-feature directions, and (ii) concentrates plasticity on redundant channels so updates don't disrupt what the model already learned. 🔬 The result is a structured low-rank adapter (∆W = BAQ⊤) where only A is trained. B (redundant neuron selector) and Q (orthogonal subspace) are computed once from frozen weights, requiring zero old-task data. Evaluations on both LLM out of distribution specialization and multiple two-task continual learning setups spanning language modeling, regression, vision, and text classification, PLATE matches LoRA’s new-task gains while preserving prior behavior, without having the need to access old-task data. 📊 The method also gives practitioners an explicit dial via PLATE’s hyperparameters: the number of trainable neurons r and orthogonal subspace energy threshold τ provide together how much the model can move along the learning ↔ forgetting spectrum, instead of discovering it by accident. 🎛️ Author: Romain Cosentino #EnterpriseAI #FutureOfAI
Salesforce AI Research tweet mediaSalesforce AI Research tweet mediaSalesforce AI Research tweet mediaSalesforce AI Research tweet media
English
1
9
31
3.1K
Salesforce AI Research
Salesforce AI Research@SFResearch·
(5/6) MAS-Orchestra delivers strong results on public benchmarks—math, multi-hop QA, and multi-step search QA—with 10X efficiency over strong baselines. ✅ Consistent gains across all benchmarks ✅ Robust OOD generalization🔍 Behavior matters: • Low DoM → learns effective single-agent delegation • High DoM → learns to exploit parallelism 📈 Efficiency and Effectiveness: MAS-Orchestra lies on the performance–cost Pareto frontier.
Salesforce AI Research tweet media
English
1
0
1
262
Salesforce AI Research
Salesforce AI Research@SFResearch·
(🧵 1/6) Multi-agent systems (MAS) ≠ “just more agents.” Today’s MAS orchestration is often sequential, local, and hard-coded—and we still don’t know when MAS actually helps. MAS-Orchestra enables holistic orchestration, framing it as a function-calling RL problem with an explicit notion of degree of MAS (DoM). We introduce MASBench to quantify gains over single agents, achieving strong multi-step reasoning with 10X efficiency. 🧠 Project Page: bit.ly/3NOg61r 📘 Paper: bit.ly/4kLeiTl 💻 Code: bit.ly/4teCGQV 📚 Dataset: bit.ly/4klZUkc
Salesforce AI Research tweet media
English
1
2
9
844