Ran Xu

92 posts

Ran Xu

@stanleyran

Research Director @ Salesforce AI Research

Katılım Ağustos 2010

224 Takip Edilen61 Takipçiler

Ran Xu retweetledi

Salesforce AI Research@SFResearch·7 Kas

🎨 Introducing BLIP3o-NEXT: The Next Frontier of Native Image Generation 📄 Paper: arxiv.org/abs/2510.15857 A fully open-science foundation model that unifies text-to-image generation AND image editing in one architecture. 🖼️✨ Key innovations: ➡️ Autoregressive + Diffusion design combining reasoning + fine-detail rendering 🧠 ➡️ First successful RL application for native image generation (GRPO on discrete tokens) 🎯 ➡️ State-of-the-art performance on GenEval & image editing benchmarks 📊 #FutureOfAI #EnterpriseAI ImageGeneration #DiffusionModels #ReinforcementLearning #ImageEditing #ComputerVision #MachineLearning #AI

English

Ran Xu retweetledi

Caiming Xiong@CaimingXiong·23 Eki

Humans don’t just use tools — we invent them. That’s the next frontier for AI agents. At @SFResearch, we’re introducing WALT (Web Agents that Learn Tools) — a framework that teaches browser agents to discover and reverse-engineer a website’s hidden functionality into reusable tools. Through a demonstrate → generate → validate loop, WALT systematically transforms web interactions into structured APIs — moving us closer to truly autonomous web intelligence. We benchmark WALT on VisualWebArena and WebArena — discovering 50+ reusable tools across search, content management, and communication. WALT hits 52.9% / 50.1% SOTA success, with 10–30% higher accuracy and 1.3–1.4× fewer steps. Paper: bit.ly/4nhJf0K Code: bit.ly/47gMAXZ @virprabh @yutong_dai @jinggu4ai @luo_yanqi @silviocinguetta @LiJunnan0409 @ZeyuanChen @stanleyran

English

117

10.4K

Ran Xu retweetledi

Caiming Xiong@CaimingXiong·7 Ağu

🚀 Computer-using agents represent a powerful new paradigm for human-computer interaction. Over the past year, we’ve explored multiple approaches to tackle the key challenges in building robust CUA systems. 12/2024 we released Aguvis (x.com/CaimingXiong/s…) 07/2024 we released GTA1 (x.com/CaimingXiong/s…) Today, we introduce CoAct-1 — a hybrid agent that elevates coding to a first-class action alongside GUI manipulation. On OSWorld, CoAct-1 achieves a new SOTA score of 60.76%, becoming the first CUA agent to cross the 60-point mark. Takeaways - Treat code as an action, not just a tool call. - Hybrid action space (code + GUI) reduces error accumulation and boosts reliability. - New SOTA on OSWorld with better efficiency and broader applicability. Paper: arxiv.org/abs/2508.03923 Page: linxins.net/coact/

Caiming Xiong@CaimingXiong

Meet AGUVIS: A pure vision-based framework for autonomous GUI agents, operating seamlessly across web, desktop, and mobile platforms without UI code. Key Features & Contributions 🔍 Pure Vision Framework: First fully autonomous pure vision GUI agent capable of performing tasks independently without relying on closed-source models 🔄 Cross-Platform Unification: Unified action space and plugin system that works consistently across different GUI environments 📊 Comprehensive Dataset: Large-scale dataset of GUI agent trajectories with multimodal grounding and reasoning 🧠 Two-Stage Training: Novel training pipeline focusing on GUI grounding followed by planning and reasoning 💭 Inner Monologue: Explicit planning and reasoning capabilities integrated into the model training Project Page: aguvis-project.github.io Paper: huggingface.co/papers/2412.04… GitHub: github.com/xlang-ai/aguvis

English

204

32.5K

Ran Xu retweetledi

Salesforce AI Research@SFResearch·26 Haz

From Flow Generalists to Champions: Building #AgenticAI for Salesforce Automation💻 Introducing Enterprise General Intelligence (#EGI) models for Salesforce Flow automation! Blog: salesforce.com/blog/agentic-a… Unlike frontier LLMs that treat this as token generation, our EGI approach: ✅ Encodes enterprise domain knowledge in a custom DSL ✅ Trains in Flow Simulator with continuous self-improvement ✅ Achieves 50% relative improvement with 88% less data EGI isn't just better AI—it's AI purpose-built for enterprise. 32% → 48% activation rate on complex flows proves it works. #EnterpriseAI #FutureOfAI

English

1.3K

Ran Xu retweetledi

Li Junnan@LiJunnan0409·11 Haz

🚀 We’re open-sourcing Grounding-R1 — a series of SoTA models for GUI Grounding, trained with RL using a simple click-based reward. 🧠 Dive into our blog post: “GRPO for GUI Grounding Done Right” for the full training recipe. huggingface.co/blog/HelloKKMe…

English

111

8.9K

Ran Xu retweetledi

Salesforce AI Research@SFResearch·23 May

🚨NEW MODEL: BLIP3-o 🚨 🔬 Researchers from @SFResearch + @ml_umd introduce BLIP3-o: solving AI's dual challenge of building ONE model that both understands AND generates images at SOTA level. 💡 Key innovation: dual-stage training with frozen autoregressive backbone prevents task interference - the model excels at both understanding and generation simultaneously. 🔓 Open source for the research community: bit.ly/4muUBzm 🤗 Model: bit.ly/4kB9oXK 💻 Demo: bit.ly/4jb0YVD 📎 Blog: salesforce.com/blog/blip3/ 🗞️ Feature: bit.ly/4dyTZoz #FutureOfAI #EnterpriseAI #OpenScience @github @Marktechpost

English

11.5K

Ran Xu retweetledi

Salesforce AI Research@SFResearch·16 May

We're thrilled to announce BLIP3-o, a breakthrough in unified multimodal models that excels at both image understanding and generation in a single autoregressive architecture! 💫 📊 Paper: bit.ly/3Saybpo 🤗 Models: bit.ly/4jhFaYM 🧠 Code: bit.ly/43id1uB 📽️ Learn on the go (AI Generated): bit.ly/3EWDZQp Our research reveals that using CLIP features with diffusion transformer and flow matching creates superior performance while reducing computational complexity. Most importantly, we're making this model family available to the AI Research community: ▶️ Complete model implementations ▶️ Model weights ▶️ 25M+ detailed caption pretrain dataset ▶️ 60K high-quality instruction tuning dataset Advance your multimodal AI research and share your findings in the comments. (And thanks for the shout, @_akhaliq!)

English

4.6K

Ran Xu retweetledi

AK@_akhaliq·15 May

Salesforce just dropped BLIP3-o on Hugging Face A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

English

184

30.6K

Ran Xu retweetledi

Salesforce AI Research@SFResearch·15 Oca

🚨🎥🚨🎥🚨 xGen-MM-Vid (BLIP-3-Video) is now available on @huggingface! Our compact VLM achieves SOTA performance with just 32 tokens for video understanding. Features explicit temporal encoder + BLIP-3 architecture. Try it out! 🤗32 Token Model: bit.ly/3PBNBBz 🤗128 Token Model: bit.ly/4jfOFZA 📄Paper: bit.ly/4fWJb36 🖥️Website: bit.ly/3Yvyqiy 🧵Research Refresher 👇 #ComputerVision #OpenAI #AIResearch #VLM (1/3) Despite using much fewer tokens and being smaller (4B vs. 34B), xGen-MM-Vid provides comparable video question-answering accuracies to SOTA.

English

1.8K

Ran Xu retweetledi

Salesforce AI Research@SFResearch·9 Oca

🔬🔬🔬Introducing ProVision: A new system for transforming images into verified instruction data for multimodal language models (MLMs) at massive scale! Scene graphs + programmatic synthesis generate 10M+ diverse, automated Q&A pairs. Fully verifiable. Training MLMs? Dive in: 📰Blog: sforce.co/3WazqHi 🗞️Paper: bit.ly/4jkoocL 💻Dataset: bit.ly/4j2IojR 👇Researcher’s 🧵👇 (1/6) Why build ProVision? Training multimodal LMs demands massive instruction datasets - pairing images with Q&As. Manual creation is costly, while using existing models risks hallucinations. ProVision's novel solution? Scene graphs + human-written programs. We represent images as structured graphs capturing objects, attributes & relationships. We then use Python programs and textual templates, our data generators synthesize instruction data by creating questions and answers from the scene graph. 👇🧵 for more...

English

112

20.6K

Ran Xu retweetledi

Salesforce AI Research@SFResearch·24 Eki

🚨🚨🚨Introducing PROVE: A new programmatic benchmark for evaluating vision-language models (VLMs). VLMs often provide responses that are unhelpful, contain false claims about the image, or both. However, benchmarking this in the wild can be surprisingly hard! Enter PROVE, which: 💥 Includes challenging visual QA pairs that are *grounded by design* 💥 Provides a programmatic evaluation framework to quantify response *helpfulness* and *truthfulness* 🕹️ Explore: tinyurl.com/sfr-prove 🤗 Data: tinyurl.com/sfr-prove-hf 📎 Paper: arxiv.org/abs/2410.13121 🧵 Details in comments 👇

GIF

English

5.5K

Ran Xu retweetledi

Salesforce AI Research@SFResearch·22 Eki

📢📢📢Introducing xGen-MM-Vid (BLIP-3-Video)! This highly efficient multimodal language model is laser-focused on video understanding. Compared to other models, xGen-MM-Vid represents a video with a fraction of the visual tokens (e.g., 32 vs. 4608 tokens). Paper: arxiv.org/abs/2410.16267 Website: bit.ly/3Yvyqiy Researcher’s 🧵:👇

English

12.5K

Ran Xu retweetledi

Juan Carlos Niebles@jcniebles·30 Eyl

🏃🏻Swing by the ongoing poster session Amber 5 before it’s over! Our team is here to chat about xGen-MM (BLIP3) . #ECCV2024 @eccvconf

Salesforce AI Research@SFResearch

🇮🇹🚀💥Headed to #ECCV2024? Bookmark this for a deep dive into our team’s groundbreaking research across multiple domains.👇 SUNDAY 29th SEPT (All times CEST) xGen-VideoSyn-1: Setting new standards in text-to-video synthesis 10:38 — 11:20am Room: Amber 7+ 8 📝 AI4VA Workshop: bit.ly/4ej1qQ0 📚Paper: arxiv.org/pdf/2408.12590 — MONDAY 30 SEPT: ECCV2024 Workshop on Multimodal Agents 8:30am — 12:30pm Room: Amber 7 + 8 📝Workshop: multimodalagents.github.io BootPIG: Bootstrapping zero-shot personalized image generation 17:40 —17:55 (5:40pm - 5:55pm) Room Space 2 📝Synthetic Data4CV Workshop: bit.ly/47WRidH 📚Poster / Paper: arxiv.org/pdf/2401.13974 xGen-MM (BLIP-3): A groundbreaking family of multimodal models 16:00 — 20:00 (4-8pm) Rom: Amber 5 📝EVAL-FoMo 24 Workshop: bit.ly/4eksONQo 📚Poster / Paper: arxiv.org/pdf/2408.08872 — WEDNESDAY 2 OCT LayoutDETR: Redefining multimodal layout design 10:30am — 12:30pm 📚Poster / Paper: arxiv.org/pdf/2212.09877 X-InstructBLIP: Pioneering cross-modal reasoning 16:30 —18:30 (4:30 - 6:30pm) 📚Poster / Paper: arxiv.org/pdf/2311.18799 — FRIDAY 4 OCT SQ-LLaVA: Self-questioning in vision-language AI 10:30am - 12:30pm 📚Poster / Paper: arxiv.org/pdf/2403.11299 See you in Milan, @eccvconf 🤖 #AIResearch #ComputerVision

English

2.4K

Ran Xu retweetledi

Salesforce AI Research@SFResearch·25 Eyl

GIF

English

23.5K

Ran Xu retweetledi

Silvio Savarese@silviocinguetta·6 Eyl

Happy to see our team's hard work come to fruition. The xLAM family of models represents a huge leap in AI capabilities for function calling, planning and reasoning—fit-for-purpose for varied needs of modern business. Eager to see where its application takes us! #AIInnovation

Salesforce AI Research@SFResearch

Introducing the full xLAM family, our groundbreaking suite of Large Action Models! 🚀 From the 'Tiny Giant' to industrial powerhouses, xLAM is revolutionizing AI efficiency! #AIResearch #AIEfficiency 🤗 Hugging Face Collection: bit.ly/4faoYaQ 🤩 Research Blog bit.ly/3MxliCZ 🗞️ Press Release: sforce.co/3XzaOt9 Meet the family: • xLAM-1B / TINY: Our 1B parameter marvel, ideal for on-device AI. Outperforms larger models despite its compact size • xLAM-7B / SMALL: Perfect for swift academic exploration with limited GPU resources. • xLAM-8x7B / MEDIUM: Mixture-of-experts model balancing latency, resources, and performance for industrial applications. • xLAM-8x22B / LARGE: Our large-scale model for optimal performance in high-resource environments. 🎉 Huge congrats to the team of AI scientists who brought xLAM series to life! Zuxin Liu @LiuZuxin Shirley Kokane @KokaneShirley Ming Zhu @ming_zhu0527 Tian Lan @TLan001 Jianguo Zhang @JianguoZhang3 Thai Hoang @TeeH912. Caiming Xiong @CaimingXiong Silvio Savarese @silviocinguetta

English

4.1K

Ran Xu retweetledi

AK@_akhaliq·23 Ağu

Salesforce presents xGen-VideoSyn-1 High-fidelity Text-to-Video Synthesis with Compressed Representations discuss: huggingface.co/papers/2408.12… We present xGen-VideoSyn-1, a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions. Building on recent advancements, such as OpenAI's Sora, we explore the latent diffusion model (LDM) architecture and introduce a video variational autoencoder (VidVAE). VidVAE compresses video data both spatially and temporally, significantly reducing the length of visual tokens and the computational demands associated with generating long-sequence videos. To further address the computational costs, we propose a divide-and-merge strategy that maintains temporal consistency across video segments. Our Diffusion Transformer (DiT) model incorporates spatial and temporal self-attention layers, enabling robust generalization across different timeframes and aspect ratios. We have devised a data processing pipeline from the very beginning and collected over 13M high-quality video-text pairs. The pipeline includes multiple steps such as clipping, text detection, motion estimation, aesthetics scoring, and dense captioning based on our in-house video-LLM model. Training the VidVAE and DiT models required approximately 40 and 642 H100 days, respectively. Our model supports over 14-second 720p video generation in an end-to-end way and demonstrates competitive performance against state-of-the-art T2V models.

English

130

17.9K

Ran Xu@stanleyran·24 Tem

RT @SFResearch: Breaking news! ➡️➡️➡️ We just released the MINT-1T 🍃dataset! One trillion tokens. Multimodal. Interleaved. Open-source. Pe…

English

Ran Xu@stanleyran·10 May

Releasing the fist of new series of blip - #BLIP3 , more to come!

English

322

Ran Xu retweetledi

Caiming Xiong@CaimingXiong·9 Mar

Excited to share our brand new LLM evaluation benchmark 🐠FoFo🐠 on format-following! 🐠FOFO🐠 is a pioneering benchmark for evaluating large language models’ (LLMs) ability to follow complex, domain-specific formats, a crucial yet under-examined capability for their application as AI agents. Link: arxiv.org/pdf/2402.18667… Our evaluation across both open-source (e.g., Llama 2, WizardLM) and closed-source (e.g., GPT-4, PALM2, Gemini) LLMs highlights three key findings: 1. open-source models significantly lag behind closed-source ones in format adherence; 2. LLMs’ format-following performance is independent of their content generation quality; 3. LLMs’ format proficiency varies across different domains. These observations suggest two key points: i) The format-following capacity of LLMs appears independent of their content-following capacity shown in AlpacaEval and MT-Bench, and may necessitate specialized alignment fine-tuning beyond the conventional instruction-tuning of open source LLMs. ii) Format-following capacity is not universally transferable across domains, highlighting the potential utility of our benchmark as a guiding and probing tool for selecting domain-specific AI agent foundation models.

English

11.6K

Keşfet

@SFResearch @virprabh @yutong_dai @jinggu4ai @luo_yanqi @silviocinguetta @LiJunnan0409 @ZeyuanChen