Salman Abdullah

64 posts

Salman Abdullah

@salmanabdullah_

BS/MS @Stanford | RL, Reasoning, Agents @StanfordAILab

Katılım Ocak 2021

421 Takip Edilen191 Takipçiler

Sabitlenmiş Tweet

Salman Abdullah@salmanabdullah_·2 Şub

Excited to introduce RAPTOR 🦖 at #ICLR2024: RAPTOR is a tree-based retrieval approach that navigates between granular details and a holistic understanding of documents. It sets a new SoTA on 3 benchmarks. Read the paper here: arxiv.org/abs/2401.18059

Parth Sarthi@parthsarthi03

Looking for a RAG system that navigates between granular details and the big picture? We’re excited to introduce RAPTOR, a tree-based retrieval approach that sets a new SoTA on 3 benchmarks #ICLR2024 w/@salmanabdullah_ , @aditituli_, @shubhkhanna__, @annadgoldie & @chrmanning at @stanfordNLP arxiv.org/abs/2401.18059 🧵

English

3.7K

Salman Abdullah retweetledi

Jessica Chudnovsky@jchudnov·11 Mar

Your deduplication pipeline was built for small models. At scale, it's broken. New preprint: "Scale Dependent Data Duplication" 1/10

English

114

25.6K

Salman Abdullah retweetledi

Jack Bai@jackbot_cs·2 Mar

We're proud to share that WebGym is now accepted to CVPR 2026. I would be excited to talk to people working in the vision domain about web agents and reinforcement learning. See you in Denver soon. 😈 Code and data are now publicly available at github.com/microsoft/webg….

Jack Bai@jackbot_cs

😈 Today, we introduce WebGym, the largest-to-date open-source RL environment for web agent training that contains 300k tasks and a rollout framework optimized specifically for web environments' rollout speed. We reveal the effects of essential scaling directions we observe with WebGym. 1/n

English

3.3K

Salman Abdullah retweetledi

Jubayer Ibn Hamid@jubayer_hamid·27 Şub

Happy to share that our paper is accepted at ICLR 2026. Since publishing it, we’ve been pushing the ideas much further: scaling set RL to LLMs for reasoning in maths and coding. Excited to share what we have found soon!

Jubayer Ibn Hamid@jubayer_hamid

Exploration is fundamental to RL. Yet policy gradient methods often collapse: during training they fail to explore broadly, and converge into narrow, easily exploitable behaviors. The result is poor generalization, limited gains from test-time scaling, and brittleness on tasks where strategic exploration is necessary. We introduce a framework for training a policy over sets of generations and use it to induce exploration. Work with @ifdita_hasan (co-lead), @ellenjxu_ , @chelseabfinn and @DorsaSadigh at Stanford 🧵

English

125

14.9K

Salman Abdullah retweetledi

Jack Bai@jackbot_cs·25 Şub

😈 Today, Microsoft open-sources WebGym: the task set, code, a bunch of visualization tools, and guiding documentations. WebGym is an RL environment with the *first* open-source implementation of the fully asynchronous rollout system designed for multi-step vision-supported web agentic trajectory collection, which speeds up *4x-5x* compared to existing synchronous implementations. This release comes with *300k* realistic web agentic tasks with comprehensive evaluation rubrics and pipeline, together with annotations on difficulty and domains. 🧵 1/6

English

3.9K

Salman Abdullah retweetledi

Jack Bai@jackbot_cs·9 Oca

English

378

43.3K

Salman Abdullah retweetledi

Parth Sarthi@parthsarthi03·8 Oca

I am incredibly excited to introduce Chariot. (@Chariot_in) Suvrat (@TheBhooshan) and I are working a research lab based in India to research systems that can truly understand, reason, and interact with the world starting with speech. We are one of the four teams backed by the @OfficialINDIAai Mission and the Government of India. Today, we had the incredible honor of meeting and interacting with the Honorable Prime Minister of India @NarendraModi ji at his residence to explain what we are working on. While I was explaining our model to him, he zeroed in on the core problem and asked me if the model could discern the intent behind the words to determine the correct tone. With his example of "Ram Naam Satya Hai" vs "Ram Ram”, he asked if the model knows when something is solemn vs casual? Does it understand the weight of what's being said, not just the words themselves? It's exactly the kind of problem we're obsessing over at Chariot— speech isn't transcription, it's intent, emotion, cultural context, all encoded in how something is said. Grateful to the India AI Mission for this opportunity. Lots to build. 🇮🇳

Narendra Modi@narendramodi

Talked AI with youngsters from the Indian StartUp world. It was a memorable and insightful interaction, in which they shared their vision and work on how India is transforming the world of AI. It is commendable how these StartUps are working on diverse fields such as e-commerce, marketing, engineering simulations, material research, healthcare, medical research and more. pib.gov.in/PressReleseDet…

English

3.4K

Salman Abdullah retweetledi

Alex Dimakis@AlexGDimakis·7 Ara

Agreed. The frontier is on Continual learning, personalization and memory management. We fundamentally don’t know how to do it and it will have direct and immediate impact on enterprise.

Sarah Catanzaro@sarahcat21

Let’s repeat: continual learning is the next frontier.

English

274

44.3K

Salman Abdullah retweetledi

Ahmed Awadallah@AhmedHAwadallah·24 Kas

Fara-7B is our first agentic small language model for computer use. We learned a lot, and looking forward to next steps: *Agentic models can be small, yet remain capable *Unlike solutions that rely on chat model wrappers, even small agentic models can process screenshots and perform direct GUI actions such as scrolling, typing, and clicking. *Simulation-driven multi-agent synthetic data to automates task generation, trajectory generation and validation is a way to address the agentic data scarcity gap, and in our case costs < $1 per task. *Evaluating CUA is hard ; we release WebTailBench, a new eval set with diverse tasks not found in other benchmarks, and work with an external party, Browserbase, to independently assessed Fara-7B using human annotators. Model available on Foundry and HuggingFace and can run on device on Copilot+ PC

English

129

23.1K

Salman Abdullah retweetledi

Nikunj Kothari@nikunj·25 Kas

Computer use agents are SO wildly underhyped.. 2026 is going to be fun 🕺

English

221

24.3K

Salman Abdullah retweetledi

Anikait Singh@Anikait_Singh_·3 Eki

🚨🚨New Paper: Training LLMs to Discover Abstractions for Solving Reasoning Problems Introducing RLAD, a two-player RL framework for LLMs to discover 'reasoning abstractions'—natural language hints that encode procedural knowledge for structured exploration in reasoning.🧵⬇️

English

116

595

56.2K

Salman Abdullah retweetledi

Chelsea Finn@chelseabfinn·4 Eki

Hierarchical RL for LLM reasoning. Paper: arxiv.org/abs/2510.02263

Yoonho Lee@yoonholeee

The standard way to improve reasoning in LLMs is to train on long chains of thought. But these traces are often brute-force and shallow. Introducing RLAD, where models instead learn _reasoning abstractions_: concise textual strategies that guide structured exploration. 1/N🧵

English

602

63.1K

Salman Abdullah retweetledi

Qdrant@qdrant_engine·24 Tem

Researchers at @ETH_en and @Stanford released an open dataset of 5.8M+ long-form medical QA pairs, each grounded in peer-reviewed literature and designed for RAG. 🚀 The pipeline: ▪️ Source: 900K+ full-text medical papers (S2ORC) ▪️ QA generation via GPT-3.5 with a three-stage filtering process (regex, Mistral-7B classifier, human-in-the-loop) ▪️ Embeddings generated and indexed in Qdrant for scalable dense retrieval The dataset is available on @huggingface🤗 with full code for embedding, indexing, and RAG setup. 👉 Full story: qdrant.tech/blog/miriad-qd…

English

1.1K

Salman Abdullah retweetledi

Parth Sarthi@parthsarthi03·10 Tem

With the move to Compound AI systems— built from components like finetunable/closed-source models, LLM selectors, and more— one big challenge is end-to-end optimization. Optimizing each component individually doesn't necessarily guarantee optimization of the full system. Our latest work introduces Optimas, a framework that solves this by learning reward functions for each part that are aligned with final system performance. Each component gets its own Globally Aligned Local Reward (LRF), and we use the right optimization method for each (prompt optimization for API models, PPO for open-source, hyperparameter selection, etc). Across 5 real-world tasks, Optimas get an average 11.92% boost over top baselines (LLMSelector, TextGrad, DSPy). Check it out!

Shirley Wu@ShirleyYXWu

Introducing 🔥Optimas🔥: The first unified framework to optimize compound AI systems composed of multiple components like trainable/API-based LLMs, tools, model routers, and traditional ML models! 🌐 👉🏻 optimas.stanford.edu 🌟 Why Optimas? AI systems today combine diverse elements—prompts, model parameters, hyperparameters, and model router. Optimizing the entire system effectively is tough! Optimas tackles this with an intuitive strategy: Globally Aligned Local Rewards (LRFs), ensuring each component's optimization directly boosts overall system performance! 📈 Impressive Results: Tested rigorously on 5 real-world compound AI tasks: Product Recommendation, Medical QA, Complex Retrieval, Multi-hop QA, and Code Generation. 🤩 Delivers an impressive average boost of 11.92% over top baselines (e.g., LLMSelector, TextGrad, DSPy). 🔧 Here's the magic behind Optimas: ① Assigns each component a Local Reward Function (LRF). ② Aligns these LRFs with global objectives, enabling independent yet coordinated optimizations. ③ Adaptively updates LRFs for efficient, coherent improvements across diverse configurations. 💡 Compatible with popular agentic frameworks Easily optimize your own systems! Integrates with popular agentic frameworks like @DSPyOSS, @crewAIInc, @pyautogen, TextGrad, and OpenAI Agent SDK @OpenAIDevs! Proudly developed by an outstanding collaboration between @StanfordAILab, @AmazonScience, and more! Grateful to work with team @parthsarthi03, Shiyu, Aaron, @krypticmouse, @Diyi_Yang, @james_y_zou, @jure etc.! Check out more! 📄 Paper: arxiv.org/abs/2507.03041 💻 Code: github.com/snap-stanford/… (to be open-sourced soon!) #CompoundAISystem #LLM #Optimization #MachineLearning

English

3.2K

Salman Abdullah retweetledi

Marktechpost AI Dev News ⚡@Marktechpost·25 Haz

Researchers from ETH Zurich, Stanford, Mayo Clinic, and others have developed MIRIAD, a large-scale dataset containing 5.8 million medical instruction-response pairs, each grounded in peer-reviewed literature. Designed to address the factual inconsistencies of large language models (LLMs) in clinical settings, MIRIAD enhances retrieval-augmented generation (RAG) pipelines by replacing noisy, unstructured content with clean, semantically aligned QA data. When integrated with LLMs, MIRIAD improves accuracy by up to 6.7% and significantly enhances hallucination detection by up to 37%. The dataset is supported by MIRIAD-Atlas, an interactive visualization tool spanning 56 medical domains, allowing users to explore content by topic. Built through a semi-automated pipeline involving GPT-4 supervision and human expert validation, MIRIAD serves both as a high-quality retrieval corpus and a training set for specialized medical retrievers. This structured resource sets a new standard for safe and explainable medical AI, facilitating more trustworthy applications in clinical question-answering, digital health interfaces, and research. Read full article: marktechpost.com/2025/06/25/eth… Paper: arxiv.org/abs/2506.06091 Dataset: huggingface.co/miriad Code: github.com/eth-medical-ai… @Michael_D_Moor @QueyJ , @salmanabdullah_ , @samarthrawal , @cyrilzakka , @SophieOstmeier @edreisMD , @EricTopol @jure

English

802

Salman Abdullah retweetledi

Quentin Lhoest 🤗@lhoestq·16 Haz

YEEESSSssss dataset loading with Spark is 🔥 👉It loads ANY dataset on @huggingface in one line of code Using pyspark_huggingface 1.0 released last week e.g. here the latest Medical QA dataset (5M+ rows🤯) by @Michael_D_Moor @salmanabdullah_ and team

English

6.7K

Salman Abdullah retweetledi

Charly Wargnier@DataChaz·10 Haz

🚨 Just released: MIRIAD, a million-scale medical QA dataset to ground LLMs in reliable medical knowledge. 5.8M question-answer pairs, each distilled from peer-reviewed literature! 🔥 That's structured, high-quality data built for medical AI. 🧵 ↓

English

102

430

48.1K

Salman Abdullah@salmanabdullah_·9 Haz

Great collaboration with @QueyJ, @samarthrawal, @cyrilzakka, @SophieOstmeier, Maximilian Purk, @edreisMD, @EricTopol, @jure & @Michael_D_Moor!

English

216

Salman Abdullah@salmanabdullah_·9 Haz

We're excited to release MIRIAD - a massive-scale 5.8M+ synthetic dataset for retrieval in medicine. It improves RAG performance, helps LLMs detect medical hallucinations, and enables training of domain-specific retrievers. 🤗 Dataset: huggingface.co/miriad 🖥️ Code: github.com/eth-medical-ai…… 📄 Preprint: arxiv.org/abs/2506.06091

Michael Moor@Michael_D_Moor

Excited to announce MIRIAD — a large-scale dataset of 5,821,948 medical question-answer pairs, each rephrased from passages in the medical literature. Great collab with @QueyJ, @salmanabdullah_, @samarthrawal, @cyrilzakka, @SophieOstmeier, Maximilian Purk, @edreisMD, @EricTopol & @jure! Page: med-miriad.github.io Dataset: huggingface.co/miriad Preprint: arxiv.org/abs/2506.06091 Code: github.com/eth-medical-ai… Demo: med-miriad.github.io/demo [1/n]

English

5.1K

Salman Abdullah retweetledi

Avanika Narayan@Avanika15·13 May

can you chat privately with a cloud llm—*without* sacrificing speed? excited to release minions secure chat: an open-source protocol for end-to-end encrypted llm chat with <1% latency overhead (even @ 30B+ params!). cloud providers can’t peek—messages decrypt only inside a secure gpu enclave, where inference stays fully confidential 🤯 links + code in comments👇

English

243

79.1K

Salman Abdullah@salmanabdullah_·5 May

@jyangballin github.com/volcengine/verl

QME

324

John Yang@jyangballin·5 May

@ weekend warriors - DM me a GitHub repo that you like / maintain, and I'll train you a 7B coding agent that's an expert for that repo. Main constraints - it's predominantly Python, and has a testing suite w/ good coverage. (example of good repo = sympy, pandas, sqlfluff)

English

115

16.9K

Keşfet

@Chariot_in @TheBhooshan @OfficialINDIAai @NarendraModi @ETH_en @Stanford @huggingface @Michael_D_Moor