Ranjay Krishna (@RanjayKrishna) - Twitter Profili

Ranjay Krishna retweetledi

VLMs today—including our own Molmo—point via raw text strings (e.g. ""). What if pointing meant directly selecting the visual tokens instead? 🤔 Introducing MolmoPoint: Better Pointing for VLMs with Grounding Tokens 🎯 🔓models, code, data, demo all OPEN 🧵👇 Paper: allenai.org/papers/molmopo…

English

10

34

342

44.8K

Ranjay Krishna retweetledi

Yue Yang@YueYangAI·2d

🎯 We release MolmoPoint, the best open model in GUI grounding 💻 by training on purely synthetic screenshots. We open-source all our models, data, and generation code. Plug it into your agents! Demo: huggingface.co/spaces/allenai… Model: huggingface.co/allenai/MolmoP… Data: huggingface.co/datasets/allen… Code: github.com/allenai/MolmoP…

Ai2@allen_ai

Grounding lets vision-language models do more than describe—they can point to where a robot should grasp, which button to click, or which object to track across video frames. Today we're releasing MolmoPoint, a better way for models to point. 🧵

English

0

12

84

7K

Ranjay Krishna retweetledi

Zixian Ma@zixianma02·2d

MolmoPoint now points and grounds to visual tokens directly, instead of naively outputting coordinates in text 🎯 It can also do GUI grounding very well, in addition to better image and video pointing 💻 Check out our super neat new release!

Ai2@allen_ai

VLMs already have visual tokens. Letting them point by selecting those tokens turns out to be simpler, faster, & better. 🤖 Models: huggingface.co/collections/al… 📦 Data: huggingface.co/collections/al… 💻 Code: github.com/allenai/molmo2 📖 Blog: allenai.org/blog/molmopoint

English

0

2

31

4.8K

Ranjay Krishna retweetledi

Ai2@allen_ai·2d

Grounding lets vision-language models do more than describe—they can point to where a robot should grasp, which button to click, or which object to track across video frames. Today we're releasing MolmoPoint, a better way for models to point. 🧵

English

4

30

205

37.1K

Ranjay Krishna retweetledi

Ai2@allen_ai·3d

"We trained our very first [Molmo] model and were surprised to find that it outperformed GPT. Scale wasn't everything in vision language — clearly there was a key role for data." @RanjayKrishna on today's open model panel at #NVIDIAGTC

English

1

8

59

10.2K

Ranjay Krishna retweetledi

Jieyu Zhang@JieyuZhang20·4d

Nice to see researchers already finetune Molmo2 (and leverage its grounding capability) to do cool tasks!

AK@_akhaliq

Can Vision-Language Models Solve the Shell Game? paper: huggingface.co/papers/2603.08…

English

1

4

25

5K

Ranjay Krishna retweetledi

Ainaz Eftekhar@ainaz_eftekhar·11 Mar

Excited to share MolmoBot! 🤖 A big milestone for sim-to-real robotics!🚀 We show that training manipulation policies on massive, diverse simulation data can transfer zero-shot to the real world—for both static and mobile manipulation tasks🦾

Ai2@allen_ai

Today, a step forward in open robotics - our results show that sim-to-real zero shot transfer for manipulation is possible. MolmoBot is our open model suite for robotics, trained entirely in simulation on MolmoSpaces.🧵

English

0

3

17

1.8K

Ranjay Krishna retweetledi

Linxin Song@linxins2·12 Mar

🚀 Introducing ExeVRM — a video-based reward model that judges whether a computer-use agent actually completed your task, just by watching the screen recording. Our 8B model hits 84.7% accuracy & 87.7% recall, outperforming GPT-5.2 and Gemini-3 Pro on execution video assessment across Ubuntu, macOS, Windows & Android. No access to agent internals needed. Just the video. 🎬 📄 Paper: arxiv.org/abs/2603.10178 💻 Code: github.com/limenlp/ExeVRM 🤗 Model: huggingface.co/lime-nlp/ExeVR… 📦 Data: huggingface.co/datasets/lime-…

English

2

9

46

8.4K

Ranjay Krishna@RanjayKrishna·12 Mar

We are releasing MolmoBot! We challenge the assumption that sim-to-real requires real-world finetuning. Our robot models beat strong baselines with no real world data. With enough diversity and scale in simulation, zero-shot transfer can actually work—across both static and mobile manipulation. Similar to all our projects, everything is open sourced.

Ai2@allen_ai

Today, a step forward in open robotics - our results show that sim-to-real zero shot transfer for manipulation is possible. MolmoBot is our open model suite for robotics, trained entirely in simulation on MolmoSpaces.🧵

English

0

5

60

5.4K

Ranjay Krishna retweetledi

Ai2@allen_ai·11 Mar

Today, a step forward in open robotics - our results show that sim-to-real zero shot transfer for manipulation is possible. MolmoBot is our open model suite for robotics, trained entirely in simulation on MolmoSpaces.🧵

English

10

41

277

58.8K

Ranjay Krishna retweetledi

pfung@philfung·2 Mar

I read this paper and its awesome - it creates a high-performing, smooth reward function (far superior to GVL) that is SUPER simple to implement with an LLM. IMPLEMENTATION: 1. SELECT A MODEL: Pick an open-weight, multimedia LLM (ie Qwen3-VL). 2. PROMPT THE MODEL: Send the LLM the following prompt: "The above video shows a robot manipulation trajectory that completes the following task: {INSTRUCTION}. Decide whether the above statement is True or not. The answer is: " [where INSTRUCTION is any task like "fold the towel" or "pour coffee into the cup"] 3. EXTRACT THE REWARD: Find the *logit probability* for the specific token "True" and use that as your reward signal. [The logit probability is the raw, unnormalized score assigned by the model to the "True" token before it passes through the softmax layer. This logit prob is available for open-source models and some closed-source models - for example, ChatGPT exposes log probs, whereas Claude does not] That's it!! Obviously the logit prob and using the term "True" are key insights. It is quite elegant. Congrats to the brilliant authors at @UW and @allen_ai !

Jiafei Duan@DJiafei

Instead of asking a VLM to output progress, it reads the model’s internal belief directly from token logits. No in-context learning. No fine-tuning. No reward training. 📈 We introduce: TOPReward, a zero-shot reward modeling approach for robotics using token probabilities from pretrained video VLMs. The simplest way of doing reward modelling for robotics! Project: topreward.github.io/webpage/ 🧵👇

San Francisco, CA 🇺🇸 English

7

25

220

39.1K

Ranjay Krishna retweetledi

pfung@philfung·4 Mar

Inspired by the TopReward paper, I made a lil web tool to test these robot manipulation rewards on your own videos. Try: philfung.github.io/rewardscope Record yourself folding a towel, upload it, and compare: 1. TopReward (this paper) 2. GVL (Deepmind) 3. Brute Force (i.e. at each frame, ask LLM to reply with a probability) TopReward (Qwen3VL-8B) holds its own surprisingly well against the others, even if those use ChatGPT! Great work @DJiafei, UW, AllenAI, thanks for pushing @VilleKuosmanen.

pfung@philfung

I read this paper and its awesome - it creates a high-performing, smooth reward function (far superior to GVL) that is SUPER simple to implement with an LLM. IMPLEMENTATION: 1. SELECT A MODEL: Pick an open-weight, multimedia LLM (ie Qwen3-VL). 2. PROMPT THE MODEL: Send the LLM the following prompt: "The above video shows a robot manipulation trajectory that completes the following task: {INSTRUCTION}. Decide whether the above statement is True or not. The answer is: " [where INSTRUCTION is any task like "fold the towel" or "pour coffee into the cup"] 3. EXTRACT THE REWARD: Find the *logit probability* for the specific token "True" and use that as your reward signal. [The logit probability is the raw, unnormalized score assigned by the model to the "True" token before it passes through the softmax layer. This logit prob is available for open-source models and some closed-source models - for example, ChatGPT exposes log probs, whereas Claude does not] That's it!! Obviously the logit prob and using the term "True" are key insights. It is quite elegant. Congrats to the brilliant authors at @UW and @allen_ai !

Burlingame, CA 🇺🇸 English

8

21

151

31.1K

Ranjay Krishna retweetledi

Ai2@allen_ai·3 Mar

📢 Update: the Molmo 2 codebase is now open source. We're releasing the code behind Molmo 2—our open model family for video & image understanding, pointing, tracking, & more. Now you can easily train Molmo 2 on your own data. 🧵

English

6

51

364

30.9K

Ranjay Krishna retweetledi

Jiafei Duan@DJiafei·26 Şub

Instead of asking a VLM to output progress, it reads the model’s internal belief directly from token logits. No in-context learning. No fine-tuning. No reward training. 📈 We introduce: TOPReward, a zero-shot reward modeling approach for robotics using token probabilities from pretrained video VLMs. The simplest way of doing reward modelling for robotics! Project: topreward.github.io/webpage/ 🧵👇

English

12

65

362

105.8K

Ranjay Krishna retweetledi

Weikai Huang@weikaih04·24 Şub

Free Jigsaw-like data > massive human-annotated data, on detection / segmentation tasks? Excited to share our CVPR 2026 paper from @UW + @allen_ai: SOC: Synthetic Object Compositions for Scalable and Accurate Learning in Detection, Segmentation, and Grounding We generated 20M jigsaw puzzle-like synthetic object segments (47K categories) and composed 2M detection/segmentation/grounding training images, with zero human annotation. 💡Key idea: Diffusion models excel at generating single objects. So we: 1️⃣ Generate individual objects → get perfect masks for free 2️⃣ Compose them like jigsaw puzzles with 3D layout priors 3️⃣ Use generative relighting to harmonize the scene Result: Training data with pixel-perfect annotations at any scale. 📊 Highlights: → LVIS Detection: 50K SOC images → +9.7 AP, rare classes +13.4 AP, outperforming 20M GRIT and rivaling 200K human-annotated V3Det → Visual Grounding: gRefCOCO no-target accuracy +8.4, DoD +3.8 mAP, beating both GRIT & V3Det → Instance Seg: LVIS rare +3.83 AP; COCO 1% data regime +6.59 AP Huge thanks to my great mentors @RanjayKrishna, @JieyuZhang20 and all collaborators @TaoyangJia, @Michael3014018, Ziqi Gao, @jjaesungpark, and @WinsonHan Open-sourcing: 📄 arxiv.org/abs/2510.09110 💻 github.com/weikaih04/Synt… 🤗huggingface.co/collections/we…

English

3

6

40

6K

Ranjay Krishna retweetledi

Xia Su@XiaSu09·23 Şub

🚀 Excited to share that our paper “CapNav: Benchmarking Vision Language Models on Capability-conditioned Indoor Navigation” has been accepted to #CVPR2026 (main track)! 📄: arxiv.org/abs/2602.18424 💻GitHub: github.com/makeabilitylab… 🤗Hugging Face: huggingface.co/datasets/Richa… (1/4)

English

5

9

24

2K

Ranjay Krishna retweetledi

Arijit Ray@ARRay693·18 Şub

"It is by logic that we prove, but by [abstract] intuition that we discover." - Henri Poincaré. When faced with a complex problem, we pause, we think. Not exactly in words, not exactly in images — in something more abstract, something harder to name. So, for truly intelligent agents, should we not ask that they do the same? Introducing Mull-Tokens — a modality-agnostic latent thinking paradigm. Now, the model can think in space, in time, in words, in affordances — in all the things that language alone cannot easily convey. arijitray.com/multimodal_thi…

English

1

7

549

Ranjay Krishna retweetledi

Manling Li@ManlingLi_·16 Şub

📍Theory of Space (accepted at #ICLR2026) Theory of Mind → hidden mental states Theory of Space → hidden spatial beliefs from passive observers “What do I know?” to active explorers “What don’t I know, and how do I reduce that uncertainty?” Theory of Space is to evaluate if foundation models can actively construct, revise, and exploit internal spatial beliefs. We quantify Active-Passive Gap. Not just measure task accuracy, but how much uncertainty is reduced per step, and how many steps are needed in total for agents to build stable spatial beliefs. Exploration should prioritize information gain and reduce uncertainty per step. Instead, we observe LLMs/VLMs explore redundantly with stalled belief updates. Key findings: 1. Active agents perform worse than rule based programs 2. Cognitive Map Failures & Belief Drift (beliefs about previously observed objects degrades over time; new updates corrupt earlier correct perceptions) 3. Poor Vision Identification & Belief Inertia in Belief Revision Website: theory-of-space.github.io Code: github.com/mll-lab-nu/The… Data: huggingface.co/datasets/MLL-L… Theory of Space is a joint effort of @NorthwesternEng, @StanfordAILab, @uwcse, @Cornell_CS. Led by the amazing @WilliamZhangNU, jointly done with @zihanhuang66, @YueYuew8314, @JieyuZhang20, @XLe41402, @wzihanw, @qineng_wang, @keshigeyan, @RuohanZhang76, @YejinChoinka, @RanjayKrishna, @jiajunwu_cs, @drfeifei

English

7

93

491

51.3K

Ranjay Krishna@RanjayKrishna·13 Şub

@dragon_khoi Because there were no diverse simulation environments available to train/test with... until now.

English

0

1

34

khoi@dragon_khoi·13 Şub

@RanjayKrishna how come most of the big labs rely mostly on real world (teleop or video, etc) today? (nvidia, pi, 1x, deepmind, etc)

English

1

0

59

Ranjay Krishna@RanjayKrishna·12 Şub

The amount and diversity of robot data we need is much higher than what we can scale. We are betting on simulation! MolmoSpaces allows you to generate seemingly unlimited amounts of robot data in large diverse environments across multiple simulators.

Ai2@allen_ai

Introducing MolmoSpaces, a large-scale, fully open platform + benchmark for embodied AI research. 🤖 230k+ indoor scenes, 130k+ object models, & 42M annotated robotic grasps—all in one ecosystem.

English

5

4

48

6.2K

Ranjay Krishna retweetledi

Luci Pars@parsluci·12 Şub

Robotlar artık 230 bin evde prova yapıyor: MolmoSpaces gerçek dünyayı simüle ediyor Allen AI MolmoSpaces adında devasa bir açık platformu duyurdu. Robotların gerçek dünyada hareket etmesini sağlayacak yapay zeka için inanılmaz bir kaynak ortaya çıktı. 230 binden fazla farklı ev içi mekan, 130 binden fazla gerçekçi 3D nesne modeli ve tam 42 milyon doğrulanmış tutuşma verisiyle dolu. Her şey fizik kurallarına göre simüle edilmiş, nesnelerin ağırlığı, sertliği, kapıların açılışı bile hesaba katılıyor. Eskiden basitçe nesneye dokununca tuttu diye geçiştirilen şeyler artık gerçekçi şekilde işliyor. Bu platform sayesinde robotlar yeni bir odaya girse bile daha önce görmediği eşyaları tutup kullanmayı öğrenebilecek. Ayrıca MolmoSpaces-Bench diye bir test sistemi var aydınlatmayı değiştir, nesnenin ağırlığını artır, komutu farklı söyle, tek tek zorluk ekleyip yapay zekanın nerede tökezlediğini net görebiliyorsun. Binlerce sahne üzerinde sistematik deney yapmak mümkün hale geldi. Üstelik herkes kullanabiliyor kod açık, veriler Hugging Face'te, demo sitede hazır, farklı simülatörlerle bile uyumlu. Telefonundan bile robotu uzaktan kontrol edip veri toplayabiliyorsun, ekstra kurulum gerekmiyor. Gelecekte evde yardımcı robotlar, fabrikalarda çalışan sistemler çok daha hızlı gelişecek gibi duruyor. Bu ölçekte açık bir veri seti ve araç seti çıkması araştırmacıları bayağı heyecanlandırdı. Reklam değildir.

Ai2@allen_ai

Introducing MolmoSpaces, a large-scale, fully open platform + benchmark for embodied AI research. 🤖 230k+ indoor scenes, 130k+ object models, & 42M annotated robotic grasps—all in one ecosystem.

Türkçe

0

2

13

1.9K

Ranjay Krishna

Keşfet