Jiawei Gu

131 posts

Jiawei Gu

@Kuvvius

Katılım Mart 2023

286 Takip Edilen325 Takipçiler

Sabitlenmiş Tweet

Jiawei Gu@Kuvvius·3 Kas

🚨Sensational title alert: we may have cracked the code to true multimodal reasoning. Meet ThinkMorph — thinking in modalities, not just with them. And what we found was... unexpected. 👀 Emergent intelligence, strong gains, and …🫣 🧵 arxiv.org/abs/2510.27492 (1/16)

English

316

68.6K

Jiawei Gu@Kuvvius·6d

🧐Hot take from building Gym-V: the vision agent community is over-investing in better algorithms and under-investing in better observations. How you frame what the model sees matters more than how you train it.

Fanqing Meng@FanqingMengAI

Some finding: Observation scaffolding is the most decisive factor for RL training success — more than algorithm choice. ✅ Adding captions to images → consistent improvement across ALL environments ❌ Removing game rules → can kill learning entirely ⚖️ GRPO vs GSPO vs SAPO? All improve, but no single algorithm dominates HOW you present the task to the agent matters more than HOW you optimize it.

English

Jiawei Gu@Kuvvius·6d

Me: "I study multimodal reasoning" Gym-V results: "Your models can't even read a chart and click the right button" 🤦‍♂️ Time for VLMs to hit the gym.

Fanqing Meng@FanqingMengAI

Text agents have their Gym. Vision agents? Not until now. Introducing Gym-V — a unified gym-style platform for agentic vision research, with 179 procedurally generated environments across 10 domains. One API to rule them all: 📦 Offline dataset 🤖 Agentic RL training 🔧 Tool-use training 👥 Multi-agent training 📊 VLM & T2I model evaluation All under the same reset/step interface. Key findings: 1. Observation scaffolding matters MORE than RL algorithm choice 2. Broad curricula transfer well; narrow training causes negative transfer 3. Multi-turn interaction amplifies everything 📄 Paper: arxiv.org/abs/2603.15432 💻 Code: github.com/ModalMinds/gym… Open the thread for a deep dive! 🧵

English

739

Jiawei Gu retweetledi

Peter Tong@TongPetersb·4 Mar

Train Beyond Language. We bet on the visual world as the critical next step alongside and beyond language modeling. So, we studied building foundation models from scratch with vision. We share our exploration: visual representations, data, world modeling, architecture, and scaling behavior! [1/9]

English

222

1.1K

208.1K

Jiawei Gu retweetledi

OpenMOSS@Open_MOSS·1 Mar

CVPR2026 🎉 Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm 🌟We use video frames as a unified medium for text and vision reasoning. 🤯 🔥Video model (Sora-2) beats GPT-5 by 10% on Eyeballing Puzzles! 🧵arxiv.org/abs/2511.04570 (1/6) #CVPR2026 #seedance2 #Multimodal #VideoGeneration #Sora2 #Reasoning #LLM #AI

English

1.5K

Jiawei Gu@Kuvvius·1 Mar

@zst96687522 @aigclink hahaa

Filipino

Shitian Zhao@zst96687522·1 Mar

@aigclink @Kuvvius 😅

QME

AIGCLINK@aigclink·1 Mar

过去30天，128家基于openclaw的初创公司，总计产生了28万美元的真实营收，平均每家月收入约2200刀其中排名第一的月营收5万刀 TrustMRR上目前收录了128家，还在不断增长中当下这128家产品的商业模式还比较集中，这其中80%的公司都在做降低OpenClaw使用门槛的活儿，做应用层的只有3-5家，目前商业场景挖掘的还不够深 #Openclaw #openclaw赚钱 #AIagent

中文

228

1.1K

212.5K

Jiawei Gu@Kuvvius·28 Şub

Nice work showing interaction collapse is a training problem, not an interaction problem. Excited to see where agentic vision goes from here.

Shitian Zhao@zst96687522

1/7) We present PyVision-RL, a unified RL framework that stabilizes training and sustains interaction for agentic vision models with Python as the primitive tool. 🧵👇 arxiv.org/abs/2602.20739

English

384

Jiawei Gu retweetledi

Manling Li@ManlingLi_·16 Şub

📍Theory of Space (accepted at #ICLR2026) Theory of Mind → hidden mental states Theory of Space → hidden spatial beliefs from passive observers “What do I know?” to active explorers “What don’t I know, and how do I reduce that uncertainty?” Theory of Space is to evaluate if foundation models can actively construct, revise, and exploit internal spatial beliefs. We quantify Active-Passive Gap. Not just measure task accuracy, but how much uncertainty is reduced per step, and how many steps are needed in total for agents to build stable spatial beliefs. Exploration should prioritize information gain and reduce uncertainty per step. Instead, we observe LLMs/VLMs explore redundantly with stalled belief updates. Key findings: 1. Active agents perform worse than rule based programs 2. Cognitive Map Failures & Belief Drift (beliefs about previously observed objects degrades over time; new updates corrupt earlier correct perceptions) 3. Poor Vision Identification & Belief Inertia in Belief Revision Website: theory-of-space.github.io Code: github.com/mll-lab-nu/The… Data: huggingface.co/datasets/MLL-L… Theory of Space is a joint effort of @NorthwesternEng, @StanfordAILab, @uwcse, @Cornell_CS. Led by the amazing @WilliamZhangNU, jointly done with @zihanhuang66, @YueYuew8314, @JieyuZhang20, @XLe41402, @wzihanw, @qineng_wang, @keshigeyan, @RuohanZhang76, @YejinChoinka, @RanjayKrishna, @jiajunwu_cs, @drfeifei

English

492

51.3K

Jiawei Gu@Kuvvius·14 Şub

💥Exciting to see Seed 2.0 evaluated on our EMMA multimodal reasoning benchmark! Frontier-level results. Congrats to the team. 👏 emma-benchmark.github.io

English

621

Jiawei Gu retweetledi

Zijian Wu@Jaku_metsu·5 Şub

do research and then publish papers / opensource / write blogs, not just publish papers. publishing is not necessarily research.

Fanqing Meng@FanqingMengAI

I am so confused that some says research and engineer separately To be a Good Engineer , Then learn to become Researcher

English

472

Jiawei Gu@Kuvvius·4 Şub

@WeiLiu99 congrats!

English

Wei Liu@WeiLiu99·3 Şub

Happy to share that LASER has been accepted to ICLR 2026. Also, huge congrats on the success of Kimi 2.5! It’s thrilling to see them achieve such impressive results in efficiency enhancement via RL. Their approach shares a similar philosophy with our LASER-D: using an adaptive, difficulty-aware mechanism. It’s fascinating to see this logic align so well in a more online setting (w/ rollout info). Great validation that this is a promising path for efficient reasoning w/o compromising effectiveness!

Wei Liu@WeiLiu99

“What is the answer of 1 + 1?” Large Reasoning Models (LRMs) may generate 1500+ tokens just to answer this trivial question. Too much thinking 🤯 Can LRMs be both Faster AND Stronger? Yes. Introducing LASER💥: Learn to Reason Efficiently with Adaptive Length-based Reward Shaping We propose LASER and its adaptive variants LASER-D / LASER-DE: → +6.1 accuracy on AIME24 → –63% token usage 🔧 What we introduce: A unified framework that connects truncation, previous length rewards under one view A novel Length-bAsed StEp Reward (LASER) that softly encourages conciseness LASER-D: Adapts target lengths based on question difficulty + training dynamics LASER-DE: Encourages exploration on incorrect attempts 🔥 Unlike prior methods that trade efficiency for accuracy, LASER-D/E achieve Pareto-optimality: ✔️ Higher accuracy ✔️ Shorter outputs ✔️ Robust across model sizes (1.5B → 32B) ✔️ Strong generalization (GPQA, LSAT, MMLU) Example: Original LRM needs 1490 tokens to answer “1 + 1” (with many self-reflections and finger counting 🤦) LASER-D model? ✅ Answers directly in 76 tokens ✅ No lost reasoning ability ✅ More concise and intelligent Check it out: 📄 Paper: huggingface.co/papers/2505.15… 💻 Code & Models: github.com/hkust-nlp/Laser

English

5.5K

Jiawei Gu@Kuvvius·31 Oca

@zst96687522 Thxx 🙌

English

Shitian Zhao@zst96687522·31 Oca

@Kuvvius Congras!

Español

Jiawei Gu@Kuvvius·30 Oca

Also accepted at #ICLR2026!! 🥳 arxiv.org/abs/2601.18631

Jiawei Gu@Kuvvius

⛔️ Can MLLMs truly learn WHEN and HOW to use tools? (🛠AdaReasoner says: yes!! Like… actually decide: - “Should I call a tool right now?” - “Which one?” - “How many times?” What happened surprised us: a 7B model beats GPT-5 on visual tool-reasoning—and shows adaptive behaviors we never programmed. (1/17)🧵👇 📄 arxiv.org/abs/2601.18631 🌐 adareasoner.github.io

English

4.9K

AK@_akhaliq·28 Oca

AdaReasoner Dynamic Tool Orchestration for Iterative Visual Reasoning huggingface.co/papers/2601.18…

English

9.2K

Jiawei Gu retweetledi

Yejin Choi@YejinChoinka·26 Oca

Excited to share TTT-Discover (Test-Time Training for Discovery)—seeking new discoveries on long-standing problems: ✅Erdős min overlap, ✅denoising for single-cell analysis, and ✅GPU kernels! The key insight: Scientific discovery requires learning from a long sequence of trials and errors. Current approaches like AlphaEvolve operate with a frozen policy 🧊—only prompts evolve at test time. TTT-Discover instead lets the policy itself adapt 🚀, laser-focusing on one extremely hard problem for as long as it takes. Test-Time Training (TTT): a new frontier for scaling intelligence 🔥

Mert Yuksekgonul@mertyuksekgonul

How to get AI to make discoveries on open scientific problems? Most methods just improve the prompt with more attempts. But the AI itself doesn't improve. With test-time training, AI can continue to learn on the problem it’s trying to solve: test-time-training.github.io/discover.pdf

English

288

44.8K

Jiawei Gu@Kuvvius·29 Oca

Love this framing of Agentic Vision: Think → Act → Observe. ⚡ AdaReasoner hits the same loop for multimodal reasoning: actively see/verify/plan with tools, not just “think harder.” arxiv.org/abs/2601.18631 [Small models] can go surprisingly far when tool orchestration is learned.

Google AI@GoogleAI

Introducing Agentic Vision — a new frontier AI capability in Gemini 3 Flash that converts image understanding from a static act into an agentic process. By combining visual reasoning with code execution, one of the first tools supported by Agentic Vision, the model grounds answers in visual evidence and delivers a consistent 5-10% quality boost across most vision benchmarks. Here’s how the agentic ‘Think, Act, Observe’ loop works: — Think: The model analyzes an image query then architects a multi-step plan — Act: The model then generates and executes Python code to actively manipulate or analyze images — Observe: The transformed image is appended to the model's context window, allowing it to inspect the new data before generating a final response to the initial image query Learn more about Agentic Vision and how to access it in our blog ⬇️ blog.google/innovation-and…

English

1.6K

Jiawei Gu@Kuvvius·29 Oca

@_akhaliq Thanks for sharing our work! 🙌 👇 Full details & breakdown here: x.com/Kuvvius/status…

Jiawei Gu@Kuvvius

English

193

Jiawei Gu@Kuvvius·28 Oca

Open-sourced 🚀 Code: github.com/ssmisya/AdaRea… Models: huggingface.co/AdaReasoner Try it, break it, build on it—PRs and issues welcome 🛠️

English

152

Jiawei Gu@Kuvvius·28 Oca

Big credit to the incredible collaborators 🙌 @ssmisya1 Haoyu Sun @LINJIEFUN Luxin Xu @RanjayKrishna @YuCheng348997 Together, we’re pushing multimodal reasoning in MLLMs from “guess” to “see”, “think” and“check” ✅

English

186

Jiawei Gu@Kuvvius·28 Oca

GIF

English

6.4K

Keşfet

@zst96687522 @aigclink @NorthwesternEng @StanfordAILab @uwcse @Cornell_CS @WilliamZhangNU @zihanhuang66