Chao Feng

41 posts

Chao Feng

@chaof1234

PhD student @cornell_tech @Cornell_CS | Research Intern @AdobeResearch

Katılım Temmuz 2024

183 Takip Edilen132 Takipçiler

Chao Feng retweetledi

Xiyao Wang@XiyaoWang10·18 Eyl

Thinklite-VL is now accepted by #NeurIPS2025 as spotlight🎉 Excited to catch up with old friends and meet new ones in San Diego!

Furong Huang@furongh

🧠💡 What if your 7B model could beat GPT-4o and Qwen2.5-72B—using just 11k training samples? No distillation. No warm-start. Just smart data and reinforcement learning. Inspired by Moravec’s Paradox, we let the model decide what's actually hard. 🚨 New paper: "SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement" We show how ThinkLite-VL-7B achieves SoTA on MathVista—75.1%, surpassing much larger models. 👇 Here’s how we did it: 🔗 arxiv.org/abs/2504.07934 🧠 Code: github.com/si0wang/ThinkL… #AI #VisionLanguageModels #ReinforcementLearning #MachineLearning #LessIsMore

English

3.9K

Chao Feng@chaof1234·18 Eyl

@CzyangChen congrats

English

Ziyang Chen@CzyangChen·18 Eyl

Welcome aboard, Ray3! 🎉 Congrats to the team—really proud to be part of it!

Luma@LumaLabsAI

This is Ray3. The world’s first reasoning video model, and the first to generate studio-grade HDR. Now with an all-new Draft Mode for rapid iteration in creative workflows, and state of the art physics and consistency. Available now for free in Dream Machine.

English

1.4K

Chao Feng retweetledi

Xiyao Wang@XiyaoWang10·4 Eyl

Thanks to AK for sharing our paper!🎉 Training a generative critic model to judge responses makes it BETTER at EVERYTHING. Sometimes the best policy comes from good judgment. Your critic model has been hiding its true potential🌟 🚀Introducing LLaVA-Critic-R1, a family of VLMs that serve as both critic and policy in a single model. No policy training. No in-domain task data. Just 40k preference pairs "Is response A or B better?" for Critic RL Training! Result: +5.7% on 26 visual benchmarks including visual understanding, reasoning, even GUI agents. 71.9 7B-Scale SoTA performance on MMMU! Learn to judge, excel at everything🎭 📄 Paper: huggingface.co/papers/2509.00… 💻 Code: github.com/LLaVA-VL/LLaVA…

AK@_akhaliq

LLaVA-Critic-R1 Your Critic Model is Secretly a Strong Policy Model

English

8.3K

Chao Feng retweetledi

AK@_akhaliq·15 Haz

GPS as a Control Signal for Image Generation

Français

16K

Chao Feng retweetledi

seunghyun lee@seunghy23235·15 Haz

Please join us on poster #369 tomorrow afternoon @CVPR

seunghyun lee@seunghy23235

@CVPR We introduce Cropper. Image cropping is a task to find an aesthetic part in image. VLMs as generalist often struggles with the precise, continuous coordinate output (text) required for accurate crop box prediction without further fine-tuning.

English

1.3K

Chao Feng@chaof1234·13 Haz

Work with @CzyangChen , @holynski_ , Alexei A. Efros, and @andrewhowens. Paper: openaccess.thecvf.com/content/CVPR20… Project page: cfeng16.github.io/gps-gen/

English

124

Chao Feng@chaof1234·13 Haz

Beyond 2D, we can lift a 3D model directly from our GPS-conditioned model by score distillation sampling, which is trained per landmark.

English

157

Chao Feng@chaof1234·13 Haz

Sharing our #CVPR2025 paper: "GPS as a Control Signal for Image Generation"! 🛰️+✍️ We turn the GPS tag stored in EXIF of photos into a control signal for diffusion models—so they don’t just know what you asked for, but where you want it to look like. Come to see our poster at Friday 13 Jun 10:30 a.m. — 12:30 p.m. (CT) in ExHall D, Poster #250.

English

3.1K

Chao Feng retweetledi

Ayush Shrivastava@ayshrv·13 Haz

Excited to share our CVPR 2025 paper on cross-modal space-time correspondence! We present a method to match pixels across different modalities (RGB-Depth, RGB-Thermal, Photo-Sketch, and cross-style images) — trained entirely using unpaired data and self-supervision. Our approach learns correspondences through contrastive random walks across visual modalities. #CVPR2025 (1/6)

English

120

Chao Feng retweetledi

Jeongsoo Park@jespark0·13 Haz

Can AI image detectors keep up with new fakes? Mostly, no. Existing detectors are trained using a handful of models. But there are thousands in the wild! Our work, Community Forensics, uses 4800+ generators to train detectors that generalize to new fakes. #CVPR2025 🧵 (1/5)

English

1.8K

Chao Feng retweetledi

Yiming Dou@_YimingDou·13 Haz

Ever wondered how a scene sounds👂 when you interact👋 with it? Introducing our #CVPR2025 work "Hearing Hands: Generating Sounds from Physical Interactions in 3D Scenes" -- we make 3D scene reconstructions audibly interactive! yimingdou.com/hearing_hands/

English

8.2K

Chao Feng retweetledi

Daniel Geng@dangengdg·12 Haz

Hello! If you like pretty images and videos and want a rec for CVPR oral session, you should def go to Image/Video Gen, Friday at 9am: I'll be presenting "Motion Prompting" @RyanBurgert will be presenting "Go with the Flow" and @ChangPasca1650 will be presenting "LookingGlass"

English

5.3K

Chao Feng retweetledi

Xiyao Wang@XiyaoWang10·6 Haz

We have released ThinkLite-VL-72B which is trained with only 7.5k sMCTS-selected samples and achieves 79.7 on MathVista🚀 huggingface.co/russwang/Think…

Furong Huang@furongh

🚀 New update on ThinkLite-VL! We just released ThinkLite-VL-72B 🎉 With only 7.5k MCTS-selected samples, it achieves 79.7 / 64.3 accuracy on MathVista and MathVerse, significantly outperforming Qwen-2.5-VL-72B and GPT-o1, and brings an average +6.5% gain across 8 visual reasoning tasks! 📘 In our updated arXiv paper, we compare offline vs. online sample selection and find that offline selection consistently performs better at both 7B and 72B scale. 🧠 Interestingly, the hard samples picked by MCTS show higher average entropy than the discarded ones, aligning with recent findings in LLM reasoning (arxiv.org/abs/2506.01939, arxiv.org/abs/2505.22617)—suggesting that training on higher-entropy samples boosts reasoning via deeper exploration. 🔜 More exciting results coming soon. Stay tuned!

English

1.6K

Chao Feng retweetledi

tiange@tiangeluo·1 May

Will VLMs adhere strictly to their learned priors, unable to perform visual reasoning on content never existed on the Internet? We propose ViLP, a benchmark designed to probe the visual-language priors of VLMs by constructing Question-Image-Answer triplets that deliberately deviate from existing data. Check our gallery at vilp-team.github.io & huggingface.co/datasets/ViLP/… To further enhance VLMs’ reliance on visual information, we propose Image-DPO, as elaborated in this thread. w/ @AngCao3 @GunheeLee @jcjohnss @honglaklee

English

114.9K

Chao Feng retweetledi

Chris Rockwell@_crockwell·25 Nis

Ever wish YouTube had 3D labels? 🚀Introducing🎥DynPose-100K🎥, an Internet-scale collection of diverse videos annotated with camera pose! Applications include camera-controlled video generation🤩and learned dynamic pose estimation😯 Download: huggingface.co/datasets/nvidi…

English

177

42.9K

Chao Feng retweetledi

Furong Huang@furongh·20 Nis

English

473

63.5K

Chao Feng retweetledi

Xiyao Wang@XiyaoWang10·12 Nis

Thank AK for sharing our work💪 We achieve SoTA performance of 75.1 on MathVista using only 11k QA data through RL without any knowledge distillation🚀 Our model and data are fully opensourced at github.com/si0wang/ThinkL… 🤗 Will share details in another tweet later!

AK@_akhaliq

SoTA with Less MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement

English

665

Keşfet

@CzyangChen @CVPR @holynski_ @andrewhowens @RyanBurgert @ChangPasca1650 @AngCao3 @jcjohnss