Yining Hong

@yining_hong

Los Angeles, CA Katılım Kasım 2019

181 Takip Edilen3.8K Takipçiler

Sabitlenmiş Tweet

Yining Hong@yining_hong·24 Kas

3D-LLM has reached 200 citations within one year of its acceptance🎉

AK@_akhaliq

3D-LLM: Injecting the 3D World into Large Language Models paper page: huggingface.co/papers/2307.12… Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

English

178

33.9K

Yining Hong@yining_hong·4 Nis

I wrote a blog "Three Levels of TTT" — Test-Time Training, Meta Training, World Models, 3D & Self-Supervised Learning: evelinehong.github.io/ttt_three_leve… The three levels are: 🧠 Episode — hippocampus encodes fast, neocortex consolidates slow. No labels needed. 🌱 Individual Lifetime — there is no train/test split. Every minute is testing as well as training. 🌍 Natural Selection & Evolution — continuous adaptation integrates into the species' prior. Each level is the meta-training of the level below. Each level is the test-time training of the level above. Priors flow down to the lower level; consolidated adaptations flow up to the next higher level. 🧠 The self-supervised signal needs no labels — it comes from the structure of experience itself: what did I expect vs. what happened? What follows what? What appears together? 🌱 This consolidates across a lifetime — every minute is testing as well as training. Given the priors of the human species, an infant develops 3D perception, object permanence, intuitive physics — not from instruction, but from reaching, crawling, acting. The world teaches the rest through self-supervised learning. 🌍 But what gives us those priors? Two front-facing eyes, exactly the right distance apart for depth to emerge. Pain and proprioception as free error signals. A face-detection circuit running at birth. Billions of years of test-time feedback from individual lives, accumulated and frozen into hardware. Evolution doesn't optimize behavior — it optimizes the prior you start from.

English

266

18.5K

Yining Hong@yining_hong·25 Şub

Submit papers to our FMEA Workshop!

Qineng Wang@qineng_wang

🚀 Announcing the 2nd Workshop on Foundation Models Meet Embodied Agents (FMEA) @ CVPR 2026! How can we leverage foundation models to help perceive, reason, plan, and act in the physical world? 👉 FMEA brings together researchers across vision, language, robotics, and ML to push the frontier of foundation models for embodied agents. 📣 Call for Papers is now open! We invite submissions on LLMs, VLMs, Video Action (VA), and Vision–Language–Action (VLA) models for embodied agents, including: - Long-horizon reasoning & planning - Spatial intelligence & physical understanding - World models, memory, and interaction - Vision–language–action learning and evaluation - Benchmarks, datasets, and evaluation protocols for embodied agents 🏆 Challenges @ FMEA 2026 🔹 ENACT — evaluating embodied cognition of VLMs with world modeling of egocentric interaction enact-embodied-cognition.github.io 🔹 EmbodiedBench — benchmarking VLM-based embodied agents across perception, reasoning, and action embodiedbench.github.io 🔹 Embodied Agent Interface (EAI) — evaluating LLM-based agents on goal interpretation, subgoal decomposition, action sequencing, and transition modeling embodied-agent-interface.github.io 📝 OpenReview submission portal (deadline: May 1st, 2026): openreview.net/group?id=thecv… 🌐 Workshop website: …models-meet-embodied-agents.github.io/cvpr2026/ 📍 Join us at CVPR 2026 — excited to see what you’ll build and submit!

English

4.2K

Yining Hong@yining_hong·25 Şub

Many thanks to coauthors: Raven Huang @ManlingLi_ , Fei-Fei Li, @jiajunwu_cs @YejinChoinka

English

600

Yining Hong@yining_hong·25 Şub

Does it actually work? Yes — and by a large margin. On our Long-Horizon Household Tasks (adapted from BEHAVIOR-1K): 📦 Fitting: 44.7% vs 10.6% for the strongest baseline 🥗 Selection: 32.4% vs 14.7% 🍳 Preparation: 31.7% vs 15.9% On our controlled MuJoCo Cupboard Fitting benchmark: ✅ 60.2% fit rate vs 43.7% — with LoRA-based test-time training updating less than 5% of parameters Ablations tell the real story: reflection-in-action (RIA) and reflection-on-action (ROA) are mutually dependent. RIA without ROA produces overconfident yet inaccurate scores with no hindsight correction. ROA without RIA wastes learning on poorly chosen actions. Together they form a virtuous cycle.

English

780

Yining Hong@yining_hong·25 Şub

Test-time errors aren't dead ends. They're training data — for both our world model and decision model. Yet embodied systems are stuck with: Fixed world models; Repeated mistakes; Zero growth. We broke the cycle by introducing Reflective Test-Time Planning — embodied agents that reflect like human reflective practitioners. 🧠 Reflection-in-Action: We engage in internal simulation, questioning whether our planned approach will actually work given what we currently understand — before committing to any action. 🔄 Reflection-on-Action: We use actual outcomes to reshape both our beliefs about the environment and our strategies for acting within it — updating the world model and decision policy in real time. ⏪ Retro-Reflection: Rewind past decisions with hindsight — so long-horizon failures get the credit assignment they deserve. 🌐 …flective-test-time-planning.github.io 💻 github.com/Reflective-Tes…

English

134

18.2K

Yining Hong@yining_hong·5 Şub

Multisensory systems at the key to true embodied intelligence. Submit papers to our “Sense of Space” Workshop!

Rao Fu@RaoFu79761158

🚨 Embodied AI still lacks a true “Sense of Space” — 🖐️touch, 💪 force, 🧠 proprioception, 🔊 audio, 👃 smell, ❤️ bio-signals, 🌡️ thermal. Submit & join the 1st Sense of Space Workshop @CVPR 2026 🏔️✨ 🔗 sense-of-space.github.io We tackle it from two directions: ⬆️ Bottom-up: 🧩 emergent new sensors & systems ⬇️ Top-down: 🤖🖐️ dexterous hand–object manipulation from multisensory inputs 📣 Call for Papers: proceedings + non-proceedings Accepted papers can apply for ✈️ travel grants + 🏆 sponsored paper awards. Powered by an amazing organizing team 💥🤝 @RaoFu79761158 @LiGuankfd2 @alex_kai2020 Kun He @tomhodan Ergys Ristani Jessica Yin @AntheaYLi @yining_hong Devin Murphy Ray Song @xyz2maureen @ericyi0124 Qi Ye @YunzhuLiYZ @haoshu_fang @ruoshi_liu Vatsal Mehta @LuoYiyue @MengyuLearner @wojmatusik

English

1.3K

Yining Hong@yining_hong·5 Şub

Best Paper Award: $200 honorarium. Spotlight Paper Awards: $100 honorarium each. Sponsored by: @Figure_robot

English

366

Yining Hong@yining_hong·3 Şub

LLMs are now learning space, geometry, and how to move. 🤖📐 The 2nd CVPR 3D-LLM VLA Workshop brings together language, 3D perception, and action for embodied intelligence. 📢 Call for Papers is OPEN: #tab-your-consoles" target="_blank" rel="nofollow noopener">openreview.net/group?id=thecv… 🌐 Website: 3d-llm-vla.github.io If your research lives at the intersection of words, worlds, and robots—this one’s for you. #CVPR2026 @CVPR