Yining Hong

55 posts

Yining Hong

Yining Hong

@yining_hong

💻Postdoc in CS AI @stanford | 🤖3D-LLMs | embodied world models | Test-time Training | Musician -🎸Multi-Instrumentalist & Composer | Metalhead 🤘🏼

Los Angeles, CA Katılım Kasım 2019
181 Takip Edilen3.8K Takipçiler
Sabitlenmiş Tweet
Yining Hong
Yining Hong@yining_hong·
3D-LLM has reached 200 citations within one year of its acceptance🎉
AK@_akhaliq

3D-LLM: Injecting the 3D World into Large Language Models paper page: huggingface.co/papers/2307.12… Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

English
7
9
178
33.9K
Yining Hong
Yining Hong@yining_hong·
I wrote a blog "Three Levels of TTT" — Test-Time Training, Meta Training, World Models, 3D & Self-Supervised Learning: evelinehong.github.io/ttt_three_leve… The three levels are: 🧠 Episode — hippocampus encodes fast, neocortex consolidates slow. No labels needed. 🌱 Individual Lifetime — there is no train/test split. Every minute is testing as well as training. 🌍 Natural Selection & Evolution — continuous adaptation integrates into the species' prior. Each level is the meta-training of the level below. Each level is the test-time training of the level above. Priors flow down to the lower level; consolidated adaptations flow up to the next higher level. 🧠 The self-supervised signal needs no labels — it comes from the structure of experience itself: what did I expect vs. what happened? What follows what? What appears together? 🌱 This consolidates across a lifetime — every minute is testing as well as training. Given the priors of the human species, an infant develops 3D perception, object permanence, intuitive physics — not from instruction, but from reaching, crawling, acting. The world teaches the rest through self-supervised learning. 🌍 But what gives us those priors? Two front-facing eyes, exactly the right distance apart for depth to emerge. Pain and proprioception as free error signals. A face-detection circuit running at birth. Billions of years of test-time feedback from individual lives, accumulated and frozen into hardware. Evolution doesn't optimize behavior — it optimizes the prior you start from.
Yining Hong tweet media
English
7
37
266
18.5K
Yining Hong
Yining Hong@yining_hong·
Submit papers to our FMEA Workshop!
Qineng Wang@qineng_wang

🚀 Announcing the 2nd Workshop on Foundation Models Meet Embodied Agents (FMEA) @ CVPR 2026! How can we leverage foundation models to help perceive, reason, plan, and act in the physical world? 👉 FMEA brings together researchers across vision, language, robotics, and ML to push the frontier of foundation models for embodied agents. 📣 Call for Papers is now open! We invite submissions on LLMs, VLMs, Video Action (VA), and Vision–Language–Action (VLA) models for embodied agents, including: - Long-horizon reasoning & planning - Spatial intelligence & physical understanding - World models, memory, and interaction - Vision–language–action learning and evaluation - Benchmarks, datasets, and evaluation protocols for embodied agents 🏆 Challenges @ FMEA 2026 🔹 ENACT — evaluating embodied cognition of VLMs with world modeling of egocentric interaction enact-embodied-cognition.github.io 🔹 EmbodiedBench — benchmarking VLM-based embodied agents across perception, reasoning, and action embodiedbench.github.io 🔹 Embodied Agent Interface (EAI) — evaluating LLM-based agents on goal interpretation, subgoal decomposition, action sequencing, and transition modeling embodied-agent-interface.github.io 📝 OpenReview submission portal (deadline: May 1st, 2026): openreview.net/group?id=thecv… 🌐 Workshop website: …models-meet-embodied-agents.github.io/cvpr2026/ 📍 Join us at CVPR 2026 — excited to see what you’ll build and submit!

English
0
0
18
4.2K
Yining Hong
Yining Hong@yining_hong·
Does it actually work? Yes — and by a large margin. On our Long-Horizon Household Tasks (adapted from BEHAVIOR-1K): 📦 Fitting: 44.7% vs 10.6% for the strongest baseline 🥗 Selection: 32.4% vs 14.7% 🍳 Preparation: 31.7% vs 15.9% On our controlled MuJoCo Cupboard Fitting benchmark: ✅ 60.2% fit rate vs 43.7% — with LoRA-based test-time training updating less than 5% of parameters Ablations tell the real story: reflection-in-action (RIA) and reflection-on-action (ROA) are mutually dependent. RIA without ROA produces overconfident yet inaccurate scores with no hindsight correction. ROA without RIA wastes learning on poorly chosen actions. Together they form a virtuous cycle.
English
1
0
2
780
Yining Hong
Yining Hong@yining_hong·
Test-time errors aren't dead ends. They're training data — for both our world model and decision model. Yet embodied systems are stuck with: Fixed world models; Repeated mistakes; Zero growth. We broke the cycle by introducing Reflective Test-Time Planning — embodied agents that reflect like human reflective practitioners. 🧠 Reflection-in-Action: We engage in internal simulation, questioning whether our planned approach will actually work given what we currently understand — before committing to any action. 🔄 Reflection-on-Action: We use actual outcomes to reshape both our beliefs about the environment and our strategies for acting within it — updating the world model and decision policy in real time. ⏪ Retro-Reflection: Rewind past decisions with hindsight — so long-horizon failures get the credit assignment they deserve. 🌐 …flective-test-time-planning.github.io
💻 github.com/Reflective-Tes…
Yining Hong tweet media
English
3
20
134
18.2K
Yining Hong
Yining Hong@yining_hong·
Multisensory systems at the key to true embodied intelligence. Submit papers to our “Sense of Space” Workshop!
Rao Fu@RaoFu79761158

🚨 Embodied AI still lacks a true “Sense of Space” — 🖐️touch, 💪 force, 🧠 proprioception, 🔊 audio, 👃 smell, ❤️ bio-signals, 🌡️ thermal. Submit & join the 1st Sense of Space Workshop @CVPR 2026 🏔️✨ 🔗 sense-of-space.github.io We tackle it from two directions: ⬆️ Bottom-up: 🧩 emergent new sensors & systems ⬇️ Top-down: 🤖🖐️ dexterous hand–object manipulation from multisensory inputs 📣 Call for Papers: proceedings + non-proceedings Accepted papers can apply for ✈️ travel grants + 🏆 sponsored paper awards. Powered by an amazing organizing team 💥🤝 @RaoFu79761158 @LiGuankfd2 @alex_kai2020 Kun He @tomhodan Ergys Ristani Jessica Yin @AntheaYLi @yining_hong Devin Murphy Ray Song @xyz2maureen @ericyi0124 Qi Ye @YunzhuLiYZ @haoshu_fang @ruoshi_liu Vatsal Mehta @LuoYiyue @MengyuLearner @wojmatusik

English
0
0
6
1.3K
Yining Hong
Yining Hong@yining_hong·
Best Paper Award: $200 honorarium. Spotlight Paper Awards: $100 honorarium each. Sponsored by: @Figure_robot
English
0
0
0
366
Yining Hong
Yining Hong@yining_hong·
LLMs are now learning space, geometry, and how to move. 🤖📐 The 2nd CVPR 3D-LLM VLA Workshop brings together language, 3D perception, and action for embodied intelligence. 📢 Call for Papers is OPEN: #tab-your-consoles" target="_blank" rel="nofollow noopener">openreview.net/group?id=thecv…
🌐 Website: 3d-llm-vla.github.io If your research lives at the intersection of words, worlds, and robots—this one’s for you. #CVPR2026 @CVPR
Yining Hong tweet media
English
1
20
145
16K
Ahoo
Ahoo@Ahoo_Ahuu·
@yining_hong Just today i had this very idea with somewhat different vision but look at this😍 amazing work.
English
1
0
0
101
Yining Hong
Yining Hong@yining_hong·
Meet Embodied Web Agents that bridge physical-digital realms. Imagine embodied agents that can search for online recipes, shop for ingredients and cook for you. Embodied web agents search internet information for implementing real-world embodied tasks. All data, codes and web environments are available at embodied-web-agent.github.io Paper link: arxiv.org/abs/2506.15677
English
4
36
168
53.5K
redJ
redJ@sudoredj·
@yining_hong you're so cool this is creative af
English
1
0
1
101
Sarthak
Sarthak@kaytraser·
@yining_hong This paper looks awesome! Would love to know what you used to generate these outdoor environments
English
1
0
0
917