Ying Shan

1.5K posts

Ying Shan

Ying Shan

@yshan2u

Distinguished Scientist @TencentGlobal, Founder of PCG ARC Lab. Formerly @Microsoft, @MSFTResearch. Views are my own.

Katılım Haziran 2014
799 Takip Edilen2.3K Takipçiler
Sabitlenmiş Tweet
Ying Shan
Ying Shan@yshan2u·
🚀🚀We’re building a new Applied Research Team in Tencent IEG for Game AI, with a research culture similar to ARC Lab. This newly formed team focuses on research-driven Game AI, operating at the intersection of fundamental research and large-scale game environments. Our goal is to develop principled models that can understand, simulate, and act within complex virtual worlds—while remaining grounded enough to eventually shape real games. Our research directions include (but are not limited to): 🎮 Interactive & Dynamic World Modeling — learning, simulating, and reasoning about evolving game worlds 🤖 NPC World-to-Action Modeling — connecting world understanding to decision and action, with strong ties to Embodied AI and agent behavior 🌍 Game Scene Generation — generative modeling of diverse, controllable, and scalable game scenes We are looking for researchers with the following minimum qualifications: ✨ A recent Ph.D. in related fields ✨ 5+ top conference or journal papers ✨ 1000+ GitHub stars 🌟 Evidence of a “make it work” mindset We are also open to strong graduate students for intern positions. Feel free to DM me or contact: hanswen@tencent.com.
English
6
12
152
12.8K
Ying Shan
Ying Shan@yshan2u·
Socrates famously warned that writing would weaken memory — "people will not exercise their memories." This is likely happening with AI: massive cultural evolution of mankind at the cost of the biological devolution of the individual.
AI Highlight@AIHighlight

🚨BREAKING: Researchers just confirmed something the AI industry does not want you to know. AI is making professionals worse at their jobs when the AI is not available. Not slower. Not less confident. Measurably worse. A study published in The Lancet Gastroenterology and Hepatology tracked doctors performing colonoscopies across four hospitals in Poland after AI assistance was introduced into the procedure. Then the researchers measured what happened when the doctors performed the same procedure without AI help. Adenoma detection rates dropped from 28.4% to 22.4%. A six-point absolute decline. The AI was not present. The doctors were. But continuous reliance on the AI had eroded the observational skill the procedure requires. Real patients with real polyps were missed because the doctors had stopped practicing the part of their job the AI had been doing. This is not an isolated finding. Researchers at Microsoft and Carnegie Mellon University surveyed 319 knowledge workers and presented the results at CHI 2025, the premier academic conference on human-computer interaction. Workers with higher confidence in AI tools reported lower confidence in their own critical thinking. The pattern was consistent. The more someone relied on AI to produce outputs, the less cognitive effort they reported applying to the work itself. A separate study from SBS Swiss Business School published in January 2025 surveyed users across age groups and found a statistically significant negative correlation between AI usage frequency and critical thinking scores. Younger users were more affected than older ones. The MIT Media Lab reached the same conclusion in a study on cognitive atrophy. A study published in October 2025 in Computers in Human Behavior found that AI use makes people overestimate their own cognitive performance. They get smarter outputs and dumber self-awareness simultaneously. The mechanism has a name. Cognitive offloading. The brain stops practicing tasks it has delegated to a system. Active skills become passive ones. The AI performs the task. The human approves the output. Over time the human loses the ability to perform the task without the AI. The Lancet study made this visible because the stakes were measurable. A doctor either finds the polyp or does not. But the same dynamic is happening across every professional field where AI has taken over routine cognitive work. UX designers reported it for prototyping and bias detection. Cybersecurity analysts reported it for threat reasoning. Knowledge workers reported it for analysis and synthesis. The implication is structural. Entry-level roles historically existed not just to produce output but to develop judgment. The junior analyst ran the numbers because doing so taught them what the numbers meant. The junior associate drafted the brief because doing so taught them how arguments are constructed. AI is absorbing those tasks at exactly the point where the next generation of professionals would normally be building the skills they need at the senior level. There is a direct line between the Lancet study and the Anthropic finding that young worker hiring in AI-exposed fields has dropped 14%. The tasks are not being practiced. The judgment is not being developed. The researchers are not arguing against AI. They are documenting a specific harm that does not show up in any productivity metric. The output looks better. The human producing it has gotten worse. If you have been using AI for the work you used to do yourself, the studies suggest you are not just saving time. You are losing the ability to do that work without it. Sources: The Lancet Gastroenterology and Hepatology, 2025 PDF: thelancet.com/journals/langa… Microsoft and Carnegie Mellon University, CHI 2025 PDF: microsoft.com/en-us/research… SBS Swiss Business School, Societies Journal, January 2025 PDF: mdpi.com/2075-4698/15/1… Computers in Human Behavior, October 2025 DOI: doi.org/10.1016/j.chb.…

English
0
0
0
165
Ying Shan
Ying Shan@yshan2u·
There is only one reality to map. If the models do it right, they all converge on the same underlying structure.
How To AI@HowToAI_

MIT proved every major AI model is secretly converging on the same "brain." It’s called the “platonic representation hypothesis,” and it’s one of the most mind-blowing papers you’ll ever read. You train a vision model purely on images. You train a language model purely on text. They use completely different architectures. They process completely different data. They should have completely different "brains." But as these models scale up, something impossible is happening. When researchers measure how they organize information, the mathematical geometry is identical. A model that only "sees" images and a model that only "reads" text are measuring the distance between concepts in the exact same way. The models are converging. The researchers named this after Plato’s Allegory of the Cave. Plato believed that everything we experience is just a shadow of a deeper, hidden, perfect reality. The paper argues that AI models are doing the exact same thing. They are looking at the different "shadows" of human data, text, images, audio. And they are independently discovering the exact same underlying structure of the universe to make sense of it. It doesn't matter what company built the AI. It doesn't matter what data it was trained on. As models get larger, they stop memorizing their specific tasks. They are forced to build a statistical model of reality itself. And there is only one reality to map. 2024, Arxiv

English
0
0
1
241
Ying Shan retweetledi
el.cine
el.cine@EHuanglu·
the Earth doesn’t belong only to humans, but humans can be their “gods”
English
63
1.1K
5.5K
855.9K
Ying Shan retweetledi
Tengfei Wang
Tengfei Wang@DylanTFWang·
⚡️HY-World 1.5 is Faster, Lighter, and more OPEN than ever. ⚡️ We just released *a few major updates*. ✅ Open Training Code ✅ Accelerated Inference ✅ New 5B Model (Low VRAM friendly) ✅ No Waitlist for OnlineTry Code github.com/Tencent-Hunyua… OnlineTry 3d.hunyuan.tencent.com/sceneTo3D
English
11
34
279
16K
Ying Shan retweetledi
Xinggang Wang
Xinggang Wang@XinggangWang·
Thanks to Bo for the great MoDA breakdown! We believe MoDA can help drive next-gen LLM architectures. It’s a general framework with strong potential in multimodal modeling and vision—feel free to explore our Triton code: github.com/hustvl/MoDA
Bo Wang@BoWang87

Sharing another very cool paper from my friend @XinggangWang. It goes after one of the most fundamental assumptions in Transformers: residual connections. The core issue is simple: as Transformers get deeper, early-layer signals get washed out. Every residual update is added with roughly equal weight, so features formed in shallow layers gradually get diluted. By the time you are 100 layers deep, a lot of that useful early information is barely preserved. MoDA’s idea is elegant: let attention operate not just across the sequence, but across depth too. So instead of each head only attending over tokens, it also attends to KV pairs from previous layers at the same position. In other words, the model can look back not only across context, but also across its own intermediate representations — all in one unified attention operation. What makes this even better is that the engineering is serious too: --fused Triton kernel reaches 97.3% of FlashAttention-2 efficiency at 64K context with only 3.7% FLOPs overhead --works even better with post-norm than pre-norm, also reduces attention sink behavior as a nice side effect And the results are strong: at 1.5B scale, MoDA gets +2.11% average improvement across 10 downstream tasks, and -0.2 perplexity across 10 benchmarks vs OLMo2. For a long time, depth has been the relatively underused scaling axis. People talk about data scale, model width, and context length. Much less about how to make depth actually compound. MoDA makes a very compelling case that depth still has a lot to give — if the architecture can truly preserve and reuse what earlier layers learned. Triton code is open: github.com/hustvl/MoDA Paper: arxiv.org/abs/2603.15619

English
0
7
32
4.1K
Ying Shan retweetledi
Yanpei Cao
Yanpei Cao@yanpei_cao·
Generative 3D has been stuck in a representational compromise. When you serialize spatial data to fit standard architectures, you force a unidirectional causal bias. It fundamentally breaks the symmetry of 3D geometry and limits how the model understands global context. With Tripo P1.0, we stepped back to fix the representation. Instead of flattening 3D into sequences, P1.0 operates entirely within a native spatial probability space. It reasons about the entire structure at once. That’s why the topology actually makes sense. We aren't running a post-process remesher to clean up artifacts. When you directly model the structured surface manifold, clean edge flows and structural coherence naturally emerge from the noise. This is what happens when the architecture finally respects the medium. And we are doing it in 2 seconds.
English
11
9
165
64.7K
Ying Shan retweetledi
AK
AK@_akhaliq·
CubeComposer Spatio-Temporal Autoregressive 4K 360° Video Generation from Perspective Video paper: huggingface.co/papers/2603.04…
English
1
6
24
7.9K
Ying Shan retweetledi
Hong-Xing (Koven) Yu
Hong-Xing (Koven) Yu@Koven_Yu·
🤩Video world models are cool, but it is cooler if they can simulate any 3D physical actions in real time! Introducing RealWonder⚡️: Now you can simulate 3D physical action (robot actions, 3D forces, force fields, etc.) consequences from a single image in real time! 🧵1/6
English
7
45
272
28.7K
Ying Shan retweetledi
Tencent Hy
Tencent Hy@TencentHunyuan·
One static model does not fit all😭 We just dropped our latest work: Functional Neural Memory. Instead of static models, we generate custom "parameters" for every single input. ✅Prompt your model anytime ✅Instant personalization ✅Better instruction following ✅Flexible & dynamic memory (w/o memory bank✌️) (🧵1/6)
English
11
138
337
71.2K
Ying Shan retweetledi
alphaXiv
alphaXiv@askalphaxiv·
Yann LeCun 🤝 Saining Xie insane crossover of the 2 biggest visual representation researchers in the AI field “Beyond Language Modeling: An Exploration of Multimodal Pretraining” Right now, most multimodal models are basically a language model with a vision adapter bolted on, so they can describe images, but they don’t really think in images or video. This paper shows what happens when you do it the hard way: train one model from scratch on text, images, and video with a unified setup. They key idea is if you give the model a good visual internal format and it can use vision for both understanding and generating. Additionally, multimodal data can improve language instead of distracting it, and mixture-of-experts lets you scale vision’s huge data intake without bloating everything else. This paves the way towards changing the vision paradigm from “captioning add-on” model to native multimodal foundation model.
alphaXiv tweet media
English
20
147
941
79.6K
Ying Shan retweetledi
Ziwei Liu
Ziwei Liu@liuziwei7·
🚫 No Vision Encoder (VE) 🚫 No Variational Autoencoder (VAE) ✅ Just one end-to-end model directly engages with native signals, pixels and words, for both understanding and generation. 💊NEO-unify💊 is the first step toward **truly end-to-end unified models**, learning directly from near-lossless inputs via a representation space shaped by the model itself.
Ziwei Liu tweet media
English
11
84
590
72.1K
Ying Shan
Ying Shan@yshan2u·
To me, AGI is reached when, for every task a human is willing to delegate, the AI’s outcome is judged by that human to be as good as or better than a human’s. Through the lens of this "outsourcing test", AGI is about the moment humans (including Einstein) stop preferring humans for the tasks that matter.
English
0
0
0
279
Rohan Paul
Rohan Paul@rohanpaul_ai·
Demis Hassabis’s “Einstein test” for defining AGI: Train a model on all human knowledge but cut it off at 1911, then see if it can independently discover general relativity (as Einstein did by 1915); if yes, it’s AGI.
English
660
813
11.8K
2.2M
Ying Shan retweetledi
Mike Shou
Mike Shou@MikeShou1·
Thanks for posting! We found that 1/ "Almost" is a goldmine. A subtle delta in a World Model can lead to divergent outcomes in reality. By training on near-success trajectories, the world model becomes more sensitive to subtle action difference. 2/ By rolling out VLA policy in the real environment, we create a continuous WM-VLA-Env loop. Real-world feedback doesn't just improve the agent—it builds a more "physically grounded" world model.
Boris Belousov@_bbelousov

"a closed-loop paradigm that jointly optimizes the world model and the VLA policy to iteratively enhance the performance and grounding of both"

English
1
10
76
9.7K
Ying Shan retweetledi
Xihui Liu
Xihui Liu@XihuiLiu·
🚀 Excited to share that we are organizing the 1st Workshop on Video World Models: Interaction, Memory, and Efficiency at CVPR 2026 in Denver! 🌍 From Sora to Genie 3, video generation has made remarkable progress. But building a true Video World Model — one you can interact with, that maintains long-term memory, and runs efficiently in real time — remains a grand challenge. This workshop brings together researchers from academia and industry to tackle these open problems. 🎤 We have an outstanding speaker lineup: Andrea Vedaldi (Oxford) Jack Parker-Holder (Google DeepMind, Genie series) Yilun Du (Harvard) Song Han (MIT) Sherry Yang (NYU / Google DeepMind) Xingang Pan (NTU) 🎓 Organized by researchers from Stanford, Oxford, HKU, NUS, and NVIDIA. 🔬 Topics of interest include: interactive video generation, long-term memory and consistency, efficient real-time generation, and applications in robotics, autonomous driving, gaming, and more. 📝 Call for papers is now open with two tracks: Proceedings Track — Deadline: March 1, 2026 Non-Proceedings Track — Deadline: April 14, 2026 (work-in-progress and recently published work welcome!) 🔗 Workshop page: videoworldmodel-workshop.github.io 🙏 Would love to see your submissions! Reposts greatly appreciated.
Xihui Liu tweet media
English
2
12
79
9.2K