Yining Ye(叶奕宁)

165 posts

Yining Ye(叶奕宁)

@Yining_Ye

Working on LLM/VLM Tool Learning and Reasoning at Tsinghua and Bytedance, reading at least one paper a day — The future will not invent itself.

Tsinghua University Katılım Temmuz 2022

242 Takip Edilen358 Takipçiler

Sabitlenmiş Tweet

Yining Ye(叶奕宁)@Yining_Ye·5 Eyl

Maybe the most powerful GUI agent in the world right now, from 7B dense to A20B MoE models. 💻 After our first demo at ICLR, we've pushed the boundaries with unified RL across diverse devices and scenarios. Fun fact: 1. Many "accepted" research ideas failed at scale, while unexpected ones shined. ✨ 2. Better benchmarks don't always mean better user feedback! We have too many findings and not enough time to write papers. Come work with us at SEED! P.S. We're passionate about open-sourcing, but this isn't a lab project. 😉

Yujia Qin@TsingYoga

We can finally share UI-TARS-2🥳🥳 — a native GUI agent trained with multi-turn agent RL ⚡️⚡️Key highlights (all-in-one model!): 💻Computer Use: 47.5 OSWorld · 50.6 WindowsAgentArena 📱Phone Use: 73.3 AndroidWorld 🛜Browser Use: 88.2% Online-Mind2Web 🎮Gameplay: ~60% human on 15 titles · strong on LMGame-Bench 🧑‍💻TerminalUse: 68.7 SWE-Bench · 45.3 TerminalBench 🔨Tool Use: 29.6 BrowseComp Hybrid flows: GUI clicks + terminal cmds + API calls in one trace Paper arxiv.org/abs/2509.02544 Demo seed-tars.com/showcase/ui-ta…

English

7.9K

Yining Ye(叶奕宁)@Yining_Ye·27 Şub

能工制人

jack@jack

we're making @blocks smaller today. here's my note to the company. #### today we're making one of the hardest decisions in the history of our company: we're reducing our organization by nearly half, from over 10,000 people to just under 6,000. that means over 4,000 of you are being asked to leave or entering into consultation. i'll be straight about what's happening, why, and what it means for everyone. first off, if you're one of the people affected, you'll receive your salary for 20 weeks + 1 week per year of tenure, equity vested through the end of may, 6 months of health care, your corporate devices, and $5,000 to put toward whatever you need to help you in this transition (if you’re outside the U.S. you’ll receive similar support but exact details are going to vary based on local requirements). i want you to know that before anything else. everyone will be notified today, whether you're being asked to leave, entering consultation, or asked to stay. we're not making this decision because we're in trouble. our business is strong. gross profit continues to grow, we continue to serve more and more customers, and profitability is improving. but something has changed. we're already seeing that the intelligence tools we’re creating and using, paired with smaller and flatter teams, are enabling a new way of working which fundamentally changes what it means to build and run a company. and that's accelerating rapidly. i had two options: cut gradually over months or years as this shift plays out, or be honest about where we are and act on it now. i chose the latter. repeated rounds of cuts are destructive to morale, to focus, and to the trust that customers and shareholders place in our ability to lead. i'd rather take a hard, clear action now and build from a position we believe in than manage a slow reduction of people toward the same outcome. a smaller company also gives us the space to grow our business the right way, on our own terms, instead of constantly reacting to market pressures. a decision at this scale carries risk. but so does standing still. we've done a full review to determine the roles and people we require to reliably grow the business from here, and we've pressure-tested those decisions from multiple angles. i accept that we may have gotten some of them wrong, and we've built in flexibility to account for that, and do the right thing for our customers. we're not going to just disappear people from slack and email and pretend they were never here. communication channels will stay open through thursday evening (pacific) so everyone can say goodbye properly, and share whatever you wish. i'll also be hosting a live video session to thank everyone at 3:35pm pacific. i know doing it this way might feel awkward. i'd rather it feel awkward and human than efficient and cold. to those of you leaving…i’m grateful for you, and i’m sorry to put you through this. you built what this company is today. that's a fact that i'll honor forever. this decision is not a reflection of what you contributed. you will be a great contributor to any organization going forward. to those staying…i made this decision, and i'll own it. what i'm asking of you is to build with me. we're going to build this company with intelligence at the core of everything we do. how we work, how we create, how we serve our customers. our customers will feel this shift too, and we're going to help them navigate it: towards a future where they can build their own features directly, composed of our capabilities and served through our interfaces. that's what i'm focused on now. expect a note from me tomorrow. jack

日本語

159

Yining Ye(叶奕宁) retweetledi

kache@yacineMTB·19 Şub

remember when covid started happening and you were on 4chan like holy shit and then you told your friends about it IRL and they didn't take you seriously at all and had no idea? and then two months later the entire world shut down that's where we're at with AI rn

English

394

1.1K

19.3K

1.1M

Yining Ye(叶奕宁)@Yining_Ye·14 Şub

Unlike a tech blog, a model card makes space for where we lose. Hope being honest about where we lose gives weight to our winning claims. here is seed2.0 model card：lf3-static.bytednsdoc.com/obj/eden-cn/la…

English

480

Yining Ye(叶奕宁) retweetledi

will brown@willccbb·12 Şub

380B (30B active)

Lisan al Gaib@scaling01

Anthropic raised $30B and is now valued at $380B

English

91.1K

Yining Ye(叶奕宁)@Yining_Ye·13 Şub

@YiTayML will you have a tech report？or this is aletheia

English

172

Yi Tay@YiTayML·12 Şub

Gemini 3 Deep Think is here! 😎 This model is not only super strong in math and coding (IMO gold and 3455 codeforces ELO), it is also gold standard in physics and chemistry olympiads. 😃 Also sets new records on ARC-AGI-2 and HLE. Proud to be a (core) member of the Deep Think team. 🦾😆. Feeling the AGI!

English

333

16K

Yining Ye(叶奕宁) retweetledi

Mr. Loðbrók@sir_lodbrok·7 Şub

@elonmusk @ID_AA_Carmack

QME

439

15.6K

Yining Ye(叶奕宁) retweetledi

yi@agihippo·1 Kas

ablations are for the weak. just yolo your runs. (ok, do some small amount of ablations, but don't over do it). instinct is everything in ML and AI.

English

146

94.9K

Yining Ye(叶奕宁)@Yining_Ye·10 Ara

Real-world problems are far more complex than benchmarks, but we need to solve them.

Taylor Ogan@TaylorOgan

Another DeepSeek moment. This is the world’s first actual smart phone. It’s an engineering prototype of ZTE’s Nubia M153 running ByteDance’s Doubao AI agent fused into Android at the OS level. It has complete control over the phone. It can see the UI, choose/download apps, tap/type, call, and run multi-step task chains. Here I just say (in English) “find someone to wait in line for me” (something you can do in China), and it picks which app to open, configures the job, and hands me one confirm screen. I wouldn’t otherwise know how to do this, and here the phone just did it in a matter of seconds.

English

143

Yining Ye(叶奕宁) retweetledi

AK@_akhaliq·29 Eki

ByteDance presents Game-TARS Pretrained Foundation Models for Scalable Generalist Multimodal Game Agents

English

195

18.7K

Yining Ye(叶奕宁)@Yining_Ye·29 Eki

Since google RICO (2017), GUI agents have relied on predefined action spaces (click, swipe, etc.). But humans don't work that way. We use "raw actions"—a continuous stream (move mouse X pixels, press 'W'…) and use real-time feedback (~24fps) to adjust position. 🉑We tested if models can use this raw space. After experiments on 1000s of games/datasets, the verdict: It's weaker at first, but shows significantly better scaling properties as training scale grows. A new path for agent scaling?

Zihao Wang@RealZihaoWang

🚀 Thrilled to introduce Game-TARS: our next-gen generalist multimodal game agent! Tired of AI that needs custom code for every new game? Game-TARS is a single VLM that learns to master any game just like a human: by watching the screen and using a keyboard & mouse. Read more.

English

596

Yining Ye(叶奕宁)@Yining_Ye·24 Eyl

AIO agent needs AIO sandbox

Yujia Qin@TsingYoga

The tool/env infra behind UI-TARS-2 is open-sourced. Enjoy the All-in-One Agent Sandbox!🥳 sandbox.agent-infra.com github.com/agent-infra/sa…

Nederlands

268

Yining Ye(叶奕宁) retweetledi

Qwen@Alibaba_Qwen·24 Eyl

🚀 We're thrilled to unveil Qwen3-VL — the most powerful vision-language model in the Qwen series yet! 🔥 The flagship model Qwen3-VL-235B-A22B is now open-sourced and available in both Instruct and Thinking versions: ✅ Instruct outperforms Gemini 2.5 Pro on key vision benchmarks ✅ Thinking achieves state-of-the-art (SOTA) performance on multimodal reasoning tasks ✨ Key breakthroughs: 🖥️ Visual Agent: Operates GUIs on PC/phone — understands buttons, calls tools, and completes real-world tasks (SOTA on OS World) 💻 Visual Coding: Transforms screenshots into code (HTML/CSS/JS, Draw.io) — true "what you see is what you get" development 📚 256K+ context (scalable to 1M) → supports 2-hour videos and multi-page long PDFs 🌍 32-language OCR with enhanced robustness for blurry, tilted, or rare characters 📐 Advanced spatial reasoning: 2D → relative coordinates, 3D grounding, occlusion handling, and perspective understanding 🧠 Thinking Mode: Leading performance in STEM/Math — enables deep causal reasoning 🔤 Text capabilities rival top-tier LLMs — a solid language foundation powering its multimodal excellence From "seeing" to "understanding", from "recognizing" to "reasoning & acting" Qwen Chat: chat.qwen.ai/?models=qwen3-… API: #5540e6e52e1xx" target="_blank" rel="nofollow noopener">alibabacloud.com/help/en/model-… Blog：qwen.ai/blog?id=99f033… ModelScope: modelscope.cn/collections/Qw… HuggingFace: huggingface.co/collections/Qw…

English

292

1.8K

333.1K

Yining Ye(叶奕宁)@Yining_Ye·21 Eyl

I think a critical aspect of Agent is overlookeoked: instruction following & alignment problems. 📊 this paper discovered: When mixing general VLM data with domain-specific training, model performance follows a surprising pattern - it first rises, then falls as we add more general data. And the optimal ratio increases with model size. 🤔 But we want agent training methods that scale infinitely with more data. So why this limitation? 💡 I believe the main culprit is that current agent models either lack or lose instruction-following capabilities during domain training. Good instruction following helps models organize knowledge storage better, enabling them to recall information through reasoning during inference. Without proper instruction following, different data types become conflicting rather than complementary, preventing agents from generalizing abilities learned from non-agent or even text-only data. 🎯 Case in point: Seed 1.5 VL uses very very small ratio of GUI data but outperforms specifically-trained UI-TARS-1.5 in GUI tasks - precisely because it retains instruction-following ability! However, current agent benchmarks barely test instruction following. They focus on absolute capabilities with straightforward tasks. Our collaboration with xlang.ai on AgentArena revealed that models performance on online benchmark like OSWorld show little correlation with user votes (when performance beyond certain thresholds) 🏭 For real-world deployment, modeling instruction following in agent scenarios might be MORE important than execution ability itself. Unlike chat models, agents have execution permissions - poor instruction following makes them far more dangerous in practice. Users want agents that can SAFELY handle 10-step edge cases before attempting 100-step tasks. This might explain why agents are currently limited to very specific domains like automotive or smart home with near template-based queries. 🔄 In 2022, OpenAI built ChatGPT on GPT-3.5 by better modeling instruction following - that's when chat models truly started serving the real world. 🚀We might be at the "agent GPT-3.5 moment" right now. But to get agents into everyone's daily life (imagine 1B people giving at least one agent command daily), we need to solve the agent instruction following problem. During research, we might rediscover "agent alignment tax", or "agent inverse scaling can be U-shapes"... but this seems like an under-researched area by now!

English

403

Yining Ye(叶奕宁)@Yining_Ye·19 Eyl

batch size is all you need

Tongyi Lab@Ali_TongyiLab

1/7 We're launching Tongyi DeepResearch, the first fully open-source Web Agent to achieve performance on par with OpenAI's Deep Research with only 30B (Activated 3B) parameters! Tongyi DeepResearch agent demonstrates state-of-the-art results, scoring 32.9 on Humanity's Last Exam, 45.3 on BrowseComp, and 75.0 on the xbench-DeepSearch benchmark.

English

504

Yining Ye(叶奕宁) retweetledi

Nathan Lambert@natolambert·7 Eyl

I'm going to sound like a shill but I describe paying for better AIs right now as a way that you can "pay to win" in your career. Normally dynamics like this are restricted to video games.

finbarr@finbarrtimbers

I am very impressed by GPT-5 Pro. Had a bug in a script. Claude Code w/ Opus couldn’t find it after repeated attempts. dumped the problem and all relevant code into GPT-5 Pro and it found it first shot. Very impressive.

English

816

111.3K

Yining Ye(叶奕宁) retweetledi

Jason Wei@_jasonwei·16 Tem

Becoming an RL diehard in the past year and thinking about RL for most of my waking hours inadvertently taught me an important lesson about how to live my own life. One of the big concepts in RL is that you always want to be “on-policy”: instead of mimicking other people’s successful trajectories, you should take your own actions and learn from the reward given by the environment. Obviously imitation learning is useful to bootstrap to nonzero pass rate initially, but once you can take reasonable trajectories, we generally avoid imitation learning because the best way to leverage the model’s own strengths (which are different from humans) is to only learn from its own trajectories. A well-accepted instantiation of this is that RL is a better way to train language models to solve math word problems compared to simple supervised finetuning on human-written chains of thought. Similarly in life, we first bootstrap ourselves via imitation learning (school), which is very reasonable. But even after I graduated school, I had a habit of studying how other people found success and trying to imitate them. Sometimes it worked, but eventually I realized that I would never surpass the full ability of someone else because they were playing to their strengths which I didn’t have. It could be anything from a researcher doing yolo runs more successfully than me because they built the codebase themselves and I didn’t, or a non-AI example would be a soccer player keeping ball possession by leveraging strength that I didn’t have. The lesson of doing RL on policy is that beating the teacher requires walking your own path and taking risks and rewards from the environment. For example, two things I enjoy more than the average researcher are (1) reading a lot of data, and (2) doing ablations to understand the effect of individual components in a system. Once when collecting a dataset, I spent a few days reading data and giving each human annotator personalized feedback, and after that the data turned out great and I gained valuable insight into the task I was trying to solve. Earlier this year I spent a month going back and ablating each of the decisions that I previously yolo’ed while working on deep research. It was a sizable amount of time spent, but through those experiments I learned unique lessons about what type of RL works well. Not only was leaning into my own passions more fulfilling, but I now feel like I’m on a path to carving a stronger niche for myself and my research. In short, imitation is good and you have to do it initially. But once you’re bootstrapped enough, if you want to beat the teacher you must do on-policy RL and play to your own strengths and weaknesses :)

English

127

342

3.4K

346K

Yining Ye(叶奕宁) retweetledi

Yu Su@ysu_nlp·3 Eyl

Computer Use: Modern Moravec's Paradox A new blog post arguing why computer-use agents may be the biggest opportunity and challenge for AGI. tinyurl.com/computer-use-a… Table of Contents > Moravec’s Paradox > Moravec's Paradox in 2025 > Computer use may be the biggest opportunity for AGI > Chatbots → agents > Internet-scale learning of human cognition > Bits > atoms > Enormous economic value > Why is computer use hard for AI? > Computer use ≠ clicks + typing > Idiosyncratic environments > Contextual understanding > Tacit knowledge > Is RL the panacea? > Looking forward If you are also excited about CUAs and want to do some serious work, let's chat!

English

216

55.5K

Yining Ye(叶奕宁) retweetledi

AK@_akhaliq·3 Eyl

UI-TARS-2 Technical Report Advancing GUI Agent with Multi-Turn Reinforcement Learning

English

151

27.7K

Keşfet

@YiTayML @elonmusk @ID_AA_Carmack @BarackObama @taylorswift13 @cristiano @BillGates @NASA