Kai Yang

564 posts

Kai Yang

@YangK_ai

Fremont, CA Katılım Mart 2012

642 Takip Edilen383 Takipçiler

Kai Yang@YangK_ai·5 Nis

@TheGregYang Ranch 99 💀

English

434

Greg Yang@TheGregYang·5 Nis

generally I find white fish (like sole, branzini, rock fish, etc) to be the safest, least inflammatory source of protein I also find the fattier the meat, typically more inflammatory it is white fish < ostrich < salmon < lean beef/lamb < fatty cuts of beef/lamb quality can matter here because I have had bad experiences from ranch 99's white fish

English

17.9K

Kai Yang retweetledi

Scale Labs@ScaleAILabs·12 Mar

Versioning, Rewards, and Observations (VeRO) is a new @scale_AI framework for studying a simple question: can coding agents improve other AI agents by editing their prompts, tools, and workflows? Instead of treating this as prompt engineering, VeRO frames agent optimization as a coding agent problem.

English

5.2K

Kai Yang@YangK_ai·10 Mar

Research is a core part of what we do at Scale. We now have a dedicated place to share our results and content.

Scale Labs@ScaleAILabs

Welcome to the home of all things @scale_AI research — focused on data, evaluation, safety, and post-training that moves frontier models forward. We’ll share benchmarks, insights, and work intended to be useful to the broader research community. labs.scale.com/?utm_source=hu…

English

Kai Yang@YangK_ai·7 Mar

@LandingAI Amazing !!

English

LandingAI@LandingAI·7 Mar

99.16% Accuracy on DocVQA: New state-of-the-art for document understanding! We ran ADE on the DocVQA validation split and got 5,286 correct out of 5,331 questions. That's 99.16% accuracy. DocVQA is a question-answering benchmark on real scanned documents. It's typically used to evaluate vision-language models. We used it to test whether ADE's parsing preserves enough information that an LLM can answer questions accurately without ever seeing the original document images. Here's how it works: Standard VQA approach: Image + question → answer (model sees the document every time) Our approach: Parse document once → answer all questions from structured output (zero image access during Q&A) This is harder. The parsing has to be complete enough that no visual context is lost. What this means for production: → Parse once, run unlimited queries against structured output → Every extraction is traceable to exact locations in the source document → No need to re-process images for new questions → Structured data enables search, analytics, and RAG applications We've published every result and the code to reproduce this benchmark. Complete transparency. Link to the detailed write up and reproducible repo in the comments!

English

5.5K

Kai Yang@YangK_ai·28 Şub

One of the most important projects our team has delivered. We are addressing the primary bottleneck -- how quickly and realistically we can scale training environments for the world’s most valuable and complex scenarios.

Jason Droege@jdroege

AI agents are getting put to work on real tasks. Training them to do that well is harder than it looks. For the past year at @scale_AI, we've been building environments where agents can practice real workflows - simulated worlds that mirror real software, real processes, and real decisions. They can fail, try again, and get better before they touch production. This process is how these agents eventually become reliable for the world’s most important decisions. Today we're officially launching Scale RL Environments. Nearly half of our new training projects already run through them. We're building more this year. See what your agents can learn: scale.com/blog/rl-enviro…

English

Kai Yang@YangK_ai·24 Şub

@vbingliu LFG !

Bing Liu@vbingliu·24 Şub

OpenAI is moving away from SWE-Bench Verified, citing challenges on underspecified tasks, misaligned tests, and contamination. We agree. These were exactly the motivations behind SWE-Bench Pro (arxiv.org/pdf/2509.16941). What we changed: → Underspecified tasks: structured, executable problem definitions → Contamination: strict curation + private / commercial codebases But this is just step one. Where we’re pushing frontier coding evals next: → Beyond unit tests: rubric-based evaluation (arxiv.org/pdf/2601.04171) → From static tasks to real-world agentic environments Modern coding systems are not solving isolated problems. They operate as agents over repos, tools, and long-horizon workflows. Our evals need to reflect that. SWE-Bench Pro is one step toward more realistic and reliable evaluation for coding agents. We’ll keep pushing the frontier.

English

3.3K

Kai Yang@YangK_ai·18 Şub

@TheGregYang Hang in there

English

151

Greg Yang@TheGregYang·18 Şub

woke up quite tired and stayed that way so far really annoyed at how unpredictable this is

English

214

58.9K

Kai Yang retweetledi

Scale AI@scale_AI·13 Şub

🎙️ In our latest Chain of Thought episode we unpack ResearchRubrics, our benchmark for evaluating deep research agent performance. We explore what meaningful agent evaluation looks like, where today’s agents still fall short, and why clearer evaluation frameworks are critical as agent use accelerates.

English

2.5K

Kai Yang retweetledi

Yannis He@yannis__he·4 Şub

We evaluated the coding capabilities of the latest models from @OpenAI, @AnthropicAI , and @GoogleDeepMind on @scale_AI’s SWE-Bench Pro private set: a coding agent benchmark built exclusively from proprietary commercial codebases. Each new model beats its predecessor: - GPT-5.2: 23.8% (↑ from 14.9%) - Claude Opus 4.5: 23.4% (↑ from 17.8%) - Gemini 3 Pro: 18.0% (↑ from 10.1%) But on public repos, these same models score 40-46%. This ~2x performance difference tells us something important: Progress is real, so is the generalization gap. What drove these improvements & what drives the next one? Better reasoning? More diverse training data? What coding capabilities do you think models should tackle next? scale.com/leaderboard/sw…

English

4.8K

Kai Yang@YangK_ai·19 Ara

@raceroom This is huge

English

109

RaceRoom@raceroom·19 Ara

The new Radar HUD is now available in #RaceRoom 👀 This is the first part of a larger HUD redesign, released early ahead of the holidays 🎄 Read more: store.steampowered.com/news/app/21150…

English

146

5.3K

Kai Yang@YangK_ai·18 Ara

Speech models don’t fail because of words — they fail because of interaction. Benchmarks that ignore interruptions, context shifts, and timing miss the real problem. Audio MC is a big step in the right direction.

Scale AI@scale_AI

Speech isn’t just text read out loud. 💬 Real conversations are dynamic, full of interruptions, and context-rich — and benchmarks should match. Introducing Audio MultiChallenge (Audio MC), the first benchmark built to test how well native Speech-to-Speech models handle real conversations.

English

Kai Yang retweetledi

Scale AI@scale_AI·20 Kas

We've just released our scores for Gemini 3, and the results speak for themselves: 🥇on Humanity's Last Exam + ENIGMA 🥈on MultiChallenge + MultiNRC 🥉on VISTA

Google AI@GoogleAI

Today we’re taking a big step on the path toward AGI and releasing Gemini 3— our most intelligent model yet. With Gemini 3, you can bring any idea to life. It is state-of-the-art in reasoning, the best model in the world for multimodal understanding, and our best agentic and vibe coding model.

English

4.2K

Kai Yang@YangK_ai·19 Kas

@Nascar25Game Love this game

English

NASCAR 25 Game@Nascar25Game·18 Kas

🔧 We've just released our first patch on Steam! Here's what is in it: ▪️ Fixed an issue where some machines with lower end graphics cards could not load the game. ▪️ Added ability to map controls directly from the splash screen. ▪️ Added default mapping configurations for more wheel types. ▪️ Added the option to create password protected private online lobbies.

English

531

56.7K

Kai Yang retweetledi

Lukas Ziegler@lukas_m_ziegler·14 Eki

Build a humanoid robot yourself! 🪚 OpenArm is an open-source humanoid robot. It comes with full CAD, control code, firmware, and simulation tools, everything needed to build, modify, and operate it. The arms are designed to be compliant and backdrivable. Teleoperation is supported, with force feedback and real-time gravity compensation so operators can guide the arm naturally. Super important, on the simulation side, OpenArm works with platforms like MuJoCo and Isaac Sim, letting developers test policies in virtual environments before running on hardware. Assemble it yourself from a kit or get it prebuilt, the goal is accessibility for research labs, small teams, and enthusiasts. The project is run by Enactic in Tokyo, Japan, and aims to lower the barrier for experimenting with dexterous manipulation. Let's put robotics in mainstream! 🔥🔥 P.S. I put the project page in the comments!

English

364

2.5K

232.7K

Kai Yang@YangK_ai·29 Eyl

@raceroom The new graphic is good and love the new daily racing setup.

English

230

RaceRoom@raceroom·29 Eyl

Thanks to everyone who has jumped into the new Ranked Multiplayer this week. It’s been incredible seeing the grids fill up 🔥 Your feedback on the graphics update has made all the hard work worth it. Here are some stunning community screenshots✨ #RaceRoom

English

483

13.7K

Kai Yang@YangK_ai·27 Eyl

@Lionpassion @raceroom @SimRacingExpo @Simucube Interesting. I do find more US time zone ranked drivers in the daily event than before. The super touring at silverstone classic is nice.

English

Derick@Lionpassion·27 Eyl

In the past 4 HOURS, I could only do 4 races, of 15 mins each, of which each had its own bucket loads of crap going on (not even mentioning the drivers I'm chucked in with now) If you want to kill the game, just do it, do not take people's money on empty promises and then have them walk away...

English

RaceRoom@raceroom·26 Eyl

🏆 The Tim Heinemann Challenge is live! Fight for a share of the 5000 € prize pool 💰 Winners will be crowned live at @SimRacingExpo, with prizes from: @Simucube #AscherRacing #TrackTime #FromSimtoDTM #KW_Suspensions 👉game.raceroom.com/competitions/2…

English

1.4K

Kai Yang@YangK_ai·18 Eyl

@raceroom Nice

English

136

RaceRoom@raceroom·18 Eyl

🏁 A Super Touring legend returns 🏁 The iconic Renault Laguna arrives on September 24, expanding one of the most popular car classes in RaceRoom and delivering even more golden-era touring car action. #RaceRoom #SuperTouring

English

331

13.2K

Kai Yang@YangK_ai·17 Eyl

@raceroom Can’t wait

English

427

RaceRoom@raceroom·17 Eyl

🚨 A comprehensive graphics upgrade arrives September 24! 🚨 Experience more realistic lighting, physics-based rendering, improved reflections, and performance optimizations that make #RaceRoom look and run better than ever 🔥

English

826

26.1K

Kai Yang retweetledi

Jie Wang@JieWang_ZJUI·13 Eyl

Great reading list on modern robotics foundation models

stevengongg@stevengongg

Below is an additional reading list if you want to read more about robot foundation models: Additional Reading List - Brohan, et al., 2022, [RT-1: Robotics Transformer for Real-World Control at Scale](arxiv.org/abs/2212.06817) - Brohan, et al., 2023, [RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control](arxiv.org/abs/2307.15818) - Open X-Embodiment Collaboration, 2023, [Open X-Embodiment: Robotic Learning Datasets and RT-X Models](arxiv.org/abs/2310.08864) - Chi, et al. 2023, [Diffusion Policy: Visuomotor Policy Learning via Action Diffusion](arxiv.org/abs/2303.04137) - Liu, et al., 2024, [RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation](arxiv.org/pdf/2410.07864) - Etukuru, et al., 2024, [Robot Utility Models: General Policies for Zero-Shot Deployment in New Environments](arxiv.org/abs/2409.05865) - Kim, et al. 2024, [OpenVLA: An Open-Source Vision-Language-Action Model](arxiv.org/abs/2406.09246) - Cheang, et al., 2024, [GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation](arxiv.org/abs/2410.06158) - Octo Model Team, 2024, [Octo: An Open-Source Generalist Robot Policy](arxiv.org/abs/2405.12213) - Fang, et al., 2025, [Robix: A Unified Model for Robot Interaction, Reasoning and Planning](arxiv.org/abs/2509.01106) - NVIDIA, 2025, [GR00T N1: An Open Foundation Model for Generalist Humanoid Robots](arxiv.org/abs/2503.14734) - Yang, et al., 2025, [FP3: A 3D Foundation Policy for Robotic Manipulation](arxiv.org/pdf/2503.08950) - arxiv.org/abs/2508.07917 - Lee, et al. 2025, [MolmoAct: Action Reasoning Models that can Reason in Space](arxiv.org/abs/2508.07917) Additional Resources - [Short Blog on VLAs](itcanthink.substack.com/p/vision-langu…) by Chris Paxton - [U of T Robotics Institute Seminar on Robotics Foundation Models](youtube.com/watch?v=EYLdC3…) by Sergey Levine

English

10.8K

Kai Yang@YangK_ai·13 Eyl

@ElonMuskAOC No. Free speech and let them not able to delete their post

English

Not Elon Musk@ElonMuskAOC·12 Eyl

Should X remove the accounts that are cheering the death of Charlie Kirk? What do you think?

English

32.2K

5.7K

122.7K

5.3M

Keşfet

@TheGregYang @scale_AI @LandingAI @vbingliu @OpenAI @AnthropicAI @GoogleDeepMind @raceroom