Kai Yang

564 posts

Kai Yang banner
Kai Yang

Kai Yang

@YangK_ai

AI | Entrepreneur | Lifelong Learner | Astrophotographer | @Scale_AI | Ex-@LandingAI — Opinions are my own

Fremont, CA Katılım Mart 2012
642 Takip Edilen383 Takipçiler
Greg Yang
Greg Yang@TheGregYang·
generally I find white fish (like sole, branzini, rock fish, etc) to be the safest, least inflammatory source of protein I also find the fattier the meat, typically more inflammatory it is white fish < ostrich < salmon < lean beef/lamb < fatty cuts of beef/lamb quality can matter here because I have had bad experiences from ranch 99's white fish
English
12
1
97
17.9K
Kai Yang retweetledi
Scale Labs
Scale Labs@ScaleAILabs·
Versioning, Rewards, and Observations (VeRO) is a new @scale_AI framework for studying a simple question: can coding agents improve other AI agents by editing their prompts, tools, and workflows? Instead of treating this as prompt engineering, VeRO frames agent optimization as a coding agent problem.
Scale Labs tweet media
English
3
10
44
5.2K
LandingAI
LandingAI@LandingAI·
99.16% Accuracy on DocVQA: New state-of-the-art for document understanding! We ran ADE on the DocVQA validation split and got 5,286 correct out of 5,331 questions. That's 99.16% accuracy. DocVQA is a question-answering benchmark on real scanned documents. It's typically used to evaluate vision-language models. We used it to test whether ADE's parsing preserves enough information that an LLM can answer questions accurately without ever seeing the original document images. Here's how it works: Standard VQA approach: Image + question → answer (model sees the document every time) Our approach: Parse document once → answer all questions from structured output (zero image access during Q&A) This is harder. The parsing has to be complete enough that no visual context is lost. What this means for production: → Parse once, run unlimited queries against structured output → Every extraction is traceable to exact locations in the source document → No need to re-process images for new questions → Structured data enables search, analytics, and RAG applications We've published every result and the code to reproduce this benchmark. Complete transparency. Link to the detailed write up and reproducible repo in the comments!
LandingAI tweet media
English
4
4
26
5.5K
Kai Yang
Kai Yang@YangK_ai·
One of the most important projects our team has delivered. We are addressing the primary bottleneck -- how quickly and realistically we can scale training environments for the world’s most valuable and complex scenarios.
Jason Droege@jdroege

AI agents are getting put to work on real tasks. Training them to do that well is harder than it looks. For the past year at @scale_AI, we've been building environments where agents can practice real workflows - simulated worlds that mirror real software, real processes, and real decisions. They can fail, try again, and get better before they touch production. This process is how these agents eventually become reliable for the world’s most important decisions. Today we're officially launching Scale RL Environments. Nearly half of our new training projects already run through them. We're building more this year. See what your agents can learn: scale.com/blog/rl-enviro…

English
0
0
1
54
Bing Liu
Bing Liu@vbingliu·
OpenAI is moving away from SWE-Bench Verified, citing challenges on underspecified tasks, misaligned tests, and contamination. We agree. These were exactly the motivations behind SWE-Bench Pro (arxiv.org/pdf/2509.16941). What we changed: → Underspecified tasks: structured, executable problem definitions → Contamination: strict curation + private / commercial codebases But this is just step one. Where we’re pushing frontier coding evals next: → Beyond unit tests: rubric-based evaluation (arxiv.org/pdf/2601.04171) → From static tasks to real-world agentic environments Modern coding systems are not solving isolated problems. They operate as agents over repos, tools, and long-horizon workflows. Our evals need to reflect that. SWE-Bench Pro is one step toward more realistic and reliable evaluation for coding agents. We’ll keep pushing the frontier.
Bing Liu tweet media
English
2
3
39
3.3K
Greg Yang
Greg Yang@TheGregYang·
woke up quite tired and stayed that way so far really annoyed at how unpredictable this is
English
39
5
214
58.9K
Kai Yang retweetledi
Scale AI
Scale AI@scale_AI·
🎙️ In our latest Chain of Thought episode we unpack ResearchRubrics, our benchmark for evaluating deep research agent performance. We explore what meaningful agent evaluation looks like, where today’s agents still fall short, and why clearer evaluation frameworks are critical as agent use accelerates.
English
4
5
18
2.5K
Kai Yang retweetledi
Yannis He
Yannis He@yannis__he·
We evaluated the coding capabilities of the latest models from @OpenAI, @AnthropicAI , and @GoogleDeepMind on @scale_AI’s SWE-Bench Pro private set: a coding agent benchmark built exclusively from proprietary commercial codebases. Each new model beats its predecessor: - GPT-5.2: 23.8% (↑ from 14.9%) - Claude Opus 4.5: 23.4% (↑ from 17.8%) - Gemini 3 Pro: 18.0% (↑ from 10.1%) But on public repos, these same models score 40-46%. This ~2x performance difference tells us something important: Progress is real, so is the generalization gap. What drove these improvements & what drives the next one? Better reasoning? More diverse training data? What coding capabilities do you think models should tackle next? scale.com/leaderboard/sw…
Yannis He tweet media
English
0
2
9
4.8K
Kai Yang
Kai Yang@YangK_ai·
Speech models don’t fail because of words — they fail because of interaction. Benchmarks that ignore interruptions, context shifts, and timing miss the real problem. Audio MC is a big step in the right direction.
Scale AI@scale_AI

Speech isn’t just text read out loud. 💬 Real conversations are dynamic, full of interruptions, and context-rich — and benchmarks should match. Introducing Audio MultiChallenge (Audio MC), the first benchmark built to test how well native Speech-to-Speech models handle real conversations.

English
0
0
1
60
Kai Yang retweetledi
NASCAR 25 Game
NASCAR 25 Game@Nascar25Game·
🔧 We've just released our first patch on Steam! Here's what is in it: ▪️ Fixed an issue where some machines with lower end graphics cards could not load the game. ▪️ Added ability to map controls directly from the splash screen. ▪️ Added default mapping configurations for more wheel types. ▪️ Added the option to create password protected private online lobbies.
English
82
40
531
56.7K
Kai Yang retweetledi
Lukas Ziegler
Lukas Ziegler@lukas_m_ziegler·
Build a humanoid robot yourself! 🪚 OpenArm is an open-source humanoid robot. It comes with full CAD, control code, firmware, and simulation tools, everything needed to build, modify, and operate it. The arms are designed to be compliant and backdrivable. Teleoperation is supported, with force feedback and real-time gravity compensation so operators can guide the arm naturally. Super important, on the simulation side, OpenArm works with platforms like MuJoCo and Isaac Sim, letting developers test policies in virtual environments before running on hardware. Assemble it yourself from a kit or get it prebuilt, the goal is accessibility for research labs, small teams, and enthusiasts. The project is run by Enactic in Tokyo, Japan, and aims to lower the barrier for experimenting with dexterous manipulation. Let's put robotics in mainstream! 🔥🔥 P.S. I put the project page in the comments!
English
69
364
2.5K
232.7K
Kai Yang
Kai Yang@YangK_ai·
@raceroom The new graphic is good and love the new daily racing setup.
English
0
0
0
230
RaceRoom
RaceRoom@raceroom·
Thanks to everyone who has jumped into the new Ranked Multiplayer this week. It’s been incredible seeing the grids fill up 🔥 Your feedback on the graphics update has made all the hard work worth it. Here are some stunning community screenshots✨ #RaceRoom
RaceRoom tweet mediaRaceRoom tweet mediaRaceRoom tweet mediaRaceRoom tweet media
English
7
27
483
13.7K
Derick
Derick@Lionpassion·
In the past 4 HOURS, I could only do 4 races, of 15 mins each, of which each had its own bucket loads of crap going on (not even mentioning the drivers I'm chucked in with now) If you want to kill the game, just do it, do not take people's money on empty promises and then have them walk away...
Derick tweet mediaDerick tweet media
English
1
0
0
40
RaceRoom
RaceRoom@raceroom·
🏁 A Super Touring legend returns 🏁 The iconic Renault Laguna arrives on September 24, expanding one of the most popular car classes in RaceRoom and delivering even more golden-era touring car action. #RaceRoom #SuperTouring
RaceRoom tweet mediaRaceRoom tweet mediaRaceRoom tweet mediaRaceRoom tweet media
English
9
26
331
13.2K
RaceRoom
RaceRoom@raceroom·
🚨 A comprehensive graphics upgrade arrives September 24! 🚨 Experience more realistic lighting, physics-based rendering, improved reflections, and performance optimizations that make #RaceRoom look and run better than ever 🔥
RaceRoom tweet mediaRaceRoom tweet mediaRaceRoom tweet mediaRaceRoom tweet media
English
20
49
826
26.1K
Kai Yang retweetledi
Jie Wang
Jie Wang@JieWang_ZJUI·
Great reading list on modern robotics foundation models
stevengongg@stevengongg

Below is an additional reading list if you want to read more about robot foundation models: Additional Reading List - Brohan, et al., 2022, [RT-1: Robotics Transformer for Real-World Control at Scale](arxiv.org/abs/2212.06817) - Brohan, et al., 2023, [RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control](arxiv.org/abs/2307.15818) - Open X-Embodiment Collaboration, 2023, [Open X-Embodiment: Robotic Learning Datasets and RT-X Models](arxiv.org/abs/2310.08864) - Chi, et al. 2023, [Diffusion Policy: Visuomotor Policy Learning via Action Diffusion](arxiv.org/abs/2303.04137) - Liu, et al., 2024, [RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation](arxiv.org/pdf/2410.07864) - Etukuru, et al., 2024, [Robot Utility Models: General Policies for Zero-Shot Deployment in New Environments](arxiv.org/abs/2409.05865) - Kim, et al. 2024, [OpenVLA: An Open-Source Vision-Language-Action Model](arxiv.org/abs/2406.09246) - Cheang, et al., 2024, [GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation](arxiv.org/abs/2410.06158) - Octo Model Team, 2024, [Octo: An Open-Source Generalist Robot Policy](arxiv.org/abs/2405.12213) - Fang, et al., 2025, [Robix: A Unified Model for Robot Interaction, Reasoning and Planning](arxiv.org/abs/2509.01106) - NVIDIA, 2025, [GR00T N1: An Open Foundation Model for Generalist Humanoid Robots](arxiv.org/abs/2503.14734) - Yang, et al., 2025, [FP3: A 3D Foundation Policy for Robotic Manipulation](arxiv.org/pdf/2503.08950) - arxiv.org/abs/2508.07917 - ​Lee, et al. 2025, [MolmoAct: Action Reasoning Models that can Reason in Space](arxiv.org/abs/2508.07917) Additional Resources - [Short Blog on VLAs](itcanthink.substack.com/p/vision-langu…) by Chris Paxton - [U of T Robotics Institute Seminar on Robotics Foundation Models](youtube.com/watch?v=EYLdC3…) by Sergey Levine

English
0
2
14
10.8K
Kai Yang
Kai Yang@YangK_ai·
@ElonMuskAOC No. Free speech and let them not able to delete their post
English
0
0
0
5
Not Elon Musk
Not Elon Musk@ElonMuskAOC·
Should X remove the accounts that are cheering the death of Charlie Kirk? What do you think?
English
32.2K
5.7K
122.7K
5.3M