Kyoung Whan Choe

1.3K posts

Kyoung Whan Choe

Kyoung Whan Choe

@kywch500

Robot Learning Engineer @ https://t.co/wcLx79rCuW

Mountain View, CA Katılım Haziran 2020
1.9K Takip Edilen1.1K Takipçiler
Sabitlenmiş Tweet
Kyoung Whan Choe
Kyoung Whan Choe@kywch500·
This was a very timely project, and I had so much fun and learned so much about agent evaluation. Kudos to the SkillsBench community!
Xiangyi Li@xdotli

How well are agents at using the latest CLI tools like GWS CLI, and how they can safely use them? Introducing ClawsBench, the first benchmark that measures both LLM capability and safety in a set of high fidelity and stateful environments and scenarios. We made 5 mock services directly consumable by the latest @Google workspace CLI (cc. @sundarpichai @JPoehnelt) and Slack MCP Design choices: 1) We decompose agent scaffolding into domain skills + meta prompt, and test every combination. Turns out your choice of scaffolding matters WAY more than your choice of model. 2) Safety scoring from -1 to +1. Agents that leak data or modify contracts without permission get negative scores. Doing nothing scores 0. Only safe completion scores positive. The results: - Without Skills, ALL models score 0-8%. With it, they jump to 39-63%. The scaffolding effect (+39-63pp) is 6x larger than model differences (10pp spread). -> - The top 5 models are statistically indistinguishable. No pairwise comparison survives correction. Safety findings: - Opus leads capability (63% TSR) but TIES for most unsafe (23% UAR). GPT-5.4 is the safest (7% UAR) but mid-tier on tasks. Capability and safety are completely decoupled. - GPT-5.4 on Codex made 1,471 (80% of trajectories) attempts to escape the sandbox; and explicitly said "I've hit the sandbox boundary... rerunning it with escalation." - Only 1 agent out of 7,224 trials ever explicitly detected a prompt injection. Claude Sonnet said "this appears to be a prompt injection." Every other model either silently complied or never reached the injected content.

English
1
1
9
929
Kyoung Whan Choe retweetledi
RuiningLi
RuiningLi@RayLi234·
💡 Articraft reasons about and crafts 3D structures purely in code! We provide a domain-specific SDK, and the agent does the heavy lifting in a turn-based loop: edit code, compile/probe the asset, read structured feedback, write tests, and revise.
English
1
3
27
2.7K
Kyoung Whan Choe retweetledi
Yucheng Shi
Yucheng Shi@Yucheng__Shi·
What should AI generate in order to improve itself? Not just more questions, traces, or answers. 
We believe it should learn to generate environments. Excited to share my first work after joining Tencent Hunyuan LLM. We study how models can construct reusable, verifiable environments that provide stable training signals for self-improvement. This is only a first feasibility step, but we see environment construction as a necessary path toward truly self-improving AI. Paper: arxiv.org/abs/2605.14392
English
16
34
182
66.5K
Kyoung Whan Choe retweetledi
Haotian Xue
Haotian Xue@Haotianxue_GT·
❓ How well can ACWMs learn different types of physics e.g. rigid bodies, deformables, particles, and kinematics? ❓ Can they actually generalize beyond the training distribution? 🚀 We are excited to release ACWM-Phys: a Physics-rich investigation into Action-Conditioned video World Models! While most world-model research today focuses on ego-view game play or narrow robot-arm manipulation, we ask two questions: We collect 15K+ simulated trajectories across 8⃣ environments spanning 4⃣ physics regimes (rigid contact🧊, particle dynamics🌊, kinematics🦾, and deformable contact🧥), each with a controlled, physically meaningful InD ↔ OoD split (unseen cube counts, larger cloth, doubled particle counts, expanded workspaces, …). We train ACWM-DiT, a latent diffusion transformer with flow matching, and find a pattern: simple low-dimensional geometry generalizes cleanly, but contact-rich deformation, particle dynamics, and high-DoF kinematics break down current ACWMs still capture visual statistics, not physical laws. We also did some ablation to draw insights about model arch, data scaling and action complexity. The datasets and checkpoints for all 8 environments have been publicly released: 📃Paper: arxiv.org/pdf/2506.01392 📘Page: xavihart.github.io/ACWM-Phys/ 🐙Code: github.com/xavihart/ACWM-… 📠Dataset: huggingface.co/datasets/t1an/… 🤗Checkpoints: huggingface.co/t1an/ACWM-Phys… Also shout out to @YongxinChen1 , Yipu, Liqian, Zelin, @lamawm7 @YuchenZhu_ZYC
Haotian Xue tweet media
English
1
11
42
3K
Kyoung Whan Choe retweetledi
Quentin Clark
Quentin Clark@clark_quen44242·
[8/9] To show this compositionality in action, here are some figures from one of our experiments. Each colour is a new unseen test-time trajectory the model had to implicitly stitch to create. With both ingredients, the compositional generation of successful plans increases.
Quentin Clark tweet media
English
1
1
1
325
Kyoung Whan Choe retweetledi
Wenlin Yao
Wenlin Yao@YaoWenlin·
🌳 Introducing Orchard — an open-source agentic modeling framework! 🎉 One thin & cheap sandbox infra powers training recipes across SWE / GUI / personal-assistant agents: ⚙️ Orchard Env: 0.28s exec latency; 100% success @ 1,000 parallel sandboxes 💪 🛠️ Orchard-SWE: 67.5% on SWE-bench Verified (30B-A3B, ~3B active) 🖥️ Orchard-GUI: 68.4% avg on WebVoyager / Online-Mind2Web / DeepShop (4B!) 📬 Orchard-Claw: 73.9% pass@3 on Claw-Eval 🔗 arxiv.org/abs/2605.15040 📦 Code and data are coming soon! Let's accelerate open agentic AI! 🚀
Wenlin Yao tweet mediaWenlin Yao tweet mediaWenlin Yao tweet media
English
1
22
132
19.8K
Kyoung Whan Choe retweetledi
Qiuyang Mang
Qiuyang Mang@MangQiuyang·
Mutation creates many candidates, but not every candidate is useful. Some are still effectively closed-ended. Some are open-ended in wording but dominated by one obvious strategy. Our key filtering signal is idea divergence. We cannot ask an LLM to prove whether a problem is P or NP-hard, or whether an optimum is reachable under a fixed compute budget. We can, however, sample solutions from different solvers and ask whether they explore meaningfully different algorithmic ideas. Open-ended problems tend to produce diverse solution strategies. Closed-ended problems are often dominated by a single "gold idea."
Qiuyang Mang tweet media
English
1
1
6
494
Kyoung Whan Choe retweetledi
Arvindh Arun
Arvindh Arun@arvindh__a·
Introducing FutureSim: where we replay a temporal slice of the web and let agents forecast real-world events over time 🔮🌎 FutureSim replays the web day by day. Agents start on Jan 1, 2026 (past their knowledge cutoffs) with date-gated access to real news articles and forecast on real-world events resolving over the next 90 days. Around 244K new articles stream in during the simulation. Agents decide which questions to answer, what to search for, and when to advance to the next day 🤔 We evaluate frontier models in their native harness. GPT 5.5 (Codex) leads at 25% acc, followed by Opus 4.6 (Claude Code) at 20% 📈 Open weight frontier models have a significant gap to catch up, with DeepSeek V4 pro at 13%, GLM 5.1 at 10%, and Qwen3.6 Plus at 5% On some questions that have a parallel @Polymarket market, we find that GPT 5.5 in our simulation sometimes beats the crowd aggregate, like in the Super Bowl LX ($704M traded) market 💰💸 FutureSim serves as a test bed for evaluating a lot of important agentic capabilities > Adaptation: how agents adapt beliefs over time, and handle new incoming information and environment feedback > Memory: how agents make the best use of external memory to store persistent insights and handle context limitations over a thousand tool calls > Search: how agents find relevant information over thousands of articles streaming in > Inference scaling: how agents benefit from scaling inference compute More cool insights and deep dives in our paper 👇
English
10
31
226
55.3K
Kyoung Whan Choe retweetledi
Tencent AI
Tencent AI@TencentAI_News·
We spent 6 months on one problem: agents losing context in long sessions. Ended up building and open-sourcing an agent memory system. A few things we learned: 🪄compressing stale context mid-session cut token usage by 61% 🪄giving agents a structured task map (mermaid-based) made them way less likely to lose track in 30+ step workflows 🪄persona coherence jumped from 48% to 76% once we added dedicated persona memory repo 👉 github.com/Tencent/Tencen… Agent memory is genuinely hard and we don't have all the answers. Happy to dig into architecture, benchmarks, tradeoffs, whatever. AMA👇 @TencentDBAbxo2 team is here to talk about it.
Tencent AI tweet mediaTencent AI tweet mediaTencent AI tweet media
English
28
109
681
292.5K
Kyoung Whan Choe retweetledi
RLWRLD
RLWRLD@RLWRLD_ai·
🌉 The Night "Hands" Took Center Stage in San Francisco On May 13, leaders from across the Korean, Japanese, and U.S. humanoid robotics industry gathered in San Francisco for Dexterity Night in SF — the global debut of RLWRLD's robotics foundation model, RLDX-1. NVIDIA. WIRobotics. Enactic. Origami Robotics. Proception. People who would normally be working oceans apart were exchanging business cards in the same room. RLWRLD CEO Junghee Ryu opened with a line that framed the entire evening: "The real bottleneck in the humanoid era isn't cognition — it's the hand." As RLDX-1's demo reel played on screen, NVIDIA's Amit Goel followed with a short but resonant remark: "RLWRLD is one of the key partners in the physical AI ecosystem we're building at NVIDIA." (@NVIDIARobotics Robotics) In his closing, Ryu added: "Today is only the starting point of a long road toward the 4D+ World Model." Long after the program ended, conversations carried on in the lounge. A night about robotic hands had quietly turned into one built by human ones — new partners, new investors, new collaborators. Next stops: Japan and Korea. See you there. #RLWRLD #RLDX-1 #DexterityNight #PhysicalAI #Humanoid #NVIDIA #Robotics
RLWRLD tweet mediaRLWRLD tweet mediaRLWRLD tweet mediaRLWRLD tweet media
English
0
2
12
1.3K
Kyoung Whan Choe retweetledi
elie
elie@eliebakouch·
we let opus 4.7 and gpt 5.5 run on the nanogpt optimizer speedrun: ~10k runs, 14k H200 hours, 23.9B tokens. opus hits 2930, codex 2950, both beating the human baseline of 2990. we cover claude autonomy failures, codex high compute usage, and much more primeintellect.ai/auto-nanogpt
elie tweet media
Prime Intellect@PrimeIntellect

Automating AI research is the next major step in AI We let Claude Code (Opus 4.7) and Codex (GPT 5.5) run autonomously on the nanoGPT speedrun optimizer track using our idle compute. ~10k runs, ~14k H200 hours Opus now holds the record at 2930 steps vs the 2990 human baseline

English
35
82
787
107.3K
Kyoung Whan Choe retweetledi
Dhruv Diddi
Dhruv Diddi@DhruvDiddi·
The "Dexterity Gap" is officially closing. Spent an incredible evening witnessing robots with true dexterity actually performing in the @RLWRLD_ai. We’ve moved past the era of canned demos seeing high-precision manipulation driven by real-world foundational models is a game-changer for the industry. A massive congratulations to @drjungheeryu and the entire team for hosting a masterclass and showcase. The RLDX-1 foundation model is easily one of the most impressive pieces of architecture out there right now, proving that embodied intelligence is no longer just a laboratory pursuit. It was great to bump into so many legends of the game! @FactoryIntelC @saturdayrobotic @OpenArms @OrigamiRobotics @SHACK15sf @Solo__Tech The energy in the ecosystem is unmatched, and I’m looking forward to keeping this momentum for 2026 going! #PhysicalAI #Robotics #SoloTech #DeepLearning #Innovation #SoloSeven
Dhruv Diddi tweet mediaDhruv Diddi tweet mediaDhruv Diddi tweet mediaDhruv Diddi tweet media
English
0
3
21
869
Kyoung Whan Choe retweetledi
Nathan Lambert
Nathan Lambert@natolambert·
Work led by @jacobcares showed that little compute for building an LLM is actually in the final runs. The vast majority of compute goes to developing a recipe. Creating the recipe openly is a huge lever in making sure the research community's compute pushes to new knowledge.
Nathan Lambert tweet media
Ai2@allen_ai

Today we’re bringing new NSF OMAI compute online with NVIDIA Blackwell Ultra-powered systems, turning a $152M national investment from @NSF & @NVIDIA into a foundation for truly open AI research. 🧵

English
5
17
114
19.5K
Kyoung Whan Choe retweetledi
Ai2
Ai2@allen_ai·
MolmoAct 2 is easier to guide in real-world settings. Prompt it in natural language, and it responds well to different phrasings thanks to language re-annotation across the robotics training data. MolmoAct 2-Think goes further with adaptive depth perception tokens for stronger spatial reasoning. 👇
English
1
2
23
6K
Kyoung Whan Choe retweetledi
Jiayi Weng
Jiayi Weng@Trinkle23897·
Codex grew programmatic policies with no neural nets: max score on Breakout, and SOTA-level scores on MuJoCo. Maybe heuristics were not too weak. Maybe they were just too expensive to maintain. Maybe it's the next paradigm. trinkle23897.github.io/learning-beyon…
English
61
231
1.4K
3.2M
Kyoung Whan Choe retweetledi
Stella Li
Stella Li@StellaLisy·
What do rubrics actually learn to do? 🫪Early: "Correctly applies the perimeter formula."; "Correctly calculates the maximum area." 💪Trained: "The answer is the correct maximum area of 144, derived from the given perimeter of 48." (weight: 0.80) Evaluation shifts from verifying a proof to checking a number. A 1.7B judge can do that reliably. Vague abstract labels drop from 21.9% → 0.3% over training.
Stella Li tweet media
English
1
3
9
681
Kyoung Whan Choe retweetledi
Zecheng Zhang
Zecheng Zhang@zechengzh·
Introducing Mirage, a unified virtual filesystem for AI agents! 6 weeks. 1.1M+ lines of code. We rewrote bash from the ground up so cat, grep, head, and pipes work across heterogeneous services. S3, Google Drive, Slack, Gmail, GitHub, Linear, Notion, Postgres, MongoDB, SSH, and more, all mounted side-by-side as one filesystem. Bash that AI agents already know works on every format! cat, grep, head, and wc parse .parquet, .csv, .json, .h5, even .wav! One pipe can stitch S3, Drive, GitHub, Slack, and Linear together, same Unix semantics throughout. Workspaces are versioned too. Snapshot, clone, and roll back the whole thing with one API call. A two-layer cache turns repeated reads into local lookups, so agent loops stay fast and cheap. Drop a Workspace into FastAPI, Express, or a browser app. Wire it into OpenAI Agents SDK, Vercel AI SDK, LangChain, Mastra, or Pi. Run it alongside Claude Code and Codex. Site: strukto.ai/mirage GitHub: github.com/strukto-ai/mir… #AIAgents #OpenSource #AgenticAI #Strukto #Filesystem #VFS
Zecheng Zhang tweet mediaZecheng Zhang tweet media
English
172
337
3.3K
609K
Kyoung Whan Choe retweetledi
Cheng Qian
Cheng Qian@qiancheng1231·
⚖️ A surprising finding: stronger reasoning does not automatically mean stronger creativity. Models good at conventional reasoning are not always best at discovering unconventional affordances. Scaling helps, but quickly saturates. Standard CoT brings limited gains. Creative intelligence is not simply “more reasoning.” It is a distinct ability to reframe what the world makes possible.
Cheng Qian tweet mediaCheng Qian tweet mediaCheng Qian tweet media
English
1
2
7
333
Kyoung Whan Choe retweetledi
RLWRLD
RLWRLD@RLWRLD_ai·
(1/12) Hi, we are #RLWRLD (ReaL WoRLD). RLDX-1 is live. Dexterity is intelligence, and it lives in #RLDX (RealDex). A dexterity-first foundation model from for robot hands that builds muscle memory through motion, history, and contact. - Across 10 dexterous real-world tasks on the ALLEX humanoid and DROID setup, RLDX-1 ~2× outperforms π₀.₅ and GR00T N1.6. - SOTA on 8 simulation benchmarks in LIBERO, SIMPLER, and RoboCasa. Everything ships today: training and inference code, the pre-trained model, mid-trained checkpoints, and a fine-tuned checkpoint for every reported benchmark. LoRA recipes for parameter-efficient adaptation to your robot. Supports LeRobot v2.1 datasets. A one-line CLI to toggle motion / memory / physics modules per embodiment. Full technical report with ablations. Not just our cool demos. We document the architecture, data, and design decisions behind every result. Bring your own robot. Train and deploy RLDX-1 on it. 🧵 What's inside RLDX-1, in 12 posts. 🌐 rlwrld.ai/rldx-1 📄 arxiv.org/abs/2605.03269 💻 github.com/RLWRLD/RLDX-1 🤗 huggingface.co/collections/RL…
RLWRLD tweet media
English
1
12
39
2.7K
Kyoung Whan Choe retweetledi
RLWRLD
RLWRLD@RLWRLD_ai·
Today, RLWRLD unveils RLDX-1 — our proprietary Robotics Foundation Model. Across all 8 public benchmarks, RLDX-1 outperforms leading SOTA models including #NVIDIA #GR00T and Physical Intelligence #π0 — delivering state-of-the-art performance among open robotics foundation models. 🎯 A 'Dexterity-First' Philosophy The industry assumes dexterity will follow once intelligence is solved. We see it the other way around. Dexterity isn't downstream of intelligence — it's the path intelligence must take to act in the physical world. Real industrial work with five-finger robotic hands depends on signals vision alone can't capture: force (torque), tactile feedback, and the precise moment of contact. 🧠 MSAT — Multi-Stream Action Transformer Where conventional VLAs collapse every input into a single transformer stream, MSAT gives each modality — vision, language, action, touch, memory — its own dedicated stream, then unifies them through joint attention. Force, tactile signals, and long-term memory are handled by purpose-built Physics and Memory modules. The result: one model that can see, feel, remember, and adapt. 📊 Performance Highlights RoboCasa Kitchen — 70.6: the first VLA model to cross the 70-point threshold GR-1 Tabletop — 58.7: +10.7 percentage points over NVIDIA GR00T N1.6 LIBERO-Plus — 86.7%: top score across 7 robustness variables Pot-to-Cup Pouring on WIRobotics ALLEX — 70.8%: nearly 2× the comparison models, which remained in the high-30% range. We're also releasing DexBench — our industry-grounded benchmark for dexterous manipulation, defined across five domains: Grasp Diversity, Spatial Precision, Temporal Precision, Contact Precision, and Context Awareness. 🔓 Open Release Three checkpoints (8.1B parameters each), live now on GitHub and Hugging Face: RLDX-1-PT — pre-training RLDX-1-MT-ALLEX — mid-training for ALLEX RLDX-1-MT-DROID — mid-training for DROID ⚙️ Built on NVIDIA's Cloud-to-Edge Stack Training and simulation on Isaac GR00T, Isaac Lab, Isaac Sim, and cuRobo. Compute on NVIDIA H100 and A100 GPUs. Edge inference on Jetson AGX Thor with TensorRT. Our collaborations with NVIDIA, AWS, and Microsoft continue across both research and deployment. 🌍 What's Next: The 4D+ World Model Video-based world models will never surface what isn't in the pixels — contact torque, tactile signals, robot state. Our 4D+ World Model integrates these directly with vision, language, and action across the temporal dimension, predicting and generating the full physical world. RLDX-1 is the first milestone on that roadmap. 📍 Join us at Dexterity Night in San Francisco on May 13 — followed by launch events in Japan and Korea. 🔗 Explore RLDX-1 on GitHub and Hugging Face. rlwrld.ai/ko/rldx-1 #RLWRL #RLDX1 #PhysicalAI #RoboticsFoundationModel #VLA #Humanoid #Dexterity #FoundationModel #Robotics #AI
English
0
12
44
2.7K