Börje Karlsson

5.1K posts

Börje Karlsson banner
Börje Karlsson

Börje Karlsson

@tellarin

AI Researcher @BAAIBeijing, ex-@MSFTResearch Asia, @nokia INdT, @inovacao_cesar. Occasional politics, opinions are my own, RTs ≠ endorsements… 🇸🇪🇧🇷🇵🇹🇺🇦

Somewhere Katılım Ağustos 2009
257 Takip Edilen442 Takipçiler
Sabitlenmiş Tweet
Börje Karlsson
Börje Karlsson@tellarin·
What if agents leverage recent LMM capabilties for general computer control, instead of using target-specific APIs? In our recent research, Cradle, we explore foundation agents acing diverse computer tasks, and demo their capabilities by playing RDR2. arxiv.org/abs/2403.03186
Börje Karlsson tweet media
English
4
29
103
23.2K
Börje Karlsson retweetledi
Jim Fan
Jim Fan@DrJimFan·
I made Physical AutoResearch sound simple (conceptually), but it took a village to pull off and lots of design thinking into the robot /loopcraft. The hardest part is everything we need to setup *before* pressing Enter. Here's a behind-the-scene tour: 1. Safety harness Letting 8 robots run unattended overnight means safety has to be more than a hint in the system prompt. ENPIRE hardwires it in 2 layers: (1) hard kinematic limit that trips an immediate task failure and auto-resets as soon as a robot leaves its safety envelope, and (2) a torque-limited compliant gripper so a bad contact or misaligned insertion ends in a safe stall, instead of crushing the robot or the object at hand. We make safety more conservative than usual so humans can sleep tight. In reality, we still need a few human operators to watch over the "robots of loving grace". 2. Definition of /done An agent that can edit its own reward will game it for sure. ENPIRE fixes the goalposts before the fleet can move them. Here's the recipe: Collect a few minutes of success & failure demos -> Ask agent to write code using computer vision tools to classify success and measure against groundtruth -> Agent hill-climbs on classifier until reliably good -> This classifier becomes the real-time reward function that directly computes on sensor streams -> *Freeze* the reward function before AutoResearch. It's sacred, enshrined in a Gym env that no one can touch. 3. System telemetry design Robot-seconds is by far the scarcest resource, followed by GPU-seconds, and finally tokens. We instrument all three and surface them to ENPIRE for live resource awareness rather than letting it hill-climb in a vacuum. We define: - Mean Robot Utilization ("MRU"): the fraction of wall-clock time when the robot is actively executing an experiment. Otherwise the hardware is sitting idle and waiting for the next code commit. - Mean Token Utilization ("MTU"): tokens consumed per minute, our proxy for how hard the agent is actually thinking. A low MTU means the agent is stalled, waiting on a robot rollout to finish instead of doing research. - GPU utilization: fraction of wall-clock time when GPU is active. ... and evaluate on two budget-to-outcome metrics: 1. Tokens-to-Success: token budget the fleet burns to complete /goal. 2. Time-to-Success: wall-clock time to /goal
Jim Fan@DrJimFan

Today, we enable AutoResearch in the physical world for the first time! Introducing ENPIRE: we give 8 Codex agents a fleet of robots, an allocation of GPUs, and generous token budget. We set them free with a simple goal: solve the task as quickly as possible, keep the robots busy but stay safe, don't waste precious compute. Make no mistake. Then humans step aside and our watch begins. The robot fleet starts to come alive: they learn to look for visual clues, reset the scene, practice novel skills, tinker with control stack, read papers online, debate, reflect, get stuck, and try again directly on the hardware. All we did is to give Codex an API to the world of atoms, and the rest is emergence. ENPIRE is able to solve high-precision tasks like tying zip-ties, organizing fine pins, and installing GPUs all by itself. We also discovered a new type of "physical scaling": 8 robots exploring in parallel improves significantly faster than fewer ones. A part of our NVIDIA GEAR lab now self-improves tirelessly over night. We just read the reports in the morning. /goal: we all take a holiday and Jensen wouldn't even notice ;) We will be open-sourcing everything, so you can host your self-running robot lab at home too! Deep dive in the thread:

English
39
88
756
92.5K
Börje Karlsson retweetledi
DailyPapers
DailyPapers@HuggingPapers·
LectūraAgents Brings a professor-student dynamic to AI learning. A multi-agent team researches, plans, and delivers embodied lessons with adaptive handwriting, highlights, and speech-aligned actions for truly personalized education.
DailyPapers tweet media
English
1
7
22
1.8K
Börje Karlsson retweetledi
Qianhui Wu
Qianhui Wu@5000hui·
Excited to introduce WebHarbor! 🌟 ⛑️Mirrors real sites into local Docker environments that are stable and RL-ready. ✅Ship with all 15 WebVoyager sites for reproducible and CAPTCHA-free evaluation. 📢Scaling to 100+ sites next, call for contributors!
Zhaoyang Wang@zhaoywang_CS

Introducing WebHarbor ⚓ — an open community effort to dock real websites into local, deterministic, and evolving environments for web agent research. 🌐 Come help us build it. 🤝 Contribute new web environments or fix existing ones — will be included in the author list! ✍️ 🎉 First release: 15 multimodal, high-fidelity environments covering all 643 WebVoyager tasks — full frontend, backend, database, and auth, all in one lightweight Docker image. Why? Web agent eval today is broken😦: reCAPTCHA, geo-blocks, content drift, flaky networks, and login-gated deep features (e.g., account and checkout) that benchmarks can't touch. Live sites can't be reset either — making online agent RL impractical. Again, the bottleneck isn't the agent. It's the environment. WebHarbor: dock real websites into stable, reproducible local mirrors with sub-second reset. But here's the key 🌱 — you can't clone the entire web upfront, and you don't need to. WebHarbor evolves with the agent: as harder tasks arrive, environments grow to support them. Coding agents (e.g., Claude Code/CodeX) build mirrors fast; human reviewers catch what coding agent hacks (shortcuts, leaks, fake completions). We need you. 🙌 Help us scale to 100+ and beyond: 🔨 Contribute a new web environment 🐛 Fix or improve existing mirrors 🔍 Audit task fidelity & interaction realism See more details and join the effort: - 🏠 Project Page: aiming-lab.github.io/webharbor.gith… - 💻 GitHub Repo: github.com/aiming-lab/Web… - 📝 Contribution Form: forms.gle/ngcD1rzAfUEphN… Let's build the open-source environment infrastructure for GUI web agents! ⚓ Initiating institutions: UNC-Chapel Hill ✖️Microsoft #AIAgents #WebAgents #LLM #OpenSource #AgenticAI

English
0
5
12
729
Börje Karlsson retweetledi
Danfei Xu
Danfei Xu@danfei_xu·
The thing about robotics is that you cannot "solve X to solve robotics". You have to solve robotics.
English
8
17
149
16.7K
Börje Karlsson retweetledi
张小珺 Xiaojun Zhang
张小珺 Xiaojun Zhang@zhang_benita·
The era of large language models has moved past its first act—the chat era—and entered its second act: the age of Agents. On this show, we’ll dive deep into the core technical principles of Agents and break down the technology for you, offering a clear overview of its evolutionary trajectory. If you enjoy our show, we’d appreciate it if you could leave us a 5‑star rating on Apple Podcasts🤓🤓 podcasts.apple.com/cn/podcast/%E5…
English
14
49
272
77K
Börje Karlsson retweetledi
Grady Booch
Grady Booch@Grady_Booch·
I’ve come to the conclusion that those who are pushing agentic systems have at least three glaring holes in their approaches: Most are entirely ignorant of the existing literature in this space, both from early AI to biological studies of swarms to complex systems theory. This is not a new landscape. Orchestration among agents is either treated as an afterthought or via extremely naive centralized architectures. This BTW is why I am a fan of blackboard architectures as pioneered in Hearsay years ago and in @BernardJBaars global workspace theory. There exist many flavors of agents and yet most today are a little more than trivial input/output mappings. This is fertile ground, but most are planting seeds opportunistically in fallow ground, without consideration for where they may fall or how they may be nourished.
English
45
93
706
43.2K
Börje Karlsson retweetledi
DailyPapers
DailyPapers@HuggingPapers·
ExoActor: Teaching robots through imagination A framework that generates third-person videos of task execution and converts them into real humanoid behaviors. Scales to new scenarios without additional real-world data collection.
DailyPapers tweet media
English
2
15
66
4.2K
Börje Karlsson retweetledi
(((ل()(ل() 'yoav))))👾
The big dilemma with teaching an "LLM course" is that it is really easy to get drawn into teaching the various technical things like efficiency tricks, attention variants, PPO vs GRPO, etc etc. But the real "meat" is not there, but in the data: data for pre-training, for mid-training, for SFT, for RL and for "reasoning", synthetic data, curated data, annotated data... cleaning, evaluating, improving, mixing, ... lots of stuff. but "data" is so much harder to teach: it is not "mathematic" or "algorithmic" like the technical things, and it is not clear what is the teachable thing there. it is also a lot less transparent than the technical topics, both because it is semi-secret, and also because it is also not appealing for publishing, for roughly the same reasons it is not appealing for teaching. so, what would you teach about data? what are the key lessons and insights one should know? any good papers or resources? good existing classes? blogs? hit me with what you have
English
54
56
827
59.1K
Börje Karlsson retweetledi
Guanya Shi
Guanya Shi@GuanyaShi·
I’m so tired of writing rebuttals to this kind of “lack of novelty” review: “This paper trivially combines A, B, and C, so the algorithmic novelty is limited.” Technically, most (if not all) robotics papers are convex combinations of existing ideas. I still deeply appreciate A+B+C papers—especially when they deliver: - New capabilities: the “trivial combination” unlocks behaviors we simply couldn’t achieve before - Sensible & organic design: A+B+C is clearly the right composition—not some arbitrary A′+B+C′ - Nontrivial interactions: careful analysis of the dynamics, coupling, or failure modes between A, B, C - Rehabilitating old ideas: A was dismissed for years, but paired with modern B/C, it suddenly works—and teaches us why - System-level & "interface" insight: the contribution is not any single piece, but how the pieces talk to each other - Scaling laws or regimes: identifying when/why A+B+C works (and when it doesn’t) - Engineering clarity: making something actually work robustly in the real world is not “trivial” - New problem formulations: sometimes the real novelty is in the reformulation—only under this view does A+B+C make sense. Maybe worth keeping these in mind when reviewing the next A+B+C paper : )
English
30
121
985
115.4K
Börje Karlsson retweetledi
Gergely Orosz
Gergely Orosz@GergelyOrosz·
@QuinnyPig So AWS stopped releasing public postmortems and now they complain - without sharing any details - that the Financial Times was… correct? Make it make sense
English
4
11
277
33.5K
Börje Karlsson retweetledi
Cohere Labs
Cohere Labs@Cohere_Labs·
It's time to build with Tiny Aya! 🌟 We have over 20 teammates across Cohere and Cohere Labs excited to mentor 30+ community research projects from idea to execution! 💡 Have an idea? Come build it with us. Looking for one? We’ve got 20+ research directions suggested by mentors ready to be explored. Learn more and join a team before March 8th: cohere.link/Z0Fximx @weiyinko_ml @tomsherborne @evgenia_rusak @sam_cahyawijaya
Cohere Labs tweet media
English
0
6
22
1.6K
Börje Karlsson retweetledi
Börje Karlsson retweetledi
Qianhui Wu
Qianhui Wu@5000hui·
We've released the full package for GUI-Libra! 🌟 📂 Data/Model: huggingface.co/GUI-Libra 📄 Paper: arxiv.org/abs/2602.22190 🌐 Project: gui-libra.github.io Happy to hear feedback from the community!
Rui Yang@RuiYang70669025

Collecting high-quality GUI trajectories for agent training is expensive. But are we fully leveraging the open-source data we already have? 🤔 ✨Introducing GUI-Libra (gui-libra.github.io): 81K high-quality, action-aligned reasoning dataset curated from open-source corpora, plus a tailored training recipe that combines action-aware SFT with step-wise RLVR-style training (⚠️partially verifiable rather than fully verifiable!). Result: stronger native GUI agents on both offline step-wise evaluation and online environments across mobile and web domains. Take away: With careful data curation + tailored post-training recipe, a small subset of open-source trajectories can still go a long way for training native GUI agents. Check out our paper (arxiv.org/abs/2602.22190) and code/dataset/model (github.com/GUI-Libra/GUI-…) for more details. #GUI #agent #VLM

English
0
7
21
3.7K
Zora Wang
Zora Wang@ZhiruoW·
To track agent progress at real work, we release a database linking benchmarks <-> real occupations & skills: zorazrw.github.io/ai4work/ ‼️We call for new submissions of: - Agent benchmarks: guided by our 3 principles - work coverage, realism, and granular evaluation - Open agent trajectories: to enable large-scale autonomy analysis
English
3
4
26
3K
Zora Wang
Zora Wang@ZhiruoW·
AI agents are tackling more and more "human work" But are they benchmarked on the work people actually do? tl;dr: Not really Most benchmarks focus on math & coding, while most human labor and capital lie elsewhere. 📒 We built a database linking agent benchmarks & real-world work Submit new tasks + agent trajectories today 🧵
Zora Wang tweet media
English
21
79
403
63.1K
Börje Karlsson
Börje Karlsson@tellarin·
We'll also be presenting this work at the LLA Workshop @ ICLR'26, in Rio. See you there!
English
0
0
0
86
Börje Karlsson
Börje Karlsson@tellarin·
We’re iterating fast! Happy to collaborate with anyone interested in improving both the SWITCH benchmark and model capabilities!
English
2
0
0
53
Börje Karlsson
Börje Karlsson@tellarin·
Generalization in embodied AI will only happen when it can make effective use of existing real-world infrastructure, which is built for humans, not as clean APIs. Success requires more than “commonsense”, but current models can't yet handle this critical real-world scenario.
English
2
2
12
1.7K