Börje Karlsson (@tellarin) - Twitter Profili | Zamantika Mersobahis Locabet

Sabitlenmiş Tweet

What if agents leverage recent LMM capabilties for general computer control, instead of using target-specific APIs? In our recent research, Cradle, we explore foundation agents acing diverse computer tasks, and demo their capabilities by playing RDR2. arxiv.org/abs/2403.03186

English

4

29

103

23.2K

Börje Karlsson retweetledi

Jim Fan@DrJimFan·4d

I made Physical AutoResearch sound simple (conceptually), but it took a village to pull off and lots of design thinking into the robot /loopcraft. The hardest part is everything we need to setup *before* pressing Enter. Here's a behind-the-scene tour: 1. Safety harness Letting 8 robots run unattended overnight means safety has to be more than a hint in the system prompt. ENPIRE hardwires it in 2 layers: (1) hard kinematic limit that trips an immediate task failure and auto-resets as soon as a robot leaves its safety envelope, and (2) a torque-limited compliant gripper so a bad contact or misaligned insertion ends in a safe stall, instead of crushing the robot or the object at hand. We make safety more conservative than usual so humans can sleep tight. In reality, we still need a few human operators to watch over the "robots of loving grace". 2. Definition of /done An agent that can edit its own reward will game it for sure. ENPIRE fixes the goalposts before the fleet can move them. Here's the recipe: Collect a few minutes of success & failure demos -> Ask agent to write code using computer vision tools to classify success and measure against groundtruth -> Agent hill-climbs on classifier until reliably good -> This classifier becomes the real-time reward function that directly computes on sensor streams -> *Freeze* the reward function before AutoResearch. It's sacred, enshrined in a Gym env that no one can touch. 3. System telemetry design Robot-seconds is by far the scarcest resource, followed by GPU-seconds, and finally tokens. We instrument all three and surface them to ENPIRE for live resource awareness rather than letting it hill-climb in a vacuum. We define: - Mean Robot Utilization ("MRU"): the fraction of wall-clock time when the robot is actively executing an experiment. Otherwise the hardware is sitting idle and waiting for the next code commit. - Mean Token Utilization ("MTU"): tokens consumed per minute, our proxy for how hard the agent is actually thinking. A low MTU means the agent is stalled, waiting on a robot rollout to finish instead of doing research. - GPU utilization: fraction of wall-clock time when GPU is active. ... and evaluate on two budget-to-outcome metrics: 1. Tokens-to-Success: token budget the fleet burns to complete /goal. 2. Time-to-Success: wall-clock time to /goal

Jim Fan@DrJimFan

Today, we enable AutoResearch in the physical world for the first time! Introducing ENPIRE: we give 8 Codex agents a fleet of robots, an allocation of GPUs, and generous token budget. We set them free with a simple goal: solve the task as quickly as possible, keep the robots busy but stay safe, don't waste precious compute. Make no mistake. Then humans step aside and our watch begins. The robot fleet starts to come alive: they learn to look for visual clues, reset the scene, practice novel skills, tinker with control stack, read papers online, debate, reflect, get stuck, and try again directly on the hardware. All we did is to give Codex an API to the world of atoms, and the rest is emergence. ENPIRE is able to solve high-precision tasks like tying zip-ties, organizing fine pins, and installing GPUs all by itself. We also discovered a new type of "physical scaling": 8 robots exploring in parallel improves significantly faster than fewer ones. A part of our NVIDIA GEAR lab now self-improves tirelessly over night. We just read the reports in the morning. /goal: we all take a holiday and Jensen wouldn't even notice ;) We will be open-sourcing everything, so you can host your self-running robot lab at home too! Deep dive in the thread:

English

39

88

756

92.5K

Börje Karlsson@tellarin·3d

Thread by @JawardSesay_ on the details of LecturaAgents, including a demo video.

Jaward Sesay@JawardSesay_

our preprint is out! we attempt to model human teaching behaviors into agents yielding a unified framework that enables adaptive personalized learning experiences from end-to-end: 🧵

English

0

1

100

Börje Karlsson retweetledi

DailyPapers@HuggingPapers·3d

LectūraAgents Brings a professor-student dynamic to AI learning. A multi-agent team researches, plans, and delivers embodied lessons with adaptive handwriting, highlights, and speech-aligned actions for truly personalized education.

English

1

7

22

1.8K

Börje Karlsson@tellarin·4d

Very, very cool!

Jim Fan@DrJimFan

Today, we enable AutoResearch in the physical world for the first time! Introducing ENPIRE: we give 8 Codex agents a fleet of robots, an allocation of GPUs, and generous token budget. We set them free with a simple goal: solve the task as quickly as possible, keep the robots busy but stay safe, don't waste precious compute. Make no mistake. Then humans step aside and our watch begins. The robot fleet starts to come alive: they learn to look for visual clues, reset the scene, practice novel skills, tinker with control stack, read papers online, debate, reflect, get stuck, and try again directly on the hardware. All we did is to give Codex an API to the world of atoms, and the rest is emergence. ENPIRE is able to solve high-precision tasks like tying zip-ties, organizing fine pins, and installing GPUs all by itself. We also discovered a new type of "physical scaling": 8 robots exploring in parallel improves significantly faster than fewer ones. A part of our NVIDIA GEAR lab now self-improves tirelessly over night. We just read the reports in the morning. /goal: we all take a holiday and Jensen wouldn't even notice ;) We will be open-sourcing everything, so you can host your self-running robot lab at home too! Deep dive in the thread:

English

0

2

110

Börje Karlsson retweetledi

Qianhui Wu@5000hui·12 May

Excited to introduce WebHarbor! 🌟 ⛑️Mirrors real sites into local Docker environments that are stable and RL-ready. ✅Ship with all 15 WebVoyager sites for reproducible and CAPTCHA-free evaluation. 📢Scaling to 100+ sites next, call for contributors!

Zhaoyang Wang@zhaoywang_CS

Introducing WebHarbor ⚓ — an open community effort to dock real websites into local, deterministic, and evolving environments for web agent research. 🌐 Come help us build it. 🤝 Contribute new web environments or fix existing ones — will be included in the author list! ✍️ 🎉 First release: 15 multimodal, high-fidelity environments covering all 643 WebVoyager tasks — full frontend, backend, database, and auth, all in one lightweight Docker image. Why? Web agent eval today is broken😦: reCAPTCHA, geo-blocks, content drift, flaky networks, and login-gated deep features (e.g., account and checkout) that benchmarks can't touch. Live sites can't be reset either — making online agent RL impractical. Again, the bottleneck isn't the agent. It's the environment. WebHarbor: dock real websites into stable, reproducible local mirrors with sub-second reset. But here's the key 🌱 — you can't clone the entire web upfront, and you don't need to. WebHarbor evolves with the agent: as harder tasks arrive, environments grow to support them. Coding agents (e.g., Claude Code/CodeX) build mirrors fast; human reviewers catch what coding agent hacks (shortcuts, leaks, fake completions). We need you. 🙌 Help us scale to 100+ and beyond: 🔨 Contribute a new web environment 🐛 Fix or improve existing mirrors 🔍 Audit task fidelity & interaction realism See more details and join the effort: - 🏠 Project Page: aiming-lab.github.io/webharbor.gith… - 💻 GitHub Repo: github.com/aiming-lab/Web… - 📝 Contribution Form: forms.gle/ngcD1rzAfUEphN… Let's build the open-source environment infrastructure for GUI web agents! ⚓ Initiating institutions: UNC-Chapel Hill ✖️Microsoft #AIAgents #WebAgents #LLM #OpenSource #AgenticAI

English

0

5

12

729

Börje Karlsson retweetledi

Danfei Xu@danfei_xu·6 May

The thing about robotics is that you cannot "solve X to solve robotics". You have to solve robotics.

English

8

17

149

16.7K

Börje Karlsson retweetledi

张小珺 Xiaojun Zhang@zhang_benita·3 May

The era of large language models has moved past its first act—the chat era—and entered its second act: the age of Agents. On this show, we’ll dive deep into the core technical principles of Agents and break down the technology for you, offering a clear overview of its evolutionary trajectory. If you enjoy our show, we’d appreciate it if you could leave us a 5‑star rating on Apple Podcasts🤓🤓 podcasts.apple.com/cn/podcast/%E5…

English

14

49

272

77K

Börje Karlsson retweetledi

Grady Booch@Grady_Booch·3 May

I’ve come to the conclusion that those who are pushing agentic systems have at least three glaring holes in their approaches: Most are entirely ignorant of the existing literature in this space, both from early AI to biological studies of swarms to complex systems theory. This is not a new landscape. Orchestration among agents is either treated as an afterthought or via extremely naive centralized architectures. This BTW is why I am a fan of blackboard architectures as pioneered in Hearsay years ago and in @BernardJBaars global workspace theory. There exist many flavors of agents and yet most today are a little more than trivial input/output mappings. This is fertile ground, but most are planting seeds opportunistically in fallow ground, without consideration for where they may fall or how they may be nourished.

English

45

93

706

43.2K

Börje Karlsson@tellarin·3 May

@JosecarlosRY @HuggingPapers Code coming soon.

English

0

36

Robot_dark@Robotdarkg1d·3 May

@HuggingPapers Link of github : 404

English

1

0

64

Börje Karlsson retweetledi

DailyPapers@HuggingPapers·2 May

ExoActor: Teaching robots through imagination A framework that generates third-person videos of task execution and converts them into real humanoid behaviors. Scales to new scenarios without additional real-world data collection.

English

2

15

66

4.2K

Börje Karlsson retweetledi

Niels Rogge@NielsRogge·26 Nis

FYI Claude Code is mostly a vibe-coded product (as they say, 100% written by Claude) It's the worst harness for Opus 4.6 among ANY harness on Terminal-Bench 2

Matt Pocock@mattpocockuk

I feel sorry for Claude Code I know they're not the one. I'm not overcommitting - not investing too hard I wonder if they know I'm pulling away

English

100

73

2.4K

448.1K

Börje Karlsson retweetledi

(((ل()(ل() 'yoav))))👾@yoavgo·27 Nis

The big dilemma with teaching an "LLM course" is that it is really easy to get drawn into teaching the various technical things like efficiency tricks, attention variants, PPO vs GRPO, etc etc. But the real "meat" is not there, but in the data: data for pre-training, for mid-training, for SFT, for RL and for "reasoning", synthetic data, curated data, annotated data... cleaning, evaluating, improving, mixing, ... lots of stuff. but "data" is so much harder to teach: it is not "mathematic" or "algorithmic" like the technical things, and it is not clear what is the teachable thing there. it is also a lot less transparent than the technical topics, both because it is semi-secret, and also because it is also not appealing for publishing, for roughly the same reasons it is not appealing for teaching. so, what would you teach about data? what are the key lessons and insights one should know? any good papers or resources? good existing classes? blogs? hit me with what you have

English

54

56

827

59.1K

Börje Karlsson retweetledi

Guanya Shi@GuanyaShi·25 Mar

I’m so tired of writing rebuttals to this kind of “lack of novelty” review: “This paper trivially combines A, B, and C, so the algorithmic novelty is limited.” Technically, most (if not all) robotics papers are convex combinations of existing ideas. I still deeply appreciate A+B+C papers—especially when they deliver: - New capabilities: the “trivial combination” unlocks behaviors we simply couldn’t achieve before - Sensible & organic design: A+B+C is clearly the right composition—not some arbitrary A′+B+C′ - Nontrivial interactions: careful analysis of the dynamics, coupling, or failure modes between A, B, C - Rehabilitating old ideas: A was dismissed for years, but paired with modern B/C, it suddenly works—and teaches us why - System-level & "interface" insight: the contribution is not any single piece, but how the pieces talk to each other - Scaling laws or regimes: identifying when/why A+B+C works (and when it doesn’t) - Engineering clarity: making something actually work robustly in the real world is not “trivial” - New problem formulations: sometimes the real novelty is in the reformulation—only under this view does A+B+C make sense. Maybe worth keeping these in mind when reviewing the next A+B+C paper : )

English

30

121

985

115.4K

Börje Karlsson retweetledi

Gergely Orosz@GergelyOrosz·15 Mar

@QuinnyPig So AWS stopped releasing public postmortems and now they complain - without sharing any details - that the Financial Times was… correct? Make it make sense

English

4

11

277

33.5K

Börje Karlsson retweetledi

Cohere Labs@Cohere_Labs·5 Mar

It's time to build with Tiny Aya! 🌟 We have over 20 teammates across Cohere and Cohere Labs excited to mentor 30+ community research projects from idea to execution! 💡 Have an idea? Come build it with us. Looking for one? We’ve got 20+ research directions suggested by mentors ready to be explored. Learn more and join a team before March 8th: cohere.link/Z0Fximx @weiyinko_ml @tomsherborne @evgenia_rusak @sam_cahyawijaya

English

0

6

22

1.6K

Börje Karlsson retweetledi

Ge Zhang@GeZhang86038849·4 Mar

I believe that it's World Model Version OSWorld. It is definitely a correct direction to benchmark techs in Daily Real-World Scenarios.

Börje Karlsson@tellarin

Generalization in embodied AI will only happen when it can make effective use of existing real-world infrastructure, which is built for humans, not as clean APIs. Success requires more than “commonsense”, but current models can't yet handle this critical real-world scenario.

English

1

6

702

Börje Karlsson retweetledi

IEEE Conference on Games@ieee_cog·4 Mar

📢 IEEE #CoG2026 is heading to Madrid! 🇪🇸 Join the premier venue for game-driven innovation, AI, human-computer interaction, and game design. 📍 @UCMccinf (@unicomplutense) 📅 Sept 1–4 ⏰ Deadline: March 17 🔗 Details & submissions: cog2026.org/nuevaWeb/index…

English

0

5

12

422

Börje Karlsson retweetledi

Qianhui Wu@5000hui·26 Şub

We've released the full package for GUI-Libra! 🌟 📂 Data/Model: huggingface.co/GUI-Libra 📄 Paper: arxiv.org/abs/2602.22190 🌐 Project: gui-libra.github.io Happy to hear feedback from the community!

Rui Yang@RuiYang70669025

Collecting high-quality GUI trajectories for agent training is expensive. But are we fully leveraging the open-source data we already have? 🤔 ✨Introducing GUI-Libra (gui-libra.github.io): 81K high-quality, action-aligned reasoning dataset curated from open-source corpora, plus a tailored training recipe that combines action-aware SFT with step-wise RLVR-style training (⚠️partially verifiable rather than fully verifiable!). Result: stronger native GUI agents on both offline step-wise evaluation and online environments across mobile and web domains. Take away: With careful data curation + tailored post-training recipe, a small subset of open-source trajectories can still go a long way for training native GUI agents. Check out our paper (arxiv.org/abs/2602.22190) and code/dataset/model (github.com/GUI-Libra/GUI-…) for more details. #GUI #agent #VLM

English

0

7

21

3.7K

Börje Karlsson@tellarin·4 Mar

@ZhiruoW Great work! How do you see something like SWITCH fitting into this categorisation? We’re working on extending the benchmark further soon.

Börje Karlsson@tellarin

Generalization in embodied AI will only happen when it can make effective use of existing real-world infrastructure, which is built for humans, not as clean APIs. Success requires more than “commonsense”, but current models can't yet handle this critical real-world scenario.

English

0

2

38

Zora Wang@ZhiruoW·3 Mar

To track agent progress at real work, we release a database linking benchmarks <-> real occupations & skills: zorazrw.github.io/ai4work/ ‼️We call for new submissions of: - Agent benchmarks: guided by our 3 principles - work coverage, realism, and granular evaluation - Open agent trajectories: to enable large-scale autonomy analysis

English

3

4

26

3K

Zora Wang@ZhiruoW·3 Mar

AI agents are tackling more and more "human work" But are they benchmarked on the work people actually do? tl;dr: Not really Most benchmarks focus on math & coding, while most human labor and capital lie elsewhere. 📒 We built a database linking agent benchmarks & real-world work Submit new tasks + agent trajectories today 🧵

English

21

79

403

63.1K

Börje Karlsson@tellarin·3 Mar

We'll also be presenting this work at the LLA Workshop @ ICLR'26, in Rio. See you there!

English

0

86

Börje Karlsson@tellarin·3 Mar

We’re iterating fast! Happy to collaborate with anyone interested in improving both the SWITCH benchmark and model capabilities!

English

2

0

53

Börje Karlsson@tellarin·3 Mar

Generalization in embodied AI will only happen when it can make effective use of existing real-world infrastructure, which is built for humans, not as clean APIs. Success requires more than “commonsense”, but current models can't yet handle this critical real-world scenario.

English

2

12

1.7K

Börje Karlsson

Keşfet