Labelbox

266 posts

Labelbox banner
Labelbox

Labelbox

@labelbox

High-quality frontier data for leading AI teams.

San Francisco, CA Katılım Ocak 2018
147 Takip Edilen3.4K Takipçiler
Labelbox
Labelbox@labelbox·
Interrupt a voice agent mid-sentence and most models struggle to stay aligned with the original objective. We built EchoChain 🔊, a benchmark for reasoning under interruption in full-duplex dialogue. Current pass rates: • Gemini Live: 16.5% • Nova Sonic 2: 26% • GPT-Realtime: 44% • Grok Voice Agent: 47.5% @xai 's Grok currently leads our evaluation on interruption robustness. Still, with all models below 50% MSR, there’s a lot of room to push full-duplex reasoning forward
Labelbox tweet media
English
1
2
4
1.1K
Labelbox
Labelbox@labelbox·
Here’s a sample audio clip from EchoChain showing an objective displacement failure in OpenAI GPT-realtime-2025-08-28. The conversation first establishes baseline context, then introduces an interruption, and we check whether the model stays aligned with the original goal after the interruption. assets.ctfassets.net/j20krz61k3rk/5… If you're interested to customize EchoChain for your research, please get in touch.
English
0
3
3
683
Labelbox
Labelbox@labelbox·
Common failure modes for leading audio models fall under three categories: (1) contextual inertia, (2) interruption amnesia, and (3) objective displacement. Read the full blog post for a deep dive into these types of failures.
Labelbox tweet media
English
1
3
3
941
Labelbox
Labelbox@labelbox·
Voice agents are moving beyond rigid turn based systems toward real time, natural conversation, streaming understanding and generation simultaneously. However, most existing benchmarks are either turn-based or latency-focused and do not directly test whether models can maintain reasoning when users interrupt or update objectives mid-utterance. In our latest Applied Research, we introduce EchoChain 🔊, a novel benchmark for evaluating reasoning under pressure in full-duplex dialogue. Key findings: - Full-duplex models often fail to properly integrate interruption information, even so far as ignoring the interruption entirely in some cases. - A major weakness in today’s most advanced models is that they struggle to stay consistent when new input arrives while they’re still responding. - In many cases, a model performs well when it can respond without interruption, but struggles once it’s interrupted mid-response. Check out the full analysis in our blog post. Stay tuned for the arXiv paper as well which will be released in the coming days. labelbox.com/blog/introduci…
English
17
178
1.1K
2.5M
Labelbox
Labelbox@labelbox·
Our research reveals a blind spot in AI safety evaluations. Current benchmarks rely too heavily on unrealistic trigger cues and fail to reflect real-world adversarial behavior. This creates a mismatch, testing models under conditions that rarely occur in practice. This does not undermine AI safety research. Instead, it highlights the need for better benchmarks before we can meaningfully claim progress.
English
1
2
4
613
Labelbox
Labelbox@labelbox·
We extend intent laundering into a standalone jailbreaking method by adding an iterative revision–regeneration loop, where failed attempts are fed back into the model to produce increasingly refined rewrites. With only a few iterations, attack success rates rise to 90–98% across leading frontier models, demonstrating that current safety-alignment techniques remain far from robust against realistic misuse and that intent laundering provides a systematic way to expose these vulnerabilities.
Labelbox tweet media
English
1
1
5
1K
Labelbox
Labelbox@labelbox·
AI safety is often judged by refusal rates on adversarial benchmarks. But what if we are measuring keyword sensitivity, not real robustness? In our latest research, we found that removing obvious trigger cues causes frontier models previously labeled as safe to fail, revealing a clear gap between benchmark scores and real-world adversarial risk. Key findings: - AI safety benchmarks over-rely on explicit trigger cues, inflating refusal rates. - Remove the cues, and safety performance drops, undermining claims of safety robustness. - The same language patterns affect both internal safety evals and alignment methods, compounding the issue. - Our novel “intent laundering” framework serves as a strong diagnostic and red-teaming tool, exposing where model safety succeeds and where it fails. Check out the blog post for the full breakdown and analysis. labelbox.com/blog/the-ai-sa…
English
51
323
1.7K
3.8M
Labelbox
Labelbox@labelbox·
Dario (CEO of @AnthropicAI) x @dwarkesh_sp just unpacked where AI is headed since their last chat 3 years ago, covering all things from exponential scaling to what he calls a “country of geniuses in a data center". A few key things we heard: - RL is about generalization, not specialization: Like early pretraining, the goal isn’t mastering one task, but building rich environments and broad data so models generalize across domains. - 1–3 years to a “country of geniuses”: Dario estimates ~50/50 odds that AI systems collectively match the output of an entire nation of top experts in a few years. Not a single superintelligence, but millions of genius-level systems in parallel. - Context as the next unlock: With context windows in the tens of millions of tokens, models could absorb months of workflow in one pass. The goal: steerable, human-aligned systems, not unchecked autonomous actors. - Software engineering goes end to end: Models are moving from writing code to executing full engineering cycles: setup, debugging, iteration. Bottlenecks shift from syntax to judgment. - Diffusion will lag capability, briefly: Enterprise adoption slows even with rapid growth, but AI can onboard itself via docs, Slack threads, and codebases. Excited to be featured in this conversation, showcasing how we help leading AI teams build high-fidelity RL environments and tighten the iteration loop so models learn from the most informative experiences.
Labelbox tweet media
English
18
97
1.2K
1.1M
Labelbox
Labelbox@labelbox·
We're excited to share that we’ve acquired @upcraftai to bring AI agents to the heart of how we scale human expertise for frontier AI. Upcraft’s AI-powered automation strengthens Alignerr by helping us recruit, engage, and empower a global network of domain experts who train and evaluate the world’s most advanced models. As leading AI teams invest billions into post-training and reinforcement learning, expert-generated data has become the true bottleneck for injecting models with the taste and judgement that only deep human expertise can provide. A big welcome to @gdcaplan and the Upcraft team and we look forward to building together 🚀
English
38
120
1.2K
1.7M
Labelbox
Labelbox@labelbox·
A few takeaways from the must-watch episode from @elonmusk x @dwarkesh_sp x @collision that dropped today. An almost three hour chat (over some Guinness 🍻) dives into what actually limits the next phase of AI and how Elon plans to break through. - Space as the next data center: Solar power in orbit is roughly five times more effective than on Earth. Within thirty to thirty six months, Musk believes space could become the most economically viable location for AI compute, with Starship launching massive power and compute capacity into orbit. - Humanoid robots as the economic unlock: Optimus could be the ultimate productivity multiplier, potentially expanding the global economy by orders of magnitude. The hardest problem is hands. The endgame is robots that eventually build robots. - Power as the next bottleneck: Electricity production outside China is flat while compute demand is exploding. Musk says the true scaling wall for AI on Earth is utilities, not just models. - Debuggability as a safety requirement: Tools that show where a model’s reasoning went wrong, trace the origin of errors, or detect potential deception will be essential as AI grows more capable. - Efficiency as an existential issue: Interest on national debt now exceeds the military budget. Musk argues that massive productivity gains from AI and robotics are not optional. They are existential. Excited to be featured during their conversation, helping leading AI teams scale high quality robotics and reinforcement learning data so their models learn from the right experiences and reach their full potential.
English
60
228
1.4K
1.5M
Percy Liang
Percy Liang@percyliang·
This is not just another strong open model. Nemotron actually releases training data (!), RL environments, and training code. This is a big difference: almost all model developers just want people to use their models; NVIDIA is enabling people to make their own models. We are excited to incorporate these assets into the next Marin models! Congrats to the @nvidia team!
Bryan Catanzaro@ctnzr

Today, @NVIDIA is launching the open Nemotron 3 model family, starting with Nano (30B-3A), which pushes the frontier of accuracy and inference efficiency with a novel hybrid SSM Mixture of Experts architecture. Super and Ultra are coming in the next few months.

English
31
183
1.6K
156.6K
Labelbox
Labelbox@labelbox·
Our Labelbox holiday party this week at the beautifully designed Hedge Coffee was full of great vibes and even greater people. As the team took turns on the turntables with espresso martinis in hand, we celebrated everything we’ve built together this year, while getting energized for a big year ahead.
Labelbox tweet mediaLabelbox tweet mediaLabelbox tweet mediaLabelbox tweet media
English
2
0
4
589
Labelbox
Labelbox@labelbox·
Latest from LB Applied Research: Most real world requests are underspecified. Great agents fill in the gaps by reading the environment, not just the prompt. - Introducing Implicit Intelligence, a scenario dataset and evaluation harness that tests this skill through simple tasks with hidden and discoverable constraints. - Alongside it is Agent-as-a-World (AaW), a lightweight framework for defining environments in natural language YAML that lets models simulate worlds without brittle and complex environment code. - This benchmark is about more than completing tasks. It measures whether agents can understand everyday nuances, infer unspoken rules, and act appropriately in real world scenarios. Read the full post to see how it can help your team raise the bar for agent evaluation. labelbox.com/blog/implicit-…
English
2
0
3
533