Labelbox

268 posts

Labelbox

@labelbox

Frontier RL data for the world’s leading AI teams.

San Francisco, CA Katılım Ocak 2018

147 Takip Edilen3.5K Takipçiler

Labelbox@labelbox·4d

When AI benchmarks saturate, what comes next? Historically, leaderboard saturation leads to two paths: hyper-specialized questions or increasingly abstract puzzles. A new paper from @Meta Superintelligence Labs introduces a third path: GIM (Grounded Integration Measure). Instead of testing isolated recall, GIM evaluates integrated reasoning to measure how well models coordinate constraints, ambiguity, spatial logic, and epistemic judgment within a single problem. 💡Some key takeaways: - Coordination over recall: Expert-authored tasks are able to break memorized patterns (e.g., adding new constraints to classic river-crossing puzzles) and test true reasoning under pressure. - Epistemic discipline: Models are rewarded for detecting flawed assumptions or fabricated information, not just producing plausible answers. - Better measurement: GIM uses Item Response Theory (IRT), the same framework behind exams like the SAT, to weight questions by true difficulty rather than treating all tasks equally. - Centaur effect: Human + AI teams still achieve the strongest performance, highlighting that collaboration remains a key advantage. Excited to contribute to the annotation workflows behind this benchmark. GIM reflects a broader shift in evaluation, from what models know to how they think. labelbox.com/blog/when-benc…

English

136

3.5M

Labelbox@labelbox·23 Nis

This week, we had the pleasure of hosting 50+ researchers and builders from leading AI companies to meet, talk and socialize (MTS 😎) at Labelbox HQ. Huge thanks to @dwarkesh_sp, Sholto Douglas (Anthropic), Mo Bavarian (OpenAI), and Melvin Johnson (DeepMind) for leading our fireside chat on scaling RL and the pursuit of AGI.

English

1.6M

Labelbox@labelbox·5 Mar

@elonmusk 🔥

QME

1.5K

Labelbox@labelbox·5 Mar

Interrupt a voice agent mid-sentence and most models struggle to stay aligned with the original objective. We built EchoChain 🔊, a benchmark for reasoning under interruption in full-duplex dialogue. Current pass rates: • Gemini Live: 16.5% • Nova Sonic 2: 26% • GPT-Realtime: 44% • Grok Voice Agent: 47.5% @xai 's Grok currently leads our evaluation on interruption robustness. Still, with all models below 50% MSR, there’s a lot of room to push full-duplex reasoning forward

English

3.9K

Labelbox@labelbox·4 Mar

Here’s a sample audio clip from EchoChain showing an objective displacement failure in OpenAI GPT-realtime-2025-08-28. The conversation first establishes baseline context, then introduces an interruption, and we check whether the model stays aligned with the original goal after the interruption. assets.ctfassets.net/j20krz61k3rk/5… If you're interested to customize EchoChain for your research, please get in touch.

English

Labelbox@labelbox·4 Mar

Common failure modes for leading audio models fall under three categories: (1) contextual inertia, (2) interruption amnesia, and (3) objective displacement. Read the full blog post for a deep dive into these types of failures.

English

1.4K

Labelbox@labelbox·4 Mar

Voice agents are moving beyond rigid turn based systems toward real time, natural conversation, streaming understanding and generation simultaneously. However, most existing benchmarks are either turn-based or latency-focused and do not directly test whether models can maintain reasoning when users interrupt or update objectives mid-utterance. In our latest Applied Research, we introduce EchoChain 🔊, a novel benchmark for evaluating reasoning under pressure in full-duplex dialogue. Key findings: - Full-duplex models often fail to properly integrate interruption information, even so far as ignoring the interruption entirely in some cases. - A major weakness in today’s most advanced models is that they struggle to stay consistent when new input arrives while they’re still responding. - In many cases, a model performs well when it can respond without interruption, but struggles once it’s interrupted mid-response. Check out the full analysis in our blog post. Stay tuned for the arXiv paper as well which will be released in the coming days. labelbox.com/blog/introduci…

English

163

970

2.7M

Labelbox@labelbox·20 Şub

Our research reveals a blind spot in AI safety evaluations. Current benchmarks rely too heavily on unrealistic trigger cues and fail to reflect real-world adversarial behavior. This creates a mismatch, testing models under conditions that rarely occur in practice. This does not undermine AI safety research. Instead, it highlights the need for better benchmarks before we can meaningfully claim progress.

English

773

Labelbox@labelbox·20 Şub

We extend intent laundering into a standalone jailbreaking method by adding an iterative revision–regeneration loop, where failed attempts are fed back into the model to produce increasingly refined rewrites. With only a few iterations, attack success rates rise to 90–98% across leading frontier models, demonstrating that current safety-alignment techniques remain far from robust against realistic misuse and that intent laundering provides a systematic way to expose these vulnerabilities.

English

1.2K

Labelbox@labelbox·20 Şub

AI safety is often judged by refusal rates on adversarial benchmarks. But what if we are measuring keyword sensitivity, not real robustness? In our latest research, we found that removing obvious trigger cues causes frontier models previously labeled as safe to fail, revealing a clear gap between benchmark scores and real-world adversarial risk. Key findings: - AI safety benchmarks over-rely on explicit trigger cues, inflating refusal rates. - Remove the cues, and safety performance drops, undermining claims of safety robustness. - The same language patterns affect both internal safety evals and alignment methods, compounding the issue. - Our novel “intent laundering” framework serves as a strong diagnostic and red-teaming tool, exposing where model safety succeeds and where it fails. Check out the blog post for the full breakdown and analysis. labelbox.com/blog/the-ai-sa…

English

295

1.6K

3.8M

Labelbox@labelbox·14 Şub

Check out their full convo here -> youtube.com/watch?v=n1E9IZ…

YouTube

English

Labelbox@labelbox·14 Şub

Dario (CEO of @AnthropicAI) x @dwarkesh_sp just unpacked where AI is headed since their last chat 3 years ago, covering all things from exponential scaling to what he calls a “country of geniuses in a data center". A few key things we heard: - RL is about generalization, not specialization: Like early pretraining, the goal isn’t mastering one task, but building rich environments and broad data so models generalize across domains. - 1–3 years to a “country of geniuses”: Dario estimates ~50/50 odds that AI systems collectively match the output of an entire nation of top experts in a few years. Not a single superintelligence, but millions of genius-level systems in parallel. - Context as the next unlock: With context windows in the tens of millions of tokens, models could absorb months of workflow in one pass. The goal: steerable, human-aligned systems, not unchecked autonomous actors. - Software engineering goes end to end: Models are moving from writing code to executing full engineering cycles: setup, debugging, iteration. Bottlenecks shift from syntax to judgment. - Diffusion will lag capability, briefly: Enterprise adoption slows even with rapid growth, but AI can onboard itself via docs, Slack threads, and codebases. Excited to be featured in this conversation, showcasing how we help leading AI teams build high-fidelity RL environments and tighten the iteration loop so models learn from the most informative experiences.

English

1.1K

1.1M

Labelbox@labelbox·11 Şub

Read the blog post here: labelbox.com/blog/welcoming…

English

2.4K

Labelbox@labelbox·11 Şub

We're excited to share that we’ve acquired @upcraftai to bring AI agents to the heart of how we scale human expertise for frontier AI. Upcraft’s AI-powered automation strengthens Alignerr by helping us recruit, engage, and empower a global network of domain experts who train and evaluate the world’s most advanced models. As leading AI teams invest billions into post-training and reinforcement learning, expert-generated data has become the true bottleneck for injecting models with the taste and judgement that only deep human expertise can provide. A big welcome to @gdcaplan and the Upcraft team and we look forward to building together 🚀

English

116

1.2K

1.7M

Labelbox@labelbox·5 Şub

@elonmusk @dwarkesh_sp @collision Full episode here -> x.com/dwarkesh_sp/st… and get in touch w/ us here -> labelbox.com/dwarkesh/

Dwarkesh Patel@dwarkesh_sp

.@collision and I interviewed @elonmusk. 0:00:00 - Orbital data centers 0:36:46 - Grok and alignment 0:59:56 - xAI’s business plan 1:17:21 - Optimus and humanoid manufacturing 1:30:22 - Does China win by default? 1:44:16 - Lessons from running SpaceX 2:20:08 - DOGE 2:38:28 - TeraFab

English

2.6K

Labelbox@labelbox·5 Şub

A few takeaways from the must-watch episode from @elonmusk x @dwarkesh_sp x @collision that dropped today. An almost three hour chat (over some Guinness 🍻) dives into what actually limits the next phase of AI and how Elon plans to break through. - Space as the next data center: Solar power in orbit is roughly five times more effective than on Earth. Within thirty to thirty six months, Musk believes space could become the most economically viable location for AI compute, with Starship launching massive power and compute capacity into orbit. - Humanoid robots as the economic unlock: Optimus could be the ultimate productivity multiplier, potentially expanding the global economy by orders of magnitude. The hardest problem is hands. The endgame is robots that eventually build robots. - Power as the next bottleneck: Electricity production outside China is flat while compute demand is exploding. Musk says the true scaling wall for AI on Earth is utilities, not just models. - Debuggability as a safety requirement: Tools that show where a model’s reasoning went wrong, trace the origin of errors, or detect potential deception will be essential as AI grows more capable. - Efficiency as an existential issue: Interest on national debt now exceeds the military budget. Musk argues that massive productivity gains from AI and robotics are not optional. They are existential. Excited to be featured during their conversation, helping leading AI teams scale high quality robotics and reinforcement learning data so their models learn from the right experiences and reach their full potential.

English

217

1.4K

1.5M

Labelbox@labelbox·15 Ara

@percyliang W

1.4K

Percy Liang@percyliang·15 Ara

This is not just another strong open model. Nemotron actually releases training data (!), RL environments, and training code. This is a big difference: almost all model developers just want people to use their models; NVIDIA is enabling people to make their own models. We are excited to incorporate these assets into the next Marin models! Congrats to the @nvidia team!

Bryan Catanzaro@ctnzr

Today, @NVIDIA is launching the open Nemotron 3 model family, starting with Nano (30B-3A), which pushes the frontier of accuracy and inference efficiency with a novel hybrid SSM Mixture of Experts architecture. Super and Ultra are coming in the next few months.

English

183

1.6K

157.3K

Keşfet

@Meta @dwarkesh_sp @elonmusk @xai @AnthropicAI @upcraftai @gdcaplan @collision