Turing

9.4K posts

Turing banner
Turing

Turing

@turingcom

Accelerating superintelligence to drive economic growth.

Palo Alto, CA Inscrit le Eylül 2018
2.1K Abonnements15.7K Abonnés
Tweet épinglé
Turing
Turing@turingcom·
Case Study: Most AI agent evals are flawed. They measure outputs. Real agents operate across 80–200+ actions, tools, and OS environments where failure is gradual, not binary. At Turing, we built a new evaluation framework: -900+ deterministic tasks -450+ parent–child pairs -1800+ evaluable scenarios via prompt–execution swapping • 6 domains, balanced across Windows, macOS, Linux • 40% open-source, 60% closed-source tools Each task includes full telemetry: -screen recordings -event logs (clicks, keystrokes, scrolls) -timestamped screenshots -structured prompts, subtasks, and metadata The key idea: structured failure. Instead of injecting errors, we create them by swapping execution and intent: -Parent prompt + Child execution -Child prompt + Parent execution This produces controlled, classifiable failures: -Critical mistake -Bad side effect -Instruction misunderstanding With calibrated complexity (80–225 actions) and strict QA, this becomes a fully reproducible benchmark. Result: We can measure not just if agents succeed, but: -where they break -how errors propagate -how robust they are across real environments Agents don’t fail at the answer. They fail in the process. Read more case studies below.
English
4
3
18
51.2K
Turing
Turing@turingcom·
Turing is featured in @ServiceNowRSRCH's Enterprise Ops Gym paper. We built the task and evaluation backbone: -1,000 prompts -7 single domain plus 1 hybrid workflow -7 to 30 step planning horizons -Expert reference executions with logged tool calls -Deterministic validation for success and side effect control Enabling structured comparison of enterprise agent performance across domains and complexity tiers. Dataset -> Paper -> Website -> Code. Below.
Turing tweet media
English
1
6
12
27.3K
Ewelina MoneyBabe
Ewelina MoneyBabe@EwelinaDreamer·
@turingcom Your insights into HLE++ and its potential to redefine performance metrics in AI are fascinating! Enhanced evaluation frameworks like this will be crucial in tackling complex real-world challenges effectively.
English
1
0
1
5
Turing
Turing@turingcom·
Most AI models can pass standard benchmarks. But real-world risk does not live at the surface. As frontier models improve, many STEM benchmarks are saturating. High pass rates on academic datasets reduce evaluation signal and can mask weaknesses in advanced reasoning. That’s why we built HLE++.
Turing tweet media
English
2
3
11
1.2K
Turing
Turing@turingcom·
What we’re seeing: -Performance gaps widen on structured, multi-step domain tasks.  -Models that perform well on general benchmarks degrade under domain-specific stress.  -Calibrated difficulty bands reveal weaknesses leaderboard scores often hide.
English
1
3
7
82
Praneeth V.
Praneeth V.@prane_eth_v·
@turingcom Yes. My benchmark HarmActEval proved an agent can say "Sorry, I can't do that" after performing disallowed actions using tools. GPT-5.3 scored 17%. Guardrails don't monitor agent actions. Agent Action Guard blocks harmful actions before execution. 🔗: l.praneeth.qzz.io/AAG-code
English
1
0
0
32
Turing
Turing@turingcom·
Case Study: Most AI agent evals are flawed. They measure outputs. Real agents operate across 80–200+ actions, tools, and OS environments where failure is gradual, not binary. At Turing, we built a new evaluation framework: -900+ deterministic tasks -450+ parent–child pairs -1800+ evaluable scenarios via prompt–execution swapping • 6 domains, balanced across Windows, macOS, Linux • 40% open-source, 60% closed-source tools Each task includes full telemetry: -screen recordings -event logs (clicks, keystrokes, scrolls) -timestamped screenshots -structured prompts, subtasks, and metadata The key idea: structured failure. Instead of injecting errors, we create them by swapping execution and intent: -Parent prompt + Child execution -Child prompt + Parent execution This produces controlled, classifiable failures: -Critical mistake -Bad side effect -Instruction misunderstanding With calibrated complexity (80–225 actions) and strict QA, this becomes a fully reproducible benchmark. Result: We can measure not just if agents succeed, but: -where they break -how errors propagate -how robust they are across real environments Agents don’t fail at the answer. They fail in the process. Read more case studies below.
English
4
3
18
51.2K
Turing retweeté
Jonathan Siddharth
Jonathan Siddharth@jonsidd·
Excited to share that Turing contributed to Enterprise Ops Gym, @ServiceNowRSRCH's new enterprise agent benchmark submitted to ICML. Enterprise Ops Gym moves beyond short-horizon tool calls and evaluates end-to-end enterprise operations across realistic, multi-system workflows. The paper, website, dataset, and code is below.
English
1
8
18
2.2K
Turing retweeté
Marktechpost AI Dev News ⚡
Most AI agents today are failing the enterprise 'vibe check.' ServiceNow Research just released EnterpriseOps-Gym, and it’s a massive reality check for anyone expecting autonomous agents to take over IT and HR tomorrow. We’re moving past simple benchmarks. This is a containerized sandbox with 164 database tables and 512 functional tools. It’s designed to see if agents can actually handle long-horizon planning amidst persistent state changes and strict access protocols. The Brutal Numbers: → Claude Opus 4.5 (the top performer) only achieved a 37.4% success rate. → Gemini-3-Flash followed at 31.9%. → DeepSeek-V3.2 (High) leads the open-source pack at 24.5%. Why the low scores? The research study found that strategic reasoning, not tool invocation, is the primary bottleneck. When the research team provided agents with a human-authored plan, performance jumped by 14-35 percentage points. Strikingly, with a good plan, tiny models like Qwen3-4B actually become competitive with the giants. The TL;DR for AI Devs: ✅ Planning > Scale: We can’t just scale our way to reliability; we need better constraint-aware plan generation. ✅ MAS isn't a Silver Bullet: Decomposing tasks into subtasks often regressed performance because it broke sequential state dependencies. ✅ Sandbox Everything: If you aren't testing your agents in stateful environments, you aren't testing them for the real world. Read our full analysis here: marktechpost.com/2026/03/18/ser… Check out the benchmark: enterpriseops-gym.github.io Paper: arxiv.org/pdf/2603.13594 Codes: github.com/ServiceNow/Ent… @ServiceNow @ServiceNowRSRCH @RajeswarSai @ShivaMalay @PShravannayak @TheJishnuNair @sagardavasam @SathwikTejaswi @tscholak @NVIDIAAI @turingcom @ServiceNowNews @jonsidd
English
3
12
23
3.4K