Turing (@turingcom) - Profil Twitter | Zamantika Mersobahis Locabet

Tweet épinglé

Turing@turingcom·1d

Case Study: Most AI agent evals are flawed. They measure outputs. Real agents operate across 80–200+ actions, tools, and OS environments where failure is gradual, not binary. At Turing, we built a new evaluation framework: -900+ deterministic tasks -450+ parent–child pairs -1800+ evaluable scenarios via prompt–execution swapping • 6 domains, balanced across Windows, macOS, Linux • 40% open-source, 60% closed-source tools Each task includes full telemetry: -screen recordings -event logs (clicks, keystrokes, scrolls) -timestamped screenshots -structured prompts, subtasks, and metadata The key idea: structured failure. Instead of injecting errors, we create them by swapping execution and intent: -Parent prompt + Child execution -Child prompt + Parent execution This produces controlled, classifiable failures: -Critical mistake -Bad side effect -Instruction misunderstanding With calibrated complexity (80–225 actions) and strict QA, this becomes a fully reproducible benchmark. Result: We can measure not just if agents succeed, but: -where they break -how errors propagate -how robust they are across real environments Agents don’t fail at the answer. They fail in the process. Read more case studies below.

English

4

3

18

51.2K

Turing@turingcom·16h

Paper: arxiv.org/abs/2603.13594 Website: enterpriseops-gym.github.io Code: github.com/ServiceNow/Ent…

English

0

5

9

85

Turing@turingcom·16h

Dataset: huggingface.co/datasets/Servi…

Filipino

1

5

9

94

Turing@turingcom·16h

Turing is featured in @ServiceNowRSRCH's Enterprise Ops Gym paper. We built the task and evaluation backbone: -1,000 prompts -7 single domain plus 1 hybrid workflow -7 to 30 step planning horizons -Expert reference executions with logged tool calls -Deterministic validation for success and side effect control Enabling structured comparison of enterprise agent performance across domains and complexity tiers. Dataset -> Paper -> Website -> Code. Below.

English

1

6

12

27.3K

Turing@turingcom·16h

@EwelinaDreamer Exactly. Lot's more to share. Stay tuned!😉

English

0

1

Ewelina MoneyBabe@EwelinaDreamer·19h

@turingcom Your insights into HLE++ and its potential to redefine performance metrics in AI are fascinating! Enhanced evaluation frameworks like this will be crucial in tackling complex real-world challenges effectively.

English

1

0

1

5

Turing@turingcom·23h

Most AI models can pass standard benchmarks. But real-world risk does not live at the surface. As frontier models improve, many STEM benchmarks are saturating. High pass rates on academic datasets reduce evaluation signal and can mask weaknesses in advanced reasoning. That’s why we built HLE++.

English

2

3

11

1.2K

Turing@turingcom·23h

Explore HLE++ and see how today’s frontier models perform under domain-grade scrutiny: go.turing.com/llm-stem-evalu…

English

0

3

7

79

Turing@turingcom·23h

What we’re seeing: -Performance gaps widen on structured, multi-step domain tasks. -Models that perform well on general benchmarks degrade under domain-specific stress. -Calibrated difficulty bands reveal weaknesses leaderboard scores often hide.

English

1

3

7

82

Turing@turingcom·23h

@prane_eth_v Gonna take a look.

English

0

1

6

Praneeth V.@prane_eth_v·1d

@turingcom Yes. My benchmark HarmActEval proved an agent can say "Sorry, I can't do that" after performing disallowed actions using tools. GPT-5.3 scored 17%. Guardrails don't monitor agent actions. Agent Action Guard blocks harmful actions before execution. 🔗: l.praneeth.qzz.io/AAG-code

English

1

0

32

Turing@turingcom·1d

Case Study: Most AI agent evals are flawed. They measure outputs. Real agents operate across 80–200+ actions, tools, and OS environments where failure is gradual, not binary. At Turing, we built a new evaluation framework: -900+ deterministic tasks -450+ parent–child pairs -1800+ evaluable scenarios via prompt–execution swapping • 6 domains, balanced across Windows, macOS, Linux • 40% open-source, 60% closed-source tools Each task includes full telemetry: -screen recordings -event logs (clicks, keystrokes, scrolls) -timestamped screenshots -structured prompts, subtasks, and metadata The key idea: structured failure. Instead of injecting errors, we create them by swapping execution and intent: -Parent prompt + Child execution -Child prompt + Parent execution This produces controlled, classifiable failures: -Critical mistake -Bad side effect -Instruction misunderstanding With calibrated complexity (80–225 actions) and strict QA, this becomes a fully reproducible benchmark. Result: We can measure not just if agents succeed, but: -where they break -how errors propagate -how robust they are across real environments Agents don’t fail at the answer. They fail in the process. Read more case studies below.

English

4

3

18

51.2K

Turing retweeté

Jonathan Siddharth@jonsidd·2d

Excited to share that Turing contributed to Enterprise Ops Gym, @ServiceNowRSRCH's new enterprise agent benchmark submitted to ICML. Enterprise Ops Gym moves beyond short-horizon tool calls and evaluates end-to-end enterprise operations across realistic, multi-system workflows. The paper, website, dataset, and code is below.

English

1

8

18

2.2K

Turing retweeté

Jonathan Siddharth@jonsidd·2d

Paper: arxiv.org/abs/2603.13594

English

1

4

8

423

Turing retweeté

Jonathan Siddharth@jonsidd·2d

Website: enterpriseops-gym.github.io

English

1

5

6

248

Turing retweeté

Jonathan Siddharth@jonsidd·2d

Dataset: huggingface.co/datasets/Servi…

Filipino

1

5

9

332

Turing retweeté

Jonathan Siddharth@jonsidd·2d

Code: github.com/ServiceNow/Ent…

English

2

5

9

329

Turing retweeté

Jonathan Siddharth@jonsidd·2d

CC @ShivaMalay @PShravannayak @TheJishnuNair @sagardavasam @SathwikTejaswi @tscholak @NVIDIAAI @ServiceNowNews @Mila @ServiceNowRSRCH

0

5

7

360

Turing@turingcom·1d

turing.com/case-study/eva…

ZXX

0

3

10

100

Turing@turingcom·2d

@Marktechpost Important work! Appreciate the share. 🙏

English

0

26

Turing retweeté

Marktechpost AI Dev News ⚡@Marktechpost·2d

Most AI agents today are failing the enterprise 'vibe check.' ServiceNow Research just released EnterpriseOps-Gym, and it’s a massive reality check for anyone expecting autonomous agents to take over IT and HR tomorrow. We’re moving past simple benchmarks. This is a containerized sandbox with 164 database tables and 512 functional tools. It’s designed to see if agents can actually handle long-horizon planning amidst persistent state changes and strict access protocols. The Brutal Numbers: → Claude Opus 4.5 (the top performer) only achieved a 37.4% success rate. → Gemini-3-Flash followed at 31.9%. → DeepSeek-V3.2 (High) leads the open-source pack at 24.5%. Why the low scores? The research study found that strategic reasoning, not tool invocation, is the primary bottleneck. When the research team provided agents with a human-authored plan, performance jumped by 14-35 percentage points. Strikingly, with a good plan, tiny models like Qwen3-4B actually become competitive with the giants. The TL;DR for AI Devs: ✅ Planning > Scale: We can’t just scale our way to reliability; we need better constraint-aware plan generation. ✅ MAS isn't a Silver Bullet: Decomposing tasks into subtasks often regressed performance because it broke sequential state dependencies. ✅ Sandbox Everything: If you aren't testing your agents in stateful environments, you aren't testing them for the real world. Read our full analysis here: marktechpost.com/2026/03/18/ser… Check out the benchmark: enterpriseops-gym.github.io Paper: arxiv.org/pdf/2603.13594 Codes: github.com/ServiceNow/Ent… @ServiceNow @ServiceNowRSRCH @RajeswarSai @ShivaMalay @PShravannayak @TheJishnuNair @sagardavasam @SathwikTejaswi @tscholak @NVIDIAAI @turingcom @ServiceNowNews @jonsidd

English

3

12

23

3.4K

Turing

Découvrir