Turing (@turingcom) - Perfil do Twitter | Zamantika Mersobahis Locabet

Tweet fixado

Turing@turingcom·1d

Case Study: Most AI agent evals are flawed. They measure outputs. Real agents operate across 80–200+ actions, tools, and OS environments where failure is gradual, not binary. At Turing, we built a new evaluation framework: -900+ deterministic tasks -450+ parent–child pairs -1800+ evaluable scenarios via prompt–execution swapping • 6 domains, balanced across Windows, macOS, Linux • 40% open-source, 60% closed-source tools Each task includes full telemetry: -screen recordings -event logs (clicks, keystrokes, scrolls) -timestamped screenshots -structured prompts, subtasks, and metadata The key idea: structured failure. Instead of injecting errors, we create them by swapping execution and intent: -Parent prompt + Child execution -Child prompt + Parent execution This produces controlled, classifiable failures: -Critical mistake -Bad side effect -Instruction misunderstanding With calibrated complexity (80–225 actions) and strict QA, this becomes a fully reproducible benchmark. Result: We can measure not just if agents succeed, but: -where they break -how errors propagate -how robust they are across real environments Agents don’t fail at the answer. They fail in the process. Read more case studies below.

English

4

3

18

51.2K

Turing@turingcom·1m

@RajeswarSai Let's go!!

English

0

2

Sai Rajeswar@RajeswarSai·33m

🔥 𝗘𝗻𝘁𝗲𝗿𝗽𝗿𝗶𝘀𝗲𝗢𝗽𝘀-𝗚𝘆𝗺 𝗶𝘀 𝘁𝗮𝗸𝗶𝗻𝗴 𝗼𝗳𝗳 𝗵𝘂𝗴𝗲: 2K downloads in 3 days (trending #6 dataset + #3 paper of the day) 🏆. So we re-ran the leaderboard on the 𝗹𝗮𝘁𝗲𝘀𝘁 𝗳𝗿𝗼𝗻𝘁𝗶𝗲𝗿 𝗰𝗹𝗼𝘀𝗲𝗱 𝗺𝗼𝗱𝗲𝗹𝘀… and the results were promising. ✅ Claude versions show a meaningful jump in reliability on enterprise tasks. ✅ Gemini 3.1 Pro is catching up fast, now much closer to Sonnet 4.6 than earlier releases. And yet, the bigger takeaway is still the same: - Big room for improvement on enterprise-grade agentic tasks. - These workflows punish "seemingly correct." One wrong default, one policy miss, one unintended side effect.. and the task fails. 📢 𝗖𝗮𝗹𝗹𝗼𝘂𝘁 (especially if you’re working on agents): As we prepare our next NeurIPS/COLM submissions, try your agents on EnterpriseOps-Gym and see how they hold up on realistic, policy-constrained, long-horizon tasks. 🌐 Website: enterpriseops-gym.github.io 🤗 Dataset: huggingface.co/datasets/Servi… @ServiceNowRSRCH , @sagardavasam , @turingcom , @turingcomdev , @Mila_Quebec , @shiva_malay @PShravannayak

English

1

7

16

132

Turing retweetou

Turing@turingcom·9 Mar

Benchmarking RTL Agents with 1,500+ Real-World Verilog Tasks for NVIDIA’s CVDP For NVIDIA’s Comprehensive Verilog Design Problems, Turing built a production-grade dataset to evaluate LLMs on real hardware workflows, not simplified prompts. Most existing RTL benchmarks reported > 60% pass rates because they relied on narrow prompts and constrained tests and did not reflect multi-file repos, deep module hierarchies, debugging loops, or real EDA toolchains. CVDP was designed to change that. Turing delivered 1,500+ simulation-ready RTL problems across 13 categories, including: -Spec-to-RTL mapping -Code completion -Testbench and assertion generation -Bug fixing and tool-invoked debugging Three tiers of complexity: -Single-file copilot tasks with golden solutions -Multi-file agentic tasks requiring tool use -Full Git-style projects with >200k token contexts and simulation execution Every task included a deterministic harness, a simulation-passable reference solution, and metadata for coverage and difficulty tracking. Validation ran through Icarus Verilog and Cadence Xcelium, with manual RTL engineer review & ambiguity filtering. The result was tooling realism and measurable failure diversity. Cycle-accurate simulations surfaced real issues: -FSM transition errors -Signal width mismatches -Cross-module reasoning failures -Semantic violations When frontier models were evaluated: -GPT-4o dropped from 63% on prior benchmarks to 29% -Claude 3.7 Sonnet peaked at 33.56% on non-agentic generation -Agentic settings saw an additional 10 to 20% drop CVDP is now one of the most challenging hardware design benchmarks available, enabling category-level error clustering, root-cause analysis, and rigorous evaluation as models advance. -1,500+ tasks. -13 categories. -Commercial and open-source simulation. This is what hardware-grade LLM evaluation looks like.

English

2

7

19

16.4K

Turing retweetou

Turing@turingcom·9 Mar

Request a benchmark sample featuring a debugging task, expected behavior specification, and signal-accurate pass/fail evaluation: turing.com/case-study/ben…

English

0

5

11

309

Turing retweetou

Turing@turingcom·4d

Everyone is asking about AGI. The better question: when does AI start moving the economy? At Turing, the focus is not hype. It's enterprise automation. Models are getting good at answering questions. What comes next is harder, and more valuable: Automating complex, multi-step workflows across finance, sales, engineering, healthcare, and more. Board meeting prep. Financial reporting. Sales research. Data analysis. To get there, models need to master four things: -Multimodal understanding -Deep reasoning -Reliable tool use -Strong coding ability And they need high-quality, workflow-level data to close the trust gap. This will not be a rapid takeoff. It will be slow, steady productivity gains that compound across the 30 trillion dollar knowledge economy. The future of AI is not just smarter chat. It's systems that can do real work. Watch the recent interview with our CEO @jonsidd and @alexeheath below.

English

2

6

15

3.6K

Turing retweetou

Turing@turingcom·4d

youtu.be/aqhn7NuW7WQ?si…

YouTube

ZXX

0

6

10

267

Turing@turingcom·17h

Paper: arxiv.org/abs/2603.13594 Website: enterpriseops-gym.github.io Code: github.com/ServiceNow/Ent…

English

0

5

9

87

Turing@turingcom·17h

Dataset: huggingface.co/datasets/Servi…

Filipino

1

5

9

96

Turing@turingcom·17h

Turing is featured in @ServiceNowRSRCH's Enterprise Ops Gym paper. We built the task and evaluation backbone: -1,000 prompts -7 single domain plus 1 hybrid workflow -7 to 30 step planning horizons -Expert reference executions with logged tool calls -Deterministic validation for success and side effect control Enabling structured comparison of enterprise agent performance across domains and complexity tiers. Dataset -> Paper -> Website -> Code. Below.

English

1

6

12

27.4K

Turing@turingcom·17h

@EwelinaDreamer Exactly. Lot's more to share. Stay tuned!😉

English

0

1

Ewelina MoneyBabe@EwelinaDreamer·20h

@turingcom Your insights into HLE++ and its potential to redefine performance metrics in AI are fascinating! Enhanced evaluation frameworks like this will be crucial in tackling complex real-world challenges effectively.

English

1

0

1

5

Turing@turingcom·1d

Most AI models can pass standard benchmarks. But real-world risk does not live at the surface. As frontier models improve, many STEM benchmarks are saturating. High pass rates on academic datasets reduce evaluation signal and can mask weaknesses in advanced reasoning. That’s why we built HLE++.

English

2

3

11

1.2K

Turing@turingcom·1d

Explore HLE++ and see how today’s frontier models perform under domain-grade scrutiny: go.turing.com/llm-stem-evalu…

English

0

3

7

81

Turing@turingcom·1d

What we’re seeing: -Performance gaps widen on structured, multi-step domain tasks. -Models that perform well on general benchmarks degrade under domain-specific stress. -Calibrated difficulty bands reveal weaknesses leaderboard scores often hide.

English

1

3

7

84

Turing@turingcom·1d

@prane_eth_v Gonna take a look.

English

0

1

6

Praneeth V.@prane_eth_v·1d

@turingcom Yes. My benchmark HarmActEval proved an agent can say "Sorry, I can't do that" after performing disallowed actions using tools. GPT-5.3 scored 17%. Guardrails don't monitor agent actions. Agent Action Guard blocks harmful actions before execution. 🔗: l.praneeth.qzz.io/AAG-code

English

1

0

32

Turing@turingcom·1d

Case Study: Most AI agent evals are flawed. They measure outputs. Real agents operate across 80–200+ actions, tools, and OS environments where failure is gradual, not binary. At Turing, we built a new evaluation framework: -900+ deterministic tasks -450+ parent–child pairs -1800+ evaluable scenarios via prompt–execution swapping • 6 domains, balanced across Windows, macOS, Linux • 40% open-source, 60% closed-source tools Each task includes full telemetry: -screen recordings -event logs (clicks, keystrokes, scrolls) -timestamped screenshots -structured prompts, subtasks, and metadata The key idea: structured failure. Instead of injecting errors, we create them by swapping execution and intent: -Parent prompt + Child execution -Child prompt + Parent execution This produces controlled, classifiable failures: -Critical mistake -Bad side effect -Instruction misunderstanding With calibrated complexity (80–225 actions) and strict QA, this becomes a fully reproducible benchmark. Result: We can measure not just if agents succeed, but: -where they break -how errors propagate -how robust they are across real environments Agents don’t fail at the answer. They fail in the process. Read more case studies below.

English

4

3

18

51.2K

Turing retweetou

Jonathan Siddharth@jonsidd·2d

Excited to share that Turing contributed to Enterprise Ops Gym, @ServiceNowRSRCH's new enterprise agent benchmark submitted to ICML. Enterprise Ops Gym moves beyond short-horizon tool calls and evaluates end-to-end enterprise operations across realistic, multi-system workflows. The paper, website, dataset, and code is below.

English

1

8

18

2.2K

Turing retweetou

Jonathan Siddharth@jonsidd·2d

Paper: arxiv.org/abs/2603.13594

English

1

4

8