Turing

65

Turing@turingcom·4h

That COBOL system you retired 3 years ago? It's sitting in a repo — unmaintained, unused, but valuable. Have a legacy codebase? Schedule a call below.

English

3

5

98

Turing@turingcom·4h

lazarus.turing.com

ZXX

3

4

55

Turing@turingcom·4h

@jacyanthis @GoogleDeepMind Congrats!

English

36

Jacy Reese Anthis@jacyanthis·8h

Thrilled to be joining @GoogleDeepMind as a student researcher in SF! We're building a multi-agent system to scale AI safety research and ensure pluralistic alignment. I think this is a crucial piece of safe AGI development for cooperation across many diverse human and AI agents!

English

6

2

66

5.1K

Turing@turingcom·6h

EnterpriseOps-Gym is taking off! 2K downloads in 3 days, trending #6 dataset + #3 paper of the day! Let's keep going!

Sai Rajeswar@RajeswarSai

🔥 𝗘𝗻𝘁𝗲𝗿𝗽𝗿𝗶𝘀𝗲𝗢𝗽𝘀-𝗚𝘆𝗺 𝗶𝘀 𝘁𝗮𝗸𝗶𝗻𝗴 𝗼𝗳𝗳 𝗵𝘂𝗴𝗲: 2K downloads in 3 days (trending #6 dataset + #3 paper of the day) 🏆. So we re-ran the leaderboard on the 𝗹𝗮𝘁𝗲𝘀𝘁 𝗳𝗿𝗼𝗻𝘁𝗶𝗲𝗿 𝗰𝗹𝗼𝘀𝗲𝗱 𝗺𝗼𝗱𝗲𝗹𝘀… and the results were promising. ✅ Claude versions show a meaningful jump in reliability on enterprise tasks. ✅ Gemini 3.1 Pro is catching up fast, now much closer to Sonnet 4.6 than earlier releases. And yet, the bigger takeaway is still the same: - Big room for improvement on enterprise-grade agentic tasks. - These workflows punish "seemingly correct." One wrong default, one policy miss, one unintended side effect.. and the task fails. 📢 𝗖𝗮𝗹𝗹𝗼𝘂𝘁 (especially if you’re working on agents): As we prepare our next NeurIPS/COLM submissions, try your agents on EnterpriseOps-Gym and see how they hold up on realistic, policy-constrained, long-horizon tasks. 🌐 Website: enterpriseops-gym.github.io 🤗 Dataset: huggingface.co/datasets/Servi… @ServiceNowRSRCH , @sagardavasam , @turingcom , @turingcomdev , @Mila_Quebec , @shiva_malay @PShravannayak

English

6

7

2.3K

Turing@turingcom·6h

@RajeswarSai Let's go!!

English

3

77

Sai Rajeswar@RajeswarSai·7h

🔥 𝗘𝗻𝘁𝗲𝗿𝗽𝗿𝗶𝘀𝗲𝗢𝗽𝘀-𝗚𝘆𝗺 𝗶𝘀 𝘁𝗮𝗸𝗶𝗻𝗴 𝗼𝗳𝗳 𝗵𝘂𝗴𝗲: 2K downloads in 3 days (trending #6 dataset + #3 paper of the day) 🏆. So we re-ran the leaderboard on the 𝗹𝗮𝘁𝗲𝘀𝘁 𝗳𝗿𝗼𝗻𝘁𝗶𝗲𝗿 𝗰𝗹𝗼𝘀𝗲𝗱 𝗺𝗼𝗱𝗲𝗹𝘀… and the results were promising. ✅ Claude versions show a meaningful jump in reliability on enterprise tasks. ✅ Gemini 3.1 Pro is catching up fast, now much closer to Sonnet 4.6 than earlier releases. And yet, the bigger takeaway is still the same: - Big room for improvement on enterprise-grade agentic tasks. - These workflows punish "seemingly correct." One wrong default, one policy miss, one unintended side effect.. and the task fails. 📢 𝗖𝗮𝗹𝗹𝗼𝘂𝘁 (especially if you’re working on agents): As we prepare our next NeurIPS/COLM submissions, try your agents on EnterpriseOps-Gym and see how they hold up on realistic, policy-constrained, long-horizon tasks. 🌐 Website: enterpriseops-gym.github.io 🤗 Dataset: huggingface.co/datasets/Servi… @ServiceNowRSRCH , @sagardavasam , @turingcom , @turingcomdev , @Mila_Quebec , @shiva_malay @PShravannayak

English

11

33

1.5K

Turing 已转推

Turing@turingcom·9 Mar

Benchmarking RTL Agents with 1,500+ Real-World Verilog Tasks for NVIDIA’s CVDP For NVIDIA’s Comprehensive Verilog Design Problems, Turing built a production-grade dataset to evaluate LLMs on real hardware workflows, not simplified prompts. Most existing RTL benchmarks reported > 60% pass rates because they relied on narrow prompts and constrained tests and did not reflect multi-file repos, deep module hierarchies, debugging loops, or real EDA toolchains. CVDP was designed to change that. Turing delivered 1,500+ simulation-ready RTL problems across 13 categories, including: -Spec-to-RTL mapping -Code completion -Testbench and assertion generation -Bug fixing and tool-invoked debugging Three tiers of complexity: -Single-file copilot tasks with golden solutions -Multi-file agentic tasks requiring tool use -Full Git-style projects with >200k token contexts and simulation execution Every task included a deterministic harness, a simulation-passable reference solution, and metadata for coverage and difficulty tracking. Validation ran through Icarus Verilog and Cadence Xcelium, with manual RTL engineer review & ambiguity filtering. The result was tooling realism and measurable failure diversity. Cycle-accurate simulations surfaced real issues: -FSM transition errors -Signal width mismatches -Cross-module reasoning failures -Semantic violations When frontier models were evaluated: -GPT-4o dropped from 63% on prior benchmarks to 29% -Claude 3.7 Sonnet peaked at 33.56% on non-agentic generation -Agentic settings saw an additional 10 to 20% drop CVDP is now one of the most challenging hardware design benchmarks available, enabling category-level error clustering, root-cause analysis, and rigorous evaluation as models advance. -1,500+ tasks. -13 categories. -Commercial and open-source simulation. This is what hardware-grade LLM evaluation looks like.

English

8

19

16.5K

Turing 已转推

Turing@turingcom·9 Mar

Request a benchmark sample featuring a debugging task, expected behavior specification, and signal-accurate pass/fail evaluation: turing.com/case-study/ben…

English

6

11

347

Turing 已转推

Turing@turingcom·4d

Everyone is asking about AGI. The better question: when does AI start moving the economy? At Turing, the focus is not hype. It's enterprise automation. Models are getting good at answering questions. What comes next is harder, and more valuable: Automating complex, multi-step workflows across finance, sales, engineering, healthcare, and more. Board meeting prep. Financial reporting. Sales research. Data analysis. To get there, models need to master four things: -Multimodal understanding -Deep reasoning -Reliable tool use -Strong coding ability And they need high-quality, workflow-level data to close the trust gap. This will not be a rapid takeoff. It will be slow, steady productivity gains that compound across the 30 trillion dollar knowledge economy. The future of AI is not just smarter chat. It's systems that can do real work. Watch the recent interview with our CEO @jonsidd and @alexeheath below.

English

6

15

3.7K

Turing 已转推

Turing@turingcom·4d

youtu.be/aqhn7NuW7WQ?si…

YouTube

ZXX

6

10

321

Turing@turingcom·1d

Paper: arxiv.org/abs/2603.13594 Website: enterpriseops-gym.github.io Code: github.com/ServiceNow/Ent…

English

5

10

100

Turing@turingcom·1d

Dataset: huggingface.co/datasets/Servi…

Filipino

5

10

117

Turing@turingcom·1d

Turing is featured in @ServiceNowRSRCH's Enterprise Ops Gym paper. We built the task and evaluation backbone: -1,000 prompts -7 single domain plus 1 hybrid workflow -7 to 30 step planning horizons -Expert reference executions with logged tool calls -Deterministic validation for success and side effect control Enabling structured comparison of enterprise agent performance across domains and complexity tiers. Dataset -> Paper -> Website -> Code. Below.

English

6

13

27.7K

Turing@turingcom·1d

@EwelinaDreamer Exactly. Lot's more to share. Stay tuned!😉

English

2

4

Ewelina MoneyBabe@EwelinaDreamer·1d

@turingcom Your insights into HLE++ and its potential to redefine performance metrics in AI are fascinating! Enhanced evaluation frameworks like this will be crucial in tackling complex real-world challenges effectively.

English

0

3

9

Turing@turingcom·1d

Most AI models can pass standard benchmarks. But real-world risk does not live at the surface. As frontier models improve, many STEM benchmarks are saturating. High pass rates on academic datasets reduce evaluation signal and can mask weaknesses in advanced reasoning. That’s why we built HLE++.

English

4

13

1.3K

Turing@turingcom·1d

Explore HLE++ and see how today’s frontier models perform under domain-grade scrutiny: go.turing.com/llm-stem-evalu…

English

4

9

90

Turing@turingcom·1d

What we’re seeing: -Performance gaps widen on structured, multi-step domain tasks. -Models that perform well on general benchmarks degrade under domain-specific stress. -Calibrated difficulty bands reveal weaknesses leaderboard scores often hide.

English

4

9

93

Turing@turingcom·1d

@prane_eth_v Gonna take a look.

English

1

6

Praneeth V.@prane_eth_v·1d

@turingcom Yes. My benchmark HarmActEval proved an agent can say "Sorry, I can't do that" after performing disallowed actions using tools. GPT-5.3 scored 17%. Guardrails don't monitor agent actions. Agent Action Guard blocks harmful actions before execution. 🔗: l.praneeth.qzz.io/AAG-code

English