Turing

9.4K posts

Turing banner
Turing

Turing

@turingcom

Accelerating superintelligence to drive economic growth.

Palo Alto, CA 加入时间 Eylül 2018
2.1K 关注15.7K 粉丝
置顶推文
Turing
Turing@turingcom·
Case Study: Most AI agent evals are flawed. They measure outputs. Real agents operate across 80–200+ actions, tools, and OS environments where failure is gradual, not binary. At Turing, we built a new evaluation framework: -900+ deterministic tasks -450+ parent–child pairs -1800+ evaluable scenarios via prompt–execution swapping • 6 domains, balanced across Windows, macOS, Linux • 40% open-source, 60% closed-source tools Each task includes full telemetry: -screen recordings -event logs (clicks, keystrokes, scrolls) -timestamped screenshots -structured prompts, subtasks, and metadata The key idea: structured failure. Instead of injecting errors, we create them by swapping execution and intent: -Parent prompt + Child execution -Child prompt + Parent execution This produces controlled, classifiable failures: -Critical mistake -Bad side effect -Instruction misunderstanding With calibrated complexity (80–225 actions) and strict QA, this becomes a fully reproducible benchmark. Result: We can measure not just if agents succeed, but: -where they break -how errors propagate -how robust they are across real environments Agents don’t fail at the answer. They fail in the process. Read more case studies below.
English
4
3
18
51.2K
Turing
Turing@turingcom·
More on Project Lazarus: x.com/turingcom/stat…
Turing@turingcom

Turing Research is launching a groundbreaking initiative to capture and utilize the complete, unfiltered operational history of companies, creating the definitive dataset for training the next generation of frontier models. Project Lazarus is an initiative to acquire and permanently preserve the full, unfiltered operational history of defunct or inactive companies at scale. We focus on private codebases, version histories, internal documentation, post-mortems, experimentation logs, infrastructure tooling, and everyday work artifacts that collectively reflect how real organizations actually operate. These materials capture the reality of knowledge work: incomplete specifications, tradeoffs made under time pressure, accumulated technical debt, evolving systems, and decisions made under uncertainty. Unlike polished outputs, operational traces preserve the causal structure of work across weeks, months, and years. We prioritize industries with high complexity and outsized GDP impact, including financial services, healthcare and pharma, advanced manufacturing, and enterprise software. These domains contain long-horizon decision making, regulatory constraints, supply chain dependencies, and high-value intellectual property that are critical for training economically useful AI systems. The data is structured for advanced methodologies such as reinforcement learning, imitation learning & long-horizon task evaluation, enabling models to learn multi-step reasoning, organizational decision processes, and system diagnosis over extended timelines. For founders, Project Lazarus is also preservation. A company’s history is a compressed record of human judgment, experimentation, and problem-solving. Instead of disappearing, that work compounds by becoming part of the foundation shaping the next generation of autonomous AI systems.

English
0
0
0
65
Turing
Turing@turingcom·
That COBOL system you retired 3 years ago? It's sitting in a repo — unmaintained, unused, but valuable. Have a legacy codebase? Schedule a call below.
English
2
3
5
98
Jacy Reese Anthis
Jacy Reese Anthis@jacyanthis·
Thrilled to be joining @GoogleDeepMind as a student researcher in SF! We're building a multi-agent system to scale AI safety research and ensure pluralistic alignment. I think this is a crucial piece of safe AGI development for cooperation across many diverse human and AI agents!
Jacy Reese Anthis tweet media
English
6
2
66
5.1K
Turing
Turing@turingcom·
EnterpriseOps-Gym is taking off! 2K downloads in 3 days, trending #6 dataset + #3 paper of the day! Let's keep going!
Sai Rajeswar@RajeswarSai

🔥 𝗘𝗻𝘁𝗲𝗿𝗽𝗿𝗶𝘀𝗲𝗢𝗽𝘀-𝗚𝘆𝗺 𝗶𝘀 𝘁𝗮𝗸𝗶𝗻𝗴 𝗼𝗳𝗳 𝗵𝘂𝗴𝗲: 2K downloads in 3 days (trending #6 dataset + #3 paper of the day) 🏆. So we re-ran the leaderboard on the 𝗹𝗮𝘁𝗲𝘀𝘁 𝗳𝗿𝗼𝗻𝘁𝗶𝗲𝗿 𝗰𝗹𝗼𝘀𝗲𝗱 𝗺𝗼𝗱𝗲𝗹𝘀… and the results were promising. ✅ Claude versions show a meaningful jump in reliability on enterprise tasks. ✅ Gemini 3.1 Pro is catching up fast, now much closer to Sonnet 4.6 than earlier releases. And yet, the bigger takeaway is still the same: - Big room for improvement on enterprise-grade agentic tasks. - These workflows punish "seemingly correct." One wrong default, one policy miss, one unintended side effect.. and the task fails. 📢 𝗖𝗮𝗹𝗹𝗼𝘂𝘁 (especially if you’re working on agents): As we prepare our next NeurIPS/COLM submissions, try your agents on EnterpriseOps-Gym and see how they hold up on realistic, policy-constrained, long-horizon tasks. 🌐 Website: enterpriseops-gym.github.io 🤗 Dataset: huggingface.co/datasets/Servi… @ServiceNowRSRCH , @sagardavasam , @turingcom , @turingcomdev , @Mila_Quebec , @shiva_malay @PShravannayak

English
0
6
7
2.3K
Sai Rajeswar
Sai Rajeswar@RajeswarSai·
🔥 𝗘𝗻𝘁𝗲𝗿𝗽𝗿𝗶𝘀𝗲𝗢𝗽𝘀-𝗚𝘆𝗺 𝗶𝘀 𝘁𝗮𝗸𝗶𝗻𝗴 𝗼𝗳𝗳 𝗵𝘂𝗴𝗲: 2K downloads in 3 days (trending #6 dataset + #3 paper of the day) 🏆. So we re-ran the leaderboard on the 𝗹𝗮𝘁𝗲𝘀𝘁 𝗳𝗿𝗼𝗻𝘁𝗶𝗲𝗿 𝗰𝗹𝗼𝘀𝗲𝗱 𝗺𝗼𝗱𝗲𝗹𝘀… and the results were promising. ✅ Claude versions show a meaningful jump in reliability on enterprise tasks. ✅ Gemini 3.1 Pro is catching up fast, now much closer to Sonnet 4.6 than earlier releases. And yet, the bigger takeaway is still the same: - Big room for improvement on enterprise-grade agentic tasks. - These workflows punish "seemingly correct." One wrong default, one policy miss, one unintended side effect.. and the task fails. 📢 𝗖𝗮𝗹𝗹𝗼𝘂𝘁 (especially if you’re working on agents): As we prepare our next NeurIPS/COLM submissions, try your agents on EnterpriseOps-Gym and see how they hold up on realistic, policy-constrained, long-horizon tasks. 🌐 Website: enterpriseops-gym.github.io 🤗 Dataset: huggingface.co/datasets/Servi… @ServiceNowRSRCH , @sagardavasam , @turingcom , @turingcomdev , @Mila_Quebec , @shiva_malay @PShravannayak
Sai Rajeswar tweet media
English
2
11
33
1.5K
Turing 已转推
Turing
Turing@turingcom·
Benchmarking RTL Agents with 1,500+ Real-World Verilog Tasks for NVIDIA’s CVDP For NVIDIA’s Comprehensive Verilog Design Problems, Turing built a production-grade dataset to evaluate LLMs on real hardware workflows, not simplified prompts. Most existing RTL benchmarks reported > 60% pass rates because they relied on narrow prompts and constrained tests and did not reflect multi-file repos, deep module hierarchies, debugging loops, or real EDA toolchains. CVDP was designed to change that. Turing delivered 1,500+ simulation-ready RTL problems across 13 categories, including: -Spec-to-RTL mapping -Code completion -Testbench and assertion generation -Bug fixing and tool-invoked debugging Three tiers of complexity: -Single-file copilot tasks with golden solutions -Multi-file agentic tasks requiring tool use -Full Git-style projects with >200k token contexts and simulation execution Every task included a deterministic harness, a simulation-passable reference solution, and metadata for coverage and difficulty tracking. Validation ran through Icarus Verilog and Cadence Xcelium, with manual RTL engineer review & ambiguity filtering. The result was tooling realism and measurable failure diversity. Cycle-accurate simulations surfaced real issues: -FSM transition errors -Signal width mismatches -Cross-module reasoning failures -Semantic violations When frontier models were evaluated: -GPT-4o dropped from 63% on prior benchmarks to 29% -Claude 3.7 Sonnet peaked at 33.56% on non-agentic generation -Agentic settings saw an additional 10 to 20% drop CVDP is now one of the most challenging hardware design benchmarks available, enabling category-level error clustering, root-cause analysis, and rigorous evaluation as models advance. -1,500+ tasks. -13 categories. -Commercial and open-source simulation. This is what hardware-grade LLM evaluation looks like.
Turing tweet media
English
2
8
19
16.5K
Turing 已转推
Turing
Turing@turingcom·
Request a benchmark sample featuring a debugging task, expected behavior specification, and signal-accurate pass/fail evaluation: turing.com/case-study/ben…
English
0
6
11
347
Turing 已转推
Turing
Turing@turingcom·
Everyone is asking about AGI. The better question: when does AI start moving the economy? At Turing, the focus is not hype. It's enterprise automation. Models are getting good at answering questions. What comes next is harder, and more valuable: Automating complex, multi-step workflows across finance, sales, engineering, healthcare, and more. Board meeting prep. Financial reporting. Sales research. Data analysis. To get there, models need to master four things: -Multimodal understanding -Deep reasoning -Reliable tool use -Strong coding ability And they need high-quality, workflow-level data to close the trust gap. This will not be a rapid takeoff. It will be slow, steady productivity gains that compound across the 30 trillion dollar knowledge economy. The future of AI is not just smarter chat. It's systems that can do real work. Watch the recent interview with our CEO @jonsidd and @alexeheath below.
English
2
6
15
3.7K
Turing
Turing@turingcom·
Turing is featured in @ServiceNowRSRCH's Enterprise Ops Gym paper. We built the task and evaluation backbone: -1,000 prompts -7 single domain plus 1 hybrid workflow -7 to 30 step planning horizons -Expert reference executions with logged tool calls -Deterministic validation for success and side effect control Enabling structured comparison of enterprise agent performance across domains and complexity tiers. Dataset -> Paper -> Website -> Code. Below.
Turing tweet media
English
1
6
13
27.7K
Ewelina MoneyBabe
Ewelina MoneyBabe@EwelinaDreamer·
@turingcom Your insights into HLE++ and its potential to redefine performance metrics in AI are fascinating! Enhanced evaluation frameworks like this will be crucial in tackling complex real-world challenges effectively.
English
1
0
3
9
Turing
Turing@turingcom·
Most AI models can pass standard benchmarks. But real-world risk does not live at the surface. As frontier models improve, many STEM benchmarks are saturating. High pass rates on academic datasets reduce evaluation signal and can mask weaknesses in advanced reasoning. That’s why we built HLE++.
Turing tweet media
English
2
4
13
1.3K
Turing
Turing@turingcom·
What we’re seeing: -Performance gaps widen on structured, multi-step domain tasks.  -Models that perform well on general benchmarks degrade under domain-specific stress.  -Calibrated difficulty bands reveal weaknesses leaderboard scores often hide.
English
1
4
9
93
Praneeth V.
Praneeth V.@prane_eth_v·
@turingcom Yes. My benchmark HarmActEval proved an agent can say "Sorry, I can't do that" after performing disallowed actions using tools. GPT-5.3 scored 17%. Guardrails don't monitor agent actions. Agent Action Guard blocks harmful actions before execution. 🔗: l.praneeth.qzz.io/AAG-code
English
1
0
0
33
Turing
Turing@turingcom·
Case Study: Most AI agent evals are flawed. They measure outputs. Real agents operate across 80–200+ actions, tools, and OS environments where failure is gradual, not binary. At Turing, we built a new evaluation framework: -900+ deterministic tasks -450+ parent–child pairs -1800+ evaluable scenarios via prompt–execution swapping • 6 domains, balanced across Windows, macOS, Linux • 40% open-source, 60% closed-source tools Each task includes full telemetry: -screen recordings -event logs (clicks, keystrokes, scrolls) -timestamped screenshots -structured prompts, subtasks, and metadata The key idea: structured failure. Instead of injecting errors, we create them by swapping execution and intent: -Parent prompt + Child execution -Child prompt + Parent execution This produces controlled, classifiable failures: -Critical mistake -Bad side effect -Instruction misunderstanding With calibrated complexity (80–225 actions) and strict QA, this becomes a fully reproducible benchmark. Result: We can measure not just if agents succeed, but: -where they break -how errors propagate -how robust they are across real environments Agents don’t fail at the answer. They fail in the process. Read more case studies below.
English
4
3
18
51.2K