Turing Community

2.5K posts

Turing Community

@turingcomdev

Accelerating frontier AI models + systems. Powered by ALAN and a 4M+ global talent cloud. AGI at scale.

Katılım Şubat 2022

111 Takip Edilen1.4K Takipçiler

Turing Community retweetledi

Turing@turingcom·4h

EnterpriseOps-Gym is taking off! 2K downloads in 3 days, trending #6 dataset + #3 paper of the day! Let's keep going!

Sai Rajeswar@RajeswarSai

🔥 𝗘𝗻𝘁𝗲𝗿𝗽𝗿𝗶𝘀𝗲𝗢𝗽𝘀-𝗚𝘆𝗺 𝗶𝘀 𝘁𝗮𝗸𝗶𝗻𝗴 𝗼𝗳𝗳 𝗵𝘂𝗴𝗲: 2K downloads in 3 days (trending #6 dataset + #3 paper of the day) 🏆. So we re-ran the leaderboard on the 𝗹𝗮𝘁𝗲𝘀𝘁 𝗳𝗿𝗼𝗻𝘁𝗶𝗲𝗿 𝗰𝗹𝗼𝘀𝗲𝗱 𝗺𝗼𝗱𝗲𝗹𝘀… and the results were promising. ✅ Claude versions show a meaningful jump in reliability on enterprise tasks. ✅ Gemini 3.1 Pro is catching up fast, now much closer to Sonnet 4.6 than earlier releases. And yet, the bigger takeaway is still the same: - Big room for improvement on enterprise-grade agentic tasks. - These workflows punish "seemingly correct." One wrong default, one policy miss, one unintended side effect.. and the task fails. 📢 𝗖𝗮𝗹𝗹𝗼𝘂𝘁 (especially if you’re working on agents): As we prepare our next NeurIPS/COLM submissions, try your agents on EnterpriseOps-Gym and see how they hold up on realistic, policy-constrained, long-horizon tasks. 🌐 Website: enterpriseops-gym.github.io 🤗 Dataset: huggingface.co/datasets/Servi… @ServiceNowRSRCH , @sagardavasam , @turingcom , @turingcomdev , @Mila_Quebec , @shiva_malay @PShravannayak

English

1.4K

Turing Community retweetledi

Turing@turingcom·21h

Turing is featured in @ServiceNowRSRCH's Enterprise Ops Gym paper. We built the task and evaluation backbone: -1,000 prompts -7 single domain plus 1 hybrid workflow -7 to 30 step planning horizons -Expert reference executions with logged tool calls -Deterministic validation for success and side effect control Enabling structured comparison of enterprise agent performance across domains and complexity tiers. Dataset -> Paper -> Website -> Code. Below.

English

27.6K

Turing Community retweetledi

Turing@turingcom·21h

Dataset: huggingface.co/datasets/Servi…

Filipino

117

Turing Community retweetledi

Turing@turingcom·21h

Paper: arxiv.org/abs/2603.13594 Website: enterpriseops-gym.github.io Code: github.com/ServiceNow/Ent…

English

100

Turing Community retweetledi

Turing@turingcom·1d

Most AI models can pass standard benchmarks. But real-world risk does not live at the surface. As frontier models improve, many STEM benchmarks are saturating. High pass rates on academic datasets reduce evaluation signal and can mask weaknesses in advanced reasoning. That’s why we built HLE++.

English

1.3K

Turing Community retweetledi

Turing@turingcom·1d

HLE++ is a calibrated STEM evaluation framework designed to preserve measurable pass@k separation beyond baseline benchmarks like Humanity’s Last Exam (HLE). It measures how leading large language models perform on graduate-to-PhD-level, multi-step math and science tasks under strict structural constraints.

English

Turing Community retweetledi

Turing@turingcom·1d

Why this matters: -Advanced STEM reasoning underpins high-impact AI use cases across finance, life sciences, manufacturing, energy, and defense. -Small reasoning errors compound at scale in automated decision systems. Evaluation design, ambiguity control, and rubric structure materially affect reported model performance.

English

Turing Community retweetledi

Turing@turingcom·1d

What we’re seeing: -Performance gaps widen on structured, multi-step domain tasks. -Models that perform well on general benchmarks degrade under domain-specific stress. -Calibrated difficulty bands reveal weaknesses leaderboard scores often hide.

English

Turing Community retweetledi

Turing@turingcom·1d

Explore HLE++ and see how today’s frontier models perform under domain-grade scrutiny: go.turing.com/llm-stem-evalu…

English

Turing Community retweetledi

Marktechpost AI Dev News ⚡@Marktechpost·2d

NVIDIA just open-sourced OpenShell (Apache 2.0), a dedicated runtime environment designed to address the security risks associated with autonomous AI agents. As agents move from simple chat interfaces to executing code and accessing local/remote tools, they require a secure execution layer that prevents unauthorized system access or data exfiltration. OpenShell provides this infrastructure through three primary technical pillars: 1️⃣ Sandboxed Execution Using kernel-level isolation (Landlock LSM), OpenShell creates an ephemeral environment for agent tasks. This ensures that any shell commands or scripts generated by an LLM are contained, protecting the host system from unintended modifications or destructive commands. 2️⃣ Policy-Enforced Access Control Rather than providing broad permissions, OpenShell utilizes a granular policy engine. Developers can define restrictions at multiple levels: → Per-binary: Explicitly allow or deny specific executables (e.g., git, python). → Per-endpoint: Restrict network traffic to authorized domains or IP addresses. → Per-method: Control specific API calls or L7 protocols. → Audit Logging: Every action is recorded for debugging and compliance. 3️⃣ Private Inference Routing To manage privacy and costs, OpenShell includes a routing layer that intercepts model traffic. This allows organizations to enforce data-handling rules and route inference requests between local and cloud models without changing the agent's code. OpenShell is currently in alpha....... Read our full analysis on OpenShell: marktechpost.com/2026/03/18/nvi… GitHub: github.com/NVIDIA/OpenShe… Docs: docs.nvidia.com/openshell/late… Technical details: developer.nvidia.com/blog/run-auton… @nvidia @NVIDIAAI @NVIDIAAIDev

GIF

English

112.4K

Turing Community retweetledi

Jonathan Siddharth@jonsidd·2d

Excited to share that Turing contributed to Enterprise Ops Gym, @ServiceNowRSRCH's new enterprise agent benchmark submitted to ICML. Enterprise Ops Gym moves beyond short-horizon tool calls and evaluates end-to-end enterprise operations across realistic, multi-system workflows. The paper, website, dataset, and code is below.

English

3.5K

Turing Community retweetledi

Jonathan Siddharth@jonsidd·2d

Website: enterpriseops-gym.github.io

English

256

Turing Community retweetledi

Jonathan Siddharth@jonsidd·2d

Dataset: huggingface.co/datasets/Servi…

Filipino

342

Turing Community retweetledi

Jonathan Siddharth@jonsidd·2d

Code: github.com/ServiceNow/Ent…

English

339

Turing Community retweetledi

Jonathan Siddharth@jonsidd·2d

CC @ShivaMalay @PShravannayak @TheJishnuNair @sagardavasam @SathwikTejaswi @tscholak @NVIDIAAI @ServiceNowNews @Mila @ServiceNowRSRCH

371

Turing Community retweetledi

ServiceNow AI Research@ServiceNowRSRCH·3d

🧵 Introducing ENTERPRISEOPS-GYM — a new stateful benchmark for enterprise AI agents. Best model: 37.4% success — with oracle tool access. Give a tiny model a human plan? It catches up to giants. The bottleneck isn’t tools. It’s planning. 👇

Sai Rajeswar@RajeswarSai

🧵 Introducing 𝐄𝐧𝐭𝐞𝐫𝐩𝐫𝐢𝐬𝐞𝐎𝐩𝐬-𝐆𝐲𝐦🚀 : a rigorous new benchmark for stateful agentic planning and tool use in real enterprise environments. 1,150 expert-curated tasks · 512 tools · 164 DB tables · 8 domains. Every task verified by hand-written SQL, checking goal completion, state integrity and policy compliance🔥 𝐓𝐡𝐞 𝐡𝐞𝐚𝐝𝐥𝐢𝐧𝐞: Claude Opus 4.5 — our best-performing model succeeds on just 37.4% of tasks. With oracle tool access. No tool discovery required. 📄 arxiv.org/abs/2603.13594 (trending #4 on daily-papers) 🌐 enterpriseops-gym.github.io 🤗 huggingface.co/datasets/Servi… 💻 github.com/ServiceNow/Ent…

English

835

Turing Community retweetledi

Turing@turingcom·3d

Benchmarking frontier models takes more than bigger datasets. Turing built 5,000+ HLE-grade STEM problems to stress-test deep reasoning across physics, chemistry, biology, and math. 100% accepted. 40+ subdomains. Built for real model differentiation. This is what next-gen AI evaluation looks like. Review the case study below.

English

31.2K

Turing Community retweetledi

Turing@turingcom·3d

turing.com/case-study/ben…

ZXX

187

Turing Community retweetledi

Marktechpost AI Dev News ⚡@Marktechpost·2d

Most AI agents today are failing the enterprise 'vibe check.' ServiceNow Research just released EnterpriseOps-Gym, and it’s a massive reality check for anyone expecting autonomous agents to take over IT and HR tomorrow. We’re moving past simple benchmarks. This is a containerized sandbox with 164 database tables and 512 functional tools. It’s designed to see if agents can actually handle long-horizon planning amidst persistent state changes and strict access protocols. The Brutal Numbers: → Claude Opus 4.5 (the top performer) only achieved a 37.4% success rate. → Gemini-3-Flash followed at 31.9%. → DeepSeek-V3.2 (High) leads the open-source pack at 24.5%. Why the low scores? The research study found that strategic reasoning, not tool invocation, is the primary bottleneck. When the research team provided agents with a human-authored plan, performance jumped by 14-35 percentage points. Strikingly, with a good plan, tiny models like Qwen3-4B actually become competitive with the giants. The TL;DR for AI Devs: ✅ Planning > Scale: We can’t just scale our way to reliability; we need better constraint-aware plan generation. ✅ MAS isn't a Silver Bullet: Decomposing tasks into subtasks often regressed performance because it broke sequential state dependencies. ✅ Sandbox Everything: If you aren't testing your agents in stateful environments, you aren't testing them for the real world. Read our full analysis here: marktechpost.com/2026/03/18/ser… Check out the benchmark: enterpriseops-gym.github.io Paper: arxiv.org/pdf/2603.13594 Codes: github.com/ServiceNow/Ent… @ServiceNow @ServiceNowRSRCH @RajeswarSai @ShivaMalay @PShravannayak @TheJishnuNair @sagardavasam @SathwikTejaswi @tscholak @NVIDIAAI @turingcom @ServiceNowNews @jonsidd

English

3.4K

Turing Community retweetledi

Turing@turingcom·3d

Turing contributed to Enterprise Ops Gym, ServiceNow’s new enterprise agent benchmark. We designed 1,000 prompts across 8 enterprise scenarios spanning HR, CSM, ITSM, Email, Calendar, Drive, Teams, and hybrid cross domain workflows. Tasks range from 7 to 30 steps with stateful system updates, expert reference traces, and deterministic verification scripts to evaluate correctness and policy compliance. A rigorous step forward for long horizon enterprise agent evaluation. Learn more: Paper: arxiv.org/abs/2603.13594 Data Set: huggingface.co/datasets/Servi… Website: enterpriseops-gym.github.io Code: github.com/ServiceNow/Ent…

Sai Rajeswar@RajeswarSai

English

3.8K

Keşfet

@ServiceNowRSRCH @nvidia @NVIDIAAI @NVIDIAAIDev @ShivaMalay @PShravannayak @TheJishnuNair @sagardavasam