BlockchainGirl

9.3K posts

BlockchainGirl banner
BlockchainGirl

BlockchainGirl

@BlockchainGirll

Obsessed with all things #Blockchain.✨#Bitcoin and #AI Are you IN or OUT? 🇬🇷 #Web3

Los Angeles, CA Katılım Eylül 2017
4.7K Takip Edilen6.6K Takipçiler
BlockchainGirl retweetledi
Turing
Turing@turingcom·
Human-guided AI is how AI works in regulated environments. In compliance, fraud, and audit workflows, speed is not enough. Systems must be explainable, auditable, and defensible. Autonomous-first AI fails where accountability matters: -Hallucinations -Silent drift -Unclear decisions -Weak audit trails “The model said so” does not hold up. The shift is architectural: -> Confidence-based routing -> Deterministic validation -> Human gating before execution -> End-to-end traceability This is partial autonomy: -Routine work scales -Edge cases get expert review -Every decision is reconstructable Governance is not a layer. It is the system.
English
1
4
7
81
BlockchainGirl retweetledi
Turing
Turing@turingcom·
Turing operates at the intersection of frontier research and enterprise deployment. Our experience with leading AI labs informs what’s realistic, reliable, and ready for production. That perspective helps enterprises move faster, avoid costly missteps, and deploy AI systems that scale within real regulatory and operational constraints. turing.com/blog/frontier-…
English
1
4
7
79
BlockchainGirl retweetledi
Turing
Turing@turingcom·
CASE STUDY: Better code models need better benchmarks. We partnered with a client to build a dataset that shows where models actually break, not just where they succeed. 200+ SWE-bench style Java tasks 20+ real GitHub repositories Each task includes a validated patch, reproducible tests, and a trainer-authored issue prompt The goal was simple: reflect how bugs are found and fixed in the real world. The problem: Most benchmarks rely on clean, solvable examples. Real pull requests are not like that. They are messy, uneven, and often hard to resolve. Our client needed to understand: -Where their model succeeds -Where it fails -How well it generalizes across real codebases The approach: We curated tasks from high-quality Java repositories with strict criteria: -Reproducible test failures before the patch -Clean passes after the patch -Meaningful logic changes only -Stable compilation throughout Each repo was containerized in Docker to ensure consistent, isolated test execution. When issues were missing, Turing trainers wrote them. Every prompt was: -Problem-focused -Neutral and solution-agnostic -Aligned with test behavior for clean evaluation We also balanced difficulty: -About 30 percent solvable -About 70 percent designed to expose failure modes The outcome: A benchmark that does more than measure accuracy. It reveals capability. Teams can now: -Test model performance on real bugs -Identify breakpoints across complexity and context -Analyze failure patterns with precision This is how you move from optimistic benchmarks to real insight.
English
1
3
8
2K
BlockchainGirl retweetledi
Turing
Turing@turingcom·
Request a sample task featuring a curated issue prompt, validated patch, pass/fail test states & metadata on difficulty, solvability, and repository source: turing.com/case-study/cur…
English
0
3
8
114
BlockchainGirl retweetledi
Jonathan Siddharth
Jonathan Siddharth@jonsidd·
Important Project Lazarus update: Back in December, @Turingcom pioneered acquiring real-world startup/enterprise codebases and operational data to train frontier AI models. @steph_palazzolo at @theinformation broke the story on day one. Incredible reporting that helped define a new category. The companies might be dead, but the human intelligence that built them can live on, powering the next generation of frontier models. Lazarus from Turing resurrects the spirits of dead companies. Now we're scaling massively. Buying all data assets from active and inactive companies. Founder or investor or operator with data to monetize? Hit me up. DM or email jonsid@turing.com
Jonathan Siddharth tweet media
Stephanie Palazzolo@steph_palazzolo

Can't go public or sell yourself? Try selling your codebase to an AI lab as training data! In this morning's AI Agenda, we get into this growing trend, as data curation firms like Turing and AfterQuery pick up failed startups' codebases. theinformation.com/articles/turin…

English
6
19
91
20.5K
BlockchainGirl retweetledi
Turing
Turing@turingcom·
Case Study: Most AI agent evals are flawed. They measure outputs. Real agents operate across 80–200+ actions, tools, and OS environments where failure is gradual, not binary. At Turing, we built a new evaluation framework: -900+ deterministic tasks -450+ parent–child pairs -1800+ evaluable scenarios via prompt–execution swapping • 6 domains, balanced across Windows, macOS, Linux • 40% open-source, 60% closed-source tools Each task includes full telemetry: -screen recordings -event logs (clicks, keystrokes, scrolls) -timestamped screenshots -structured prompts, subtasks, and metadata The key idea: structured failure. Instead of injecting errors, we create them by swapping execution and intent: -Parent prompt + Child execution -Child prompt + Parent execution This produces controlled, classifiable failures: -Critical mistake -Bad side effect -Instruction misunderstanding With calibrated complexity (80–225 actions) and strict QA, this becomes a fully reproducible benchmark. Result: We can measure not just if agents succeed, but: -where they break -how errors propagate -how robust they are across real environments Agents don’t fail at the answer. They fail in the process. Read more case studies below.
English
4
5
18
51.7K
BlockchainGirl retweetledi
Turing
Turing@turingcom·
That COBOL system you retired 3 years ago? It's sitting in a repo — unmaintained, unused, but valuable. Have a legacy codebase? Schedule a call below.
English
2
6
19
19.2K
BlockchainGirl retweetledi
Turing
Turing@turingcom·
Benchmarking RTL Agents with 1,500+ Real-World Verilog Tasks for NVIDIA’s CVDP For NVIDIA’s Comprehensive Verilog Design Problems, Turing built a production-grade dataset to evaluate LLMs on real hardware workflows, not simplified prompts. Most existing RTL benchmarks reported > 60% pass rates because they relied on narrow prompts and constrained tests and did not reflect multi-file repos, deep module hierarchies, debugging loops, or real EDA toolchains. CVDP was designed to change that. Turing delivered 1,500+ simulation-ready RTL problems across 13 categories, including: -Spec-to-RTL mapping -Code completion -Testbench and assertion generation -Bug fixing and tool-invoked debugging Three tiers of complexity: -Single-file copilot tasks with golden solutions -Multi-file agentic tasks requiring tool use -Full Git-style projects with >200k token contexts and simulation execution Every task included a deterministic harness, a simulation-passable reference solution, and metadata for coverage and difficulty tracking. Validation ran through Icarus Verilog and Cadence Xcelium, with manual RTL engineer review & ambiguity filtering. The result was tooling realism and measurable failure diversity. Cycle-accurate simulations surfaced real issues: -FSM transition errors -Signal width mismatches -Cross-module reasoning failures -Semantic violations When frontier models were evaluated: -GPT-4o dropped from 63% on prior benchmarks to 29% -Claude 3.7 Sonnet peaked at 33.56% on non-agentic generation -Agentic settings saw an additional 10 to 20% drop CVDP is now one of the most challenging hardware design benchmarks available, enabling category-level error clustering, root-cause analysis, and rigorous evaluation as models advance. -1,500+ tasks. -13 categories. -Commercial and open-source simulation. This is what hardware-grade LLM evaluation looks like.
Turing tweet media
English
2
8
19
16.5K
BlockchainGirl retweetledi
Turing
Turing@turingcom·
Request a benchmark sample featuring a debugging task, expected behavior specification, and signal-accurate pass/fail evaluation: turing.com/case-study/ben…
English
0
6
11
386
BlockchainGirl retweetledi
Turing
Turing@turingcom·
EnterpriseOps-Gym is taking off! 2K downloads in 3 days, trending #6 dataset + #3 paper of the day! Let's keep going!
Sai Rajeswar@RajeswarSai

🔥 𝗘𝗻𝘁𝗲𝗿𝗽𝗿𝗶𝘀𝗲𝗢𝗽𝘀-𝗚𝘆𝗺 𝗶𝘀 𝘁𝗮𝗸𝗶𝗻𝗴 𝗼𝗳𝗳 𝗵𝘂𝗴𝗲: 2K downloads in 3 days (trending #6 dataset + #3 paper of the day) 🏆. So we re-ran the leaderboard on the 𝗹𝗮𝘁𝗲𝘀𝘁 𝗳𝗿𝗼𝗻𝘁𝗶𝗲𝗿 𝗰𝗹𝗼𝘀𝗲𝗱 𝗺𝗼𝗱𝗲𝗹𝘀… and the results were promising. ✅ Claude versions show a meaningful jump in reliability on enterprise tasks. ✅ Gemini 3.1 Pro is catching up fast, now much closer to Sonnet 4.6 than earlier releases. And yet, the bigger takeaway is still the same: - Big room for improvement on enterprise-grade agentic tasks. - These workflows punish "seemingly correct." One wrong default, one policy miss, one unintended side effect.. and the task fails. 📢 𝗖𝗮𝗹𝗹𝗼𝘂𝘁 (especially if you’re working on agents): As we prepare our next NeurIPS/COLM submissions, try your agents on EnterpriseOps-Gym and see how they hold up on realistic, policy-constrained, long-horizon tasks. 🌐 Website: enterpriseops-gym.github.io 🤗 Dataset: huggingface.co/datasets/Servi… @ServiceNowRSRCH , @sagardavasam , @turingcom , @turingcomdev , @Mila_Quebec , @shiva_malay @PShravannayak

English
0
6
12
2.8K
BlockchainGirl retweetledi
Turing
Turing@turingcom·
Turing is featured in @ServiceNowRSRCH's Enterprise Ops Gym paper. We built the task and evaluation backbone: -1,000 prompts -7 single domain plus 1 hybrid workflow -7 to 30 step planning horizons -Expert reference executions with logged tool calls -Deterministic validation for success and side effect control Enabling structured comparison of enterprise agent performance across domains and complexity tiers. Dataset -> Paper -> Website -> Code. Below.
Turing tweet media
English
1
7
14
27.9K
BlockchainGirl retweetledi
a16z crypto
a16z crypto@a16zcrypto·
"When we made our acceptance speech on stage, we did give a shout out to Ethereum. White Rabbit wouldn't have existed without this technology." - @pplpleasr1
a16z crypto@a16zcrypto

Oscars week. Great time to talk about the first crypto project to win an Emmy. White Rabbit was crowdfunded on Ethereum, community-directed, and never pitched to a studio. @pplpleasr1 on building Shibuya. 0:00 Starting with a Fortune cover 1:49 White Rabbit and winning an Emmy 5:32 From Dickens to onchain storytelling 6:52 Why crypto unlocks capital formation 8:42 Lightning Round

English
8
8
59
11K
BlockchainGirl retweetledi
Turing
Turing@turingcom·
Turing contributed to Enterprise Ops Gym, ServiceNow’s new enterprise agent benchmark. We designed 1,000 prompts across 8 enterprise scenarios spanning HR, CSM, ITSM, Email, Calendar, Drive, Teams, and hybrid cross domain workflows. Tasks range from 7 to 30 steps with stateful system updates, expert reference traces, and deterministic verification scripts to evaluate correctness and policy compliance. A rigorous step forward for long horizon enterprise agent evaluation. Learn more: Paper: arxiv.org/abs/2603.13594 Data Set: huggingface.co/datasets/Servi… Website: enterpriseops-gym.github.io Code: github.com/ServiceNow/Ent…
Sai Rajeswar@RajeswarSai

🧵 Introducing 𝐄𝐧𝐭𝐞𝐫𝐩𝐫𝐢𝐬𝐞𝐎𝐩𝐬-𝐆𝐲𝐦🚀 : a rigorous new benchmark for stateful agentic planning and tool use in real enterprise environments. 1,150 expert-curated tasks · 512 tools · 164 DB tables · 8 domains. Every task verified by hand-written SQL, checking goal completion, state integrity and policy compliance🔥 𝐓𝐡𝐞 𝐡𝐞𝐚𝐝𝐥𝐢𝐧𝐞: Claude Opus 4.5 — our best-performing model succeeds on just 37.4% of tasks. With oracle tool access. No tool discovery required. 📄 arxiv.org/abs/2603.13594 (trending #4 on daily-papers) 🌐 enterpriseops-gym.github.io 🤗 huggingface.co/datasets/Servi… 💻 github.com/ServiceNow/Ent…

English
1
8
16
3.9K