Yaser Martinez

482 posts

Yaser Martinez

Yaser Martinez

@elyase

Ai Shepardeur 🤖🤝📈🤝🪙

Berlin, Germany 가입일 Haziran 2009
1K 팔로잉145 팔로워
Yaser Martinez 리트윗함
chiefofautism
chiefofautism@chiefofautism·
openai built a model that HIDES personal data in text so nothing leaks i flipped it INSIDE OUT same 1.5B weights, same label taxonomy, but instead of masks you get structured spans, name, email, phone, bank account, address, secrets, char offsets and all point it at logs, dumps, stolen inboxes and it just... returns every private thing in the pile
chiefofautism tweet media
English
52
104
2K
132.5K
Yaser Martinez 리트윗함
Cua
Cua@trycua·
We're open-sourcing Cua Driver - our new macOS driver that lets any agent (Claude Code, Codex, your own loop) drive any app in the background, with true multi-player and multi-cursor built-in. 1/8
Cua tweet media
English
55
168
1.6K
204.4K
Yaser Martinez 리트윗함
Artificial Analysis
Artificial Analysis@ArtificialAnlys·
GPT-5.5 takes OpenAI back to the clear number one in AI. OpenAI’s new model tops the Artificial Analysis Intelligence Index by 3 points, breaking a three-way tie with Anthropic and Google OpenAI gave us pre-release access to test all five reasoning effort levels: xhigh, high, medium, low and non-reasoning. ➤ OpenAI topping five headline evaluations: GPT-5.5 (xhigh) leads Terminal-Bench Hard, GDPval-AA and our newly hosted APEX-Agents-AA. The model trails only other OpenAI models in CritPt and AA-LCR, and comes second to Gemini 3.1 Pro Preview on three additional evaluations. The largest gains are on AA-Omniscience (+14 pts), our knowledge and hallucination benchmark, and τ²-Bench Telecom (+7 pts), a customer service agent benchmark. ➤ 20% more expensive to run our Intelligence Index: Per-token pricing has doubled from GPT-5.4 to $5/$30 per 1M input/output tokens. However, a ~40% token use reduction largely absorbs the hike - resulting in a net ~+20% cost to run our Intelligence Index. ➤ Effort a clear ladder for balancing intelligence and cost: GPT-5.5 (medium) scores the same as Claude Opus 4.7 (max) on our Intelligence Index at one quarter of the cost (~$1,200 vs $4,800) - although Gemini 3.1 Pro Preview scores the same at a cost of ~$900. GPT-5.5 (low) approximates Claude Opus 4.7 (Non-reasoning, high) on our Intelligence Index at half the cost to run (~$500 vs ~$1 ,000). ➤ Number one in GDPval-AA with an Elo of 1785: GPT-5.5 (xhigh) leads Claude Opus 4.7 (max) by ~30 pts and Gemini 3.1 Pro Preview by ~470 pts. GDPval-AA is Artificial Analysis’ benchmark that leverages OpenAI’s GDPval dataset to evaluate models on real-world economically valuable tasks. ➤ Top AA-Omniscience accuracy, but trailing the frontier on hallucination: Our private AA-Omniscience benchmark rewards factual knowledge across diverse topics, but punishes hallucination. GPT-5.5 (xhigh) has the highest accuracy at 57% - meaning the model can recall facts in the Omniscience corpus more effectively than any other model. However, it has a hallucination rate of 86% - vs Opus 4.7 (max) at 36%, and Gemini 3.1 Pro Preview at 50%. This makes it more likely to answer a question when it does not ‘know’ the answer. The 14 pt gain in AA-Omniscience from GPT-5.4 (xhigh) was largely driven by knowledge, with a modest improvement in hallucination. Congratulations to the team at @OpenAI and @sama on the launch
Artificial Analysis tweet media
English
57
207
1.7K
259.1K
Yaser Martinez 리트윗함
OpenAI
OpenAI@OpenAI·
Introducing GPT-5.5 A new class of intelligence for real work and powering agents, built to understand complex goals, use tools, check its work, and carry more tasks through to completion. It marks a new way of getting computer work done. Now available in ChatGPT and Codex.
English
2.4K
7K
51.7K
12.3M
Yaser Martinez 리트윗함
Yaser Martinez 리트윗함
Yoonho Lee
Yoonho Lee@yoonholeee·
We just released code for Meta-Harness! github.com/stanford-iris-… Aside from replicating paper experiments, the repo is designed to help users implement good Meta-Harnesses in completely new domains! Just point your agent at ONBOARDING.md and have a conversation
Yoonho Lee tweet media
Yoonho Lee@yoonholeee

How can we autonomously improve LLM harnesses on problems humans are actively working on? Doing so requires solving a hard, long-horizon credit-assignment problem over all prior code, traces, and scores. Announcing Meta-Harness: a method for optimizing harnesses end-to-end

English
27
165
1.1K
122.5K
Yaser Martinez 리트윗함
Yaser Martinez 리트윗함
Rivet
Rivet@rivet_dev·
Say hello to agentOS (beta) A portable open-source OS built just for agents. Powered by WASM & V8 isolates. 🔗 Embedded in your backend ⚡ ~6ms coldstarts, 32x cheaper than sbxs 📁 Mount anything as a file system (S3, SQLite, …) 🥧 Use Pi, Claude Code/Codex/Amp/OpenCode soon
English
59
76
1.1K
254.9K
Yaser Martinez 리트윗함
Varun
Varun@varun_mathur·
Introducing the Agent Virtual Machine (AVM) Think V8 for agents. AI agents are currently running on your computer with no unified security, no resource limits, and no visibility into what data they're sending out. Every agent framework builds its own security model, its own sandboxing, its own permission system. You configure each one separately. You audit each one separately. You hope you didn't miss anything in any of them. The AVM changes this. It's a single runtime daemon (avmd) that sits between every agent framework and your operating system. Install it once, configure one policy file, and every agent on your machine runs inside it - regardless of which framework built it. The AVM enforces security (91-pattern injection scanner, tool/file/network ACLs, approval prompts), protects your privacy (classifies every outbound byte for PII, credentials, and financial data - blocks or alerts in real-time), and governs resources (you say "50% CPU, 4GB RAM" and the AVM fair-shares it across all agents, halting any that exceed their budget). One config. One audit command. One kill switch. The architectural model is V8 for agents. Chrome, Node.js, and Deno are different products but they share V8 as their execution engine. Agent frameworks bring the UX. The AVM brings the trust. Where needed, AVM can also generate zero-knowledge proofs of agent execution via 25 purpose-built opcodes and 6 proof systems, providing the foundational pillar for the agent-to-agent economy. AVM v0.1.0 - Changelog - Security gate: 5-layer injection scanner with 91 compiled regex patterns. Every input and output scanned. Fail-closed - nothing passes without clearing the gate. - Privacy layer: Classifies all outbound data for PII, credentials, and financial info (27 detection patterns + Luhn validation). Block, ask, warn, or allow per category. Tamper-evident hash-chained log of every egress event. - Resource governor: User sets system-wide caps (CPU/memory/disk/network). AVM fair-shares across all agents. Gas budget per agent - when gas runs out, execution halts. No agent starves your machine. - Sandbox execution: Real code execution in isolated process sandboxes (rlimits, env sanitization) or Docker containers (--cap-drop ALL, --network none, --read-only). AVM auto-selects the tier - agents never choose their own sandbox. - Approval flow: Dangerous operations (file writes, shell commands, network requests) trigger interactive approval prompts. 5-minute timeout auto-denies. Every decision logged. - CLI dashboard: hyperspace-avm top shows all running agents, resource usage, gas budgets, security events, and privacy stats in one live-updating screen. - Node.js SDK: Zero-dependency hyperspace/avm package. AVM.tryConnect() for graceful fallback - if avmd isn't running, the agent framework uses its own execution path. OpenClaw adapter example included. - One config for all agents: ~/.hyperspace/avm-policy.json governs every agent framework on your machine. One file. One audit. One kill switch.
English
138
180
1.3K
138.1K
Yaser Martinez 리트윗함
Claude
Claude@claudeai·
You can now enable Claude to use your computer to complete tasks. It opens your apps, navigates your browser, fills in spreadsheets—anything you'd do sitting at your desk. Research preview in Claude Cowork and Claude Code, macOS only.
English
5K
14.5K
139.6K
77.7M
Yaser Martinez 리트윗함
Haocheng Xi
Haocheng Xi@HaochengXiUCB·
𝗞-𝗺𝗲𝗮𝗻𝘀 𝗶𝘀 𝘀𝗶𝗺𝗽𝗹𝗲. 𝗠𝗮𝗸𝗶𝗻𝗴 𝗶𝘁 𝗳𝗮𝘀𝘁 𝗼𝗻 𝗚𝗣𝗨𝘀 𝗶𝘀𝗻’𝘁. That’s why we built Flash-KMeans — an IO-aware implementation of exact k-means that rethinks the algorithm around modern GPU bottlenecks. By attacking the memory bottlenecks directly, Flash-KMeans achieves 30x speedup over cuML and 200x speedup over FAISS — with the same exact algorithm, just engineered for today’s hardware. At the million-scale, Flash-KMeans can complete a k-means iteration in milliseconds. A classic algorithm — redesigned for modern GPUs. Paper: arxiv.org/abs/2603.09229 Code: github.com/svg-project/fl…
English
36
202
1.8K
305.8K
Yaser Martinez 리트윗함
Chris Tate
Chris Tate@ctatedev·
The "holy shit" moment when I realized agent-browser can control Slack npx skills add vercel-labs/agent-browser --skill slack
English
81
80
1.4K
550.7K
Yaser Martinez 리트윗함
wevm
wevm@wevm_dev·
Introducing 𝚒𝚗𝚌𝚞𝚛 – the CLI framework built for agents and humans. Automatic discovery for agents enabling a guided experience for humans, without compromising tokens & context windows. » npx incur skills add
wevm tweet media
English
15
46
680
162.3K
Yaser Martinez 리트윗함
Ray
Ray@raysan5·
After some months of intensive work, #raylib has finally reached the ZERO open issues** and ZERO open PRs!!! 🚀 Definitely, it's time to... Source: github.com/raysan5/raylib
Ray tweet media
English
60
66
1.7K
79.7K
Yaser Martinez 리트윗함
Graham Neubig
Graham Neubig@gneubig·
MiniMax-M2.5 is a surprising new step in open coding models. The first model where I've been able to independently confirm that it's better than the most recent Claude Sonnet. It showed up in our benchmarks below, and in my vibe checks it felt strong and diverse.
OpenHands@OpenHandsDev

Big news for open models: @MiniMax_AI M2.5 is out and it’s an excellent+affordable coding model. It ranks 4th in our benchmarks, the first open model to beat Claude Sonnet. Only Claude Opus and GPT-5.2 Codex score higher. Details on scores and limited-time free access below 🧵

English
3
6
79
9K
Yaser Martinez 리트윗함
Samuel Colvin
Samuel Colvin@samuelcolvin·
Fuck it, a bit early but here goes: Monty: a new python implementation, from scratch, in rust, for LLMs to run code without host access. Startup time measured in single digit microseconds, not seconds. @mitsuhiko here's another sandbox/not-sandbox to be snarky about 😜 Thanks @threepointone @dsp_ (inadvertently) for the idea. github.com/pydantic/monty
English
92
164
1.8K
319.5K
Yaser Martinez 리트윗함
Claude
Claude@claudeai·
Introducing Claude Opus 4.6. Our smartest model got an upgrade. Opus 4.6 plans more carefully, sustains agentic tasks for longer, operates reliably in massive codebases, and catches its own mistakes. It’s also our first Opus-class model with 1M token context in beta.
English
1.7K
4.8K
39.4K
10.6M