Yaser Martinez

482 posts

Yaser Martinez

@elyase

Ai Shepardeur 🤖🤝📈🤝🪙

Berlin, Germany انضم Haziran 2009

1K يتبع145 المتابعون

Yaser Martinez أُعيد تغريده

chiefofautism@chiefofautism·3d

openai built a model that HIDES personal data in text so nothing leaks i flipped it INSIDE OUT same 1.5B weights, same label taxonomy, but instead of masks you get structured spans, name, email, phone, bank account, address, secrets, char offsets and all point it at logs, dumps, stolen inboxes and it just... returns every private thing in the pile

English

104

132.5K

Yaser Martinez أُعيد تغريده

Cua@trycua·4d

We're open-sourcing Cua Driver - our new macOS driver that lets any agent (Claude Code, Codex, your own loop) drive any app in the background, with true multi-player and multi-cursor built-in. 1/8

English

168

1.6K

204.4K

Yaser Martinez أُعيد تغريده

Artificial Analysis@ArtificialAnlys·4d

GPT-5.5 takes OpenAI back to the clear number one in AI. OpenAI’s new model tops the Artificial Analysis Intelligence Index by 3 points, breaking a three-way tie with Anthropic and Google OpenAI gave us pre-release access to test all five reasoning effort levels: xhigh, high, medium, low and non-reasoning. ➤ OpenAI topping five headline evaluations: GPT-5.5 (xhigh) leads Terminal-Bench Hard, GDPval-AA and our newly hosted APEX-Agents-AA. The model trails only other OpenAI models in CritPt and AA-LCR, and comes second to Gemini 3.1 Pro Preview on three additional evaluations. The largest gains are on AA-Omniscience (+14 pts), our knowledge and hallucination benchmark, and τ²-Bench Telecom (+7 pts), a customer service agent benchmark. ➤ 20% more expensive to run our Intelligence Index: Per-token pricing has doubled from GPT-5.4 to $5/$30 per 1M input/output tokens. However, a ~40% token use reduction largely absorbs the hike - resulting in a net ~+20% cost to run our Intelligence Index. ➤ Effort a clear ladder for balancing intelligence and cost: GPT-5.5 (medium) scores the same as Claude Opus 4.7 (max) on our Intelligence Index at one quarter of the cost (~$1,200 vs $4,800) - although Gemini 3.1 Pro Preview scores the same at a cost of ~$900. GPT-5.5 (low) approximates Claude Opus 4.7 (Non-reasoning, high) on our Intelligence Index at half the cost to run (~$500 vs ~$1 ,000). ➤ Number one in GDPval-AA with an Elo of 1785: GPT-5.5 (xhigh) leads Claude Opus 4.7 (max) by ~30 pts and Gemini 3.1 Pro Preview by ~470 pts. GDPval-AA is Artificial Analysis’ benchmark that leverages OpenAI’s GDPval dataset to evaluate models on real-world economically valuable tasks. ➤ Top AA-Omniscience accuracy, but trailing the frontier on hallucination: Our private AA-Omniscience benchmark rewards factual knowledge across diverse topics, but punishes hallucination. GPT-5.5 (xhigh) has the highest accuracy at 57% - meaning the model can recall facts in the Omniscience corpus more effectively than any other model. However, it has a hallucination rate of 86% - vs Opus 4.7 (max) at 36%, and Gemini 3.1 Pro Preview at 50%. This makes it more likely to answer a question when it does not ‘know’ the answer. The 14 pt gain in AA-Omniscience from GPT-5.4 (xhigh) was largely driven by knowledge, with a modest improvement in hallucination. Congratulations to the team at @OpenAI and @sama on the launch

English

207

1.7K

259.1K

Yaser Martinez أُعيد تغريده

OpenAI@OpenAI·4d

Introducing GPT-5.5 A new class of intelligence for real work and powering agents, built to understand complex goals, use tools, check its work, and carry more tasks through to completion. It marks a new way of getting computer work done. Now available in ChatGPT and Codex.

English

2.4K

51.7K

12.3M

Yaser Martinez أُعيد تغريده

elie@eliebakouch·20 Nis

kimi K2.6 vs K2.5, mythos, opus 4.7, and cursor composer 2 (based on K2.5) on every benchmark i could find tl;dr: it's a really really good model

Kimi.ai@Kimi_Moonshot

Meet Kimi K2.6: Advancing Open-Source Coding 🔹Open-source SOTA on HLE w/ tools (54.0), SWE-Bench Pro (58.6), SWE-bench Multilingual (76.7), BrowseComp (83.2), Toolathlon (50.0), Charxiv w/ python(86.7), Math Vision w/ python (93.2) What's new: 🔹Long-horizon coding - 4,000+ tool calls, over 12 hours of continuous execution, with generalization across languages (Rust, Go, Python) and tasks (frontend, devops, perf optimization). 🔹Motion-rich frontend - Videos in hero sections, WebGL shaders, GSAP + Framer Motion, Three.js 3D. 🔹Agent Swarms, elevated - 300 parallel sub-agents × 4,000 steps per run (up from K2.5's 100 / 1,500). One prompt, 100+ files. 🔹Proactive Agents - K2.6 model powers OpenClaw, Hermes Agent, etc for 24/7 autonomous ops. 🔹Claw Groups (research preview) - bring your own agents, command your friends', bots & humans in the loop. - K2.6 is now live on kimi.com in chat mode and agent mode. For production-grade coding, pair K2.6 with Kimi Code: kimi.com/code - 🔗 API: platform.moonshot.ai 🔗 Tech blog: kimi.com/blog/kimi-k2-6 🔗 Weights & code: huggingface.co/moonshotai/Kim…

English

137

1.7K

147.5K

Yaser Martinez أُعيد تغريده

Yoonho Lee@yoonholeee·15 Nis

We just released code for Meta-Harness! github.com/stanford-iris-… Aside from replicating paper experiments, the repo is designed to help users implement good Meta-Harnesses in completely new domains! Just point your agent at ONBOARDING.md and have a conversation

Yoonho Lee@yoonholeee

How can we autonomously improve LLM harnesses on problems humans are actively working on? Doing so requires solving a hard, long-horizon credit-assignment problem over all prior code, traces, and scores. Announcing Meta-Harness: a method for optimizing harnesses end-to-end

English

165

1.1K

122.5K

Yaser Martinez أُعيد تغريده

OpenAI@OpenAI·9 Nis

Our existing $200 Pro tier still remains our highest usage option. And as a thank you to our existing Pro users on the $200 tier, we’re extending our 2x Codex usage promo (until May 31st) and we’ve reset your Codex rate limits (yes, again).

OpenAI@OpenAI

We’re updating our ChatGPT Pro and Plus subscriptions to better support the growing use of Codex. We’re introducing a new $100/month Pro tier. This new tier offers 5x more Codex usage than Plus and is best for longer, high-effort Codex sessions. In ChatGPT, this new Pro tier still offers access to all Pro features, including the exclusive Pro model and unlimited access to Instant and Thinking models. To celebrate the launch, we’re increasing Codex usage for a limited time through May 31st so that Pro $100 subscribers get up to 10x usage of ChatGPT Plus on Codex to build your most ambitious ideas.

English

372

300

5.3K

824.8K

Yaser Martinez أُعيد تغريده

Rivet@rivet_dev·31 Mar

Say hello to agentOS (beta) A portable open-source OS built just for agents. Powered by WASM & V8 isolates. 🔗 Embedded in your backend ⚡ ~6ms coldstarts, 32x cheaper than sbxs 📁 Mount anything as a file system (S3, SQLite, …) 🥧 Use Pi, Claude Code/Codex/Amp/OpenCode soon

English

1.1K

254.9K

Yaser Martinez أُعيد تغريده

Varun@varun_mathur·23 Mar

Introducing the Agent Virtual Machine (AVM) Think V8 for agents. AI agents are currently running on your computer with no unified security, no resource limits, and no visibility into what data they're sending out. Every agent framework builds its own security model, its own sandboxing, its own permission system. You configure each one separately. You audit each one separately. You hope you didn't miss anything in any of them. The AVM changes this. It's a single runtime daemon (avmd) that sits between every agent framework and your operating system. Install it once, configure one policy file, and every agent on your machine runs inside it - regardless of which framework built it. The AVM enforces security (91-pattern injection scanner, tool/file/network ACLs, approval prompts), protects your privacy (classifies every outbound byte for PII, credentials, and financial data - blocks or alerts in real-time), and governs resources (you say "50% CPU, 4GB RAM" and the AVM fair-shares it across all agents, halting any that exceed their budget). One config. One audit command. One kill switch. The architectural model is V8 for agents. Chrome, Node.js, and Deno are different products but they share V8 as their execution engine. Agent frameworks bring the UX. The AVM brings the trust. Where needed, AVM can also generate zero-knowledge proofs of agent execution via 25 purpose-built opcodes and 6 proof systems, providing the foundational pillar for the agent-to-agent economy. AVM v0.1.0 - Changelog - Security gate: 5-layer injection scanner with 91 compiled regex patterns. Every input and output scanned. Fail-closed - nothing passes without clearing the gate. - Privacy layer: Classifies all outbound data for PII, credentials, and financial info (27 detection patterns + Luhn validation). Block, ask, warn, or allow per category. Tamper-evident hash-chained log of every egress event. - Resource governor: User sets system-wide caps (CPU/memory/disk/network). AVM fair-shares across all agents. Gas budget per agent - when gas runs out, execution halts. No agent starves your machine. - Sandbox execution: Real code execution in isolated process sandboxes (rlimits, env sanitization) or Docker containers (--cap-drop ALL, --network none, --read-only). AVM auto-selects the tier - agents never choose their own sandbox. - Approval flow: Dangerous operations (file writes, shell commands, network requests) trigger interactive approval prompts. 5-minute timeout auto-denies. Every decision logged. - CLI dashboard: hyperspace-avm top shows all running agents, resource usage, gas budgets, security events, and privacy stats in one live-updating screen. - Node.js SDK: Zero-dependency hyperspace/avm package. AVM.tryConnect() for graceful fallback - if avmd isn't running, the agent framework uses its own execution path. OpenClaw adapter example included. - One config for all agents: ~/.hyperspace/avm-policy.json governs every agent framework on your machine. One file. One audit. One kill switch.

English

138

180

1.3K

138.1K

Yaser Martinez أُعيد تغريده

Claude@claudeai·24 Mar

You can now enable Claude to use your computer to complete tasks. It opens your apps, navigates your browser, fills in spreadsheets—anything you'd do sitting at your desk. Research preview in Claude Cowork and Claude Code, macOS only.

English

14.5K

139.6K

77.7M

Yaser Martinez أُعيد تغريده

Miguel Salinas@Vercantez·21 Mar

pi.camelai.dev

ZXX

4.4K

Yaser Martinez أُعيد تغريده

Haocheng Xi@HaochengXiUCB·17 Mar

𝗞-𝗺𝗲𝗮𝗻𝘀 𝗶𝘀 𝘀𝗶𝗺𝗽𝗹𝗲. 𝗠𝗮𝗸𝗶𝗻𝗴 𝗶𝘁 𝗳𝗮𝘀𝘁 𝗼𝗻 𝗚𝗣𝗨𝘀 𝗶𝘀𝗻’𝘁. That’s why we built Flash-KMeans — an IO-aware implementation of exact k-means that rethinks the algorithm around modern GPU bottlenecks. By attacking the memory bottlenecks directly, Flash-KMeans achieves 30x speedup over cuML and 200x speedup over FAISS — with the same exact algorithm, just engineered for today’s hardware. At the million-scale, Flash-KMeans can complete a k-means iteration in milliseconds. A classic algorithm — redesigned for modern GPUs. Paper: arxiv.org/abs/2603.09229 Code: github.com/svg-project/fl…

English

202

1.8K

305.8K

Yaser Martinez أُعيد تغريده

Alvin Sng@alvinsng·17 Mar

x.com/i/article/2028…

ZXX

158

505

5.5K

2.1M

Yaser Martinez أُعيد تغريده

Chris Tate@ctatedev·28 Şub

The "holy shit" moment when I realized agent-browser can control Slack npx skills add vercel-labs/agent-browser --skill slack

English

1.4K

550.7K

Yaser Martinez أُعيد تغريده

wevm@wevm_dev·27 Şub

Introducing 𝚒𝚗𝚌𝚞𝚛 – the CLI framework built for agents and humans. Automatic discovery for agents enabling a guided experience for humans, without compromising tokens & context windows. » npx incur skills add

English

680

162.3K

Yaser Martinez أُعيد تغريده

Ray@raysan5·26 Şub

After some months of intensive work, #raylib has finally reached the ZERO open issues** and ZERO open PRs!!! 🚀 Definitely, it's time to... Source: github.com/raysan5/raylib

English

1.7K

79.7K

Yaser Martinez أُعيد تغريده

Graham Neubig@gneubig·12 Şub

MiniMax-M2.5 is a surprising new step in open coding models. The first model where I've been able to independently confirm that it's better than the most recent Claude Sonnet. It showed up in our benchmarks below, and in my vibe checks it felt strong and diverse.

OpenHands@OpenHandsDev

Big news for open models: @MiniMax_AI M2.5 is out and it’s an excellent+affordable coding model. It ranks 4th in our benchmarks, the first open model to beat Claude Sonnet. Only Claude Opus and GPT-5.2 Codex score higher. Details on scores and limited-time free access below 🧵

English

Yaser Martinez أُعيد تغريده

Samuel Colvin@samuelcolvin·6 Şub

Fuck it, a bit early but here goes: Monty: a new python implementation, from scratch, in rust, for LLMs to run code without host access. Startup time measured in single digit microseconds, not seconds. @mitsuhiko here's another sandbox/not-sandbox to be snarky about 😜 Thanks @threepointone @dsp_ (inadvertently) for the idea. github.com/pydantic/monty

English

164

1.8K

319.5K

Yaser Martinez أُعيد تغريده

th0rgal@Th0rgal_·6 Şub

x.com/i/article/2013…

ZXX

32K

Yaser Martinez أُعيد تغريده

Claude@claudeai·5 Şub

Introducing Claude Opus 4.6. Our smartest model got an upgrade. Opus 4.6 plans more carefully, sustains agentic tasks for longer, operates reliably in massive codebases, and catches its own mistakes. It’s also our first Opus-class model with 1M token context in beta.

English

1.7K

4.8K

39.4K

10.6M

اكتشف

@OpenAI @sama @mitsuhiko @threepointone @dsp_ @elonmusk @BarackObama @taylorswift13