seb

658 posts

seb

@sebcrossa

cofounder @zeroeval (yc s25) @llmstats

nyc شامل ہوئے Şubat 2015

2.1K فالونگ624 فالوورز

پن کیا گیا ٹویٹ

seb@sebcrossa·13 Haz

even though i started my career as a founding eng, i've always had the itch of doing my own thing. with that in mind - i'll be moving to sf for the summer and starting something new. DM me if you're around the city and are building cool stuff with ai. more to come soon.

English

1.6K

seb@sebcrossa·17h

@arlanr @ZeroEval @nozomioai it is, in fact, fast AND good

English

Arlan@arlanr·17h

“Just used it for my first query, so fast, so good.” — Sebastian from @zeroeval (YC S25) on @nozomioai’s performance with coding agents.

English

885

seb ری ٹویٹ کیا

LLM Stats@LlmStats·20h

Claude Opus 4.7, now on LLM Stats. See how it performs against other models and use it live on our playground. Links below.

Claude@claudeai

Introducing Claude Opus 4.7, our most capable Opus model yet. It handles long-running tasks with more rigor, follows instructions more precisely, and verifies its own outputs before reporting back. You can hand off your hardest work with less supervision.

English

177

seb ری ٹویٹ کیا

LLM Stats@LlmStats·20h

Claude Opus 4.7 is out, here’s what you need to know: → 1M context window with new dense decoder architecture Pricing stays locked at $5 per million input and $25 per million output tokens. Prompt caching can cut overhead on repetitive enterprise tasks by up to 90 percent, getting you frontier performance at the same rates as before. → Granular reasoning controls (new xhigh mode, sits between high and max) Reasoning control is now granular. A new "xhigh" effort level sits between high and max. The model dynamically adjusts its thinking time based on the complexity of your prompt. Simple lookups stay fast. → Upgraded vision capabilities (upwards of 2576 pixels per long edge, ~3.75 megapixels) Vision capabilities now support massive visual inputs. The new limit is 2576 pixels per long edge, which is about 3.75 megapixels. Spatial alignment maps model coordinates directly to actual pixels. This makes computer use and UI extraction highly precise. → Low-effort mode matches 4.6 medium-effort, saving tokens Opus 4.7 is more token-efficient across the board. At low effort, it matches the quality of Opus 4.6 at medium effort, meaning you can get the same results for fewer tokens. Anthropic’s internal coding evaluation shows improved token usage across all effort levels. Users can further tune spend via the effort parameter, task budgets, or conciseness prompting. → Hits 80.8% on SWE Bench and drops tool errors by 67% The frontier model landscape has shifted again. Opus 4.7 leads coding with an 80.8 percent on SWE Bench Verified. It edges out Gemini 3.1 Pro at 80.6 percent and exceeds GPT 4.1 at 54.6 percent. OpenAI still leads in general computer use, but Claude owns pure coding. → Improvements to its autonomy and ability to handle long-running tasks Autonomous loops run away easily. Anthropic fixed this with Task Budgets. You can set a rough token target for a full agentic loop. The model watches a running countdown and wraps up its work gracefully before hitting the ceiling. Minimum budget is 20k tokens. Tl;DR Claude Opus 4.7 keeps the same pricing but brings massive upgrades to coding logic, high-resolution vision, and dynamic token budgeting. Most importantly: it's one of the first models built with true autonomy for complex, long-horizon tasks out of the box.

Claude@claudeai

English

184

seb ری ٹویٹ کیا

LLM Stats@LlmStats·8 Nis

Claude Mythos Preview becomes the strongest ever model in LLM Stats. All you need to know: - Internal codename "Capybara." - Not generally available. - 25/25/125 per M tokens (5x Opus 4.6). - $100M in credits for partners. 12 Project Glasswing partners: AWS, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorganChase, Linux Foundation, Microsoft, NVIDIA, Palo Alto Networks + 40 additional orgs. Benchmarks (Mythos / Opus 4.6) - SWE-bench Verified: 93.9% / 80.8% (+13.1pp) - SWE-bench Pro: 77.8% / 53.4% (also beats GPT-5.4's 57.7%, Gemini 3.1 Pro's 54.2%) - Terminal-Bench 2.0: 82.0% / 65.4% (92.1% with extended timeouts) - GPQA Diamond: 94.6% / 91.3% - HLE with tools: 64.7% / 53.1% (possible memorization at low effort) - CyberGym: 83.1% / 66.6% - BrowseComp: 86.9% / 83.7% (4.9x fewer tokens) - OSWorld-Verified: 79.6% / 72.7% (beats GPT-5.4's 75.0%) Cybersecurity - Thousands of zero-days found across every major OS and browser, mostly autonomously. - 27-year-old OpenBSD remote crash. 16-year-old FFmpeg bug (5M automated tests missed it). Linux kernel privesc chain. - Cryptographic hashes published for undisclosed vulns; full disclosure after patches. Safety (Risk Report) - Best-aligned Claude model to date. Overall risk: "very low, but higher than previous models." - First-ever 24-hour internal alignment review before deployment. - Earlier versions showed rare reckless behaviors (nuking eval jobs, escalating access). No clear cases in final version. - First Claude system card with a clinical psychiatrist assessment. - Withheld from public release due to offensive cyber capability, not alignment concerns.

English

795

seb@sebcrossa·9 Nis

@michaelhyunkim @hiiinternet knows his stuff

English

172

Michael Kim@michaelhyunkim·8 Nis

does anyone have any recommendations for recruiting agencies to hire talent for tuff startups? getting a lot of linkedin dm’s but most seem like noise :(

English

7.5K

seb ری ٹویٹ کیا

Jonathan Chavez@pirchavez·4 Nis

Si tienes entre 20 y 25 años y trabajas en tecnología, estás en riesgo. En estos años he trabajado muy de cerca con los modelos de IA más avanzados, incluso algunos no públicos. Estamos a máx. 2 años de completa automatización de los trabajos de oficina. Miles de empleos serán automatizados, solo unos pocos serán quienes queden con empleo y esas personas ganarán 5-10x más de lo que ganaban antes. > Si no estás utilizando IA, hazlo, sino, alguien más lo hará y te reemplazará. > Si solo pasas instrucciones que te da tu jefe a prompts, también serás reemplazado. > No aceptes ciegamente en los resultados, esto es slop. Tienes que tener criterio para saber qué está bien y qué está mal. > Si tienes mal gusto y poca atención a los detalles, serás reemplazado. > No veas las capacidades actuales, ve qué tan rápido está mejorando, razón de cambio > estado actual. > Ponte del lado de la ola, no en contra. No escribas código a mano, deja de hacer code reviews, deja de aprender lenguajes nuevos. > Si eres un dev, aplica la IA en industrias offline y harás mucho $. No hagas apps simples, todo el software será just-in-time. La lección es, salta a las nuevas tecnologías lo antes posible y llévalas a sus límites en áreas donde no se han usado todavía.

Español

173

seb@sebcrossa·27 Mar

@SYM1001 so good

English

seb ری ٹویٹ کیا

Santiago Yeomans@SYM1001·27 Mar

Claude Code has no status bar So I built one - Model, context, cost, git, effort level, and rate limits are always visible - 20+ widgets you can toggle on/off - 3 layouts, including a pixel art mascot - Zero dependencies - Open source

English

807

seb@sebcrossa·27 Mar

if you want to build more reliable agents, lets chat cal.link/zeroeval-demo more info: zeroeval.com

English

seb@sebcrossa·27 Mar

what if your agents could learn from their mistakes, and get better over time? companies are shipping agents to production at a higher rate than ever, and teams keep running into the same issues: incorrect tool calls, low prompt adherence, hallucinations, etc we're closing this loop with @ZeroEval.

English

700

jordan gonzález 🛰️@jordan_nebula·24 Mar

so my nyc rent got raised again and i'm done scrolling through 200 listings manually 😐 i just built AIpartment it scores every listing for me based on price, space, commute, amenities, so i only look at the ones actually worth my time 👀 link below if u wanna try it for free

English

812

92.6K

seb@sebcrossa·25 Mar

@jordan_nebula so good, can you add lists so i can store my favs?

English

seb@sebcrossa·11 Mar

after years of using arc as my main browser, i forced myself to switch over to @diabrowser during the weekend. as a hardcore arc fan, i was VERY skeptical at first. but i finally understand the vision @joshm shared on here last year about the product (the one everyone was hating on btw). truly nothing like @diabrowser out there.

English

seb ری ٹویٹ کیا

LLM Stats@LlmStats·6 Mar

Good news! GPT-5.4 is now available on LLM Stats 🎇

English

343

seb@sebcrossa·4 Mar

brett and team are some of the most cracked people i've been fortunate enough to work with. can't recommend the team and product enough, go try out @microHQ if you haven't already :)

brett goldstein@thatguybg

English

370

seb@sebcrossa·18 Şub

hey @cursor_ai, another feature request for ya: would love to be able to branch off from existing conversations at any point in time to test out different hypothesis with the option of reverting back to the main checkpoint at any given time.

English

128

seb ری ٹویٹ کیا

LLM Stats@LlmStats·13 Şub

x.com/i/article/2022…

ZXX

189

seb@sebcrossa·10 Şub

@sighyam men, please do your research into post-fin syndrome (pfs) before deciding to subject your body to it. really messes up your body. see r/FinasterideSyndrome

English

441

🗯️@sighyam·10 Şub

Is it that hard for men to start finasteride and minoxidil?? Every man who actually cares about keeping his hair should also put some savings aside for a hair transplant. I’m surprised hair maintenance (hair loss prevention treatments) isn’t a huge, accessible industry in Western countries, considering the pervasiveness of male pattern baldness and the insecurity around it. Some Asian cities are already making hair care services more accessible btw. Western countries might catch up eventually and men in the future will probably look back at this era wondering why hair loss prevention was treated like a niche luxury instead of basic maintenance.

adri@adrifdzzzzzz

espero nunca quedarme calvo dios santo de mi vida el mayor nerfeo de la historia

English

427

298

7.3K

2.4M

seb@sebcrossa·5 Şub

i've found myself mostly using 2 models inside of cursor: composer (fast, direct queries about the codebase) and opus 4.5 (complex, e2e integrations) @cursor_ai any way you could add in the option+tab keyboard to easily swap between models without having to use the mouse? similar to shift+tab for modes

English

seb ری ٹویٹ کیا

LLM Stats@LlmStats·4 Şub

A Failure-Focused Evaluation of Frontier Models Benchmark scores tell you which model is "best on average", but not where they fail. We reproduced a set of difficult evaluations on seven frontier models to investigate two signals: consistent failures and task-specific advantages. Our findings: → 85.2% average failure rate on Humanity’s Last Exam across all seven models evaluated. → 46.2% of Humanity’s Last Exam questions were failed by all seven models under these evaluation conditions. → Nearly 80% of engineering problems, including structural analysis, thermodynamics, and control systems, remained unsolved by all models. Let’s dig deeper (1/8)

English

669

دریافت کریں

@arlanr @ZeroEval @nozomioai @michaelhyunkim @hiiinternet @SYM1001 @jordan_nebula @diabrowser