kerimkaya

1.6K posts

kerimkaya

@kerimrocks

Beauty and craft in the coming abundance of software. Co-creating Kai, the Continuous Codebase Engineer @driaforall

Katılım Haziran 2017

1.1K Takip Edilen1.8K Takipçiler

kerimkaya retweetledi

andthattoo@andthatto·1d

Qwen 3.6 is frontier for local. It also thinks forever. I tried a dumb inference-time trick: make its block obey a tiny grammar. Result: - HumanEval+: 22x fewer think tokens, no accuracy loss - LiveCodeBench public slice: +14% pass@1, ~5x fewer total tokens

English

1.3K

129.4K

kerimkaya@kerimrocks·3d

Smart move by OpenAI. They’re now giving verified security researchers access to cyber-permissive models (fewer refusals, deeper capability) through a trust tier system. Critical infrastructure organizations gain access to dedicated cyber models. Meanwhile, Anthropic doubled down on restrictions with its safety policy. The problem with over-refusing is you don’t just block attackers, you block defenders too. And defenders are the ones who actually need the capability. @sama seems to have figured this out faster.

OpenAI@OpenAI

Introducing GPT-5.5 A new class of intelligence for real work and powering agents, built to understand complex goals, use tools, check its work, and carry more tasks through to completion. It marks a new way of getting computer work done. Now available in ChatGPT and Codex.

English

kerimkaya@kerimrocks·3d

2021: Crypto created abundant supply but not enough demand 2026: AI created abundant demand but not enough supply 2034: Robotics will ______

English

kerimkaya@kerimrocks·3d

She isn't real, she's reciting a script she half-learned, and the seams are showing. AI this, AI that: the same recycled noise from people who've never built anything that breaks at 3am and has to be fixed by morning. That's why I can't wait for AI to replace mids like her, strip the script and there's nothing underneath. Because real entrepreneurship ( real 996 ) isn't the bullshit she's spouting. It's pouring your heart and soul into work that evolves you, built on incentive and reward systems that don't drain others life but compound it. That's the real game.

Yatırımcı Yazılımcı@yatirimci

Az önce modern köleliğin "ütopya" ambalajıyla pazarlandığı bir paralel evrene düştüm. İş-yaşam dengesini tamamen rafa kaldırıp, haftanın yedi günü 18 saatlik mesaiyi gururla "adanmışlık" olarak sunan bir zihniyetle karşı karşıyayız. Özel hayatı, sağlığı ve aileyi unutup sadece başkasının hayali uğruna ömür çürütmek vizyonerlik değil, düpedüz plaza prangasıdır. Emeğinizi sömürmeyi "büyük bir tutkuyla dünyayı değiştiriyoruz" masalıyla meşrulaştıran bu tarz toksik çalışma kültürlerinden arkanıza bile bakmadan koşarak uzaklaşın.

English

174

kerimkaya@kerimrocks·15 Nis

1.5 years ago I sat across from a VC who kept pushing me on a competitor that had way more GitHub stars than us. I told him two things. First: "Ranking us below them because of a star count isn't first principles thinking. That's pattern matching on a vanity metric." Second: "Have you actually run their code? I have. It doesn't work. Half the examples fail. Go look at their PRs, nobody's responding. You cannot get organic reactions from devs like that. They're farming stars to raise a round." He wasn't convinced in the room. To his credit, he went and looked, and we ended up getting a term sheet from him. A year and a half later, and VCs are STILL eating this bait.

Andras Bacsai@heyandras

wtf

English

158

kerimkaya@kerimrocks·14 Nis

Love this work, and the efficiency is wild. One thing I keep thinking about though: Claude got really good at climbing PGR, but PGR is a proxy for alignment, not the real thing. When the thing being measured is smarter than the people who wrote the measure, climbing faster can look identical to gaming it. Feels like the real frontier now is evaluating the evaluator.

Jan Leike@janleike

New research result: we use Claude to make fully autonomous progress on scalable oversight research, as measured by performance gap recovered (PGR). Claude iterates on a number of different techniques and ends up significantly outperforming human researchers for $18k in credits.

English

228

kerimkaya@kerimrocks·14 Nis

The consensus take is that coding agents get better by sitting closer to the developer, faster autocomplete, tighter IDE loop, more context in the sidebar, but the actual step function is in the opposite direction, async agents you wake up to, where a vulnerability has already been patched with the exploit that proved it, a hot path has already been rewritten against a real benchmark, and a year of mixed-tool drift has already been reconciled into one coherent style, each one landing as a verified pull request with the eval trace attached. The teams still optimizing the inner loop are going to lose to the ones who treated the whole thing as a single system from day one.

Ashpreet Bedi@ashpreetbedi

New post: Systems Engineering Coding agents have lowered the barrier to writing code, but they haven't lowered the requirements of production software. Agentic software is just software. The agent replaces business logic. Everything else is the same. ashpreetbedi.com/articles/syste…

English

196

kerimkaya@kerimrocks·14 Nis

Two lines from OpenAI's harness team that I cannot get out of my head: "Human taste is captured once, then enforced continuously on every line of code." "Your taste keeps working while you sleep."

English

kerimkaya@kerimrocks·11 Nis

@ClementDelangue @QuixiAI The pieces are actually already sitting on the table, permissionless compute, open models, and open harnesses. I wrote up what I think can be done with them and why open source and distributed AI have to start moving together.

kerimkaya@kerimrocks

x.com/i/article/2042…

English

1.8K

clem 🤗@ClementDelangue·10 Nis

Should we start an open Glasswing?

English

126

1.2K

93.4K

kerimkaya@kerimrocks·10 Nis

@the_smart_ape Here is the solution:

kerimkaya@kerimrocks

x.com/i/article/2042…

English

The Smart Ape 🔥@the_smart_ape·9 Nis

x.com/i/article/2042…

ZXX

197

286.9K

kerimkaya@kerimrocks·10 Nis

The most capable vulnerability researcher ever built lives inside one company. Forty organizations have a key. The rest of the internet has a blog post. The problem is too big for any one team to be precious about it. Open source and distributed AI have to unite.

kerimkaya@kerimrocks

x.com/i/article/2042…

English

186

kerimkaya@kerimrocks·10 Nis

x.com/i/article/2042…

ZXX

2.1K

kerimkaya@kerimrocks·9 Nis

Train or be trained on. Overheard in SF lately. Smart teams are distilling frontier models for their niche at full speed, racing to bank data liquidity before the gate drops.

martin_casado@martin_casado

It's only a matter of time before only the model creators have access to the most powerful models. The rest get access to smaller, distilled versions. Or access the models through first party apps and services that don't provide direct access to the token path. The investment needs for training are too high, and distillation too effective to warrant any other future.

English

139

kerimkaya@kerimrocks·7 Nis

One man now holds more zero-days in your OS than any spy agency on Earth. He handed them to twelve companies you did not vote for. The only reason that is good news is a position the industry spent two years calling him a coward for. The shape of the power is the news.

English

kerimkaya@kerimrocks·7 Nis

Open source just had its security argument flip. And closed source got one back for the first time in twenty years. "Many eyes make all bugs shallow" was always half-fiction. FFmpeg had many eyes for sixteen years and a reasoner found the bug on first read. What OSS actually gained this month is not more eyes. It is legibility to a frontier model. End to end, every line, in one pass, for the price of a flight. That cuts both ways and the cut is sharp. The defender scans the whole graph. The attacker scans the whole graph. Same model, same weekend, different intent. Whoever runs the loop faster wins the window. Anthropic bought open source a 90-day head start and called it a consortium. Meanwhile proprietary code became dark matter again. You cannot scan what you cannot exfiltrate. For the first time since the 90s, "we do not publish our source" is a defensible security posture instead of a punchline. The second-order move is the one I keep thinking about. If your source is closed, the attacker has to compromise your scanner to see your code. The model becomes the control plane. Attacking the reasoner becomes more valuable than attacking the target, because the reasoner sees every target at once. Model monoculture is the new monoculture. Open source is not losing here. It is entering a regime where survival depends on running the same loop the attackers will run, on the same schedule, with the same weights class, forever. The maintainers who build that loop keep their projects. The ones who do not, will not notice when they lose them.

Anthropic@AnthropicAI

Introducing Project Glasswing: an urgent initiative to help secure the world’s most critical software. It’s powered by our newest frontier model, Claude Mythos Preview, which can find software vulnerabilities better than all but the most skilled humans. anthropic.com/glasswing

English

237

kerimkaya@kerimrocks·7 Nis

“After only one week with AlphaEvolve, we halved our compute costs. After a month of AlphaEvolve co-designing algorithms with our team, our compute costs for the same optimization were reduced by 97%, our equivalent memory usage was 74% lower, and runtime was reduced by 680%.”

Substrate@substrate

Over the past few months, we have integrated @googledeepmind's AlphaEvolve into our computational lithography. Enabled by AlphaEvolve's algorithmic leaps, we are now printing complex patterns in a single exposure that would otherwise require multiple. substrate.com/information-to…

English

335

kerimkaya@kerimrocks·4 Nis

The bet is that composing well-known patterns beats generating arbitrary code when the search space is large and failure modes are systematic. Open source, full reproduction in the repo. Learn more: andthattoo.dev/blog/scaffold

English

kerimkaya@kerimrocks·4 Nis

The key difference from a coding agent is what gets wasted. Every candidate Scaffold proposes is type-checked before a single LLM call runs. No budget on broken imports, type errors, or runtime crashes. All 9 candidates were valid and semantically unique.

English

kerimkaya@kerimrocks·4 Nis

Claude Code and Cursor improving themselves is just the beginning. The next step is LLM pipelines that evolve their own topology. @andthatto built a typed DSL for self-evolving AI. 77.1% vs Meta-Harness at 48.6%, with 4x fewer candidates.

Matthew Berman@MatthewBerman

Claude Code and Cursor... but they improve themselves. Autonomously. Meta Harness is wild. Had to make a video about it...

English

561

Keşfet

@sama @ClementDelangue @QuixiAI @the_smart_ape @elonmusk @BarackObama @taylorswift13 @cristiano