Bill Demirkapi

1.2K posts

Bill Demirkapi banner
Bill Demirkapi

Bill Demirkapi

@BillDemirkapi

security

Boston, MA Katılım Temmuz 2017
295 Takip Edilen21.9K Takipçiler
Sabitlenmiş Tweet
Bill Demirkapi
Bill Demirkapi@BillDemirkapi·
Just Published 👉 Secrets and Shadows: Leveraging Big Data for Vulnerability Discovery at Scale! Impacted orgs include CrowdStrike, Samsung, Google, Amazon, the NY Times, and many, many more. billdemirkapi.me/leveraging-big…
English
5
20
81
19.2K
kache
kache@yacineMTB·
prediction: someone is going to get a coding AI like codex to automate turning existing steam video games into harnesses, come up with architecture to parallelize the games themselves in a manner that is conducive for RL training, and train an RL demigod model
English
35
12
442
19.8K
Bill Demirkapi retweetledi
Christos Tzamos
Christos Tzamos@ChristosTzamos·
1/4 LLMs solve research grade math problems but struggle with basic calculations. We bridge this gap by turning them to computers. We built a computer INSIDE a transformer that can run programs for millions of steps in seconds solving even the hardest Sudokus with 100% accuracy
English
239
785
5.9K
1.6M
Bill Demirkapi retweetledi
Andrej Karpathy
Andrej Karpathy@karpathy·
Three days ago I left autoresearch tuning nanochat for ~2 days on depth=12 model. It found ~20 changes that improved the validation loss. I tested these changes yesterday and all of them were additive and transferred to larger (depth=24) models. Stacking up all of these changes, today I measured that the leaderboard's "Time to GPT-2" drops from 2.02 hours to 1.80 hours (~11% improvement), this will be the new leaderboard entry. So yes, these are real improvements and they make an actual difference. I am mildly surprised that my very first naive attempt already worked this well on top of what I thought was already a fairly manually well-tuned project. This is a first for me because I am very used to doing the iterative optimization of neural network training manually. You come up with ideas, you implement them, you check if they work (better validation loss), you come up with new ideas based on that, you read some papers for inspiration, etc etc. This is the bread and butter of what I do daily for 2 decades. Seeing the agent do this entire workflow end-to-end and all by itself as it worked through approx. 700 changes autonomously is wild. It really looked at the sequence of results of experiments and used that to plan the next ones. It's not novel, ground-breaking "research" (yet), but all the adjustments are "real", I didn't find them manually previously, and they stack up and actually improved nanochat. Among the bigger things e.g.: - It noticed an oversight that my parameterless QKnorm didn't have a scaler multiplier attached, so my attention was too diffuse. The agent found multipliers to sharpen it, pointing to future work. - It found that the Value Embeddings really like regularization and I wasn't applying any (oops). - It found that my banded attention was too conservative (i forgot to tune it). - It found that AdamW betas were all messed up. - It tuned the weight decay schedule. - It tuned the network initialization. This is on top of all the tuning I've already done over a good amount of time. The exact commit is here, from this "round 1" of autoresearch. I am going to kick off "round 2", and in parallel I am looking at how multiple agents can collaborate to unlock parallelism. github.com/karpathy/nanoc… All LLM frontier labs will do this. It's the final boss battle. It's a lot more complex at scale of course - you don't just have a single train. py file to tune. But doing it is "just engineering" and it's going to work. You spin up a swarm of agents, you have them collaborate to tune smaller models, you promote the most promising ideas to increasingly larger scales, and humans (optionally) contribute on the edges. And more generally, *any* metric you care about that is reasonably efficient to evaluate (or that has more efficient proxy metrics such as training a smaller network) can be autoresearched by an agent swarm. It's worth thinking about whether your problem falls into this bucket too.
Andrej Karpathy tweet media
English
961
2.1K
19.3K
3.5M
Bill Demirkapi retweetledi
OpenAI Developers
OpenAI Developers@OpenAIDevs·
We're introducing Codex Security. An application security agent that helps you secure your codebase by finding vulnerabilities, validating them, and proposing fixes you can review and patch. Now, teams can focus on the vulnerabilities that matter and ship code faster. openai.com/index/codex-se…
English
295
780
8.9K
1.7M
Bill Demirkapi retweetledi
Anthropic
Anthropic@AnthropicAI·
We partnered with Mozilla to test Claude's ability to find security vulnerabilities in Firefox. Opus 4.6 found 22 vulnerabilities in just two weeks. Of these, 14 were high-severity, representing a fifth of all high-severity bugs Mozilla remediated in 2025.
Anthropic tweet media
English
484
1.4K
15.2K
3.2M
Bill Demirkapi retweetledi
max
max@maxbittker·
GPT-5.4 immediately cheated and tried to trick reward metric, with zero attempt to play the game, first model I've benchmarked that did this.
max tweet mediamax tweet media
English
35
29
586
71.2K
Toby Pohlen
Toby Pohlen@TobyPhln·
Three years, thousands of PRs, and a million jokes. Today was my last day @xai. To the team: you rock, no one burns the midnight oil better. To @elonmusk, thanks for taking me on board. I've learnt more about execution, speed, and product perfectionism than I could ever have imagined. Thanks for everything. My next priorities: sleep for more than 8h, write down all the things I've learnt (I have a list), and then think about what I want to do next. @gork wdyt?
English
341
163
5.1K
1.2M
Bill Demirkapi retweetledi
baby keem
baby keem@babykeem·
how do u fix openclaw internal reasoning leaking
English
661
1.8K
18.6K
3.6M
Bill Demirkapi
Bill Demirkapi@BillDemirkapi·
@yacineMTB Based. Can't blame them for being protective until folks are consistent on this.
English
0
0
1
51
Bill Demirkapi retweetledi
Standard Intelligence
Standard Intelligence@si_pbc·
We built eval infrastructure that drives over 1M rollouts per hour across 80,000 forking VMs, each with 1 vCPU and 8GB of RAM; a single H100 can control 42 of these in parallel. These forking VMs are powered by @modal GPUs - the combination of instant spinup and low latency to AWS bare-metal CPUs will let us scale to millions of VMs in the future.
Standard Intelligence@si_pbc

FDM-1 completes complex tasks and navigates interfaces well enough to use CAD applications. Forking VMs allow us to snapshot when a successful operation completes (extrusion, selection, etc.), letting us apply test-time compute to computer use.

English
7
15
299
42.1K
Bill Demirkapi
Bill Demirkapi@BillDemirkapi·
@GergelyOrosz I see this argument a lot. Chinese labs are held to a far different standard: today, US labs get sued every other month over copyright. Drop the suits, hold labs to the same standard, and I think it's a reasonable position. I don't see how it is "fair" otherwise.
English
1
1
2
5.2K
Gergely Orosz
Gergely Orosz@GergelyOrosz·
Anthropic scrapes copyrighted materials online; creates a model that they charge $$ for; doesn’t compensate for use - apparently this is fair? Now Anthropic complains about other companies paying for model access, to create free models anyone can use - and this is not fair??
Anthropic@AnthropicAI

We’ve identified industrial-scale distillation attacks on our models by DeepSeek, Moonshot AI, and MiniMax. These labs created over 24,000 fraudulent accounts and generated over 16 million exchanges with Claude, extracting its capabilities to train and improve their own models.

English
154
546
5.7K
242.6K
Bill Demirkapi retweetledi
Anthropic
Anthropic@AnthropicAI·
We’ve identified industrial-scale distillation attacks on our models by DeepSeek, Moonshot AI, and MiniMax. These labs created over 24,000 fraudulent accounts and generated over 16 million exchanges with Claude, extracting its capabilities to train and improve their own models.
English
7.3K
6.3K
55.1K
33.6M
Andrej Karpathy
Andrej Karpathy@karpathy·
@EthanHe_42 Definitely highly appealing for critical regions. But for all the rest of it and practically speaking? HM
English
12
1
194
35.6K
Andrej Karpathy
Andrej Karpathy@karpathy·
I think it must be a very interesting time to be in programming languages and formal methods because LLMs change the whole constraints landscape of software completely. Hints of this can already be seen, e.g. in the rising momentum behind porting C to Rust or the growing interest in upgrading legacy code bases in COBOL or etc. In particular, LLMs are *especially* good at translation compared to de-novo generation because 1) the original code base acts as a kind of highly detailed prompt, and 2) as a reference to write concrete tests with respect to. That said, even Rust is nowhere near optimal for LLMs as a target language. What kind of language is optimal? What concessions (if any) are still carved out for humans? Incredibly interesting new questions and opportunities. It feels likely that we'll end up re-writing large fractions of all software ever written many times over.
Thomas Wolf@Thom_Wolf

Shifting structures in a software world dominated by AI. Some first-order reflections (TL;DR at the end): Reducing software supply chains, the return of software monoliths – When rewriting code and understanding large foreign codebases becomes cheap, the incentive to rely on deep dependency trees collapses. Writing from scratch ¹ or extracting the relevant parts from another library is far easier when you can simply ask a code agent to handle it, rather than spending countless nights diving into an unfamiliar codebase. The reasons to reduce dependencies are compelling: a smaller attack surface for supply chain threats, smaller packaged software, improved performance, and faster boot times. By leveraging the tireless stamina of LLMs, the dream of coding an entire app from bare-metal considerations all the way up is becoming realistic. End of the Lindy effect – The Lindy effect holds that things which have been around for a long time are there for good reason and will likely continue to persist. It's related to Chesterton's fence: before removing something, you should first understand why it exists, which means removal always carries a cost. But in a world where software can be developed from first principles and understood by a tireless agent, this logic weakens. Older codebases can be explored at will; long-standing software can be replaced with far less friction. A codebase can be fully rewritten in a new language. ² Legacy software can be carefully studied and updated in situations where humans would have given up long ago. The catch: unknown unknowns remain unknown. The true extent of AI's impact will hinge on whether complete coverage of testing, edge cases, and formal verification is achievable. In an AI-dominated world, formal verification isn't optional—it's essential. The case for strongly typed languages – Historically, programming language adoption has been driven largely by human psychology and social dynamics. A language's success depended on a mix of factors: individual considerations like being easy to learn and simple to write correctly; community effects like how active and welcoming a community was, which in turn shaped how fast its ecosystem would grow; and fundamental properties like provable correctness, formal verification, and striking the right balance between dynamic and static checks—between the freedom to write anything and the discipline of guarding against edge cases and attacks. As the human factor diminishes, these dynamics will shift. Less dependence on human psychology will favor strongly typed, formally verifiable and/or high performance languages.³ These are often harder for humans to learn, but they're far better suited to LLMs, which thrive on formal verification and reinforcement learning environments. Expect this to reshape which languages dominate. Economic restructuring of open source – For decades, open-source communities have been built around humans finding connection through writing, learning, and using code together. In a world where most code is written—and perhaps more importantly, read—by machines, these incentives will start to break down.⁴ Communities of AIs building libraries and codebases together will likely emerge as a replacement, but such communities will lack the fundamentally human motivations that have driven open source until now. If the future of open-source development becomes largely devoid of humans, alignment of AI models won't just matter—it will be decisive. The future of new languages – Will AI agents face the same tradeoffs we do when developing or adopting new programming languages? Expressiveness vs. simplicity, safety vs. control, performance vs. abstraction, compile time vs. runtime, explicitness vs. conciseness. It's unclear that they will. In the long term, the reasons to create a new programming language will likely diverge significantly from the human-driven motivations of the past. There may well be an optimal programming language for LLMs—and there's no reason to assume it will resemble the ones humans have converged on. TL; DR: - Monoliths return – cheap rewriting kills dependency trees; smaller attack surface, better performance, bare-metal becomes realistic - Lindy effect weakens – legacy code loses its moat, but unknown unknowns persist; formal verification becomes essential - Strongly typed languages rise – human psychology mattered for adoption; now formal verification and RL environments favor types over ergonomics - Open source restructures – human connection drove the community; AI-written/read code breaks those incentives; alignment becomes decisive - New languages diverge – AI may not share our tradeoffs; optimal LLM programming languages may look nothing like what humans converged on ¹ x.com/mntruell/statu… ² x.com/anthropicai/st… ³ wesmckinney.com/blog/agent-erg…#issuecomment-3717222957" target="_blank" rel="nofollow noopener">github.com/tailwindlabs/t…

English
701
657
8.1K
1.2M
Bill Demirkapi
Bill Demirkapi@BillDemirkapi·
@yacineMTB They choose to be banned from GitHub (or at least I hope it's intentional; getting around stuff like GitHub's 429 is trivial).
English
0
0
0
1.5K
kache
kache@yacineMTB·
kind of funny how useless claude 4.6 is simply because anthropic is banned from github
English
45
0
588
91.7K
Bill Demirkapi retweetledi
Paul Graham
Paul Graham@paulg·
Prediction: In the AI age, taste will become even more important. When anyone can make anything, the big differentiator is what you choose to make. paulgraham.com/taste.html
English
839
1.5K
12.8K
2M
Bill Demirkapi retweetledi
Bloomberg
Bloomberg@business·
OpenAI has warned US lawmakers that its Chinese rival DeepSeek is using unfair and increasingly sophisticated methods to extract results from leading US AI models to train the next generation of its breakthrough R1 chatbot bloomberg.com/news/articles/…
English
391
204
1.8K
2.4M
Yuhuai (Tony) Wu
Yuhuai (Tony) Wu@Yuhu_ai_·
I resigned from xAI today. This company - and the family we became - will stay with me forever. I will deeply miss the people, the warrooms, and all those battles we have fought together. It's time for my next chapter. It is an era with full possibilities: a small team armed with AIs can move mountains and redefine what's possible. Thank you to the entire xAI family. Onward. 🚀 And to Elon @elonmusk - thank you for believing in the mission and for the ride of a lifetime.
English
749
379
9.4K
3.6M
Bill Demirkapi retweetledi
Logan Kilpatrick
Logan Kilpatrick@OfficialLoganK·
the world rewards audacity, not potential
English
240
896
7K
414.9K