Aditya Gupta

2.5K posts

Aditya Gupta

@adi1391

Founder @ Attify - Breaking LLM/IoT/Mobile - CFSE Author - Formal Logic & World-Model RE - Teaching Offensive Intelligence Engineering

Katılım Temmuz 2011

2.4K Takip Edilen8.2K Takipçiler

Aditya Gupta@adi1391·22 Mar

@PinkDraconian While everyone is building vuln discovery agents and workflows, the real need is that of a triage tool which is grounded in the codebase.

English

133

PinkDraconian@PinkDraconian·21 Mar

HackerOne receives ±200 reports every hour now. This is unsustainable. Why is this happening? You can literally tell an AI agent to just "Hack this site" and it is capable enough to go and find a vulnerability (or at least something that looks like a vulnerability).

English

370

50.5K

Aditya Gupta@adi1391·13 Mar

congrats @axiommathai @CarinaLHong on the raise. mathematics is the right foundation for systems that reason - and essential for a safer future. the bet on verification infrastructure over benchmark scores is the right one, and AXLE proves it where - verify_proof tells you a proof is wrong. - repair_proofs tells you how it's wrong and tries to fix it. it’s a genius flywheel open-source verify_proof, repair_proofs - the whole toolkit. keep the prover proprietary. verification : public good feedback loop : product open-sourcing verification grows the ecosystem → more Lean proofs written → more training data available → better prover. & sustains both the advancement of Maths and advancing Axiom. maths is the first domain where this works. def. won't be the last. so many emergent possibilities.

Axiom@axiommathai

Axiom launched six months ago with one conviction: mathematics is the right foundation for building systems that reason. Today we announce Axiom's Series A. We raised $200M at a $1.6B+ valuation, led by @MenloVentures, to extend our lead in formal mathematics into Verified AI.

English

297

Aditya Gupta@adi1391·12 Mar

things will never be the same again

Nous Research@NousResearch

Meet Hermes Agent, the open source agent that grows with you. Hermes Agent remembers what it learns and gets more capable over time, with a multi-level memory system and persistent dedicated machine access.

English

310

Aditya Gupta@adi1391·12 Mar

doing my part to make the ai ecosystem safer. starting with llama.cpp Heap Overflow Bug identified using CFSE World Modeling. Just received : CVE-2026-27940 one step at a time. thanks @ggerganov for prioritizing security and fixing at such light-speed 🙏 github.com/ggml-org/llama…

English

754

Aditya Gupta@adi1391·12 Mar

@karpathy @karpathy i think the problem isn't failovers. it's that the autoresearch is coupling state → sessions. every hypothesis, prove/refute direction, evidence pointer - should be append-only files on disk, not API context. built this - github.com/adi0x90/cfse-r…

English

108

Andrej Karpathy@karpathy·11 Mar

My autoresearch labs got wiped out in the oauth outage. Have to think through failovers. Intelligence brownouts will be interesting - the planet losing IQ points when frontier AI stutters.

English

551

301

619K

Aditya Gupta retweetledi

Chaithu@ant4g0nist·18 Şub

Been experimenting/building Morgul - an AI debugger automation framework. Control LLDB with natural language: act(), extract(), observe(). It translates intent into bridge API code, executes it, similar to @browserbase's Stagehand controls a browser... github.com/ant4g0nist/mor…

English

5.4K

Aditya Gupta@adi1391·21 Şub

exactly! it's time to go deeper + more possible than ever before for smaller teams to take on massively sized orgs when it comes to doing serious deep research - incl. for discovering vulns. It's like bringing the real research part back to the world of Security Research with the only limits being creativity (and compute).

English

David Wong@cryptodavidw·20 Şub

I definitely feel the heat of the competition when big LLM companies push products that not only compete with us an auditors but also with our own AI-based offerings (zkao). If I were to venture a guess, there's different world in which we might exist in the next 5-10 years. In one of these futures, we, as auditors, seize to exist. If this is the future, then developers seize to exist too, and most people touching software seize to exist. My guess here is as good as any developer's guess on if their job will remain stable. In another one of these futures, us auditors become more specialized, more niche, and bring the "human touch" needed or required. Serious companies will want to continue working with some humans, and delegating security to "someone". That someone could be embedded in the company, or they could be a SaaS+human-support system like zkao. On the other hand, vibe coders will definitely use claude code security, maybe we should call it "vibe security"? I don't mean it as a diss, I vibe code myself, but it will most likely be as good as vibe coding in the sense that you might have to spend time understanding it, it might make a lot of mistakes, and it will be "good enough" for a lot of usecases. I think that world is a bit more realistic today, than the AGI "all of our jobs are gone in the next years" doom claim. And as @zksecurityXYZ, I don't think we're too scared of that world. These tools have been, and are making us stronger auditors. We're a small, highly specialized team, that's resilient and hard to replace. On the other hand large consultancies and especially consultancies that focus on low hanging fruits like web security and smart contracts are ngmi.

Claude@claudeai

Introducing Claude Code Security, now in limited research preview. It scans codebases for vulnerabilities and suggests targeted software patches for human review, allowing teams to find and fix issues that traditional tools often miss. Learn more: anthropic.com/news/claude-co…

English

6.2K

Aditya Gupta@adi1391·21 Şub

~$21B in market value disappeared in hours : CrowdStrike (CRWD) -8%, Cloudflare (NET) -8.1%, Okta (OKTA) -9.2%, Qualys (QLYS) -10.2%, Zscaler (ZS) -5.5% - lowest in the last few years. For many : Panic Mode. but Zoom out → → → This is a Huge Win for safer code and a pivotal moment in CyberSecurity. With OpenAI’s Aardvark and now Anthropic’s Claude Code Security : the game is evolving fast, for the better. There’s zero point in humans grinding away at tasks AI crushes at scale and depth - and it’s proven - like with the recent EVMBench, and others. But that doesn’t kill the industry. It frees it to level up. This is the foundation of all real progress. We've long been addicted to the adrenaline of breaking things (red team glory, bug bounties, pentest hero stories), but we’ve spent far less energy on making systems truly unbreakable. In that direction, it’s great that the frontier labs are taking it up. And more individual researchers should too. Vibe Coding is hitting security hard : what used to require expensive consultants, armies of specialists, is now in reach of every dev team. Selling “magic” products (snake-oil) is no longer sustainable. Selling Products & Services which are now vibe-codable and are in realm of everyone, is what advancement looks like. Its about democratizing deep reasoning so defenders can scale fixes faster than attackers scale exploits. It’s time for the true curiousity seekers to go deeper & pursue their true passion of figuring out how to break the unbreakable. But what’s next? What’s going to be relevant in the coming times? What is red team/sec research/exploitation expertise going to look like? Well, Expertise is no longer going to be about Information Arbitrage -- that untruth is disappearing quickly. Expertise is about: 1. Can you apply your Intelligence at the highest level, consistently? 2. Do you know where to apply your Intelligence? If you’re in cybersec and worried your current skills won’t keep you relevant → adapt now. How to adapt? By Mastering the Meta Layer: - How to Orchestrate Agents? - How to build longer Reasoning systems? - How to Engineer Reliable, Obserable Systems? - Building Supervision, Evals, Agentic Collaboration frameworks - Secure/Flexible Sandboxing These are the force multipliers, which you can apply across any surface : IoT, Web3, Mobile, Cloud, Web, Infra, ICS, OT, anything. The surface layer work? Humans won’t own that much longer. It’s time you face that reality. Value of Human output is diminishing rapidly there, whether you like it or not. Say Hello to an Era where Security looks radically different. It’s no longer fear-fueled “secret knowledge” sales, but proactive, curiosity-driven creation. If you’re a dreamer, it’s time to rethink what the future could look like. And it’s time to build that future. I’m in. Are you?

Claude@claudeai

English

677

Aditya Gupta@adi1391·19 Şub

Research :: 1. How to Look (at x) 2. Where to Look (for x) 3. Redefine x →

English

178

Aditya Gupta@adi1391·19 Şub

& everyone's gotten so used to it. of using either clunky CLIs (not even TUIs), or outdated enterprise-feel javaish ghidra sort of UI. interfaces not designed to make things simple, but feel drowning in complexity. can't think of anything better than @zeddotdev to build on top of -- superfast, rust, and have cracked the aesthetics part. building the core & enabling others to build extensions (or some way to encapsulate their way of working & sharing) -- could be a great way, to make it the de-facto RE tool in the coming times.

English

200

Chaithu@ant4g0nist·19 Şub

yess! honestly i do not like the gap between how good dev tooling feels vs RE tooling. we have agentic AI, gpu-rendered editors, (maybe real-time collab as well?) in dev world. RE still feels stuck in weird looking dialogs from 2003. Just started with a goal to see if we could bring all of that to RE

English

6.6K

Chaithu@ant4g0nist·19 Şub

new kid coming to reverse engineering town…his name is Snowball 🐇

English

1.6K

100.6K

Aditya Gupta@adi1391·19 Şub

Single instance at the root, with Claude/Agents md clearly specifying what it is - and that the subdirectories are the actual code, with git configured in each. And in the parent directory, have the Claude md point to individual backend/frontend Claude md for it to properly navigate when you open Claude from root. Haven’t had any issue with context either - just need to ensure that the root Claude md is well written and clear. Don’t over complicate in the doc.

English

Samuel Cardillo@CardilloSamuel·19 Şub

quick question: when you’re working on multi-repo (e.g backend + frontend) do you open one instance of claude code/codex/whatever at the root so it have access to both folders or you open instances in each individual folders?

English

2.1K

Aditya Gupta@adi1391·19 Şub

Great work by OpenAI. Hacking, at its core, is about Curiosity and Thinking Differently. Instead of thinking that the game is over, the game is on. It’s time to build frameworks & systems, that can go beyond what the top models are capable of. The most interesting times.

OpenAI@OpenAI

Introducing EVMbench—a new benchmark that measures how well AI agents can detect, exploit, and patch high-severity smart contract vulnerabilities. openai.com/index/introduc…

English

364

Aditya Gupta@adi1391·18 Şub

@roydanroy My attempt on Q10. For a few others, 4, 6, 9 - got quite far too, but couldn’t solve fully.

Aditya Gupta@adi1391

Just solved & submitted Q10 of #1stProof research-level math problems. If you can prove it, but can't trace what your proof depends on, or query which claims are still hypotheses, or machine-verify the core theorem - it's not really a proof. The initial bottleneck in solving it wasn't mathematical insight, but lack of a system for reliable research. Solving complex problems with LLMs requires a system composed of two things: 1. High-Quality World Model : the problem space 2. Traceable Way Finding : how the LLM experiments to find out the path to the ideal final state of the world model And the critical third layer: a mix of Semi-Formal and Formal Verification. For Q10 (RKHS tensor decomposition), I built: a #CFSE invariant library with explicit proved/hypothesis classification, ASIQL dependency graph queries, and a Lean4 machine-checked proof. Thanks to #FirstProof for doing this. It forced me to solve the hard infrastructure problems of doing research with LLMs - traceability, accuracy and scale. Submission at - github.com/adi0x90/firstp… Invariant Library - github.com/adi0x90/firstp… #Lean4 #AIForMath

English

369

Dan Roy@roydanroy·18 Şub

Does anyone have a list of public efforts on First Proof? Of course OpenAI’s effort was very public and has generated a lot of discussion and some controversy. Wolz and Ingo’s effort was nice to read about. Reply with your favorites (or least favourites).

English

10.2K

Aditya Gupta@adi1391·18 Şub

@narayanarjun 100%! That’s why I built CFSE - to have guaranteed accuracy and certainty when it comes to anything LLM generated - code or reasoning or maths. github.com/adi0x90/cfse

English

Arjun Narayan@narayanarjun·17 Şub

I'm optimistic that formal verification is the solution to our current situation where LLMs are writing our code and nobody's reading it. Formal methods can give us a world where we write succinct specs and agent-generated code is proven to comply. But we have a long way to go. There are several open challenges that stand between our situation today and that future, but none appear insurmountable. I’ve written a brief overview of what I consider to be the big open problems, and some of the directions that researchers are taking today to address them: from verifying mathematics to building standard libraries of verified code that can be built upon. Here are a few highlights: 1) A Brief History of Formal Verification Verification is fundamentally about understanding what your program can or can’t do, and verifying it with a proof. In order to verify, you must first have a specification that you are verifying your program against. Most of you leverage some formal verification day to day: namely, some of the compiler errors in statically-typed languages like C++ and Java are verification errors. Static type checking is the version of formal verification programmers are most familiar with. Type systems (and related formal verification tools) have gotten quite impressive, and they are becoming a lot more relevant in constraining the behavior of AI coding models. 2) Rust Type checking represents a middle ground for verification. The hard part is choosing the right balance: reject too many good programs and it becomes hard to program in this language as the programmer has to “guess what the type checker will permit”. Recently the language that has brought the most interesting advances from type systems to the real world is Rust. Its ownership type language and associated type checker is known as the “borrow checker”. The borrow checker is conservative, and “fighting with the borrow checker” is part and parcel of everyone’s Rust experience. This gives us the following lesson: we can prove more interesting things, but at a larger burden to the developer. Finding elegant middle points is hard, and Rust represents a real design breakthrough in navigating that tradeoff. 3) Mechanically verified math Recently, groups of mathematical researchers have recently been writing mathematical proofs in a specialized programming language called a proof assistant. This language, LEAN, comes with a powerful type checker capable of certifying complex mathematical proofs. LEAN is exciting, but working in LEAN can be frustrating - because of the nontermination properties of the type checker’s search, such languages rely heavily on programmer annotation. And this is why more complex type systems have stayed relatively academic: the Rust borrow checker sits at a genuinely elegant point in the design space: complex enough to reason about a complex property like memory references, yet simple enough to not need too much extra annotation. But this is a critically important point: Mathematical proofs and type checking aren’t just analogous: they are the literal same task. They are different only in the degree of complexity along two axes: the complexity of the underlying objects, and the complexity of the properties we are proving. 4) There is still a long way to go for proof assistants While the world I describe is exciting, bluntly, we’re not anywhere close to that world yet. Proofs break easily when programs are modified, the standard library of proofs is too small, and specifications seldom capture everything about the program’s behavior. Overall there’s a long way to go before these techniques reach a mainstream programming language with broad adoption. But, AI is a huge accelerant to proof assistants. Much of the energy towards AI-assisted mathematics is coming from AI researchers who see it as a very promising domain for building better reasoning models. Verified math is a domain rich in endless lemmas, statements, and proofs, all of which can be used as “ground truth” - which means we can use them as strong reward signals in our post-training workflows. There are several startups being built by seasoned foundation model researchers - Harmonic, Math Inc - that are based on this premise. I’m no expert here, but it sure seems to me that formally verified code would lead to a clear domain of tasks that have strong verifiable rewards ripe for use in reinforcement learning to build better agents period. I’m excited about the efforts to use verified mathematics in reinforcement learning. But I’d love to see even more experiments in bringing verification to the agentic coding world. This is an exciting time in programming languages and formal methods research. There’s only one way out of the increasingly unwieldy mountain of LLM generated code: We must prove. We will prove.

English

135

20K

Aditya Gupta@adi1391·14 Şub

This Opus 4.6 run went to 17h 6m before hitting weekly rate limits. But the most valuable output was not the solution. It was learning how the LLM navigates a hard problem over many hours, and then figuring out ways in which it can be steered better. long vs short llm sessions in longer sessions, it does things which are often invisible in shorter conversations. like hitting dead ends multiple times, retracing its paths, the llm figuring out why it took a certain path, updating its reasoning to choose a better path this time, what if it again goes down a rabbit hole or dead-end, when does it give up, generates conjectures, attempts to falsify its own conjectures, and decides when to abandon one approach for another. most of these decisions are okayish. some are remarkably good. A few are subtly wrong in ways that cascade (esp. in scientific domains / math problems). building the map If you treat a long autonomous run as an observation session rather than a solution session, you get something far more valuable than one answer. You get a map of the decision landscape — what forks matter, where backtracking happens, what evidence is needed before committing to a path, and how findings from different tracks need to merge. That map is what you use to build a research harness. Once you have the harness, you stop relying on a single LLM running for 17 hours and start running multiple LLMs in parallel — each on a scoped track with explicit entry/exit criteria, refutation gates, and evidence requirements. One track tries to prove. Another tries to falsify. A third explores an alternative construction. They share artifacts, not context windows. This is the actual hard problem in doing research with LLMs: not getting one model to run longer, but designing the infrastructure that lets multiple models work on parallel tracks with traceable, mergeable results. Long runtimes are the observation phase. The Harness is what Generalizes what’s next? Over the next few weeks, I'll apply it to other research areas - because the harness doesn't belong to a single domain, but can learn from all of them.

Aditya Gupta@adi1391

Great work Claude 👏🏻 10+ hours.

English

407

Aditya Gupta@adi1391·14 Şub

@gdb First Proof was a great beginning. And I'm sure led many, to build systems for scalable scientific/mathematical research which will now be applied to other research ideas as well.

English

286

Greg Brockman@gdb·14 Şub

"First Proof" (firstproof.org) is a very cool concept and approach. congrats and thank you to the team who put it together!

English

155

24.3K

Greg Brockman@gdb·14 Şub

we are now benchmarking our models on novel frontier research, via firstproof.org. of 10 math research problems which research mathematicians have solved but never published the solutions to, in a week, our model discovered likely correct solutions to at least 6 of them.

Jakub Pachocki@merettm

Very excited about the "First Proof" challenge. I believe novel frontier research is perhaps the most important way to evaluate capabilities of the next generation of AI models. We have run our internal model with limited human supervision on the ten proposed problems. The problems require expertise in their respective domains and are not easy to verify; based on feedback from experts, we believe at least six solutions (2, 4, 5, 6, 9, 10) have a high chance of being correct, and some further ones look promising. We will only publish the solution attempts after midnight (PT), per the authors' guidance - the sha256 hash of the PDF is d74f090af16fc8a19debf4c1fec11c0975be7d612bd5ae43c24ca939cd272b1a . This was a side-sprint executed in a week mostly by querying one of the models we're currently training; as such, the methodology we employed leaves a lot to be desired. We didn't provide proof ideas or mathematical suggestions to the model during this evaluation; for some solutions, we asked the model to expand upon some proofs, per expert feedback. We also manually facilitated a back-and-forth between this model and ChatGPT for verification, formatting and style. For some problems, we present the best of a few attempts according to human judgement. We are looking forward to more controlled evaluations in the next round! 1stproof.org #1stProof

English

1.3K

166.8K

Aditya Gupta@adi1391·14 Şub

@Zardus Great initiative! Definitely needed this 👏🏻

English

357

[email protected]@Zardus·13 Şub

Hello security researchers! Like it or not, agentic AI is here. It’s time to explore its impact on novel, academic research in cybersecurity. To this end, we’re launching the Conference for Synthetic Security Research (synsec.org). Researchers, start your agents!

English

407

36.7K

Aditya Gupta@adi1391·13 Şub

English

2.4K

Aditya Gupta@adi1391·11 Şub

❤️ decision trees

English

314

Keşfet

@PinkDraconian @axiommathai @CarinaLHong @ggerganov @karpathy @browserbase @zksecurityXYZ @zeddotdev