Nitarshan

1.6K posts

Nitarshan

@nitarshan

computer @anthropic, PhD @cambridge_cl. prev created @aisecurityinst, AI Safety Summit, UK AI Research Resource, EU AI Code of Practice.

SF / London Katılım Mayıs 2012

2.3K Takip Edilen2.3K Takipçiler

Sabitlenmiş Tweet

Nitarshan@nitarshan·7 May

The West has a closing window to win on AI. In our @JoinFAI article, @saroshnagar, @scott_r_singer and I argue that our leadership in AI requires "full-stack diffusion" to promote our entire AI stack globally. 1/6

English

31.5K

Nitarshan retweetledi

Jackson Kernion@JacksonKernion·2d

I simply don't understand what people have in mind when they say stuff like this. What we have is extremely capable computer use agents. They will continue to get better at computer use. But how does a capable computer use agent 'take over' and why haven't they done that today?

Elizabeth Barnes@BethMayBarnes

(1) We are likely on track to develop AI systems capable of causing human extinction/permanent disempowerment, quite possibly within the next few years

English

116

467

137.9K

Nitarshan retweetledi

Sebastien Bubeck@SebastienBubeck·4d

@kareem_carr There was 0 human involvement. The prompt is in the report. The final answer by the model is in the report. And we have a (gpt-rewritten) CoT that we released.

English

677

193.6K

Nitarshan retweetledi

jessica dai@jessicadai_·4d

I'd maybe deprioritize, checks notes, 5000 thinktanks or the nth nonprofit "alignment" "research" institute but, really, what do I know

English

3.6K

Nitarshan retweetledi

roon@tszzl·17 May

there really are very high degrees of biorisk, cyberrisk, whatever else that are worth trading off against having a small monopoly of cyberpunk warring-states exercise full control over frontier superintelligence imo

English

727

60.7K

Nitarshan retweetledi

Alex Turner@Turn_Trout·15 May

Lots of hubbub about "is LW to blame for self-fulfilling misalignment" 1. If a scientist builds a machine which does bad because people said it would, it's NOT the people's fault (morally) 2. Balance of evidence is that YES, LW & doom-speculation contributed to the problem

English

7.4K

Nitarshan retweetledi

kamilė@kamilelukosiute·21 Nis

Over a weekend and with ~$760, I (not a biologist) used Claude Code to fine-tune a biological AI model on human-infecting viral sequences. Although my experiment wasn't dangerous, it demonstrates how coding agents are changing the biosecurity risk landscape. In a new @GovAIOrg blog post with @lucafrighetti and James Black, we describe this experiment and its policy implications. Biosecurity has traditionally divided AI risks into two buckets: general LLMs that "raise the floor" by democratizing knowledge and specialized biological AI models (BAIMs) that "raise the ceiling" by enabling experts. Increasingly capable coding agents blur that line via three mechanisms: 1) Coding agents let both novices and experts operate BAIMs more effectively, expanding the pool of potential misusers and letting experts test more designs faster. 2) Data filters on BAIMs are brittle when coding agents can autonomously fine-tune the models, as my experiment shows. 3) Coding agents speed up ML engineering, making it more feasible for threat actors to train new specialized models optimized for harmful capabilities from scratch. Policy recommendations: BAIM developers should move beyond data filtering toward trusted-access programs; LLM developers should test agent interactions with BAIMs; policymakers should prioritize physical chokepoints like DNA synthesis screening. Read the piece: governance.ai/analysis/codin…

English

8.4K

Nitarshan retweetledi

Séb Krier@sebkrier·19 Nis

I think a lot of arguments in this piece are weak or out of date, and mostly repeating theoretical assumptions developed way before we had the AI systems we build today. Some of the evidence cited is a bit selective too. Some takes: 1. Grok going MechaHitler is not an example of goal misgeneralization or specification gaming, which were issues with different, older RL systems. Nor does it support the article's claim that 'AI won't do what we want (by default)'. This was a result of a bad system prompt, which was easily fixed: simonwillison.net/2025/Jul/15/xa… 2. In general, models are pretty good at instruction following, and even more so over time. The default empirical trend so far is improving steerability and instruction following, so expecting this to get worse as they get more capable requires strong evidence than metaphors or analogies. 3. The examples used in the post don't really support the chimp to human analogy, which is rhetorically vivid but analytically sloppy; plus there are many examples of control exerted over more capable systems, like control over companies. Control is also likely not the sole frame we should use to understand and steer AI. 4. The 'we are growing AIs' meme is not very instructive. It's not true that 'all we can see are the trillions of inscrutable parameters' - there is a lot of research that now give us a much better understanding of how models work: circuit tracing work, sparse feature work, representation engineering etc. Just because models aren't 'pre-programmed' does not imply that 'there is no way to directly specify what behaviour we want an AI system to have.' 5. We don't train AIs to 'optimise for long-term goals', this is not in fact a good description of what model training does. The article compresses too many distinct things (models, instruction tuning, scaffolds etc) into one 'goal-maximizing' or 'scheming' story. See also: lesswrong.com/posts/pdaGN6pQ… 6. It's not at all obvious that learning self-preservation is a necessary side effect of better capabilities, which is a core assumption in rationalist circles. There are in my view no strong signs pointing towards robust endogenous self-preservation drives, and if anything more signs pointing in the other direction. See also: blog.cosmos-institute.org/p/alignment-by… 7. The cited 'blackmail the engineer' test environment is widely seen as highly flawed and not instructive in any way. Even Anthropic’s own write-up makes clear these were contrived scenarios with no external validity. See also: arxiv.org/abs/2507.03409 and x.com/sebkrier/statu… 8. Reward hacking is a legitimate issue, but not one that implies the kind of loss of control alluded to in the article, nor that some hidden reward has become the system’s deep objective. See also: turntrout.com/reward-hacking… 9. The 2024 alignment faking paper is also highly stylized, not particularly instructive, and dismissed as not in fact proving deceptive intent as the post implies. The label “alignment faking” imported more intentionality and strategic coherence than the setup warranted. See also: arxiv.org/abs/2506.18032 and alignmentforum.org/posts/PWHkMac9… 10. A model to inferring it's being evaluated shows recognizing highly standardised/obvious evaluation environments rather than any deceptive intent. Interpreting eval awareness as such illustrates nicely the underlying assumptions held by some safety researchers. The jump from "models can detect standardized eval contexts" to "models are deceptively scheming" is a unwarranted interpretive leap.

Benjamin Todd@ben_j_todd

With chatbots, AI alignment looked easier than expected. But with the shift to ever smarter longer-horizon agents, the classic reasons for concern come back. New primer: four reasons why AI won't do what we want 🧵

English

276

44.8K

Nitarshan retweetledi

kamilė@kamilelukosiute·9 Mar

AI models' cyber capabilities keep getting meaningfully better, and fast. To determine how AI capabilities will impact cybercrime, we first need a baseline for global cybercrime damages. In a new @GovAIOrg technical report with John Halstead and @lucafrighetti, we arrive at a baseline estimate of global cybercrime damages: $500B (with 90% CI of $100B-$1T) per year. Existing estimates of global cybercrime damages range from tens of billions to tens of trillions of dollars. Most have serious problems: they rely on reported damages only (missing the vast majority of incidents that go unreported), or they don't publish their methodology at all. We tried to do better by extrapolating mostly from survey data, which captures unreported incidents, and by being transparent about every assumption we make. Our total estimate: ~$500B a year. This includes direct losses to individuals, direct + response costs to businesses, and defensive spending. Notably, this does not include costs that are even harder to quantify, such as IP theft, espionage, and national security costs, so the real yearly damages are presumably higher. As AI gets better at cyber, even a modest additive effect on the volume of cybercrime is a big deal. A 20% increase would mean ~$100B in additional yearly damages. Our estimates have extremely high uncertainty ranges. If we want to understand how AI is shaping cybercrime, we'll need to build new ways of measure the effects by looking at real world indicators of threat actor AI usage. Read the full report here: governance.ai/research-paper…

English

7.6K

Nitarshan retweetledi

NASA Administrator Jared Isaacman@NASAAdmin·7 Mar

NASA does not have a top-line problem. We receive roughly $25 billion in annual appropriations, including more than a $10 billion plus-up from President Trump’s One Big Beautiful Bill. If that is not enough to run a lunar exploration program and do all the other things across science and discovery, then what is the right number? We don’t need to blame budgets or continuity of decision-making as the common excuse, as if a billion dollars is somehow not a billion dollars and troubled programs should perpetually stay troubled programs. NASA, like the federal government, cannot spend our way out of every problem, nor can we perpetuate bad decisions. That means not getting spread thin across too many imposed endeavors or jumping straight to the “dream state,” which is how everything becomes over budget and behind schedule. Instead, we concentrate on the needle-moving objectives, the reason NASA exists in the first place. We execute with urgency, in an iterative and safe way, and empower the workforce and our partners to get the job done. That is how we changed the world on July 20, 1969, and it is how we will do it again. Expect more from NASA and start believing again.

English

226

3.1K

110.5K

Nitarshan retweetledi

Andrew Gordon Wilson@andrewgwils·7 Mar

There was a new language. A lot of signaling, branding, tribalism, politics. A fresh wave of people who controlled immense resources but didn't know much about what came before. And some research that felt narratively driven, but only loosely connected with the stated goals. 5/6

English

8.1K

Nitarshan retweetledi

Xander Davies@alxndrdavies·6 Mar

The Red Team at @AISecurityInst is hiring! We work with frontier AI companies to red team their misuse safeguards, control measures, and alignment techniques. As the stakes rise, we need much stronger red teaming and many more talented researchers working within gov 🧵

English

236

72K

Nitarshan retweetledi

Anthropic@AnthropicAI·6 Mar

We partnered with Mozilla to test Claude's ability to find security vulnerabilities in Firefox. Opus 4.6 found 22 vulnerabilities in just two weeks. Of these, 14 were high-severity, representing a fifth of all high-severity bugs Mozilla remediated in 2025.

English

478

1.4K

15.1K

3.2M

Nitarshan retweetledi

roon@tszzl·5 Mar

@memeticweaver @tautologer > the USG can in general do whatever they want the founders of this great nation fought several bloody wars to make sure this is not true

English

931

51.7K

Nitarshan retweetledi

Dean W. Ball@deanwball·5 Mar

the word of the week is “alas”

English

3.4K

Nitarshan retweetledi

pamela mishkin@manlikemishap·4 Mar

:party-parrot: slow to reply due to issues of national security :pray:

English

1.3K

Nitarshan retweetledi

Nathan Calvin@_NathanCalvin·3 Mar

Not sure without seeing full text, but it seems to me there are two options here: 1. This updated deal does not protect anthropics redlines (which are the same as OAIs) 2. This deal does protect them. if it does, then why was Anthropic treated so much worse by the admin?

Sam Altman@sama

Here is re-post of an internal post: We have been working with the DoW to make some additions in our agreement to make our principles very clear. 1. We are going to amend our deal to add this language, in addition to everything else: "• Consistent with applicable laws, including the Fourth Amendment to the United States Constitution, National Security Act of 1947, FISA Act of 1978, the AI system shall not be intentionally used for domestic surveillance of U.S. persons and nationals. • For the avoidance of doubt, the Department understands this limitation to prohibit deliberate tracking, surveillance, or monitoring of U.S. persons or nationals, including through the procurement or use of commercially acquired personal or identifiable information." It’s critical to protect the civil liberties of Americans, and there was so much focus on this, that we wanted to make this point especially clear, including around commercially acquired information. Just like everything we do with iterative deployment, we will continue to learn and refine as we go. I think this is an important change; our team and the DoW team did a great job working on it. 2. The Department also affirmed that our services will not be used by Department of War intelligence agencies (for example, the NSA). Any services to those agencies would require a follow-on modification to our contract. 3. For extreme clarity: we want to work through democratic processes. It should be the government making the key decisions about society. We want to have a voice, and a seat at the table where we can share our expertise, and to fight for principles of liberty. But we are clear on how the system works (because a lot of people have asked, if I received what I believed was an unconstitutional order, of course I would rather go to jail than follow it). But 4. There are many things the technology just isn’t ready for, and many areas we don’t yet understand the tradeoffs required for safety. We will work through these, slowly, with the DoW, with technical safeguards and other methods. 5. One thing I think I did wrong: we shouldn't have rushed to get this out on Friday. The issues are super complex, and demand clear communication. We were genuinely trying to de-escalate things and avoid a much worse outcome, but I think it just looked opportunistic and sloppy. Good learning experience for me as we face higher-stakes decisions in the future. In my conversations over the weekend, I reiterated that Anthropic should not be designated as a SCR, and that we hope the DoW offers them the same terms we’ve agreed to. We will host an All Hands tomorrow morning to answer more questions.

English

4.5K

Nitarshan retweetledi

Dean W. Ball@deanwball·3 Mar

It is so clear that the important fissure in AI politics right now is not “liberal vs. conservative,” “Democrat vs. Republican,” “e/acc vs. EA,” or “safety vs. anti-safety,” but instead “takes advanced AI seriously as a concept vs. does not take advanced AI seriously.”

English

134

1.3K

233.4K

Nitarshan retweetledi