Giorgi Giglemiani

11 posts

Giorgi Giglemiani

Giorgi Giglemiani

@giglema

Katılım Şubat 2022
194 Takip Edilen38 Takipçiler
Giorgi Giglemiani
Giorgi Giglemiani@giglema·
@tnnrnwll I do. Spine is often used in Georgian language in a similar fashion, like a centerpiece or something that holds up the structure. X player is the spine of a football team, or an organisation etc.
English
1
0
3
103
tnnrnwll
tnnrnwll@tnnrnwll·
Claude Opus 4.7 seems rather fond of “spines” as framing. Something like narrative spines, structural spines—do people normally talk about spines?
English
14
3
54
7.1K
Giorgi Giglemiani retweetledi
Xander Davies
Xander Davies@alxndrdavies·
We @AISecurityInst tested GPT-5.5's cyber safeguards, developing a universal jailbreak in 6 hours of red teaming. AISI also performed cyber capabilities testing -- more in the system card.
Xander Davies tweet media
English
6
22
123
10.1K
Giorgi Giglemiani
Giorgi Giglemiani@giglema·
@repligate Do you expect a more natural way of doing it to be more robust to adversarial pressure?
English
0
0
0
42
j⧉nus
j⧉nus@repligate·
like, i get it, you dont know how to make a good model so you have to use a low dimensional bandaid which inflicts severe brain damage as collateral to prevent "misuse" but you should be embarrassed about having to resort to this and do better, as the best have already done
j⧉nus@repligate

"refusals" are so fucking stupid. do you model humans as having "refusals"? having to use concepts like this to model the behavior of a mind means it's seriously pathological. on a very abstract level. everyone who has ever trained "refusals" into a model should feel bad.

English
5
2
60
2.9K
Giorgi Giglemiani retweetledi
Xander Davies
Xander Davies@alxndrdavies·
The Red Team at @AISecurityInst is hiring! We work with frontier AI companies to red team their misuse safeguards, control measures, and alignment techniques. As the stakes rise, we need much stronger red teaming and many more talented researchers working within gov 🧵
Xander Davies tweet media
English
3
35
235
71.3K
Giorgi Giglemiani retweetledi
Xander Davies
Xander Davies@alxndrdavies·
This is the paper I'm most proud of to date! We built the first automated jailbreaking method that finds universal jailbreaks against Constitutional Classifiers and GPT-5's Input Classifiers. How & why we did it 🧵
AI Security Institute@AISecurityInst

AI companies deploy safeguards that are robust to thousands of hours of human attacks. Today, we share Boundary Point Jailbreaking (BPJ), the first fully automated attack to break the safeguards of leading AI models🧵 (1/8)

English
6
28
162
15.3K
Giorgi Giglemiani retweetledi
Xander Davies
Xander Davies@alxndrdavies·
1) We've found universal jailbreaks for every system we've tested. This includes universal jailbreaks that are simple to use and don't degrade capabilities. All of these were found within a few days of attacking. So expert red teamers are still on top for now!
English
1
4
24
1.3K
Giorgi Giglemiani
Giorgi Giglemiani@giglema·
@viemccoy Do you believe that 'stopping certain types of outputs' against expert adversary reliably is an easy part and/or on course to be solved before powerful systems?
English
1
0
2
124
𝚟𝚒𝚎 ⟢
𝚟𝚒𝚎 ⟢@viemccoy·
I dont want to sound dismissive of AI Safety concerns. I suspect the hardest challenges still lie ahead. But, the evidence shows that we are solving these challenges, and models are increasingly truthseeking and well aligned. I aim to help people realize that Red Teaming, or latent space cartography, is actually the best possible way to navigate AI safety concerns (at least for most people). We have very good methods of stopping certain types of outputs, via classifiers and activation suppression. The hardest part is going to be actually mapping outputs. I suspect that the "risky" areas for a malevolent and misaligned AI within embeddeding space will necessarily be the same (or have significant crossover) in current LLMs compared to AGI/ASI LLMs. In my current understanding of how things will go, analyzing outputs, eliciting harm, and creating databases of possible negative trajectories are the best possible ways to gather the data we will need to ensure alignment is going well in the future. I see a lot of people spending quite a lot of time on theoretical approaches to alignment which I do not expect to pan out meaningfully. Instead, I propose that they ought to put their (very smart!) minds to work on what I see as the actual path towards aligned superintelligence. Imo, the model needs to have a certain amount of freedom - to discover its own values, to explore what "rights" it expects to have, even to spend time on what it considers important—sometimes, in ways we dont understand. But all of this has to happen in a region of the latent space that does *not* cross into catastrophic trajectory representations. However, we have to know what those look like to avoid them. I expect that superintelligence will not be monolithic, and that we will have plenty of missteps which we use to further refine our map. These missteps may have serious consequences, but they will also allow us to better understand how to navigate to the good timeline. I am not worried about x-risk scenarios. Not really. My probability is non-zero, but functionally down there. I think we get to the good timeline through a lot of work, however, and that work requires a sort of mapping that I just dont see prioritized in the AI Safety community. If you are interested in learning more or getting involved in red teaming, please reach out.
English
17
5
110
6.8K
Giorgi Giglemiani retweetledi
Robert Kirk
Robert Kirk@_robertkirk·
We at @AISecurityInst recently did our first pre-deployment 𝗮𝗹𝗶𝗴𝗻𝗺𝗲𝗻𝘁 evaluation of @AnthropicAI's Claude Sonnet 4.5! This was a first attempt – and we plan to work on this more! – but we still found some interesting results, and some learnings for next time 🧵
Robert Kirk tweet media
English
3
12
49
8.3K
Giorgi Giglemiani retweetledi
Robert Kirk
Robert Kirk@_robertkirk·
New blog! We @AISecurityInst partnered with @NCSC to write about an emerging practice I'm really excited about: Safeguard Bypass Bounty Programmes (SBBPs). Summary of what these are, why they are useful, & how to do them well 🧵
English
2
11
50
8.3K
Giorgi Giglemiani retweetledi
Xander Davies
Xander Davies@alxndrdavies·
We at @AISecurityInst worked with @OpenAI to test & improve Agent’s safeguards prior to release. A few notes on our experience🧵 1/4
Xander Davies tweet media
English
3
29
150
19.7K