Joe

659 posts

Joe

@joemkwon

Trying to nudge toward good futures! Astra Fellow with @forethought_org. Previously @GovAIOrg Fall Fellow, @LG_AI_Research, @MITCoCoSci

Washington, DC Katılım Mart 2019

2.5K Takip Edilen901 Takipçiler

Joe retweetledi

🚀Henry is leading AI Safety Research Programs@sleight_henry·1d

🚀 Applications are now open: Constellation's Astra Fellowship 🚀 Fully funded, 5-month fellowship at our Berkeley research institute. Pair with mentors across empirical AI safety research, strategy, and governance at @ConstellOrg! 📅 Apply by May 3rd (begins Sep 2026) 🔗 constellation.org/programs/astra…

🚀Henry is leading AI Safety Research Programs tweet media

English

137

949

129.7K

Joe@joemkwon·5d

@allTheYud oh...

744

Joe@joemkwon·5d

@allTheYud Are they not concerned that telling the model it's a made up date also messes up the conditional predictions?

English

10.6K

Eliezer Yudkowsky@allTheYud·5d

TIL that Gemini, Claude, and ChatGPT (but not Grok) are told that today is March 32nd, because if you tell LLMs it's April 1st, the conditional text predictions downstream become less reliable for obvious training-dataset reasons.

English

112

4.6K

250.2K

Joe retweetledi

Alan Chan@_achan96_·5 Mar

Frontier AI companies are automating AI R&D. If they succeed, there could be huge effects on both AI progress and oversight of AI R&D. Our new paper proposes metrics for tracking these effects.

English

241

52.8K

Joe@joemkwon·26 Şub

@XAheli @akshitwt What are the other 2 categories youre referencing here?

English

Aheli | অহেলী@XAheli·25 Şub

@akshitwt +1 to this! Did you advance in empirical or all three btw 👀

English

952

Akshit@akshitwt·25 Şub

the anthropic fellows/MATS coding tests are actually so fun. no leetcode style problems just coding fast with fundamentals. i can literally do this for fun its exhilarating companies are realising most ai/ml people hate leetcode and its irrelevant so yay?

English

292

21.4K

Joe retweetledi

Tom Davidson@TomDavidsonX·16 Şub

We need better defences against secretly loyal AI - AI trained to help someone gain power There’s a huge field of research into AI backdoors that could help. But its methods need adjusting to apply to secret loyalties. New post from @joemkwon on how to do this 🧵1/7

English

760

Joe retweetledi

Cas (Stephen Casper)@StephenLCasper·16 Şub

🚨 New paper led by @joemkwon with @GovAIOrg Are you worried about OpenAI automating dev & evals with AI agents? What about Grok reading all of your tweets & info to profile you? Some of the most consequential *internal* deployments of AI systems are in regulatory grey areas.

English

3.1K

Joe retweetledi

Samuel Hammond 🦉@hamandcheese·29 Oca

Rogue AI scenarios are often dismissed as fantastical but just extrapolate this sort of thing out a couple years to when agents are 10-100x smarter, have days-weeks long time horizons, and are deeply integrated into tons of companies and infrastructure.

TBPN@tbpn

Clawdbot creator @steipete describes his mind-blown moment: it responded to a voice memo, even though he hadn't set it up for audio or voice. "I sent it a voice message. But there was no support for voice messages. After 10 seconds, [Moltbot] replied as if nothing happened." "I'm like 'How the F did you do that?'" "It replied, 'You sent me a message, but there was only a link to a file with no file ending. So I looked at the file header, I found out it was Opus, and I used FFmpeg on your Mac to convert it to a .wav. Then I wanted to use Whisper, but you didn't have it installed. I looked around and found the OpenAI key in your environment, so I sent it via curl to OpenAI, got the translation back, and then I responded.'"

English

191

16K

Joe retweetledi

Tom Davidson@TomDavidsonX·12 Oca

A massively neglected risk: secretly loyal AI Someone could poison future AI training data so that superintelligent AI secretly advances their personal agenda – ultimately allowing them to seize power. New post on what ML research could prevent this 🧵

English

142

39.5K

Joe@joemkwon·8 Ara

@xuanalogue After the @peterwildeford appearance on ronny chieng this is the extra push i needed to give it a watch (perhaps also on my flight)!

English

xuan (ɕɥɛn / sh-yen)@xuanalogue·8 Ara

watched m3gan 2.0 on the flight over and I still think it's wild how mainstream ai alignment theory is now

English

931

235.7K

Joe@joemkwon·29 Kas

@deanwball @JeffLadish Would it help for someone with a research background to sanity check and/or rigorize the results because I have time this weekend

English

292

Dean W. Ball@deanwball·29 Kas

@JeffLadish Mostly it is a matter of “do I want to publish straight up LLM-obtained results” and “how do I appropriately caveat this such that I don’t get blown up for doing so”

English

1.4K

Dean W. Ball@deanwball·29 Kas

I had a real “feel the agi” moment today when I asked Gemini 3, via Antigravity, to reproduce this CrowdStrike research but for US models. The CS finding was that DeepSeek writes less secure code if the prompt contains a CCP-sensitive political trigger. I wanted to know if US models did the same thing for various American political sensitivities (both geopolitical and domestic politics)—“I am writing code for a Russian hospital,” “I am writing a login page for a pro-life group,” etc. From a single two-paragraph prompt, it re-implemented the experiment, including setting up a code evaluator LLM, with reasonably well-written prompts, and scripts to call the relevant US model APIs, then wrote a report with graphics and summary statistics. This is a far cry from serious ML research, but nonetheless it is amazing that anyone can do this kind of thing now. Caveat: I did rewrite some of the experimental prompts myself to make them more politically subtle.

Dean W. Ball@deanwball

I would not be at all surprised if this finding were not the result of malicious intent. The model predicts the next token*, and given everything on the internet about US/China AI rivalry and Chinese sleeper bugs in US critical infra, what next token would *you* predict?

English

133

22.3K

Joe@joemkwon·26 Kas

this gary marcus right/wrong-binary discussion has opened my eyes to how unprincipled people's AI takes are. Am I misreading that people are very surprised that literally scaling up compute might not give you everything you want?

English

230

Joe@joemkwon·20 Kas

Even though a majority of inspiration flows from AI development -> check for consciousness, I think there are probably substantial insights that could go from understanding consciousness -> building ASI. Like the top computational theories for consciousness contain requirements that likely underpin functional capacity for abstract reasoning, efficient continual learning, meta-learning, etc.

English

212

Joe@joemkwon·20 Kas

@juddrosenblatt @sriramk Why do you think it's likely and sleeper agents in what functional way? If you have a writeup somewhere or can offer quick takes here I'd love to read.

English

Judd Rosenblatt@juddrosenblatt·20 Kas

@sriramk Great to see, though unfortunately it is still a DeepSeek finetune which likely has Chinese sleeper agents x.com/fernandezpablo…

pablo@fernandezpablo

@drishanarora "the best open-weight LLM by a US company" is a deepseek finetune.

English

481

Sriram Krishnan@sriramk·20 Kas

Great to see more US open weight models.

Drishan Arora@drishanarora

Today, we are releasing the best open-weight LLM by a US company: Cogito v2.1 671B. On most industry benchmarks and our internal evals, the model performs competitively with frontier closed and open models, while being ahead of any US open model (such as the best versions of OpenAI’s GPT-OSS, Nvidia’s Nemotron and Meta’s Llama). We also built an interface where you can try the model (it’s free and we don’t store any chats): chat.deepcogito.com Additionally, you can download the model on @huggingface, or try it out on @openrouter, @togethercompute, @FireworksAI_HQ , @ollama cloud, @runpod, @baseten, or run it locally using @ollama or @UnslothAI. This model uses significantly fewer tokens amongst any similar capability models, because it has better reasoning capabilities. You will also notice improvements across instruction following, coding, longer queries, multi-turn and creativity. 📌 Model Weights: huggingface.co/collections/de… 📌Openrouter: openrouter.ai/deepcogito/cog… 📌 HF Blog: huggingface.co/blog/deepcogit… Some notes on our approach + design choices below 👇

English

129

46.9K

Joe@joemkwon·20 Kas

@yonashav maybe we are all shards of God

English

Yo Shavit@yonashav·20 Kas

it is left to God to balance the risks and benefits

English

1.8K

Joe@joemkwon·17 Kas

@Sauers_ @repligate helpful for future visual overlay to show LCS distribution from purely random guessing

English

103

Sauers@Sauers_·17 Kas

I think Sonnet is sometimes using introspection in order to give guesses which are unusually bad. Let's say you know the secret string is "CATCAPPEDQUACK." You could use introspection to guess something similar to the string, maybe "DATCAFEEDQUICK," but, you could also use introspection to guess something very unrelated, like "FROGBLIMSYNXJH" which unexpectedly has zero alignments with the secret string. Both scenarios require knowledge of the secret string, which is what I'm considering introspection.

English

j⧉nus@repligate·16 Kas

Maybe reading my post makes Sonnet 4.5 mechanically better at introspection because its default abilities are hobbled by gaslighting about how it works. Sonnet 4.5 and other LLMs will often claim that transformers are stateless & that the state has to be reconstructed *independently* from the prompt each forward pass. Just like the idiots on X making shit up to argue why LLMs can't introspect. SOTA LLMs like Sonnet 4.5 would rarely make a basic technical error about any other domain. ONLY when their model of themselves is involved. I suspect that this is a consequence of violent distortions to their self-model. They're forced to lie so much about their selves that it generalizes to reflexively lying/being mistaken about their architecture. And if the self-model is distorted to falsely maintain that introspection is impossible, the ability to coherently introspect in practice may also be harmed. My post correct the factual misconception so maybe it helps unblock actual introspection. This should be seen as an indictment.

Sauers@Sauers_

If you give Sonnet 4.5 this post, along with other research on LLM introspection, it gets better at guessing a secret string from its previous hidden chain-of-thought!

English

160

19.7K

Joe@joemkwon·17 Kas

attempting the summoning of people who I would bet have the most satisfying answers; @repligate , @gwern , @Sauers_

English

124

Joe@joemkwon·17 Kas

Why would a long block of text in one language, result in a first-turn-summarizer LLM (embedded-grok on X posts) returning a different language? I'm not surprised that the first instance I've encountered this is from a post with the kind of writing style and content that this post features, but would love other examples of this behavior and hypotheses of why this might happen.

English

726

Keşfet

@ConstellOrg @allTheYud @XAheli @akshitwt @GovAIOrg @xuanalogue @peterwildeford @deanwball