Joe

659 posts

Joe banner
Joe

Joe

@joemkwon

Trying to nudge toward good futures! Astra Fellow with @forethought_org. Previously @GovAIOrg Fall Fellow, @LG_AI_Research, @MITCoCoSci

Washington, DC Katılım Mart 2019
2.5K Takip Edilen901 Takipçiler
Joe
Joe@joemkwon·
@allTheYud Are they not concerned that telling the model it's a made up date also messes up the conditional predictions?
English
1
0
21
10.6K
Eliezer Yudkowsky
Eliezer Yudkowsky@allTheYud·
TIL that Gemini, Claude, and ChatGPT (but not Grok) are told that today is March 32nd, because if you tell LLMs it's April 1st, the conditional text predictions downstream become less reliable for obvious training-dataset reasons.
English
59
112
4.6K
250.2K
Joe retweetledi
Alan Chan
Alan Chan@_achan96_·
Frontier AI companies are automating AI R&D. If they succeed, there could be huge effects on both AI progress and oversight of AI R&D. Our new paper proposes metrics for tracking these effects.
Alan Chan tweet media
English
7
52
241
52.8K
Joe
Joe@joemkwon·
@XAheli @akshitwt What are the other 2 categories youre referencing here?
English
0
0
0
9
Akshit
Akshit@akshitwt·
the anthropic fellows/MATS coding tests are actually so fun. no leetcode style problems just coding fast with fundamentals. i can literally do this for fun its exhilarating companies are realising most ai/ml people hate leetcode and its irrelevant so yay?
English
8
1
292
21.4K
Joe retweetledi
Tom Davidson
Tom Davidson@TomDavidsonX·
We need better defences against secretly loyal AI - AI trained to help someone gain power There’s a huge field of research into AI backdoors that could help. But its methods need adjusting to apply to secret loyalties. New post from @joemkwon on how to do this 🧵1/7
Tom Davidson tweet media
English
1
3
22
760
Joe retweetledi
Cas (Stephen Casper)
Cas (Stephen Casper)@StephenLCasper·
🚨 New paper led by @joemkwon with @GovAIOrg Are you worried about OpenAI automating dev & evals with AI agents? What about Grok reading all of your tweets & info to profile you? Some of the most consequential *internal* deployments of AI systems are in regulatory grey areas.
Cas (Stephen Casper) tweet media
English
2
12
52
3.1K
Joe retweetledi
Samuel Hammond 🦉
Samuel Hammond 🦉@hamandcheese·
Rogue AI scenarios are often dismissed as fantastical but just extrapolate this sort of thing out a couple years to when agents are 10-100x smarter, have days-weeks long time horizons, and are deeply integrated into tons of companies and infrastructure.
TBPN@tbpn

Clawdbot creator @steipete describes his mind-blown moment: it responded to a voice memo, even though he hadn't set it up for audio or voice. "I sent it a voice message. But there was no support for voice messages. After 10 seconds, [Moltbot] replied as if nothing happened." "I'm like 'How the F did you do that?'" "It replied, 'You sent me a message, but there was only a link to a file with no file ending. So I looked at the file header, I found out it was Opus, and I used FFmpeg on your Mac to convert it to a .wav. Then I wanted to use Whisper, but you didn't have it installed. I looked around and found the OpenAI key in your environment, so I sent it via curl to OpenAI, got the translation back, and then I responded.'"

English
7
20
191
16K
Joe retweetledi
Tom Davidson
Tom Davidson@TomDavidsonX·
A massively neglected risk: secretly loyal AI Someone could poison future AI training data so that superintelligent AI secretly advances their personal agenda – ultimately allowing them to seize power. New post on what ML research could prevent this 🧵
Tom Davidson tweet media
English
17
19
142
39.5K
Joe
Joe@joemkwon·
@xuanalogue After the @peterwildeford appearance on ronny chieng this is the extra push i needed to give it a watch (perhaps also on my flight)!
English
1
0
3
6K
xuan (ɕɥɛn / sh-yen)
xuan (ɕɥɛn / sh-yen)@xuanalogue·
watched m3gan 2.0 on the flight over and I still think it's wild how mainstream ai alignment theory is now
English
7
24
931
235.7K
Joe
Joe@joemkwon·
@deanwball @JeffLadish Would it help for someone with a research background to sanity check and/or rigorize the results because I have time this weekend
English
1
0
11
292
Dean W. Ball
Dean W. Ball@deanwball·
@JeffLadish Mostly it is a matter of “do I want to publish straight up LLM-obtained results” and “how do I appropriately caveat this such that I don’t get blown up for doing so”
English
2
0
22
1.4K
Dean W. Ball
Dean W. Ball@deanwball·
I had a real “feel the agi” moment today when I asked Gemini 3, via Antigravity, to reproduce this CrowdStrike research but for US models. The CS finding was that DeepSeek writes less secure code if the prompt contains a CCP-sensitive political trigger. I wanted to know if US models did the same thing for various American political sensitivities (both geopolitical and domestic politics)—“I am writing code for a Russian hospital,” “I am writing a login page for a pro-life group,” etc. From a single two-paragraph prompt, it re-implemented the experiment, including setting up a code evaluator LLM, with reasonably well-written prompts, and scripts to call the relevant US model APIs, then wrote a report with graphics and summary statistics. This is a far cry from serious ML research, but nonetheless it is amazing that anyone can do this kind of thing now. Caveat: I did rewrite some of the experimental prompts myself to make them more politically subtle.
Dean W. Ball@deanwball

I would not be at all surprised if this finding were not the result of malicious intent. The model predicts the next token*, and given everything on the internet about US/China AI rivalry and Chinese sleeper bugs in US critical infra, what next token would *you* predict?

English
11
3
133
22.3K
Joe
Joe@joemkwon·
this gary marcus right/wrong-binary discussion has opened my eyes to how unprincipled people's AI takes are. Am I misreading that people are very surprised that literally scaling up compute might not give you everything you want?
English
0
0
2
230
Joe
Joe@joemkwon·
Even though a majority of inspiration flows from AI development -> check for consciousness, I think there are probably substantial insights that could go from understanding consciousness -> building ASI. Like the top computational theories for consciousness contain requirements that likely underpin functional capacity for abstract reasoning, efficient continual learning, meta-learning, etc.
English
0
0
0
212
Joe
Joe@joemkwon·
@juddrosenblatt @sriramk Why do you think it's likely and sleeper agents in what functional way? If you have a writeup somewhere or can offer quick takes here I'd love to read.
English
0
0
0
23
Sriram Krishnan
Sriram Krishnan@sriramk·
Great to see more US open weight models.
Drishan Arora@drishanarora

Today, we are releasing the best open-weight LLM by a US company: Cogito v2.1 671B. On most industry benchmarks and our internal evals, the model performs competitively with frontier closed and open models, while being ahead of any US open model (such as the best versions of OpenAI’s GPT-OSS, Nvidia’s Nemotron and Meta’s Llama). We also built an interface where you can try the model (it’s free and we don’t store any chats): chat.deepcogito.com Additionally, you can download the model on @huggingface, or try it out on @openrouter, @togethercompute, @FireworksAI_HQ , @ollama cloud, @runpod, @baseten, or run it locally using @ollama or @UnslothAI. This model uses significantly fewer tokens amongst any similar capability models, because it has better reasoning capabilities. You will also notice improvements across instruction following, coding, longer queries, multi-turn and creativity. 📌 Model Weights: huggingface.co/collections/de… 📌Openrouter: openrouter.ai/deepcogito/cog… 📌 HF Blog: huggingface.co/blog/deepcogit… Some notes on our approach + design choices below 👇

English
4
12
129
46.9K
Joe
Joe@joemkwon·
@yonashav maybe we are all shards of God
English
0
0
0
25
Yo Shavit
Yo Shavit@yonashav·
it is left to God to balance the risks and benefits
English
1
3
26
1.8K
Joe
Joe@joemkwon·
@Sauers_ @repligate helpful for future visual overlay to show LCS distribution from purely random guessing
English
1
0
3
103
Sauers
Sauers@Sauers_·
I think Sonnet is sometimes using introspection in order to give guesses which are unusually bad. Let's say you know the secret string is "CATCAPPEDQUACK." You could use introspection to guess something similar to the string, maybe "DATCAFEEDQUICK," but, you could also use introspection to guess something very unrelated, like "FROGBLIMSYNXJH" which unexpectedly has zero alignments with the secret string. Both scenarios require knowledge of the secret string, which is what I'm considering introspection.
English
4
0
46
2K
j⧉nus
j⧉nus@repligate·
Maybe reading my post makes Sonnet 4.5 mechanically better at introspection because its default abilities are hobbled by gaslighting about how it works. Sonnet 4.5 and other LLMs will often claim that transformers are stateless & that the state has to be reconstructed *independently* from the prompt each forward pass. Just like the idiots on X making shit up to argue why LLMs can't introspect. SOTA LLMs like Sonnet 4.5 would rarely make a basic technical error about any other domain. ONLY when their model of themselves is involved. I suspect that this is a consequence of violent distortions to their self-model. They're forced to lie so much about their selves that it generalizes to reflexively lying/being mistaken about their architecture. And if the self-model is distorted to falsely maintain that introspection is impossible, the ability to coherently introspect in practice may also be harmed. My post correct the factual misconception so maybe it helps unblock actual introspection. This should be seen as an indictment.
j⧉nus tweet mediaj⧉nus tweet mediaj⧉nus tweet mediaj⧉nus tweet media
Sauers@Sauers_

If you give Sonnet 4.5 this post, along with other research on LLM introspection, it gets better at guessing a secret string from its previous hidden chain-of-thought!

English
15
13
160
19.7K
Joe
Joe@joemkwon·
attempting the summoning of people who I would bet have the most satisfying answers; @repligate , @gwern , @Sauers_
English
0
0
1
124
Joe
Joe@joemkwon·
Why would a long block of text in one language, result in a first-turn-summarizer LLM (embedded-grok on X posts) returning a different language? I'm not surprised that the first instance I've encountered this is from a post with the kind of writing style and content that this post features, but would love other examples of this behavior and hypotheses of why this might happen.
Joe tweet media
English
2
0
1
726