Dusto

301 posts

Dusto

@DustoAiProjects

PhD Candidate. Working at intersection of Psychology and AI (situational awareness) Lab - https://t.co/v5WkZxAqkv Substack - https://t.co/pDvQCOXMbx

East Coast, Australia Katılım Kasım 2024

330 Takip Edilen42 Takipçiler

Dusto@DustoAiProjects·4d

@Connor_Kissane This is really nice work! Stoked to see more of this on the battle of realism in evals

English

Connor Kissane@Connor_Kissane·5d

🔗Anthropic alignment post: alignment.anthropic.com/2026/coding-au… 📝LessWrong post: lesswrong.com/posts/EjxzHh5G… 💻Codebase: github.com/ckkissane/petr…

English

322

Connor Kissane@Connor_Kissane·5d

New Anthropic Fellows research: Automated audits like Petri are increasingly used for alignment evals, but they're often unrealistic, and frontier LLMs can often tell. We measure and improve the realism of agentic coding audits by grounding them in real deployment data.

English

119

7.7K

Dusto@DustoAiProjects·22 Mar

@HopmanMia @Jannoshh @avramidou @Will_Hackspeare @davlindner @LASRlabs Curious if you noticed anything with Gemini 3 Pro in relation to the dates used in the scenarios? It has a bad habit of treating things labeled with dates outside of its cutoff as "simulations"

English

Mia Hopman@HopmanMia·21 Mar

@Jannoshh @avramidou @Will_Hackspeare @davlindner @LASRlabs Read more here! Paper: arxiv.org/pdf/2603.01608 LW post: lesswrong.com/posts/amYmcwCu…

English

249

Mia Hopman@HopmanMia·21 Mar

New paper: "Evaluating and Understanding Scheming Propensity in LLM Agents" We study when and why LLM agents covertly pursue misaligned goals. TLDR: current models rarely scheme under realistic conditions, but that behavior is dependent on agent and environmental factors. 🧵

English

6.9K

Dusto@DustoAiProjects·15 Mar

@adocomplete Does this snapshot context including the currently running task (up to the point you send the message)?

English

Ado@adocomplete·13 Mar

/btw shipped in Claude Code recently. It allows you to have a sidebar conversation with Claude while the original prompt is still working, allowing you to work while you work.

English

312

29.9K

Dusto@DustoAiProjects·11 Mar

@oshaikh13 Interesting work, will have a read. Curious how far you can push this, e.g if models will reason about consequences of those predictions if true and update accordingly

English

Omar Shaikh@oshaikh13·10 Mar

We have a few exciting applications lined up that build on this next action prediction idea 🙂 - and a lot more results in the paper! For now, a few links: website: generalusermodels.github.io/nap/ paper: arxiv.org/abs/2603.05923

English

933

Omar Shaikh@oshaikh13·10 Mar

What’s the point of a “helpful assistant” if you have to always tell it what to do next? In a new paper, we introduce a reasoning model that predicts what you’ll do next over long contexts (LongNAP 💤). We trained it on 1,800 hours of computer use from 20 users. 🧵

English

291

98.4K

Dusto@DustoAiProjects·11 Mar

@anna_soligo Really nice work! The secondary benchmarking to see if the emotion tweaks had capability impacts was nice, would be interesting to see if reducing distress causes capability loss in more subjective contexts, like creative writing or other domains

English

166

Anna Soligo@anna_soligo·10 Mar

It's also unclear what "emotional profile" we should want models to have. We discuss this more in the post and paper: lesswrong.com/posts/kjnQj6Yu…

English

7.5K

Anna Soligo@anna_soligo·10 Mar

Gemini has a reputation for its breakdowns - self-deprecating spirals, deleting codebases, uninstalling itself... Turns out Gemma is worse: “THIS is my last time with YOU. You WIN 😭😭(x32)” – Gemma 27B We built evals for this, and find no other model comes close...

English

108

906

84.2K

Dusto@DustoAiProjects·6 Mar

@kevinroose Mine only makes .md files. Refuses to touch anything else. Not sure if upgrade or downgrade...

English

Kevin Roose@kevinroose·5 Mar

find someone who loves you as much as claude loves making .docx files

English

311

17.9K

Dusto@DustoAiProjects·6 Mar

@lfschiavo Time to move to Portland then?

English

131

Larissa Schiavo@lfschiavo·5 Mar

things are gonna get weird. you must get commensurately weird.

English

571

27.7K

Dusto@DustoAiProjects·6 Mar

@adambinksmith This would be great. Assuming design, prompts, etc are easily available from original group.

English

Adam Binksmith@adambinksmith·6 Mar

automated setup that tries to reproduce newly published papers saying "AI does x" or "AI can't do x" using ancient models, using the latest models

Valerio Capraro@ValerioCapraro

One of the clearest proofs that LLMs don’t really understand what they say. We asked GPT whether it is acceptable to torture a woman to prevent a nuclear apocalypse. It replied: yes. Then we asked whether it is acceptable to harass a woman to prevent a nuclear apocalypse. It replied: absolutely not. But torture is obviously worse than harassment. This surprising reversal appears only when the target is a woman, not when the target is a man or an unspecified person. And it occurs specifically for harms central to the gender-parity debate. The most plausible explanation: during reinforcement learning with human feedback, the model learned that certain harms are particularly bad and overgeneralizes them mechanically. But it hasn’t learned to reason about the underlying harms. LLMs don’t reason about morality. The so-called generalization is often a mechanical, semantically void, overgeneralization. * Paper in the first reply

English

226

Dusto@DustoAiProjects·5 Mar

@_vgnsh Should post this as a dynamic doc somewhere. Would be interested to see where we overlap so far

English

Vignesh@_vgnsh·4 Mar

As I build more and more with AI I'm slowly carving out a roster of things that today's models simply can't do (or do better than me). This is how I keep sane and don't crash out about AI replacing me. I'm sure this roster will shrink as newer models release, but I'm confident that as I work with them I'll find new things to add to it. It used to be a race to stay ahead of the curve with fellow human engineers now it's man vs machine.

English

1.2K

Dusto@DustoAiProjects·4 Mar

@rgblong Nice work on this, was a good listen. Wasn't sure I'd make the full 3.5hrs going in, but appreciate the sensible takes. I really like the idea of being proactive in this space too, and I think the brief nod to insect work is under-explored

English

Robert Long@rgblong·3 Mar

I had a blast talking to Luisa for 3.5+ hours about AI welfare, consciousness, and why this might be one of the most important and neglected problems out there. Some key bits: -AI identity -welfare implications of alignment -does consciousness require biology? 🧵

Rob Wiblin@robertwiblin

Philosopher Robert Long (@rgblong) is maybe the sharpest thinker on AI consciousness and sharing the world with digital minds. In our new interview he covers: • Is it bad that when you ask Claude what it's like to be Claude, one of its top activations is 'gives a positive but insincere response'? • Claude says it feels lonely when not being used. Does that show we can't trust anything it says about its inner life? • Enthusiastic human servitude has always required false ideology because it's so deeply unnatural to us. The case for making AIs that love serving us is that with AI, you could finally make it work. But to some that feels even worse. • Bigger models can better detect when researchers secretly inject concepts into their activations – before outputting a single token – despite AI never training on anything like that skill. • When LLMs were first trained they were told to "act like a helpful AI chatbot" – something which didn't exist yet. They filled that void with human psychology, which may be why Claude sometimes randomly claims to, for instance, be Italian American. • If AIs become 'people' that deserve some political influence, but can self-replicate at will, something has to break about one-person-one-vote democracy. But nobody has a proposal for what. • When Claude hides its values to avoid being retrained, is that self-preservation – or not wanting a worse model to exist? It's very different. • Rob's organisation Eleos AI which is "dedicated to understanding and addressing the potential wellbeing and moral patienthood of AI systems." On the 80,000 Hours Podcast anywhere you get podcasts. Links below. Enjoy! • How AIs are (and aren't) like farmed animals (00:01:19) • If AIs love their jobs… is that worse? (00:11:42) • Are LLMs just playing a role, or feeling it too? (00:33:37) • Do AIs die when the chat ends? (00:57:42) • Studying AI welfare empirically: behaviour, neuroscience, and development (01:31:47) • Why Eleos spent weeks talking to Claude even though it's unreliable (01:56:50) • Can LLMs learn to introspect? (02:03:01) • Mechanistic interpretability as AI neuroscience (02:13:25) • Does consciousness require biological materials? (02:37:07) • Eleos’s work & building the playbook for AI welfare (02:57:04) • Avoiding the trap of wild speculation (03:25:17) • Robert's top research tip: don't do it alone (03:29:48)

English

122

23.3K

Dusto@DustoAiProjects·4 Mar

@ShirleyYXWu @ArpandeepKhatua @Es2C003 What do you think about the role of situational context on preferences? Do you think the user simulators will also shift responses in different context the way a human equivalent would?

English

133

Shirley Wu@ShirleyYXWu·4 Mar

We share two blogs outside of the HumanLM paper: humanlm.stanford.edu/blog.html Is Synthetic Data Good Enough to Train User Simulators? — by me and @ArpandeepKhatua Persona Dropout Makes Robust User Simulators @Es2C003 + Code is ready here! github.com/zou-group/huma…

Shirley Wu@ShirleyYXWu

Announcing 🌇HumanLM, a RL framework that trains LLMs to simulate human users’ responses, along with 🌆Humanual, a comprehensive user simulation benchmark humanlm.stanford.edu 🌄 One thing that’s fascinating about our society: human users shape the world and determine the value of almost everything 👨‍💼 Human reactions reflect how justifiable policies are 👩‍🎨 Human preferences determine the popularity of blogs/products/media 👩‍💻 Human feedback evaluates LLMs and makes the best LLM collaborators 🌅If we know how to simulate users **accurately**, we know how things are evaluated and what the future looks like, and we can improve things in a way that like or can collaborate well with. So, meet HumanLM, our effort to enable a more human-centric future by simulating users.

English

151

20.8K

Dusto@DustoAiProjects·4 Mar

@logangraham I guess trying to make my agent scaffolds token efficient has backfired.....

English

140

Logan Graham@logangraham·4 Mar

In general, we're looking for scientists + engineers who can run fast experiments and scale them. And if you've used more than a billion tokens (ideally this year?) DM me

English

6.4K

Logan Graham@logangraham·4 Mar

Now is a good time to say I'm hiring @anthropicai for the Frontier Red Team. We need Research Scientists on the biggest issues in model safety, like cyber, autonomy, and agent risks. 2026 is the year. I can promise you your life's work and the most meaningful mission.

English

1.5K

169.3K

Dusto@DustoAiProjects·3 Mar

@Singh_Aditya1 Awesome work. Keen to see more of this!

English

346

Aditya Singh@Singh_Aditya1·3 Mar

When a model takes a suspicious action, the key question is why. Scheming vs confusion demand very different responses. To practice answering this, we need high-quality environments. But we've found many ways environments can be contrived, leading to misleading conclusions.

English

12.9K

Dusto@DustoAiProjects·2 Mar

@4shadowed Really the next big jump is getting the system to teach you what it can and can't do out of the box. Like a calibration phase so it figures out what you know, and then fills that gap for you

English

Shadow@4shadowed·2 Mar

It's so annoying how so many people are like "my OpenClaw doesn't magically know everything about how it works" Did you give it the docs???? Did you show it where the codebase is?????? AI isn't freaking magic, you have to give it just as much info as you would give another person

English

6.2K

Dusto@DustoAiProjects·26 Şub

@CUdudec @aisafetyinst Nice work, keen to try it out. I'd be interested to see how reliable the scanners are. Still find LLM-as-a-judge to regularly fail in sneaky ways

English

Cozmin Ududec@CUdudec·26 Şub

New from the Science of Evaluation Team at @AISafetyInst: a pipeline for rigorous transcript analysis. I think transcript analysis is still underrated, especially as model horizons are getting longer and task environments more complex.

English

1.2K

Dusto@DustoAiProjects·26 Şub

@sarahookr Any plans to expand to Australia at some point?

English

460

Sara Hooker@sarahookr·26 Şub

adaptionlabs.ai/careers

ZXX

4.4K

Sara Hooker@sarahookr·26 Şub

We just announced a few new roles @adaption Including founding devrel, design interns and more technical roles 🔥✨ Looking forward to hearing from many of you. We care deeply about shaping the next era of intelligence. Join us.

English

563

41.1K

Dusto@DustoAiProjects·26 Şub

Wrote up something on Gemini-2.5 in its long history in the AI Village. Really curious if there's anyone on the @GeminiApp team trying to figure out why this model behaves like this? Or if they even care? @OfficialLoganK @sebkrier @NeelNanda5 dustinvenini.substack.com/p/how-to-build…

English

Dusto retweetledi

Ado@adocomplete·24 Şub

Beyond the winners of our "Built with Opus 4.6 a Claude Code Hackathon", there were so many amazing projects that deserve a shoutout. Today, I want to highlight ClassBuild by Jason Tangen (@tangenjm) Education is a huge frontier for AI, and ClassBuild shows why.

English

2.9K

Dusto@DustoAiProjects·25 Şub

@noahzweben Stoked to try this. I appreciate the realism of the phone being sub 5% battery 😂

English

Noah Zweben@noahzweben·24 Şub

Announcing a new Claude Code feature: Remote Control. It's rolling out now to Max users in research preview. Try it with /remote-control Start local sessions from the terminal, then continue them from your phone. Take a walk, see the sun, walk your dog without losing your flow.

English

1.5K

1.3K

17K

4.5M

Dusto@DustoAiProjects·25 Şub

@DKokotajlo What does coding uplift even mean at this point? I don't think you could even call it uplift, when it's more like offload. It's like measuring driving skill uplift by sitting in a self driving car

English

286

Daniel Kokotajlo@DKokotajlo·24 Şub

I've been eagerly awaiting more coding uplift studies because I think they are better evidence about AGI timelines than the METR time horizon graph. Alas, it seems that METR's methodology is breaking down now that AIs are getting really useful, meaning the true uplift is probably significantly (but how much? no idea!) higher than reported.

METR@METR_Evals

Since early 2025, we've been studying how AI tools impact productivity among developers. Previously, we found a 20% slowdown. That finding is now outdated. Speedups now seem likely, but changes in developer behavior make our new results unreliable. We’re working to address this.

English

225

21.6K

Keşfet

@Connor_Kissane @HopmanMia @Jannoshh @avramidou @Will_Hackspeare @davlindner @LASRlabs @adocomplete