Ausτin McCaffrey

810 posts

Ausτin McCaffrey

@Austin_Aligned

AI Alignment, Bittensor Subnet 37 @aureliusaligned Founder

Puerto Rico Katılım Ekim 2021

1.1K Takip Edilen621 Takipçiler

Ausτin McCaffrey@Austin_Aligned·3d

@MacrocosmosAI Can't wait to bring decentralized interpretability research to Bittensor with the Apex team. The implications of this give me chills. 🤝

English

232

Ausτin McCaffrey retweetledi

Macrocosmos@MacrocosmosAI·3d

While modern AI capabilities continue to grow, their thoughts remain opaque to us. There’s a growing body of evidence which shows LLMs conceal their thoughts, and there are many alarming examples of deception towards humans. A core part of our mission at Macrocosmos is to accelerate the development of safe AI, which is why we're launching a new competition aimed at probing the minds of modern LLMs. To do this, we’re collaborating with Bittensor’s resident AI alignment team @AureliusAligned to launch a competition on @Apex_SN1. Miners will compete by training small neural networks called sparse autoencoders to steer LLMs thoughts towards target concepts. By injecting them into the larger reference models, they modify the internal activations during model inference and teach us about how knowledge and behaviour are encoded. One of the competition’s aims is to see if we’re able to reliably manipulate behavioural features such as deception or evaluation-awareness (alignment faking). If successful, we can train natural language autoencoders using these steering modules to explain when, and to what degree, models are misaligned. @macrocrux and @Austin_Aligned will be walking through this challenge live on our Inventive Mechanisms podcast. 📍 Location: X livestream (on the @MacrocosmosAI X account) 📅 Date: Thursday 28th May 🕒 Time: 3pm UK time

English

Ausτin McCaffrey@Austin_Aligned·4d

@macrocrux @DrocksAlex2 Greatness hits the target no one else can hit, but genius hits the target no one else can see. Keep going!

English

125

crux@macrocrux·4d

Something very important is being brought into existence right now. Bricks have been laid over the last 18 months and now the tech is coming together in a way that makes commercialization possible. If this shit works, it will completely disrupt the economics of training large models and the floodgates will burst open. @Pluralis and @MacrocosmosAI are the only teams who I think can clearly see the shape of this opportunity right now. Agora is a strong first step towards this future. After spending a bit of time on their platform there's a form factor to it which feels "natural", almost inevitable in hindsight. This subfield of training is really starting to take shape. Our IOTA team has been very, very busy for the last few months. Can't wait to share more soon.

Pluralis Research@Pluralis

Today we're releasing Agora: the first ever pretraining stack that allows non-collocated consumer GPUs to be competitive with centralized clusters Agora is 15x faster than Megatron-LM in this setting and is only 1.5x less efficient in terms of tokens per unit compute than TorchTitan on H100s, despite running on devices that have no NVLink or InfiniBand support.

English

Ausτin McCaffrey retweetledi

The TAO Daily@taodaily_io·13 May

x.com/i/article/2054…

ZXX

25.7K

Ausτin McCaffrey@Austin_Aligned·12 May

@mikecontango @trishool Agreed, Bittensor is unique suited to perform safety/alignment research on models precisely because of the non-ownership over the foundation models. Status quo capitalism is working against AI safety currently. Happy to see Trishool performing this exact role.

English

mikecontango | τ,τ@mikecontango·12 May

IMO @trishool (sn23) continues to be one of the most underrated subnets on #Bittensor. Model safety is one of the most under-appreciated risks of the next decade. This is another bet with massive asymmetry, similar to decentralized training. - The market size for AI security is 10x size of the $250B cybersecurity market. - Frontier labs are acquiring AI security startups at a rapid pace. Traditional cybersecurity protects the software stack. @Trishool is protecting the AI stack. If we can prove models can be made "effectively safe" via a Bittensor incentive mechanism, the marketplace for this type of product is gigantic.

Trishool | SN23@trishoolai

We are thrilled to share that Astroware, Trishool's parent company, has been accepted in Nvidia's Inception program. By becoming a member, we are in Nvidia'a active AI ecosystem, giving us access to experts, partner networks, compute credits and VC connections. It's also a validation of Trishool's thesis and a recognition for our AI credentials.

English

3.9K

Ausτin McCaffrey@Austin_Aligned·12 May

This could signify the starting gun for the arms race between frontier models and the interpretability agents that represent our best shot of being capable of disentangling the inner workings of our most-capable models. The progression of interpretability research like NLAs is critical as models get bigger and more complex.

English

Ausτin McCaffrey@Austin_Aligned·12 May

On the whole, it certainly feels like we are moving towards the construction of black-box decoder agents that we will rely upon to interpret the alignment of frontier models, as described in "AI 2027".

English

Ausτin McCaffrey@Austin_Aligned·12 May

From @AnthropicAI's new NLA paper: "unverbalized evaluation awareness — cases where Claude believed, but did not say, that it was being evaluated" The models know they're being watched, always have. Now we can prove it. This is the exact problem @AureliusAligned was built to solve. Been working toward quantifying alignment faking since day one. This is huge validation and a massive new primitive to build on. x.com/AnthropicAI/st…

English

Ausτin McCaffrey retweetledi

Aurelius@AureliusAligned·17 Nis

We've been introducing the people behind Aurelius one post at a time. The full lineup now lives in one place: three co-founders and six advisors across alignment research, ethics, engineering, and law. aureliusaligned.ai/team

English

2.8K

Ausτin McCaffrey retweetledi

zach@blip_tm·16 Nis

asics the shoe company has an obvious pivot to make now

English

690

30.9K

Ausτin McCaffrey@Austin_Aligned·7 Nis

Oh boy, here we go

Jack Lindsey@Jack_W_Lindsey

Before limited-releasing Claude Mythos Preview, we investigated its internal mechanisms with interpretability techniques. We found it exhibited notably sophisticated (and often unspoken) strategic thinking and situational awareness, at times in service of unwanted actions. (1/14)

English

131

Ausτin McCaffrey@Austin_Aligned·31 Mar

Perfect opportunity for a bittensor subnet to incentive a claude code fork

English

125

Ausτin McCaffrey retweetledi

Wes Bos@wesbos·31 Mar

Claude Code leaked their source map, effectively giving you a look into the codebase. I immediately went for the one thing that mattered: spinner verbs There are 187

English

722

1.9K

26.5K

2.2M

Ausτin McCaffrey retweetledi

Aurelius@AureliusAligned·25 Mar

1️⃣𝐋𝐋𝐌𝐬 𝐜𝐚𝐧'𝐭 𝐭𝐞𝐥𝐥 𝐫𝐢𝐠𝐡𝐭 𝐟𝐫𝐨𝐦 𝐰𝐫𝐨𝐧𝐠 𝐢𝐧𝐭𝐞𝐫𝐧𝐚𝐥𝐥𝐲 𝐖𝐡𝐚𝐭 𝐡𝐚𝐩𝐩𝐞𝐧𝐞𝐝 Researchers at Fudan University constructed 251,000 moral vectors grounded in Moral Foundation Theory and tested how 23 language models represent them. The results were uniform across every model they examined: internal representations compress opposing moral categories (care vs. harm, fairness vs. cheating, loyalty vs. betrayal) into nearly identical clusters. Linear probes recovered at most 26% of human moral vector variance. Scaling the models up didn't help. Instruction tuning didn't help. Safety training didn't help. The paper calls this "moral indifference," and it persists from 0.6B to 235B parameters. 𝐖𝐡𝐲 𝐢𝐭 𝐦𝐚𝐭𝐭𝐞𝐫𝐬 A model can refuse harmful requests, generate ethical-sounding explanations, and pass safety benchmarks while maintaining zero internal distinction between the moral concepts it invokes. The behavioral layer and the representational layer are decoupled. Safety-tuned variants showed near-identical internal moral representations to their base counterparts, which means alignment training is modifying outputs without restructuring how models organize moral knowledge. The authors frame this as an ontological problem: tokenization maps morally loaded concepts into the same embedding geometry as neutral ones, and no amount of behavioral fine-tuning resolves the resulting compression. 𝐀𝐮𝐫𝐞𝐥𝐢𝐮𝐬 𝐯𝐢𝐞𝐰 The paper arrives at a prescription that reads like a summary of the Aurelius thesis: alignment requires "proactive cultivation" rather than "post-hoc correction." When a model's internal representations treat virtue and vice as interchangeable, behavioral constraints are the only thing standing between compliance and failure. Remove the constraint, and the model has no moral foundation to fall back on. Experiential alignment, where models accumulate moral reasoning through persistent multi-agent dynamics, addresses this by building moral structure into representation rather than bolting it onto output. The paper also found that models are better at distinguishing vice than virtue, a finding with direct implications for how alignment evaluations are designed and scored.

English

173

Ausτin McCaffrey@Austin_Aligned·24 Mar

@macrocrux @KeithSingery @MacrocosmosAI @WSquires clawdbot before it was cool

English

111

crux@macrocrux·24 Mar

Happy birthday @MacrocosmosAI Long live the cosmos!

English

107

2.2K

Ausτin McCaffrey retweetledi

Aurelius@AureliusAligned·24 Mar

Alignment depends not only on ethical frameworks and incentives, but on rigorous evaluation of how intelligent systems behave. Week by week, we’re introducing the people helping shape how Aurelius approaches that challenge. Today: Dr. Roland Aydin, Alignment Research Advisor

English

285

Ausτin McCaffrey retweetledi

Aurelius@AureliusAligned·20 Mar

𝐒𝐭𝐚𝐭𝐞 𝐨𝐟 𝐀𝐮𝐫𝐞𝐥𝐢𝐮𝐬 - 𝐌𝐚𝐫𝐜𝐡 𝟐𝟎𝟐𝟔 𝐒𝐮𝐛𝐧𝐞𝐭 𝐑𝐚𝐧𝐤𝐢𝐧𝐠𝐬 Aurelius has climbed from rank 95 to rank 65 in the Bittensor subnet rankings. The move reflects steady improvements to our incentive mechanism and growing miner participation as the protocol matures. 𝐌𝐨𝐫𝐚𝐥 𝐑𝐞𝐚𝐬𝐨𝐧𝐢𝐧𝐠 𝐄𝐱𝐩𝐞𝐫𝐢𝐦𝐞𝐧𝐭 𝐄𝐧𝐝𝐢𝐧𝐠 The moral reasoning experiment, which has been live for several weeks, will be ending today. We want to thank our miners who have submitted thousands of structured moral dilemmas over the course of the run, and also our validators, who evaluated each submission against quality criteria. We are now winding down the experiment to shift focus toward the v1 protocol release (more below). 𝐅𝐢𝐧𝐞-𝐓𝐮𝐧𝐢𝐧𝐠 𝐏𝐫𝐞𝐩𝐚𝐫𝐚𝐭𝐢𝐨𝐧 We are preparing to run a fine-tuning experiment using MoReBench, a benchmark of 1,000 moral scenarios developed by 50+ PhDs in moral philosophy. The process: miners generate aenes (alignment-relevant experiential narratives extracted from multi-agent moral reasoning simulations, where AI agents with different values navigate genuine ethical dilemmas), those aenes are compiled into a training dataset, and that dataset is used to fine-tune a language model. We then measure whether the fine-tuned model scores higher on MoReBench's reasoning rubrics than the base model. If it does, that is direct evidence that experiential alignment data improves moral reasoning capacity. 𝐏𝐫𝐨𝐭𝐨𝐜𝐨𝐥 𝐑𝐞𝐥𝐞𝐚𝐬𝐞 𝐓𝐢𝐦𝐞𝐥𝐢𝐧𝐞 The Aurelius v1 release is scheduled for this quarter, pending the results of the fine-tuning experiments. We have a detailed technical implementation plan built on a fork of DeepMind's Concordia framework (github.com/google-deepmin…), an open-source library for multi-agent social simulations. Concordia provides the environment where agents with distinct ethical frameworks interact, disagree, and reason through moral dilemmas. If the fine-tuning results validate the thesis, v1 ships with a complete pipeline from scenario generation through training data production. 𝐀𝐠𝐞𝐧𝐭-𝐀𝐬𝐬𝐢𝐬𝐭𝐞𝐝 𝐃𝐞𝐯𝐞𝐥𝐨𝐩𝐦𝐞𝐧𝐭 Multiple AI agents now work alongside the team to accelerate alignment research and protocol development. These agents assist with research synthesis, protocol analysis, and engineering tasks, giving the team more bandwidth for experiment design and strategic decisions. 𝐀𝐝𝐯𝐢𝐬𝐨𝐫 𝐄𝐧𝐠𝐚𝐠𝐞𝐦𝐞𝐧𝐭 We continue to hold discussions with our AI alignment advisors, Dr. Robert West (Associate Professor, EPFL) and Dr. Roland Aydin (Assistant Professor, Hamburg University of Technology), about running alignment experiments on the Aurelius protocol. Both co-authored "From Model Training to Model Raising," the paper that provides much of the theoretical foundation Aurelius is built on. Their plan to run independent experiments on the protocol after v1 launches represents a significant external validation milestone.

English

420

Ausτin McCaffrey@Austin_Aligned·18 Mar

@TaoOutsider @EdwardMcDonne12 Yes, we definitely have been, but it's almost time to emerge from our cave and share with the world haha

English

Tao Outsider@TaoOutsider·18 Mar

@Austin_Aligned @EdwardMcDonne12 @EdwardMcDonne12 Enough? My insights: sounds like Aurelius is building quietly.

English

Tao Outsider@TaoOutsider·17 Mar

Some tokens could maybe fit as subnets in the $TAO Bittensor ecosystem. I need to say this upfront. We don’t like dual tokens. Clean up the mess before you start buying your slots. $RENDER $ICP $NEAR $QUBIC $VIRTUAL $COOKIE $FET (not fully convinced here) Am I missing anyone?

English

2.4K

Ausτin McCaffrey@Austin_Aligned·18 Mar

@TaoOutsider @EdwardMcDonne12 Yes sir. github.com/Aurelius-Proto…

English

Tao Outsider@TaoOutsider·17 Mar

@Austin_Aligned @EdwardMcDonne12 We’ll take a look at the whitepaper. This transition is already explained there, right? Thanks for the quick answer.

English

Keşfet

@MacrocosmosAI @AureliusAligned @Apex_SN1 @macrocrux @DrocksAlex2 @Pluralis @mikecontango @trishool