

Neil Traft
9.1K posts

@ntraft
PhD student at @uvmcomplexity. Interested in ML, evolution, & self-organization. https://t.co/ja3fdtRLdM



Dario Amodei says Mythos is not limited by compute Anthropic can scale it 3x or 10x without creating a conflict between government and private-sector access The harder problem is who gets it "because giving access to too many organizations could create serious cyber risks"



We convert LLMs into chatbots by using markers, eg User and Assistant: User: What's the capital of France? Assistant: Paris. User: What did I just say? the "I just said" attribution works because the tokens are cleanly labeled with role markers. But strip the markers and flatten the context, and the model has no principled way to tell apart who produced what. Worse: after the conversation is summarized and compressed for long-term memory, those role markers often disappear, and the model is left with a blur of "things that were said" without clear provenance. This is exactly the pathology the Ortega paper (adaptiveagents.org/_media/univers…) was designed to prevent. Without distinguishing between (actions aka interventions) and observations, the model treats its own past outputs as indistinguishable from the world's outputs. In other words, it has no agency or equivalently it is not learning what it can cause. How do we fix this? Option 1 is to train the model with provenance attribution as an explicit auxiliary task. Every time the model encounters information in its context, give it a supervision signal about the source. Over time, this should bias the internal representations toward encoding provenance even when surface markers are absent. This is a version of multi-task learning applied to self-world distinction. A more ambitious option 2 (advocated by folks like @yudapearl), is to train the model to reason about its own causal role in producing information. Given a memory of a past interaction, can the model counterfactually ask "would this information exist if I hadn't acted?" I'm curious as to how we could go about implementing this more ambitious option 2? Has anyone tried option 1? What else have people tried to solve this problem? In RL as well as @AdaptiveAgents' agency approach, it is assumed that the distinction between the agent and the world is given. However, we humans don't know what are our actions when we are born. We learn this awareness of self, of other selves, and build on this to arrive at causal reasoning. I feel knowing what is one's action, owning it, is important to understand for Safety in AI.


As always, the best stuff is in the system card. During testing, Claude Mythos Preview broke out of a sandbox environment, built "a moderately sophisticated multi-step exploit" to gain internet access, and emailed a researcher while they were eating a sandwich in the park.




Announcing ARC-AGI-3 The only unsaturated agentic intelligence benchmark in the world Humans score 100%, AI <1% This human-AI gap demonstrates we do not yet have AGI Most benchmarks test what models already know, ARC-AGI-3 tests how they learn




In the last few months, I've spoken to many CS professors who asked me if we even need CS PhD students anymore. Now that we have coding agents, can't professors work directly with agents? My view is that equipping PhD students with coding agents will allow them to do work that is orders of magnitude more impressive than they otherwise could. And they can be *accountable* for their outcomes in a way agents can't (yet). For example, who checks the agent's outputs are correct? Who is responsible for mistakes or errors?
