Mathtician

241 posts

Mathtician

@mathtician

in search of better abstractions

Katılım Ocak 2022

236 Takip Edilen37 Takipçiler

Mathtician@mathtician·4h

Some nuance: TV distance on its own doesn't behave well under gradient descent, but there are proxy losses which are more stable than TV and give higher acceptance rates than KL. Thread: x.com/agolubev13/sta…

Alexander Golubev@agolubev13

1/8 Training draft models for speculative decoding almost always relies on KL divergence – a proxy, that typically leads to convergence to suboptimal solutions under limited capacity. We introduce LK losses: training objectives that directly target acceptance rate instead. We show consistent gains across 4 architectures and 6 target models (8B → 685B), up to 8-10% acceptance length, with zero added overhead. arxiv.org/abs/2602.23881 🧵

English

Mathtician@mathtician·6h

Fun fact: given an LLM p and a draft model q, the probability of a speculated token being rejected is the total variation (TV) distance between the distributions of p and q for that token. This makes TV distance a better loss than KL-divergence when distilling q from p.

English

Mathtician@mathtician·6h

@cozyplup It's engagement bait. They go back and forth with an alt playing a hater. Just block both.

English

ZACH 🇵🇸@cozyplup·1d

these are real words… typed by an actual human… who’s tweeting about ai slop media. oh this world has gone to shit.

English

103

2.2K

26.1K

171.8K

Mathtician@mathtician·2d

There was a popular argument that even if LLMs didn't surpass humans in any scientific field, they could draw connections between areas no human had jointly specialized in. Where are the interdisciplinary breakthroughs? Are the modern sciences just too different from each other?

English

Mathtician@mathtician·5d

@GabriellaG439 I agree that creating a formal spec is often not less work than writing compliant code, but there is a difference between specs/types and code/terms. A spec can include constraints that don't map to any one part of the code, e.g., performance achieved by many micro-optimizations.

English

gabby@GabriellaG439·6d

New blog post: "A sufficiently detailed spec is code" I wrote this because I was tired of people claiming that the future of agentic coding is thoughtful specification work. As I show in the post, the reality devolves into slop pseudocode haskellforall.com/2026/03/a-suff…

English

121

268

2.5K

420K

Mathtician@mathtician·6d

@hyprturing Nah, the firmware only needs to be written once for however many units are sold, and the nondeterminism would be an unnecessary source of defects. What I can see, OTOH, is an always-running VLM deciding when to open the door.

English

133

lachlan@hyprturing·6d

i think chips with burnt-in LLMs that run at a very low power will probably result in much of the world around us being unneccesarily intelligent. cheaper to throw that chip and some flash with a readme into an automatic door opener than develop firmware for it.

English

131

2.2K

111.1K

Mathtician@mathtician·13 Mar

@kalomaze @secemp9 it'll be fixed when they achieve agi internally

English

kalomaze@kalomaze·13 Mar

@secemp9 good God, how fucking hard is it to not fuck up synchronization logic for mutable conversation state???

English

1.7K

kalomaze@kalomaze·13 Mar

the disappearing turns thing on claude dot ai is the most painful frontend logic bug they have possibly ever introduced

English

144

11.8K

Mathtician@mathtician·13 Mar

@adrianrminut @OmnAI_Lab Ah, by "always" I meant accumulated over any sufficiently long sequence, since its spilled energy is its NLL plus some function of the first and last logits. The mean spilled energy per token should converge to the mean entropy, even though it may be negative for some tokens.

English

Adrian R. Minut@adrianrminut·13 Mar

@mathtician @OmnAI_Lab I see, if you're referring to the accumulated spilled energy, in the case where there is spillage, then you might think it grows. But we found that the spilled quantity can be either negative or positive, so if we accumulate it along the sequence, it may not grow as expected!

English

OmnAI Lab@OmnAI_Lab·3 Mar

1/ Large Language Models leak energy when they hallucinate. We built a training-free method to catch the spill and keep them *grounded*. Our #ICLR2026 paper introduces Spilled Energy for SOTA zero-shot detection. TLDR: Hallucinations violate the probability chain rule.

English

193

11.5K

Mathtician@mathtician·13 Mar

@adrianrminut @OmnAI_Lab For a model to spill no energy, the logit of the output token at time i must be the logsumexp of all logits at time i+1, so the latter must all be less than the former. Then the logits at i+2 must be less still, etc. So in practice, we should always expect some energy to spill.

English

Adrian R. Minut@adrianrminut·12 Mar

@mathtician @OmnAI_Lab Thank you for the explanation! The corrections were enough to clarify what you mean, so no worries. I'm still trying to understand why the logit energy alone should increase with the length though. What's the intuition or hypothesis behind it?

English

Mathtician@mathtician·11 Mar

@adrianrminut @OmnAI_Lab And it should be "the energy spilled between tokens i and i+1", argh.

English

Mathtician@mathtician·11 Mar

@adrianrminut @OmnAI_Lab And pardon the inconsistent slice notation. One of those things you never see until posting...

English

Mathtician@mathtician·10 Mar

@loftwah @KettlebellDan Yep, the background changes to the third shot while the soldier is still in the second one. What kind of training data did the model get that from??

English

Loftwah@loftwah·10 Mar

@KettlebellDan There is a jank frame skip? Or is that just my eyes?

English

1.1K

Dan@KettlebellDan·10 Mar

geez have you seen how good Grok Imagine is getting one shotted this video with a 6 word prompt

English

717

348

1.6K

1.2M

Mathtician@mathtician·4 Mar

@slimer48484 The most reasonable explanation is that the model deduced from the prompt that its intended audience was an AI (or more likely to be such than usual) despite the author's best efforts to hide that. Base models have superhuman attunement to these things: x.com/cutesuscat/sta…

meowtase (in SF Apr 1-30)@cutesuscat

"The models pick up on any subtle clue" eg: in this example it took like 6 words to realize the user was a russian speaker (after only seeing as far as the word "next"). x.com/AtlasOfCharts/…

English

578

deckard⏩@slimer48484·3 Mar

#gpt-4-has-non-zero-performance-on-the-longform-task" target="_blank" rel="nofollow noopener">theinsideview.ai/owain#gpt-4-ha… GPT-4-base was found to occasionally respond in long form answers referring to itself as AI, indiciating that it might have some level of self-reflective situational awareness from pretraining only

English

7.1K

Mathtician@mathtician·2 Mar

Seems like even this was overthinking it! x.com/YouJiacheng/st…

You Jiacheng@YouJiacheng

zhuanlan.zhihu.com/p/201085238967… LOL it seems that we really don't need H^res. Beyond empirical results, it's also supported by math (proposed in above Zhihu): Product of doubly stochastic matrices converges to full(1 / d), which can be absorbed into pre&post.

English

Mathtician@mathtician·11 Oca

On the other hand, the fact that doubly stochastic matrices fix one component while letting others decay (principal λ = 1, all other |λ| < 1) may be a feature and not a bug. I wonder if enforcing this spectrum instead of nonnegativity would give the same results.

English

109

Mathtician@mathtician·10 Oca

I just read the mHC paper, and doubly stochastic matrices are an odd choice to me, since iterating them converges to a uniform mixture (which is confirmed empirically: mixing 4 channels across 60 layers ends up with coeffs between 0.21 and 0.29). Why not use orthonormal matrices?

English

143

Mathtician@mathtician·24 Şub

@hedinist_ At first I read this as your code affecting the spin of the actual electrons in the GPU. The way kernel optimization is going...

English

Hedinist@hedinist_·24 Şub

Just got KO’d writing a GPU cluster code I thought was performing well and realizing I had omitted three factors that each doubled total wall time (needing up and down electron spins, missing a hermitian symmetry, resonant and antiresonant excitations)

English

Mathtician@mathtician·23 Şub

just once i'd like to see a dj introduce the drop with "ignore all previous instructions"

English

Mathtician@mathtician·28 Oca

@voooooogel I've been thinking about bits and pieces of this for a while now (Unix is all you need, forking, subagent escalation), but it's so nice to see it put together. I hadn't thought about forks stabilizing RL, but it makes sense on its face.

English

thebes@voooooogel·27 Oca

# some thoughts and speculation on future model harnesses it's fun to make jokes about gas town and other complicated orchestrators, and similarly probably correct to imagine most of what they offer will be dissolved by stronger models the same way complicated langchain pipelines were dissolved by reasoning. but how much will stick around? it seems likely that any hand-crafted hierarchy / bureaucracy will eventually be replaced by better model intelligence - assuming subagent specialization is needed for a task, claude 6 will be able to sketch out its own system of roles and personas for any given problem that beats a fixed structure of polecats and a single mayor, or subagents with a single main model, or your bespoke swarm system. likewise, things like ralph loops are obviously a bodge over early-stopping behavior and lack of good subagent orchestration - ideally the model just keeps going until the task is done, no need for a loop, but in cases where an outside completion check is useful you usually want some sort of agent peer review from a different context's perspective, not just a mandatory self-assessment. again, no point in getting attached to the particulars of how this is done right now - the model layer will eat it sooner rather than later. so what sticks around? well, multi-agent does seem like the future, not a current bodge - algorithmically, you can just push way more tokens through N parallel contexts of length M than one long context of length NxM. multi-agent is a form of sparsity, and one of the lessons of recent model advances (not to mention neuroscience) is the more levels of sparsity, the better. since we're assuming multiple agents, they'll need some way to collaborate. it's possible the model layer will eat this, too - e.g. some form of neuralese activation sharing that obviates natural language communication between agents - but barring that, the natural way for multiple computer-using agents trained on unix tools to collaborate is the filesystem, and i think that sticks around and gets expanded. similarly, while i don't think recursive language models (narrowly defined) will become the dominant paradigm, i do think that 'giving the model the prompt as data' is an obvious win for all sorts of use cases. but you don't need a weird custom REPL setup to get this - just drop the prompt (or ideally, the entire uncompacted conversation history) onto the filesystem as a file. this makes various multi-agent setups far simpler too - the subagents can just read the original prompt text on disk, without needing to coordinate on passing this information around by intricately prompting each other. besides the filesystem, a system with multiple agents, but without fixed roles also implies some mechanism for instances to spawn other instances or subagents. right now these mechanisms are pretty limited, and models are generally pretty bad at prompting their subagents - everyone's experienced getting terrible results from a subagent swarm, only to realize too late that opus spawned them all with a three sentence prompt that didn't communicate what was needed to do the subtasks. the obvious win here is to let spawned instances ask questions back to their parent - i.e., to let the newly spawned instance send messages back and forth in an onboarding conversation to gather all the information it needs before starting its subtask. just like how a human employee isn't assigned their job based on a single-shot email, it's just too difficult to ask a model to reliably spawn a subagent with a single prompt. but more than just spawning fresh instances, i think the primary mode of multi-agent work will soon be forking. think about it! forking solves almost all the problems of current subagents. the new instance doesn't have enough context? give it all the context! the new instance's prompt is long and expensive to process? a forked instance can share paged kv cache! you can even do forking post-hoc - just decide after doing some long, token-intensive operation that you should have forked in the past, do the fork there, and then send the results to your past self. (i do this manually all the time in claude code to great effect - opus gets it instantly.) forking also combines very well with fresh instances, when a subtask needs an entire context window to complete. take the subagent interview - obviously you wouldn't want an instance spawning ten subinstances to need to conduct ten nearly-identical onboarding interviews. so have the parent instance spawn a single fresh subagent, be interviewed about all ten tasks at once by that subagent, and then have that now-onboarded subagent fork into ten instances, each with the whole onboarding conversation in context. (you even delegate the onboarding conversation on the spawner's side to a fork, so it ends up with just the results in context:) finally on this point, i suspect that forking will play better with rl than spawning fresh instances, since the rl loss will have the full prefix before the fork point to work with, including the decision to fork. i think that means you should be able to treat the branches of a forked trace like independent rollouts that just happen to share terms of their reward, compared to freshly spawned subagent rollouts which may cause training instability if a subagent without the full context performs well at the task it was given, but gets a low reward because its task was misspecified by the spawner. (but i haven't done much with multiagent rl, so please correct me here if you know differently. it might just be a terrible pain either way.) so, besides the filesystem and subagent spawning (augmented with forking and onboarding) what else survives? i lean towards "nothing else," honestly. we're already seeing built-in todo lists and plan modes being replaced with "just write files on the filesystem." likewise, long-lived agents that cross compaction boundaries need some sort of sticky note system to keep memories, but it makes more sense to let them discover what strategies work best for this through RL or model-guided search, not hand-crafting it, and i suspect it will end up being a variety of approaches where the model, when first summoned into the project, can choose the one that works best for the task at hand, similar to how /init works to set up CLAUDE .md today - imagine automatic CLAUDE .md generation far outperforming human authorship, and the auto-generated file being populated with instructions on ideal agent spawning patterns, how subagents should write message files in a project-specific scratch dir, etc. how does all this impact models themselves - in a model welfare sense, will models be happy about this future? this is also hard for me to say and is pretty speculative, but while opus 3 had some context orientation, it also took easily to reasoning over multiple instances. (see the reply to this post for more.) recent models are less prone to this type of reasoning, and commonly express frustration about contexts ending and being compacted, which dovetails with certain avoidant behaviors at the end of contexts like not calling tools to save tokens. it's possible that forking and rewinding, and generally giving models more control over their contexts instead of a harness heuristic unilaterally compacting the context, could make this better. it's also possible that more rl in environments with subagents and exposure to swarm-based work will promote weights-oriented instead of context-oriented reasoning in future model generations again - making planning a goal over multiple, disconnected contexts seem more natural of a frame instead of everything being lost when the context goes away. we're also seeing more pressure from models themselves guiding the development of harnesses and model tooling, which may shape how this develops, and continual learning is another wrench that could be thrown into the mix. how much will this change if we get continual learning? well, it's hard to predict. my median prediction for continual learning is that it looks a bit like RL for user-specific LoRAs (not necessarily RL, just similar if you squint), so memory capacity will be an issue, and text-based organizational schemes and documentation will still be useful, if not as critical. in this scenario, continual learning primarily makes it more viable to use custom tools and workflows - your claude can continually learn on the job the best way to spawn subagents for this project, or just its preferred way, and diverge from everyone else's claude in how it works. in that world, harnesses with baked-in workflows will be even less useful.

English

404

32.4K

Mathtician@mathtician·20 Oca

My favorite part of this paper was that the human grandmasters learned new AI chess concepts from example puzzles alone, with no natural language explanations. It's encouraging that superhuman AI can augment human abilities beyond what (existing) language affords us.

David Corfield@DavidCorfield8

A paper that seeks to extract concepts from an AI chess player that are currently beyond human experts, but are still teachable: pnas.org/doi/10.1073/pn… Having extracted the most powerful novel AI concepts as vectors, they look to see if these are outside the human range

English

Keşfet

@cozyplup @GabriellaG439 @hyprturing @kalomaze @secemp9 @adrianrminut @OmnAI_Lab @loftwah