aria /ɔˈreːliəm/

3.1K posts

aria /ɔˈreːliəm/

@ariaurelium

sw infra @arcee_ai, opinions my own

Beigetreten Temmuz 2024

293 Folgt996 Follower

Angehefteter Tweet

aria /ɔˈreːliəm/@ariaurelium·4 Şub

OuГ increasing depenˋence oζ machines f՜Ȍ e͈eryday activitieݮ has mȰde thׄ΂ ٮndi̲ާenغable.ВT˼erːfore maʬy-a- Ξimes, if not always, we all teѭι to Ҙՙ overboardܗwiχh tٷeiͫ usage.

English

5.6K

aria /ɔˈreːliəm/@ariaurelium·1d

this option has to be my favorite part of torchtitan "you can have your reduce in any dtype as long as it's fp32"

English

1.6K

aria /ɔˈreːliəm/@ariaurelium·1d

@gabriberton the most common model architectures for speculative decoding today piggyback off of target/verifier model state and are therefore pretty wide and shallow i've seen pretty marginal benefits from anything over 2B, and using some tricks have gotten decent results with <200M models

English

474

Gabriele Berton@gabriberton·1d

In speculative decoding the drafter just needs to be as fast as possible and its predictions be matching the verifier's I heard of drafters that are 2-layers 4B models, can anyone confirm?

English

17.2K

aria /ɔˈreːliəm/@ariaurelium·1d

@code_star everyone wanted some principled method of letting LLMs do variable-length sequential computation and the most effective method turned out to be "add an ungraded scratchpad to your response format and RL on tasks you care about"

English

109

aria /ɔˈreːliəm/@ariaurelium·1d

@code_star yeah I really struggle to believe that increases in performance between generations even at Ant/OAI can be mostly attributed to big conceptual leaps rather than removing engineering bottlenecks on fairly mundane methods

English

181

Cody Blakeney@code_star·1d

In this way most people with the title of researcher are not doing research. And I think often the ROI from these “researchers” is higher.

François Fleuret@francoisfleuret

IMO a researcher studies a problem that may not be solvable, while an engineer solves a problem that is considered solvable.

English

2.2K

aria /ɔˈreːliəm/@ariaurelium·1d

@code_star drop the "pre-" and "post-". just "training book", it's cleaner

English

179

Cody Blakeney@code_star·1d

How on earth was this URL available

Nathan Lambert@natolambert

I heard people needed clarification that my book was a post-training book posttrainingbook.com

English

16.7K

aria /ɔˈreːliəm/@ariaurelium·2d

codex 5.5 really, really loves creating functions that could never conceivably be used more than once

English

420

26.7K

aria /ɔˈreːliəm/@ariaurelium·2d

@nrehiew_ reminds me of iirc the vineppo paper, where they checked what the PPO "critic model" was actually doing and found that it was assigning credit seemingly nonsensically

English

wh@nrehiew_·2d

Interesting. Use an LLM as a judge to filter out tokens to mask during OPSD. Slight improvements over normal OPSD but seems alot more compute intensive?

Applied Compute@appliedcompute

Some enterprise tasks are challenging to hill-climb with RL-based methods since they involve very out-of-distribution behavior. On-policy self-distillation (OPSD) gives a model learning signal for every token it writes, far richer than the single scalar reward of RL. But that channel is noisy: most tokens don't reflect the behavior you're after. We introduce Relevance-Masked Self-Distillation (RMSD), which uses a two-step filtered loss mask to cut through the noise and find the tokens with the highest signal. Compared to OPSD it trains more stably, provides higher data efficiency, and reaches a higher performance ceiling.

English

7.9K

aria /ɔˈreːliəm/@ariaurelium·2d

@MangoSweet78 truth nuke

English

🥭@MangoSweet78·3d

i have two wolves in me.

English

184

aria /ɔˈreːliəm/ retweetet

shellac@she_llac·3d

ever since i was a little girl i knew i wanted to be a member of technical staff

English

1.1K

aria /ɔˈreːliəm/@ariaurelium·4d

@jdchawla29 @_ueaj @stochasticchasm not only will it be on-policy, it'll be full-vocab nothing but the best for him

English

jellybean ❄️@jdchawla29·4d

@_ueaj @stochasticchasm it’s not on-policy if he doesn’t volunteer for it

English

224

jellybean ❄️@jdchawla29·4d

when is @stochasticchasm distilling himself

English

1.7K

aria /ɔˈreːliəm/@ariaurelium·5d

@stochasticchasm routed experts shouting constantly just to be heard

English

123

stochasm@stochasticchasm·5d

this part is even more crazy. they do moe_output = (routed_output + shared_output)/2 ??? wouldn't this be a really bad init for experts? the model would be so incentivized to use shared expert capacity and the routed experts would need to learn to blow up their activations

stochasm@stochasticchasm

4 shared experts with 8 routed experts active? so 12/132, that's crazy, i wonder why. most papers like Towards Greater Leverage would suggest 1 shared expert or minimal (i think we should decouple shared expert size anyway eventually) also, 128 attention heads with GQA???

English

6.4K

aria /ɔˈreːliəm/@ariaurelium·6d

@kalomaze @1thousandfaces_ do you have people call you kalo irl

English

kalomaze@kalomaze·6d

@1thousandfaces_ 4 letters is all you need

English

1.4K

Hero Thousandfaces@1thousandfaces_·6d

ive realized that i prefer being called hero to my given name in basically all circumstances by anyone

English

234

9.2K

aria /ɔˈreːliəm/@ariaurelium·6d

@1thousandfaces_ somehow it had not even occurred to me that your given name was not "hero"

English

146

aria /ɔˈreːliəm/@ariaurelium·17 May

@1thousandfaces_ managing to get into a Lesbian Situationship at 11 is crazy

English

292

Hero Thousandfaces@1thousandfaces_·17 May

I did this with the first girl I loved at age 11 and then at some point turned off all notification settings forever. something poetic about that

☥𝐋𝐞𝐧𝐧𝐨𝐱@fw_lennox1

I had my girl’s number on a completely different vibration and text tone just so I’d know instantly if it was worth fishing my phone out my pocket. Everything else could wait.

English

118

6.3K

aria /ɔˈreːliəm/@ariaurelium·17 May

a while ago I bought an MX Master S and it is currently torturing me it is, unfortunately, the most comfortable mouse I have ever used and is also extremely shitty and unreliable

English

243

aria /ɔˈreːliəm/ retweetet

snow@snowclipsed·15 May

@stochasticchasm disablle dion

Español

630

aria /ɔˈreːliəm/@ariaurelium·15 May

@1thousandfaces_ i can't tell if they're trying to go for "low-volume audio gear created from off-the-shelf parts by a 6 person company in suburban california" or "insane fully-custom audiophile gear milled out of a solid block of aluminum" and it ends up being neither

English

114

aria /ɔˈreːliəm/@ariaurelium·15 May

@1thousandfaces_ their aesthetic feels like a throwback to discrete gadgets with utilitarian designs but randomly on some products they'll get scared of it looking too plain and produce interfaces like this (right is less bad than the left but I still don't like either tbh)

English

886

Hero Thousandfaces@1thousandfaces_·15 May

is teenage engineering tasteslop

Eesti

936

73.2K

aria /ɔˈreːliəm/@ariaurelium·13 May

@stochasticchasm @scaling01 @inductionheads reportedly this "new" Mythos is the Mythos we first heard about, and the old one was a checkpoint from before that

English

stochasm@stochasticchasm·13 May

@scaling01 @inductionheads oh wow there actually is a mythos 1.1

English

581

Lisan al Gaib@scaling01·13 May

The new version completely smashes GPT-5.5 and the previous Mythos version. Before Mythos Preview completed the cyber range 3 out of 10 times. The new version completed it 6 out of 10 times and is much more efficient!

AI Security Institute@AISecurityInst

Our cyber range results illustrate this step-up. Since our first Mythos evaluation, we received access to a newer Mythos Preview checkpoint. On a 32-step corporate network attack we estimate takes a human expert ~20 hours, this checkpoint completes the full attack in 6 /10 attempts.

English

745

287.8K

aria /ɔˈreːliəm/@ariaurelium·11 May

@eliebakouch beyond a certain point having fewer parameters doesn't really help speed but it *is* funny

English

aria /ɔˈreːliəm/@ariaurelium·11 May

@eliebakouch what the arch does do right is the downprojection on inputs for smaller hidden states ported it to dflash and am seeing comparable acceptance with ~10x smaller models and nontrivial-but-worse acceptance values down to like 24M trainable params

English

140

elie@eliebakouch·10 May

this is nice but no comparison on speed or acceptance rate vs classical MTP (stepfun flash 3.5 is 1 swa layer + MLP block for instance) or eagle, which would have been useful to see if this is an improvement or not. also a few inconsistencies: google blog says up to 3x speed, hf readme says 2x, and the generation file mentions num_assistant_tokens=6 which suggests they had more MTP layers initially? on the arch itself, a few additional details: > the cluster head thing is only in the smaller model > the larger variant does KV sharing across all layers + K=V (like the main model), very agressive imo > they further reduce the number of attn heads in the small model

George Grigorev@iamgrigorev

Very nice MTP modification for Gemma4 1. Uses 4 decoder layers with 3 SWA + 1 GA 2. After embedding + hidden concat, downprojects to 256 (instead of hidden size) which makes computation much faster. Before LM head, uprojects back to hidden size. 3. Much more efficient LM head in the MTP by selecting clusters of logits! Definitely worth looking into

English

10.7K

Entdecken

@gabriberton @code_star @nrehiew_ @MangoSweet78 @jdchawla29 @_ueaj @stochasticchasm @kalomaze