aria /ɔˈreːliəm/

3.1K posts

aria /ɔˈreːliəm/ banner
aria /ɔˈreːliəm/

aria /ɔˈreːliəm/

@ariaurelium

sw infra @arcee_ai, opinions my own

Присоединился Temmuz 2024
293 Подписки997 Подписчики
Закреплённый твит
aria /ɔˈreːliəm/
aria /ɔˈreːliəm/@ariaurelium·
OuГ increasing depenˋence oζ machines f՜Ȍ e͈eryday activitieݮ has mȰde thׄ΂ ٮndi̲ާenغable.ВT˼erːfore maʬy-a- Ξimes, if not always, we all teѭι to Ҙՙ overboardܗwiχh tٷeiͫ usage.
English
0
0
18
5.6K
aria /ɔˈreːliəm/
aria /ɔˈreːliəm/@ariaurelium·
this option has to be my favorite part of torchtitan "you can have your reduce in any dtype as long as it's fp32"
aria /ɔˈreːliəm/ tweet media
English
0
2
16
1.6K
aria /ɔˈreːliəm/
aria /ɔˈreːliəm/@ariaurelium·
@gabriberton the most common model architectures for speculative decoding today piggyback off of target/verifier model state and are therefore pretty wide and shallow i've seen pretty marginal benefits from anything over 2B, and using some tricks have gotten decent results with <200M models
English
0
0
7
475
Gabriele Berton
Gabriele Berton@gabriberton·
In speculative decoding the drafter just needs to be as fast as possible and its predictions be matching the verifier's I heard of drafters that are 2-layers 4B models, can anyone confirm?
English
7
0
39
17.2K
aria /ɔˈreːliəm/
aria /ɔˈreːliəm/@ariaurelium·
@code_star everyone wanted some principled method of letting LLMs do variable-length sequential computation and the most effective method turned out to be "add an ungraded scratchpad to your response format and RL on tasks you care about"
English
1
0
3
109
aria /ɔˈreːliəm/
aria /ɔˈreːliəm/@ariaurelium·
@code_star yeah I really struggle to believe that increases in performance between generations even at Ant/OAI can be mostly attributed to big conceptual leaps rather than removing engineering bottlenecks on fairly mundane methods
English
1
0
3
187
aria /ɔˈreːliəm/
aria /ɔˈreːliəm/@ariaurelium·
codex 5.5 really, really loves creating functions that could never conceivably be used more than once
English
26
3
420
26.7K
aria /ɔˈreːliəm/
aria /ɔˈreːliəm/@ariaurelium·
@nrehiew_ reminds me of iirc the vineppo paper, where they checked what the PPO "critic model" was actually doing and found that it was assigning credit seemingly nonsensically
English
0
0
0
85
🥭
🥭@MangoSweet78·
i have two wolves in me.
🥭 tweet media
English
2
0
18
184
aria /ɔˈreːliəm/ ретвитнул
shellac
shellac@she_llac·
ever since i was a little girl i knew i wanted to be a member of technical staff
English
1
4
83
1.1K
stochasm
stochasm@stochasticchasm·
this part is even more crazy. they do moe_output = (routed_output + shared_output)/2 ??? wouldn't this be a really bad init for experts? the model would be so incentivized to use shared expert capacity and the routed experts would need to learn to blow up their activations
stochasm tweet mediastochasm tweet media
stochasm@stochasticchasm

4 shared experts with 8 routed experts active? so 12/132, that's crazy, i wonder why. most papers like Towards Greater Leverage would suggest 1 shared expert or minimal (i think we should decouple shared expert size anyway eventually) also, 128 attention heads with GQA???

English
8
1
44
6.4K
Hero Thousandfaces
Hero Thousandfaces@1thousandfaces_·
ive realized that i prefer being called hero to my given name in basically all circumstances by anyone
English
22
0
234
9.2K
aria /ɔˈreːliəm/
aria /ɔˈreːliəm/@ariaurelium·
a while ago I bought an MX Master S and it is currently torturing me it is, unfortunately, the most comfortable mouse I have ever used and is also extremely shitty and unreliable
English
0
0
0
244
aria /ɔˈreːliəm/ ретвитнул
snow
snow@snowclipsed·
@stochasticchasm disablle dion
Español
3
1
19
631
aria /ɔˈreːliəm/
aria /ɔˈreːliəm/@ariaurelium·
@1thousandfaces_ i can't tell if they're trying to go for "low-volume audio gear created from off-the-shelf parts by a 6 person company in suburban california" or "insane fully-custom audiophile gear milled out of a solid block of aluminum" and it ends up being neither
English
0
0
5
114
aria /ɔˈreːliəm/
aria /ɔˈreːliəm/@ariaurelium·
@1thousandfaces_ their aesthetic feels like a throwback to discrete gadgets with utilitarian designs but randomly on some products they'll get scared of it looking too plain and produce interfaces like this (right is less bad than the left but I still don't like either tbh)
aria /ɔˈreːliəm/ tweet mediaaria /ɔˈreːliəm/ tweet media
English
3
0
4
886
Hero Thousandfaces
Hero Thousandfaces@1thousandfaces_·
is teenage engineering tasteslop
Eesti
93
12
936
73.2K
Lisan al Gaib
Lisan al Gaib@scaling01·
The new version completely smashes GPT-5.5 and the previous Mythos version. Before Mythos Preview completed the cyber range 3 out of 10 times. The new version completed it 6 out of 10 times and is much more efficient!
Lisan al Gaib tweet media
AI Security Institute@AISecurityInst

Our cyber range results illustrate this step-up. Since our first Mythos evaluation, we received access to a newer Mythos Preview checkpoint. On a 32-step corporate network attack we estimate takes a human expert ~20 hours, this checkpoint completes the full attack in 6 /10 attempts.

English
28
57
745
287.8K
aria /ɔˈreːliəm/
aria /ɔˈreːliəm/@ariaurelium·
@eliebakouch beyond a certain point having fewer parameters doesn't really help speed but it *is* funny
English
0
0
1
39
aria /ɔˈreːliəm/
aria /ɔˈreːliəm/@ariaurelium·
@eliebakouch what the arch does do right is the downprojection on inputs for smaller hidden states ported it to dflash and am seeing comparable acceptance with ~10x smaller models and nontrivial-but-worse acceptance values down to like 24M trainable params
English
1
0
1
140
elie
elie@eliebakouch·
this is nice but no comparison on speed or acceptance rate vs classical MTP (stepfun flash 3.5 is 1 swa layer + MLP block for instance) or eagle, which would have been useful to see if this is an improvement or not. also a few inconsistencies: google blog says up to 3x speed, hf readme says 2x, and the generation file mentions num_assistant_tokens=6 which suggests they had more MTP layers initially? on the arch itself, a few additional details: > the cluster head thing is only in the smaller model > the larger variant does KV sharing across all layers + K=V (like the main model), very agressive imo > they further reduce the number of attn heads in the small model
elie tweet mediaelie tweet media
George Grigorev@iamgrigorev

Very nice MTP modification for Gemma4 1. Uses 4 decoder layers with 3 SWA + 1 GA 2. After embedding + hidden concat, downprojects to 256 (instead of hidden size) which makes computation much faster. Before LM head, uprojects back to hidden size. 3. Much more efficient LM head in the MTP by selecting clusters of logits! Definitely worth looking into

English
6
6
95
10.7K