EIFY

7.2K posts

EIFY

@EIFY

Software Engineer

Seattle Katılım Aralık 2008

261 Takip Edilen264 Takipçiler

Sabitlenmiş Tweet

EIFY@EIFY·16 Nis

How did increased regulation of childhood affect social and geographical mobility?

English

EIFY@EIFY·13h

@ml_4rtemi5 Cool! We have experimented with neg. Euclidean distance squared logit for CLIP-like models and gained some insights into them, so you may want to take a look. The next thing I would try is to remove the final LN, possibly w/ residual scaling: arxiv.org/abs/2409.13079 w/ @nahidalam

English

Raphael Pisoni@ml_4rtemi5·1d

I dove deeper into the rabbit hole of RBF-Attention. I refined the Triton kernel, added register-tokens and developed SuSiE positional embedding as a replacement for RoPE in Euclidean space. Go have a look at the repo or the blogpost in the comments if you're interested! :)

Raphael Pisoni@ml_4rtemi5

For some reason I decided to swap out standard dot-product attention for a scaled-rbf kernel. Pretty much expected it to fail to converge or be impossibly slow but the scaled-rbf-attention is getting unexpectedly good results?? 👇

English

5.5K

EIFY@EIFY·2d

@skate_dont @mr_scientism Not every Singaporean. English is the only language every Singaporean student needs to learn at school.

English

Dont Skate on Me!@skate_dont·2d

@EIFY @mr_scientism Singapore

Filipino

scientism@mr_scientism·3d

China should extend ’peaceful reunification’ to all under heaven. Seems unfair that only Taiwan has good options right now.

English

153

2.1K

85.1K

EIFY@EIFY·4d

@Laz4rz As the innermost Matryoshka, no one can hear you scream🪆

English

416

Lazarz@Laz4rz·4d

Reminding everyone of this banger

Omar Khattab@lateinteraction

The size of the representation is not very important. You can always increase the dimension of a single-vector embedding. It barely helps. The important thing is interaction: different pieces in the query can attend to (or align with) different pieces in the document. This modularity makes learning good representations much more generalizable. It's very easy to learn a representation of a primitive concept (or a keyword). It's extraordinarily hard to learn a single vector such that a dot product captures deep relationships across distributions of queries and documents. Because of this, the size of late-interaction (multi-vector) representations of a passage is often *much smaller* than off-the-shelf single-vector representations you get from using float16 to encode a 4096-dimensional vector. Each vector in late interaction encodes extremely compact information, often representable with as little as 20 bytes.

English

16.8K

EIFY@EIFY·4d

@GeorgeN28581 @ZhenjieLing1 @RupprechtDeino It's just strange to refer them as such. It's like saying the US is flying a YF-16 derivative: Technically correct but bizarre.

English

173

George N.@GeorgeN28581·4d

@ZhenjieLing1 @RupprechtDeino Cause they produce them? Latest J-15 and J-16 are ones of the most advanced planes in T-10 lineage

English

575

@Rupprecht_A@RupprechtDeino·4d

Interesting! 😯 Looks at first sight without enlargement like just 3 J-16 far far away and much too blurry 🫣 ... but now I think more like one J-XDS and two CCAs? 🤔

DS北风@WenJian0922

捡来一张图

English

254

36.9K

EIFY@EIFY·5d

4. These are ScionC experiments, designed to keep the weight norm stable. (I don't expect 2-4 to change the direction of the result) 5. Without biases somehow the avg. spectral norm is smaller and the L2 grad norm is higher. It's possible that the optimal WD may change... 5/5

English

104

EIFY@EIFY·5d

2. For my ViT-S I made sure that QKV grad. are separately orthogonalized and the input dim. of patchifier are flattened. In the process I already left out the bias of QKV. 3. I incorporated @tmpethick's 1.0 init. mo for the unbiased exp. 4/5

English

113

EIFY@EIFY·5d

I happened to be running this ablation. Preliminary result based on training a ViT-S on ImageNet-1k for 90 epochs says it's better to leave the biases out 🤔 Background: A modified DeiT base has been the CV workhorse for Scion papers... 1/5 x.com/Ji_Ha_Kim/stat…

Ji-Ha@Ji_Ha_Kim

Nowadays biases are omitted in transformers for simplicity/same quality but it might matter more to keep affine layers for more expressivity, since bias doesn't affect Lipschitz

English

2.3K

EIFY@EIFY·6d

@chili_girl_ Isn't that leek (instead of ネギ)?

English

268

Sugar and spice@chili_girl_·6d

POV : Miku is stepping on you 💚

English

3.2K

41.4K

EIFY@EIFY·26 Mar

@KELMAND1 一審？多半會上訴。

中文

182

Eason Mao☢@KELMAND1·26 Mar

柯文哲一审被判17年台北地方法院26日下午一审判处柯文哲17年徒刑，褫夺公权6年。

中文

25.3K

EIFY@EIFY·25 Mar

@bitlord0429 @9992rc4g7c3939 Try edit this then: digitallibrary.usc.edu/asset-manageme…

English

124

定盘之命@bitlord0429·25 Mar

@9992rc4g7c3939 i can edit it too（look at the button），for me this is chinese clothes and chinese haircut during qing dynasty，look this boy’s haircut

English

1.5K

FatCat_Blue@9992rc4g7c3939·25 Mar

Go argue with the University of South California 😌 commons.wikimedia.org/wiki/File:Emac…

노농@sewarago

THAT'S CHINESE🇨🇳🤣 Imagine being so clueless you can't even identify your own history. That's 100% Chinese attire. Stop embarrassing yourselves and learn what your ancestors actually looked like. STUPID CHINESE🤣🤣

English

1.1K

53.2K

EIFY@EIFY·25 Mar

@LongDesertTrain Would the GI-persistent SARS-CoV-2 eventually establish a fecal–oral transmission route?

English

272

Ryan Hisner@LongDesertTrain·24 Mar

You have to wonder for how long we will continue seeing infections from 2020 continue to show up (in absurdly high quantities) in wastewater. 1/16

Marc Johnson@SolidEvidence

A new cryptic lineage popped up in St Louis a few weeks ago. I’ve been sampling this sewershed (500k people) twice a week for years and the first time I see this cryptic lineage it is 5 years old and makes up 50% of the sample. 1/

English

399

36.9K

EIFY@EIFY·25 Mar

@Ji_Ha_Kim @tonysilveti "Huge equal radius" = tiny WD, are you running it until steady-state? Wouldn't that take a long time...?

English

Ji-Ha@Ji_Ha_Kim·25 Mar

@EIFY @tonysilveti Gemini had an interesting strategy, fixed momentum, first sweeping effective lr with huge equal radius for all, then re-running using the final weight norms as the radii, and it seems to work surprisingly well

English

Ji-Ha@Ji_Ha_Kim·25 Mar

How are people tuning their hyperparameters for Scion optimizer?

English

1.2K

EIFY@EIFY·25 Mar

@Ji_Ha_Kim @tonysilveti Based on the fact that output Sign layer weight & grad don't become ind. I suspect / hypothesize that what's important for the output layer is the steady-state weight norm, not the detailed LR/WD dynamics. I haven't tested this tho...

English

Ji-Ha@Ji_Ha_Kim·25 Mar

@EIFY The values in Scion paper just seem hand-tuned. Was there some strategy @tonysilveti

English

EIFY@EIFY·25 Mar

@Ji_Ha_Kim Separately (that said, without any particular justification)

English

Ji-Ha@Ji_Ha_Kim·25 Mar

@EIFY Jointly or what?

English

108

EIFY@EIFY·24 Mar

@nftbanker ⋯⋯病急亂投醫？

中文

5.3K

小将@nftbanker·24 Mar

今晚国内的辅酶Q10卖爆了本来今晚还有两个电话，客户都是取消了，说明天再聊，命要紧。

中文

785

534.9K

EIFY@EIFY·24 Mar

@ruuustem_10 @AurelienLucchi @tonysilveti @ed_gorbunov @CevherLIONS Figure 7 (left) varies the momentum and (right) varies the stepsize, even though its caption says the opposite! Also, LaTeX is broken due to unpaired $ for “Predicted Optimal $\alpha$ or \beta$”

English

Rustem@ruuustem_10·24 Mar

@AurelienLucchi @tonysilveti @ed_gorbunov @CevherLIONS Some cool empirical results from the paper

English

453

Rustem@ruuustem_10·24 Mar

Excited to share our latest work on bridging theory & practice in optimization 🚀 We study stochastic conditional methods with momentum and provide practical strategies for choosing batch size and Frank–Wolfe stepsizes when token budget increases Paper: arxiv.org/abs/2603.21191

English

4.9K

Keşfet

@ml_4rtemi5 @nahidalam @skate_dont @mr_scientism @Laz4rz @GeorgeN28581 @ZhenjieLing1 @RupprechtDeino