prath_it_is

427 posts

prath_it_is

@prathyusha2002

(She/Her) | I tweet about development 👾, and some personal anecdotes on how my brain struggles to understand the world 🧠

Austin, TX Katılım Mayıs 2015

421 Takip Edilen137 Takipçiler

Sabitlenmiş Tweet

prath_it_is@prathyusha2002·6 Haz

I got my first internship and freelance gig at the same time, Now I don't know if I should celebrate and chill or go start working ASAP 😶

English

prath_it_is@prathyusha2002·29 Eki

vscode website is down

Nederlands

254

prath_it_is retweetledi

alex zhang@a1zhang·1 Eyl

All the recordings for the @GPU_MODE x @scaleml series are up as a playlist in case you missed it 😁 There's so much value in these ~8 hours of lectures, from proving quantization error bounds on a whiteboard to a deep-dive into GPU warp schedulers! Plz take advantage of it!

English

101

644

61.2K

prath_it_is@prathyusha2002·10 Eyl

@catalinmpit @warpdotdev I use it! Except for higher RAM usage, everything else is amazing, especially autocorrect and autofill are >>>

English

Catalin@catalinmpit·9 Eyl

Wondering why not more people give @warpdotdev a go. From my pov, it has the best interface for AI coding from all the terminal AI tools. I mean, look how nice it looks! The performance is also really good and follows the instructions pretty well.

English

12.6K

prath_it_is@prathyusha2002·6 Eyl

@GithubProjects standup

Indonesia

GitHub Projects Community@GithubProjects·5 Eyl

Scare a developer with one word.

English

2.1K

101

4.1K

489.5K

prath_it_is@prathyusha2002·5 Eyl

Ok, did anyone do RLVR and it actually worked?

English

prath_it_is@prathyusha2002·28 Ağu

@dakshgup I just applied, I am very interested in Generalist SWE role!

English

Daksh Gupta@dakshgup·28 Ağu

greptile.com/careers

ZXX

19.4K

Daksh Gupta@dakshgup·28 Ağu

we're hiring!

English

208

144

2.9K

316.4K

prath_it_is@prathyusha2002·28 Ağu

@rankdim When LLama or other models learnt these capabilities using SFT and then trained with RLVR, their performance increased too.

English

prath_it_is@prathyusha2002·28 Ağu

@rankdim Until qwen models emerged, no other models worked well with RLVR. Qwen inherently had capabilities like verification and backtracking, which let RLVR improve the likelihoods of these capabilities

English

109

rank-1@rankdim·27 Ağu

1. why RL didn't work before? 2. why it works now?

English

1.4K

295.1K

prath_it_is@prathyusha2002·25 Ağu

@TheCinesthetic your name

English

cinesthetic.@TheCinesthetic·25 Ağu

What movie do you consider “perfect”?

English

732

918

1.2M

prath_it_is@prathyusha2002·20 Ağu

@aishwarya_2x21 Lagragian is basically gradient descent, good luck though

English

Aishwarya@aishwarya_2x21·18 Ağu

finally done with trpo theory and math. 4 major chunks: 1. taylor expansion 2. fisher information matrix 3. linearize the objective 4. lagragian derivation (KKT) full math and intuition notes - share.note.sx/g8vnoz06

English

167

9.8K

prath_it_is@prathyusha2002·20 Ağu

I once put 1.5 ETH in a MetaMask wallet and was told to just forget about it. So I did. Now it’s worth $6.5k… and I don’t have the 12-word phrase 😭

English

201

prath_it_is@prathyusha2002·12 Ağu

@aaditsh This is the system prompt we need

English

prath_it_is@prathyusha2002·12 Ağu

@trashh_dev Mine think it is “America India”

English

trash@trashh_dev·12 Ağu

my parents have no idea what ai is

English

121

9.7K

prath_it_is@prathyusha2002·12 Ağu

@alec_helbling I did and the loss went up during training

English

Alec Helbling@alec_helbling·12 Ağu

kids, don't vibe code your evals

English

1.7K

prath_it_is@prathyusha2002·5 Ağu

Can I please get more eyes on this to validate this fact

English

prath_it_is@prathyusha2002·5 Ağu

Just today years old when I learned that gradient descent is basically the same as Lagrangian mechanics, shoutout to my high-energy physics PhD friend for blowing my mind. 🤯

English

prath_it_is@prathyusha2002·5 Ağu

@alec_helbling Meanwhile in RL where they use it as some sort of regularizer

English

Alec Helbling@alec_helbling·4 Ağu

Why is KL Divergence a more commonly used term in the ML literature than the (in my opinion) much more intuitive “relative entropy”?

English

745

81.9K

prath_it_is retweetledi

Daniel Han@danielhanchen·1 Ağu

OpenAI's OSS model possible breakdown: 1. 120B MoE 5B active + 20B text only 2. Trained with Float4 maybe Blackwell chips 3. SwiGLU clip (-7,7) like ReLU6 4. 128K context via YaRN from 4K 5. Sliding window 128 + attention sinks 6. Llama/Mixtral arch + biases Details: 1. 120B MoE 5B active + 20B text only Most likely 2 models will be released as per x.com/apples_jimmy/s… - 120B MoE with 5B/6B active and a 20B dense probably (or MoE). Not multimodal most likely, just text for now. 2. Trained with Float4 maybe Blackwell chips MoE layers MLP are merged up / down probably with 8bit scaling factors and float4 weights. Most likely trained with Blackwell chips since they support float4. Or maybe PTQ to float4. 3. SwiGLU clip (-7,7) like ReLU6 Clips SwiGLU to -7 and 7 to reduce outliers and aid float4 quantization. Normally -6 to 6 is good for float4's range, but -7 and 7 is ok as well. 4. 128K context via YaRN from 4K Native 128K context extended via YaRN from 4K. Long context extension was done probably during mid-training. 5. Sliding window 128 + attention sinks SWA of 128 was used, but to counteract the SWA not remembering past info, attention sinks like in arxiv.org/abs/2309.17453 was used. Maybe 4 / 8 vectors are used. TensorRT-LLM supports the flag "sink_token_length" for attention sinks #streamingllm" target="_blank" rel="nofollow noopener">nvidia.github.io/TensorRT-LLM/a… 6. Llama/Mixtral arch + biases Merged QKV, MLP and also biases are used on all modules it seems. MoE Router has bias as well. We discussed in @AiEleuther discord here: discord.com/channels/72974… Credits to @apples_jimmy , @secemp9 and others in the Discord server for the discussions!