prath_it_is

427 posts

prath_it_is

prath_it_is

@prathyusha2002

(She/Her) | I tweet about development 👾, and some personal anecdotes on how my brain struggles to understand the world 🧠

Austin, TX Katılım Mayıs 2015
421 Takip Edilen137 Takipçiler
Sabitlenmiş Tweet
prath_it_is
prath_it_is@prathyusha2002·
I got my first internship and freelance gig at the same time, Now I don't know if I should celebrate and chill or go start working ASAP 😶
English
5
0
15
0
prath_it_is
prath_it_is@prathyusha2002·
vscode website is down
Nederlands
0
0
5
254
prath_it_is retweetledi
alex zhang
alex zhang@a1zhang·
All the recordings for the @GPU_MODE x @scaleml series are up as a playlist in case you missed it 😁 There's so much value in these ~8 hours of lectures, from proving quantization error bounds on a whiteboard to a deep-dive into GPU warp schedulers! Plz take advantage of it!
alex zhang tweet media
English
7
101
644
61.2K
prath_it_is
prath_it_is@prathyusha2002·
@catalinmpit @warpdotdev I use it! Except for higher RAM usage, everything else is amazing, especially autocorrect and autofill are >>>
English
0
0
0
86
Catalin
Catalin@catalinmpit·
Wondering why not more people give @warpdotdev a go. From my pov, it has the best interface for AI coding from all the terminal AI tools. I mean, look how nice it looks! The performance is also really good and follows the instructions pretty well.
Catalin tweet media
English
11
1
38
12.6K
prath_it_is
prath_it_is@prathyusha2002·
Ok, did anyone do RLVR and it actually worked?
English
0
0
0
57
prath_it_is
prath_it_is@prathyusha2002·
@dakshgup I just applied, I am very interested in Generalist SWE role!
English
0
0
1
93
Daksh Gupta
Daksh Gupta@dakshgup·
we're hiring!
Daksh Gupta tweet media
English
208
144
2.9K
316.4K
prath_it_is
prath_it_is@prathyusha2002·
@rankdim When LLama or other models learnt these capabilities using SFT and then trained with RLVR, their performance increased too.
English
0
0
1
10
prath_it_is
prath_it_is@prathyusha2002·
@rankdim Until qwen models emerged, no other models worked well with RLVR. Qwen inherently had capabilities like verification and backtracking, which let RLVR improve the likelihoods of these capabilities
English
1
0
1
109
rank-1
rank-1@rankdim·
1. why RL didn't work before? 2. why it works now?
English
88
60
1.4K
295.1K
cinesthetic.
cinesthetic.@TheCinesthetic·
What movie do you consider “perfect”?
English
732
62
918
1.2M
Aishwarya
Aishwarya@aishwarya_2x21·
finally done with trpo theory and math. 4 major chunks: 1. taylor expansion 2. fisher information matrix 3. linearize the objective 4. lagragian derivation (KKT) full math and intuition notes - share.note.sx/g8vnoz06
English
8
5
167
9.8K
prath_it_is
prath_it_is@prathyusha2002·
I once put 1.5 ETH in a MetaMask wallet and was told to just forget about it. So I did. Now it’s worth $6.5k… and I don’t have the 12-word phrase 😭
English
4
0
3
201
trash
trash@trashh_dev·
my parents have no idea what ai is
English
42
1
121
9.7K
Alec Helbling
Alec Helbling@alec_helbling·
kids, don't vibe code your evals
English
2
0
12
1.7K
prath_it_is
prath_it_is@prathyusha2002·
Can I please get more eyes on this to validate this fact
English
0
0
0
72
prath_it_is
prath_it_is@prathyusha2002·
Just today years old when I learned that gradient descent is basically the same as Lagrangian mechanics, shoutout to my high-energy physics PhD friend for blowing my mind. 🤯
English
1
0
0
92
Alec Helbling
Alec Helbling@alec_helbling·
Why is KL Divergence a more commonly used term in the ML literature than the (in my opinion) much more intuitive “relative entropy”?
English
37
21
745
81.9K
prath_it_is retweetledi
Daniel Han
Daniel Han@danielhanchen·
OpenAI's OSS model possible breakdown: 1. 120B MoE 5B active + 20B text only 2. Trained with Float4 maybe Blackwell chips 3. SwiGLU clip (-7,7) like ReLU6 4. 128K context via YaRN from 4K 5. Sliding window 128 + attention sinks 6. Llama/Mixtral arch + biases Details: 1. 120B MoE 5B active + 20B text only Most likely 2 models will be released as per x.com/apples_jimmy/s… - 120B MoE with 5B/6B active and a 20B dense probably (or MoE). Not multimodal most likely, just text for now. 2. Trained with Float4 maybe Blackwell chips MoE layers MLP are merged up / down probably with 8bit scaling factors and float4 weights. Most likely trained with Blackwell chips since they support float4. Or maybe PTQ to float4. 3. SwiGLU clip (-7,7) like ReLU6 Clips SwiGLU to -7 and 7 to reduce outliers and aid float4 quantization. Normally -6 to 6 is good for float4's range, but -7 and 7 is ok as well. 4. 128K context via YaRN from 4K Native 128K context extended via YaRN from 4K. Long context extension was done probably during mid-training. 5. Sliding window 128 + attention sinks SWA of 128 was used, but to counteract the SWA not remembering past info, attention sinks like in arxiv.org/abs/2309.17453 was used. Maybe 4 / 8 vectors are used. TensorRT-LLM supports the flag "sink_token_length" for attention sinks #streamingllm" target="_blank" rel="nofollow noopener">nvidia.github.io/TensorRT-LLM/a… 6. Llama/Mixtral arch + biases Merged QKV, MLP and also biases are used on all modules it seems. MoE Router has bias as well. We discussed in @AiEleuther discord here: discord.com/channels/72974… Credits to @apples_jimmy , @secemp9 and others in the Discord server for the discussions!
Daniel Han tweet media
secemp@secemp9

openai accidentally leaking weights live on HF

English
32
113
713
97.6K
prath_it_is
prath_it_is@prathyusha2002·
I fine tuned a model 150 times but I still don’t understand the research problem I want to solve.
English
0
0
1
50