Sabitlenmiş Tweet
prath_it_is
427 posts

prath_it_is
@prathyusha2002
(She/Her) | I tweet about development 👾, and some personal anecdotes on how my brain struggles to understand the world 🧠
Austin, TX Katılım Mayıs 2015
421 Takip Edilen137 Takipçiler
prath_it_is retweetledi

@catalinmpit @warpdotdev I use it! Except for higher RAM usage, everything else is amazing, especially autocorrect and autofill are >>>
English

Wondering why not more people give @warpdotdev a go.
From my pov, it has the best interface for AI coding from all the terminal AI tools. I mean, look how nice it looks!
The performance is also really good and follows the instructions pretty well.

English

@dakshgup I just applied, I am very interested in Generalist SWE role!
English

@rankdim When LLama or other models learnt these capabilities using SFT and then trained with RLVR, their performance increased too.
English

@rankdim Until qwen models emerged, no other models worked well with RLVR.
Qwen inherently had capabilities like verification and backtracking, which let RLVR improve the likelihoods of these capabilities
English

@aishwarya_2x21 Lagragian is basically gradient descent, good luck though
English

finally done with trpo theory and math. 4 major chunks:
1. taylor expansion
2. fisher information matrix
3. linearize the objective
4. lagragian derivation (KKT)
full math and intuition notes - share.note.sx/g8vnoz06
English

@alec_helbling Meanwhile in RL where they use it as some sort of regularizer
English
prath_it_is retweetledi

OpenAI's OSS model possible breakdown:
1. 120B MoE 5B active + 20B text only
2. Trained with Float4 maybe Blackwell chips
3. SwiGLU clip (-7,7) like ReLU6
4. 128K context via YaRN from 4K
5. Sliding window 128 + attention sinks
6. Llama/Mixtral arch + biases
Details:
1. 120B MoE 5B active + 20B text only
Most likely 2 models will be released as per x.com/apples_jimmy/s… - 120B MoE with 5B/6B active and a 20B dense probably (or MoE).
Not multimodal most likely, just text for now.
2. Trained with Float4 maybe Blackwell chips
MoE layers MLP are merged up / down probably with 8bit scaling factors and float4 weights. Most likely trained with Blackwell chips since they support float4. Or maybe PTQ to float4.
3. SwiGLU clip (-7,7) like ReLU6
Clips SwiGLU to -7 and 7 to reduce outliers and aid float4 quantization. Normally -6 to 6 is good for float4's range, but -7 and 7 is ok as well.
4. 128K context via YaRN from 4K
Native 128K context extended via YaRN from 4K. Long context extension was done probably during mid-training.
5. Sliding window 128 + attention sinks
SWA of 128 was used, but to counteract the SWA not remembering past info, attention sinks like in arxiv.org/abs/2309.17453 was used. Maybe 4 / 8 vectors are used. TensorRT-LLM supports the flag "sink_token_length" for attention sinks #streamingllm" target="_blank" rel="nofollow noopener">nvidia.github.io/TensorRT-LLM/a…
6. Llama/Mixtral arch + biases
Merged QKV, MLP and also biases are used on all modules it seems. MoE Router has bias as well.
We discussed in @AiEleuther discord here: discord.com/channels/72974…
Credits to @apples_jimmy , @secemp9 and others in the Discord server for the discussions!

secemp@secemp9
openai accidentally leaking weights live on HF
English
prath_it_is retweetledi










