simrat hanspal

121 posts

simrat hanspal banner
simrat hanspal

simrat hanspal

@simsimsandy

Exploring LLMs | Data scientist with a curious engineering mind

Bengaluru, India Katılım Nisan 2013
651 Takip Edilen137 Takipçiler
Sabitlenmiş Tweet
simrat hanspal
simrat hanspal@simsimsandy·
My recent blog with @hasgeek - “Decoding Llama3” is out. It’s a deep dive into the Llama3 model code released in April this year. This is a fun blog with a code-first approach. hasgeek.com/simrathanspal/…
English
0
3
6
362
simrat hanspal
simrat hanspal@simsimsandy·
🫣 Softmax is unstable with very large and very small numbers. 🤓 Here is a simple illustration of how (x-max) makes softmax stable for use.
simrat hanspal tweet mediasimrat hanspal tweet mediasimrat hanspal tweet mediasimrat hanspal tweet media
English
0
0
0
75
simrat hanspal
simrat hanspal@simsimsandy·
What does it mean to have dropout in Attention computation? Dropouts are used to prevent overfitting. In case of attention, we drop some attention scores, which means that if the model learnt to attend to some token, it now has to focus on other related tokens. #LLM #Attention
English
0
0
0
30
simrat hanspal
simrat hanspal@simsimsandy·
I mean token to token embedding :’D
English
0
0
1
48
simrat hanspal
simrat hanspal@simsimsandy·
Simple illustration of what token to word embedding conversion looks like.
simrat hanspal tweet mediasimrat hanspal tweet media
English
1
0
3
118
simrat hanspal
simrat hanspal@simsimsandy·
So, you use len(tokenizer) Not sure why colab is not recognising len() :D
simrat hanspal tweet media
English
0
0
0
32
simrat hanspal
simrat hanspal@simsimsandy·
The tokeniser lies about how many tokens it holds ;) What the tokeniser returns is the size of the base vocabulary that it learnt during training. Everything after that are special tokens. Special tokens are like metadata and help structure context.
simrat hanspal tweet media
English
1
0
1
52
Stutii
Stutii@Sam0kayy·
I mean what’s stopping you from turning you balcony into this 🤌
Stutii tweet media
English
5.9K
713
16.4K
2.3M
simrat hanspal
simrat hanspal@simsimsandy·
Trivial but worth a reminder use np.matmul for dot product instead of np. dot. np. dot is meant to be a flexible function that will adjust according to the input shape, instead of raising an error. Example np. dot(np.array([[1, 2], [3, 4]]), 10)
English
0
0
0
23
simrat hanspal
simrat hanspal@simsimsandy·
@A_K_Nain Thank you so much for the summary👌 One question, why did they mask the prompt token in SFT?
English
0
0
0
109
Aakash Kumar Nain
Aakash Kumar Nain@A_K_Nain·
I went through the Llama-3 technical report (92 pages!). The report is very detailed, and it will be hard to describe everything in a single tweet, but I will try to summarize it in the best possible way. Here we go... Overview - Standard dense Transformer with minor changes - No Mixture of Experts in favor of stable training - Similarly, instead of RL in post-training, they use SFT, rejection sampling, and DPO - Multimodal combining all three modalities: Text, Images, and speech. Multimodal models are still under active development - Multimodal pre-training includes separate encoders for images and speech -Image encoder pre-trained on image-text pairs, while speech encoder is pre-trained in a self-supervised fashion with masked inputs reconstructed via a discrete-token representation -Both pre-trained encoders (image and speech) are connected to the pre-trained LM via vision and speech adapters respectively. Language Model 1. Pre-Training - 405B parameters -15.6T tokens - knowledge cutoff end of 2023 - context window of 8K tokens - context window increased to 128K in the continued pre-training stage - custom parser for HTML that maintains the structure of mathematics and code content - deduplication performed on URL-level, document-level, and line-level - n-gram coverage ratio to remove redundant text dirty-word counting to remove adult content - token-distribution KL divergence to filter documents with excessive outlier tokens - fasttext-based model to classify documents into 176 languages. - data ratio: 50% of tokens corresponding to general knowledge, 25% mathematical and reasoning tokens, 17% code tokens, and 8% multilingual tokens. 2. Model Architecture - Model architecture same as llama and llama-2 with a few modifications (proving once again that quality data remains the king!) - GQA with 8 KV heads - RoPE frequency increased to 500,000 - attention mask that prevents self-attention between different documents within the same sequence - 128K vocab size (100K from tiktoken, 28K additional tokens for languages other than English) - 126 layers, 128 attention heads, and 16,384 embedding size 3. Infra and Scaling - Llama 3 405B trained on up to 16K H100 GPUs, each running at 700W TDP with 80GB HBM3 using Meta’s Grand Teton AI server platform - Tectonic (Meta's iternal) distributed file system for storage, 240PB, 7500 serves with SSDs, throughout ranging from 2TB/s to 7TB/s - RoCE-based AI cluster comprises 24K GPUs connected by a three-layer network - Parallelism, and schedulers optimized for the hardware topology - Enhanced ECMP routing and deep-buffer switches for congestion control - 4D parallelism: a combination of four different types of parallelism methods including tensor parallelism, pipeline parallelism, context parallelism, and data parallelism to shard the model - In context parallelism, partitions are across the sequence dimension. Based on all-gather where they gather all keys and values, and then compute attention output for the local query tensor chunk - Optimized order of parallelism for better network bandwidth and latency: TP, CP, PP, DP - BF16 MFU of 38%-43% - Gradient accumulation with FP32. For immediate tensors that are used in multiple places, like vision encoder output, the gradients are accumulated in FP32 -54-day snapshot pre-training with 467 interruptions. GPU issues account for 58% of the total issues (this is why TPUs are superior) -During the day, the throughput of GPUs varies from 1-2% because of higher temperature 4. Training Recipe - Cosine schedule, peak lr 8e-5, 8000 linear warmup steps, and decay to 8e-7 with 1,200,000 training steps. - Initial batch size 4M tokens with sequence length 4096 - After training for 252M tokens, batch size was adjusted to 8M tokens with a sequence length of 8192 The batch size is doubled again to 16M tokens after training for 2.87T tokens - Context length increased only when model performance on short-context evaluations has recovered completely, and the model perfectly solves “needle in a haystack” tasks up to that length. - A total of six stages of gradual increment in context length - Long context pretraining done with 800B training tokens. 5. Post-training - SFT followed by DPO - A mix of annotated and synthetic datasets (mostly generated data) - First, train a reward model on human-annotated preference data followed by SFT and DPO - New capabilities like tool use with a new multi-message chat protocol which uses various special header and termination tokens. 6. Reward Modeling - For reward modeling, the training objective is the same except for the removal of the margin term in the loss because of diminishing return - The reward model is used for rejection sampling on human-annotated prompts - For rejection sampling, K (ranging from 10-30) outputs are chosen. - PagedAttention is used to implement efficient rejection sampling -Each preference ranking sample has two or three responses with rankings as: edited > chosen > rejected 7. Supervised finetuning - SFT on rejection-sampled data combined with real data and synthetic data - Prompt tokens masked (finally!) - Trained with a lr of 1e-5 for 8.5K to 9K steps 8. Direct Preference Optimization -trained on preference data collected using the best-performing models from the previous alignment rounds lr=1e-5 and β=0.1 - Special tokens like header and termination tokens are masked out to stabilize the training. The presence of these tokens, both in the accepted and the rejected responses, leads to a conflicting learning objective. - An additional negative log-likelihood (NLL) loss term with a scaling coefficient of 0.2  is applied to the chosen sequence. ---------------------------------------------------- Vision 1. Data - Image-text and video-text pairs for the image encoder - Annealing dataset created by resampling the image-caption pairs to a smaller volume of approx 350M examples using n-grams. - Additional 150M examples collected using visual grounding, screenshot parsing, question-answer pair, synthetic captions, synthetically generated images from charts, tables, equations, etc represented via LaTex or markdown - Video duration varies from 16-21 seconds, resolution varies from 320p to 4K 2. Model - Three components: Image encoder, image adapter, and video adapter Image encoder - ViT/H-14 variant, 630M params, trained on 2.5B image-text pairs for five epochs. Image size is 224x224 split into 16x16 patches - Multi-layer feature extraction from previous layers and injected into final layer to preserve fine-grained localization information - 40 transformer blocks with 8 gated attention layers Image Adapter - GQA attention - The cross-attention layer alone has almost 100B params (wt..😱🫤😵‍💫🫨) - Trained in two stages Video Adapter - 64 frames uniformly sampled from a video each processed by image encoder - Temporal information using temporal aggregator (perceiver resampler), and some additional cross-attention layers 3. Pre-training - For image, started with pre-trained text model and vision encoder weights. - Vision encoder is unfrozen, text frozen, and trained using 6B image-text pairs with a batch size of 16,384, cosine schedule, lr 10e-4 with weight decay of 0.01 - For video, start from the image pre-trained and annealed weights. Video aggregator trained from scratch while everything else is frozen 4. Post training - Data procedures are mostly the same as in the case of text - Academic datasets, human-annotated and synthetic data - For quality tuning, a small but highly selective with utmost quality data is curated. DPO is done on this data to improve response quality to help improve human evaluations --------------------------------------------------------- Speech 1. Data - 15M hours of multilingual speech data for pretraining - ASR training data contains 230K hours of manually transcribed - speech recordings that span 34 languages.AST training data contains 90K hours of translations in two directions (33 languages -> English and English ->33 language) - 25K hours of synthetic data 2. Model - Speech encoder and speech adapter - Speech encoder is Conformer model with 1B params. - Input to the model is a 80 dimensional mel-spectrogram processed by 4-stride stacking layers followed by a linear projection before passing to the conformer encoder - Each Conformer layer has a latent dimension of 1536, and consists of two Macron-net style feed-forward networks with dimension 4096, a convolution module with kernel size 7, and a rotary attention module - The speech adapter OTOH contains about 100M parameters. It is composed of a convolution layer, a rotary Transformer layer, and a linear layer. 3. Training - Pre-training leverages unlabeled data to train a speech encoder - Self-supervised BEST-RQ algorithm use to pre-train the encoder - a mask of 32-frame length with a probability of 2.5% is applied to the input mel-spectrogram - mel-spectrogram features are quantizied by stacking 4 consecutive frames, projecting the 320-dimensional vectors to a 16-dimensional space, and performing a NN search with cosine similarity metric within a codebook of 8,192 vectors - 16 different codebooks - multi-softmax loss is used only on masked frames for efficiency reasons. - encoder trained for 500K steps with a global batch size of 2,048 utterances. - In the second stage, SFT is done where the adapter and pre-trained encoder are integrated with the language model and trained jointly with it while the LLM stays frozen. --------------------------------------------------- Results --------------------------------------------------- Conclusion Llama-3 is an extremely good model, and the technical report is pure gold. This is the kind of report you expect from people doing serious stuff. Other labs should take a note! Also, it is pretty evident that not much is happening on the modeling side. The architecture is pretty much the same (and still surprises me that we have come this far). In my honest opinion, more than the modeling capabilities, Llama-3 is an engineering marvel. So, if you are competing with a big lab, ask yourself "Do I have the infra to run things at this scale reliably and efficiently?" Kudos to Zuck and the rest of the Meta team for doing this. You did what we expected from others to do from the beginning. Full support! ♥️🍻
Aakash Kumar Nain tweet mediaAakash Kumar Nain tweet mediaAakash Kumar Nain tweet mediaAakash Kumar Nain tweet media
English
11
155
792
75.9K
simrat hanspal retweetledi
Bengaluru Systems (fka Bengaluru Systems Meetup)
First, @simsimsandy walked us through GPU architecture, optimizations, CUDA, and the challenges of running large ML models on GPUs, with a special look at the attention mechanism, KV-Cache optimizations, and PagedAttention!
Bengaluru Systems (fka Bengaluru Systems Meetup) tweet mediaBengaluru Systems (fka Bengaluru Systems Meetup) tweet media
English
1
1
12
885
simrat hanspal
simrat hanspal@simsimsandy·
If you are into GenAI, @hasgeek is organizing a call today to build a community on #ResponsibleAI. Join for cross-learning. 🔗 Meeting Link: Register here to confirm your participation - lnkd.in/gFSt8bYc 🕰Time: 7 PM IST Friday, 28 June (tonight)
English
0
1
3
227
simrat hanspal retweetledi
Tune AI
Tune AI@Tunehq_ai·
🚀Join us in Chennai next week for our hands-on workshop: "Building AI Agents with RAG and Functions" 🤖✨ Limited seats available, so hurry and secure your spot! 🏃‍♂️💨 🔗 Register now: lu.ma/6lrnyo1b #AIWorkshop #ChennaiEvents #llm #genai
English
1
5
7
1.8K