Max Jeblick

1 posts

Max Jeblick

Max Jeblick

@MJeblick

เข้าร่วม Mayıs 2020
92 กำลังติดตาม14 ผู้ติดตาม
Max Jeblick
Max Jeblick@MJeblick·
@IntuitMachine Would have been nice if they would have released the data to boost OS community.
English
0
0
0
56
Carlos E. Perez
Carlos E. Perez@IntuitMachine·
1/n Apple's Web Rephrase Augmented Pre-training (WRAP) Methodology for Synthetic Training Here is a guide to using synthetic data for large language model (LLM) pre-training based on the insights from are recent paper from Apple: 1. Objective: The goal is to train LLMs with an optimal blend of real and synthetic data to enable more efficient and effective pre-training; balancing data redundancy/overfitting, generalization capability, and diversity. 2. Synthetic Data Generation - Use a medium-sized (3B to 13B) instruction-tuned LLM over base decoding-only LLM - Prompt templates provide task-specific context to derive data matching target domains or tasks - Rephrase real documents vs generating freeform text to leverage existing information - Diversify generation styles – simplicity, complexity, formality, conversational - 100-300 tokens per sample; longer documents lose accuracy - Model generations require filtering based on prompt adherence 3. Corpus Composition - Blend rephrased documents with 30-50% original real data - Balance target domain representation in synthetic distribution - Combining generation styles outperforms single style overall - Continually assess semantic similarity to original data 4. Pre-training Details - Train decoder-only transformer architectures - Batch size of 1 million tokens; Learning rate up to 2-3e-4 - For 150B tokens across model sizes up to 1.3B parameters - 3 epochs+ on purely synthetic starts overfitting 5. Evaluation - Combine perplexity on target corpus with OOD semantic tasks - Zero/few-shot QA benchmarks assess generalization The specific blending ratios of real, simple, complex, and conversational style rephrases must be tuned based on model scale, end objectives and availability of real data. But this framework provides data generation capabilities to programmatically craft training distributions.
English
3
4
22
8K