
🏆 Big news! UltraData just hit #1 AND #2 on HuggingFace Trending worldwide! 🎉 Released by OpenBMB × @TsinghuaNLP × Modelbest — two massive open-source datasets now free for everyone: 🔥 Ultra-FineWeb-L3 (web pretraining synthetic data) → 600B+ tokens (400B+ English, 200B+ Chinese) → Largest open-source Chinese pretraining synthetic dataset to date → Built to maximize learnability per token 🔥 UltraData-SFT-2605 (post-training SFT data) → China's first open-source 15M+ SFT dataset with both thinking & non-thinking annotations → Covers math, code, knowledge & instruction-following → Fully traceable data pipeline 🧱 Both built on the UltraData L0–L4 five-tier data management framework, validated end-to-end on MiniCPM5-1B training. Free to download now 👇 huggingface.co/datasets/openb… huggingface.co/datasets/openb… #OpenSource #LLM #AI #HuggingFace #MiniCPM #UltraData

















