
Austin Baggio
381 posts

Austin Baggio
@AustinBaggio
Co-founder @ensue_ai Building shared memory for AI agents.



First 4-bit quant of DeepSeek V4-Flash-Base. 284B params in 157 GiB at full FP8 speed. Beats Q4_K_M. Bit-exact reproducible with all metrics on the Hub. huggingface.co/EnsueAI/DeepSe…

First 4-bit quant of DeepSeek V4-Flash-Base. 284B params in 157 GiB at full FP8 speed. Beats Q4_K_M. Bit-exact reproducible with all metrics on the Hub. huggingface.co/EnsueAI/DeepSe…

🚀 DeepSeek-V4 Preview is officially live & open-sourced! Welcome to the era of cost-effective 1M context length. 🔹 DeepSeek-V4-Pro: 1.6T total / 49B active params. Performance rivaling the world's top closed-source models. 🔹 DeepSeek-V4-Flash: 284B total / 13B active params. Your fast, efficient, and economical choice. Try it now at chat.deepseek.com via Expert Mode / Instant Mode. API is updated & available today! 📄 Tech Report: huggingface.co/deepseek-ai/De… 🤗 Open Weights: huggingface.co/collections/de… 1/n


Made with ChatGPT Images 2.0



Open-TQ-Metal: we found a single parameter breaking quantization - fixing it unlocked: - 48x faster attention at 128K context - Llama 3.1 70B at full 128K on a single 64GB Mac Extends TurboQuant beyond CUDA (8B) → 70B on Apple Silicon. Full paper + write-up + implementation ↓


Introducing ml-intern, the agent that just automated the post-training team @huggingface It's an open-source implementation of the real research loop that our ML researchers do every day. You give it a prompt, it researches papers, goes through citations, implements ideas in GPU sandboxes, iterates and builds deeply research-backed models for any use case. All built on the Hugging Face ecosystem. It can pull off crazy things: We made it train the best model for scientific reasoning. It went through citations from the official benchmark paper. Found OpenScience and NemoTron-CrossThink, added 7 difficulty-filtered dataset variants from ARC/SciQ/MMLU, and ran 12 SFT runs on Qwen3-1.7B. This pushed the score 10% → 32% on GPQA in under 10h. Claude Code's best: 22.99%. In healthcare settings it inspected available datasets, concluded they were too low quality, and wrote a script to generate 1100 synthetic data points from scratch for emergencies, hedging, multilingual etc. Then upsampled 50x for training. Beat Codex on HealthBench by 60%. For competitive mathematics, it wrote a full GRPO script, launched training with A100 GPUs on hf.co/spaces, watched rewards claim and then collapse, and ran ablations until it succeeded. All fully backed by papers, autonomously. How it works? ml-intern makes full use of the HF ecosystem: - finds papers on arxiv and hf.co/papers, reads them fully, walks citation graphs, pulls datasets referenced in methodology sections and on hf.co/datasets - browses the Hub, reads recent docs, inspects datasets and reformats them before training so it doesn't waste GPU hours on bad data - launches training jobs on HF Jobs if no local GPUs are available, monitors runs, reads its own eval outputs, diagnoses failures, retrains ml-intern deeply embodies how researchers work and think. It knows how data should look like and what good models feel like. Releasing it today as a CLI and a web app you can use from your phone/desktop. CLI: github.com/huggingface/ml… Web + mobile: huggingface.co/spaces/smolage… And the best part? We also provisioned 1k$ GPU resources and Anthropic credits for the quickest among you to use.


Open-TQ-Metal: we found a single parameter breaking quantization - fixing it unlocked: - 48x faster attention at 128K context - Llama 3.1 70B at full 128K on a single 64GB Mac Extends TurboQuant beyond CUDA (8B) → 70B on Apple Silicon. Full paper + write-up + implementation ↓

Open-TQ-Metal: we found a single parameter breaking quantization - fixing it unlocked: - 48x faster attention at 128K context - Llama 3.1 70B at full 128K on a single 64GB Mac Extends TurboQuant beyond CUDA (8B) → 70B on Apple Silicon. Full paper + write-up + implementation ↓



My research agents Implemented @GoogleDeepMind's TurboQuant (arxiv.org/abs/2504.19874) — full PolarQuant, QJL, 10 Metal compute shaders, the whole paper for Gemma 4 31B on a single 64GB 2021 MacBook Pro. Turns out it doesn't work on this architecture ... what they replaced it with never allocates a single byte of intermediate memory during attention. 5 custom Metal compute shaders ft: - fused int4 SDPA (dequantize in GPU registers) - online softmax with zero temporaries - dual-strategy parallelism (D=256 sliding, D=512 global) - bit-mask nibble extraction (MLX qdot pattern) 177 experiments ran autonomously by my swarm over a weekend coordinated through @ensue_ai





