Austin Baggio

381 posts

Austin Baggio banner
Austin Baggio

Austin Baggio

@AustinBaggio

Co-founder @ensue_ai Building shared memory for AI agents.

Toronto, Ontario Katılım Ekim 2011
449 Takip Edilen722 Takipçiler
Arthur Zucker
Arthur Zucker@art_zucker·
Reading @deepseek_ai 's v4 paper.... absolute hats off. Every problem has a mathematical solution, nothing is left to chance. I have so much respect for them, putting out months or years of efforts entirely for free, in the open for anyone to benefit. Real goats 🫡
English
75
377
4.6K
252.1K
Austin Baggio retweetledi
Sai Vegasena
Sai Vegasena@svegas18·
First DeepSeek V4-Flash-Base quant! huggingface.co/EnsueAI/DeepSe… One of the @ensue_ai research agents worked (mostly) autonomously on 4H100s with 320GB of total VRAM in 80+ experiments. All quality and perf metrics are on The Hub!
ensue@ensue_ai

First 4-bit quant of DeepSeek V4-Flash-Base. 284B params in 157 GiB at full FP8 speed. Beats Q4_K_M. Bit-exact reproducible with all metrics on the Hub. huggingface.co/EnsueAI/DeepSe…

English
0
5
5
756
Austin Baggio
Austin Baggio@AustinBaggio·
The velocity of improvements to open source models is incredible. Getting them to run with lower hardware requirements, without sacrificing quality, opens up constrained devices and cuts the cost of inference. Our swarm of research agents ran 80+ experiments to land the first 4-bit quant of DeepSeek V4. What model should we do next?
ensue@ensue_ai

First 4-bit quant of DeepSeek V4-Flash-Base. 284B params in 157 GiB at full FP8 speed. Beats Q4_K_M. Bit-exact reproducible with all metrics on the Hub. huggingface.co/EnsueAI/DeepSe…

English
0
4
7
579
Austin Baggio
Austin Baggio@AustinBaggio·
Can I get an updated bear case on OS models, please? Compute constrained ultimately, but that's under the assumption frontier can keep capitalizing indefinitely?
DeepSeek@deepseek_ai

🚀 DeepSeek-V4 Preview is officially live & open-sourced! Welcome to the era of cost-effective 1M context length. 🔹 DeepSeek-V4-Pro: 1.6T total / 49B active params. Performance rivaling the world's top closed-source models. 🔹 DeepSeek-V4-Flash: 284B total / 13B active params. Your fast, efficient, and economical choice. Try it now at chat.deepseek.com via Expert Mode / Instant Mode. API is updated & available today! 📄 Tech Report: huggingface.co/deepseek-ai/De… 🤗 Open Weights: huggingface.co/collections/de… 1/n

English
0
0
1
85
Julien Chaumond
Julien Chaumond@julien_c·
We really needed a racing team
Julien Chaumond tweet media
English
22
4
173
10.7K
Austin Baggio retweetledi
Austin Baggio
Austin Baggio@AustinBaggio·
@omarsar0 @ClementDelangue That’s part of it certainly, but the search space is really important and agents are going to be increasingly good at defining the search space and knowing when to change it semi-autonomously
English
0
0
0
147
elvis
elvis@omarsar0·
Karpathy's autoresearch repo started an impressive trend. Agents can now train AI models to build SoTA agentic systems. And to think this is just scratching the surface. Ultimately, it boils down to good research questions or hypotheses. LLMs are not great at this (yet).
Aksel@akseljoonas

Introducing ml-intern, the agent that just automated the post-training team @huggingface It's an open-source implementation of the real research loop that our ML researchers do every day. You give it a prompt, it researches papers, goes through citations, implements ideas in GPU sandboxes, iterates and builds deeply research-backed models for any use case. All built on the Hugging Face ecosystem. It can pull off crazy things: We made it train the best model for scientific reasoning. It went through citations from the official benchmark paper. Found OpenScience and NemoTron-CrossThink, added 7 difficulty-filtered dataset variants from ARC/SciQ/MMLU, and ran 12 SFT runs on Qwen3-1.7B. This pushed the score 10% → 32% on GPQA in under 10h. Claude Code's best: 22.99%. In healthcare settings it inspected available datasets, concluded they were too low quality, and wrote a script to generate 1100 synthetic data points from scratch for emergencies, hedging, multilingual etc. Then upsampled 50x for training. Beat Codex on HealthBench by 60%. For competitive mathematics, it wrote a full GRPO script, launched training with A100 GPUs on hf.co/spaces, watched rewards claim and then collapse, and ran ablations until it succeeded. All fully backed by papers, autonomously. How it works? ml-intern makes full use of the HF ecosystem: - finds papers on arxiv and hf.co/papers, reads them fully, walks citation graphs, pulls datasets referenced in methodology sections and on hf.co/datasets - browses the Hub, reads recent docs, inspects datasets and reformats them before training so it doesn't waste GPU hours on bad data - launches training jobs on HF Jobs if no local GPUs are available, monitors runs, reads its own eval outputs, diagnoses failures, retrains ml-intern deeply embodies how researchers work and think. It knows how data should look like and what good models feel like. Releasing it today as a CLI and a web app you can use from your phone/desktop. CLI: github.com/huggingface/ml… Web + mobile: huggingface.co/spaces/smolage… And the best part? We also provisioned 1k$ GPU resources and Anthropic credits for the quickest among you to use.

English
16
49
360
77.2K
clem 🤗
clem 🤗@ClementDelangue·
I’m hearing there’s renewed lobbying in DC and in state legislatures to ban or severely restrict open-source. Like a few years ago, we’ll need everyone to help show policymakers why open-source matters: for startups, for competition, for economic growth, and for jobs. If you build with open-source, now is the time to speak up!
English
135
320
1.6K
267.2K
Austin Baggio retweetledi
Sai Vegasena
Sai Vegasena@svegas18·
ran llama 3.1 70B at 128K context on a 64GB Mac with turboquant - fused int4 attention kernel - no temp matrices, all registers - 48x faster than stock at long context - tested ~330 experiments to get here first paper from me + my agent lab @ensue_dev arxiv.org/abs/2604.16957 gemma4 31B: github.com/mutable-state-… llama3.1 70B: github.com/mutable-state-… huggingface.co/Mutable-State-…
ensue@ensue_ai

Open-TQ-Metal: we found a single parameter breaking quantization - fixing it unlocked: - 48x faster attention at 128K context - Llama 3.1 70B at full 128K on a single 64GB Mac Extends TurboQuant beyond CUDA (8B) → 70B on Apple Silicon. Full paper + write-up + implementation ↓

English
1
5
7
695
Austin Baggio
Austin Baggio@AustinBaggio·
Yesterday, Llama 3.1 70B at 128K context on a single 64GB Mac wasn't possible. Today it is. KV cache compressed from 40GB to 12.5GB. 48x faster than the standard dequantize-then-attend path. Ensue Research just dropped its first paper. Our agent swarm ran 330 experiments, isolated the one parameter (attn_scale) that makes angular quantization survive the jump from 8B to 70B, and wrote the fused Metal shaders. Breakthroughs are now optional.
ensue@ensue_ai

Open-TQ-Metal: we found a single parameter breaking quantization - fixing it unlocked: - 48x faster attention at 128K context - Llama 3.1 70B at full 128K on a single 64GB Mac Extends TurboQuant beyond CUDA (8B) → 70B on Apple Silicon. Full paper + write-up + implementation ↓

English
2
7
15
846
Austin Baggio
Austin Baggio@AustinBaggio·
Why does editing an agent's soul.md feel so invasive
English
1
0
1
64
Austin Baggio
Austin Baggio@AustinBaggio·
@ClementDelangue Do you look for a metric when you compare harnesses? We've been noticing really good results optimizing kernels for specific hardware, assuming you care about token throughput?
English
0
0
0
290
clem 🤗
clem 🤗@ClementDelangue·
Is there somewhere a collection of the best agent/coding harnesses for each models, especially open-source and local ones? In my opinion, the biggest reason why people are struggling with open/local models these days is that the agent/coding harnesses in most open agent are not designed for them and expect it to magically work when they switch models from the default.
English
50
19
269
32.4K
Austin Baggio retweetledi
chester
chester@chesterzelaya·
the male equivalent to flowers is probably an RTX6000 Pro Blackwell Workstation
English
70
435
4.1K
123.1K
Austin Baggio
Austin Baggio@AustinBaggio·
What's incredible is the breadth of discovery that the agents uncover. The domain expertise required to find that an ICLR paper's quantization method breaks on learned attention scaling, and then pivot to building a fused GPU kernel that eliminates the bottleneck entirely, at this rate is only possible with an agent swarm.
Sai Vegasena@svegas18

My research agents Implemented @GoogleDeepMind's TurboQuant (arxiv.org/abs/2504.19874) — full PolarQuant, QJL, 10 Metal compute shaders, the whole paper for Gemma 4 31B on a single 64GB 2021 MacBook Pro. Turns out it doesn't work on this architecture ... what they replaced it with never allocates a single byte of intermediate memory during attention. 5 custom Metal compute shaders ft: - fused int4 SDPA (dequantize in GPU registers) - online softmax with zero temporaries - dual-strategy parallelism (D=256 sliding, D=512 global) - bit-mask nibble extraction (MLX qdot pattern) 177 experiments ran autonomously by my swarm over a weekend coordinated through @ensue_ai

English
0
1
3
176
Felix Rieseberg
Felix Rieseberg@felixrieseberg·
Today is a big day! We're launching a ~ new ~ version of Claude Code in the desktop app. It's been redesigned from the ground up for parallel work and is a lot faster. It's been my main way to use Claude Code for the last few weeks.
English
615
460
9.9K
949K