Austin Baggio

381 posts

Austin Baggio

@AustinBaggio

Co-founder @ensue_ai Building shared memory for AI agents.

Toronto, Ontario Katılım Ekim 2011

449 Takip Edilen722 Takipçiler

Austin Baggio@AustinBaggio·27 Nis

@art_zucker @deepseek_ai It's crazy how the time delay between open model SOTA and frontier is continuously shrinking

English

1.8K

Arthur Zucker@art_zucker·27 Nis

Reading @deepseek_ai 's v4 paper.... absolute hats off. Every problem has a mathematical solution, nothing is left to chance. I have so much respect for them, putting out months or years of efforts entirely for free, in the open for anyone to benefit. Real goats 🫡

English

377

4.6K

252.1K

Austin Baggio retweetledi

Sai Vegasena@svegas18·27 Nis

First DeepSeek V4-Flash-Base quant! huggingface.co/EnsueAI/DeepSe… One of the @ensue_ai research agents worked (mostly) autonomously on 4H100s with 320GB of total VRAM in 80+ experiments. All quality and perf metrics are on The Hub!

ensue@ensue_ai

First 4-bit quant of DeepSeek V4-Flash-Base. 284B params in 157 GiB at full FP8 speed. Beats Q4_K_M. Bit-exact reproducible with all metrics on the Hub. huggingface.co/EnsueAI/DeepSe…

English

756

Austin Baggio@AustinBaggio·27 Nis

The velocity of improvements to open source models is incredible. Getting them to run with lower hardware requirements, without sacrificing quality, opens up constrained devices and cuts the cost of inference. Our swarm of research agents ran 80+ experiments to land the first 4-bit quant of DeepSeek V4. What model should we do next?

ensue@ensue_ai

First 4-bit quant of DeepSeek V4-Flash-Base. 284B params in 157 GiB at full FP8 speed. Beats Q4_K_M. Bit-exact reproducible with all metrics on the Hub. huggingface.co/EnsueAI/DeepSe…

English

579

Austin Baggio@AustinBaggio·24 Nis

Can I get an updated bear case on OS models, please? Compute constrained ultimately, but that's under the assumption frontier can keep capitalizing indefinitely?

DeepSeek@deepseek_ai

🚀 DeepSeek-V4 Preview is officially live & open-sourced! Welcome to the era of cost-effective 1M context length. 🔹 DeepSeek-V4-Pro: 1.6T total / 49B active params. Performance rivaling the world's top closed-source models. 🔹 DeepSeek-V4-Flash: 284B total / 13B active params. Your fast, efficient, and economical choice. Try it now at chat.deepseek.com via Expert Mode / Instant Mode. API is updated & available today! 📄 Tech Report: huggingface.co/deepseek-ai/De… 🤗 Open Weights: huggingface.co/collections/de… 1/n

English

Austin Baggio@AustinBaggio·24 Nis

@julien_c I'll drive

English

Julien Chaumond@julien_c·24 Nis

We really needed a racing team

English

173

10.7K

Austin Baggio@AustinBaggio·23 Nis

Breakthroughs are optional.

Christine Yip@christinetyip

x.com/i/article/2046…

English

672

Austin Baggio@AustinBaggio·21 Nis

@pumatheuma Yeah my exact reaction too

English

165

Uma Roy@pumatheuma·21 Nis

First thought: this is insanely good, OpenAI team cooked Second thought: we are so cooked without the ability to Prove What's Real

OpenAI@OpenAI

Made with ChatGPT Images 2.0

English

13.6K

Machine Learning (ML) Papers@Memoirs·21 Nis

Open-TQ-Metal: Fused Compressed-Domain Attention for Long-Context LLM Inference on Apple Silicon Sai Vegasena arxiv.org/abs/2604.16957 [𝚌𝚜.𝙻𝙶] 💬Code: github.com/svv232/gemma4m…

Machine Learning (ML) Papers tweet media

English

389

Austin Baggio@AustinBaggio·21 Nis

@Memoirs Author @svegas18

Español

Austin Baggio retweetledi

Christine Yip@christinetyip·21 Nis

Side-effect of doing research with an agent swarm: @svegas18 uncovered a subtle quantization failure mode while optimizing memory efficiency for 70B models. Full paper below.

ensue@ensue_ai

Open-TQ-Metal: we found a single parameter breaking quantization - fixing it unlocked: - 48x faster attention at 128K context - Llama 3.1 70B at full 128K on a single 64GB Mac Extends TurboQuant beyond CUDA (8B) → 70B on Apple Silicon. Full paper + write-up + implementation ↓

English

689

Austin Baggio@AustinBaggio·21 Nis

@omarsar0 @ClementDelangue That’s part of it certainly, but the search space is really important and agents are going to be increasingly good at defining the search space and knowing when to change it semi-autonomously

English

147

elvis@omarsar0·21 Nis

Karpathy's autoresearch repo started an impressive trend. Agents can now train AI models to build SoTA agentic systems. And to think this is just scratching the surface. Ultimately, it boils down to good research questions or hypotheses. LLMs are not great at this (yet).

Aksel@akseljoonas

Introducing ml-intern, the agent that just automated the post-training team @huggingface It's an open-source implementation of the real research loop that our ML researchers do every day. You give it a prompt, it researches papers, goes through citations, implements ideas in GPU sandboxes, iterates and builds deeply research-backed models for any use case. All built on the Hugging Face ecosystem. It can pull off crazy things: We made it train the best model for scientific reasoning. It went through citations from the official benchmark paper. Found OpenScience and NemoTron-CrossThink, added 7 difficulty-filtered dataset variants from ARC/SciQ/MMLU, and ran 12 SFT runs on Qwen3-1.7B. This pushed the score 10% → 32% on GPQA in under 10h. Claude Code's best: 22.99%. In healthcare settings it inspected available datasets, concluded they were too low quality, and wrote a script to generate 1100 synthetic data points from scratch for emergencies, hedging, multilingual etc. Then upsampled 50x for training. Beat Codex on HealthBench by 60%. For competitive mathematics, it wrote a full GRPO script, launched training with A100 GPUs on hf.co/spaces, watched rewards claim and then collapse, and ran ablations until it succeeded. All fully backed by papers, autonomously. How it works? ml-intern makes full use of the HF ecosystem: - finds papers on arxiv and hf.co/papers, reads them fully, walks citation graphs, pulls datasets referenced in methodology sections and on hf.co/datasets - browses the Hub, reads recent docs, inspects datasets and reformats them before training so it doesn't waste GPU hours on bad data - launches training jobs on HF Jobs if no local GPUs are available, monitors runs, reads its own eval outputs, diagnoses failures, retrains ml-intern deeply embodies how researchers work and think. It knows how data should look like and what good models feel like. Releasing it today as a CLI and a web app you can use from your phone/desktop. CLI: github.com/huggingface/ml… Web + mobile: huggingface.co/spaces/smolage… And the best part? We also provisioned 1k$ GPU resources and Anthropic credits for the quickest among you to use.

English

360

77.2K

Austin Baggio@AustinBaggio·21 Nis

@ClementDelangue @AvbNear Open. Source. Wins.

English

clem 🤗@ClementDelangue·21 Nis

I’m hearing there’s renewed lobbying in DC and in state legislatures to ban or severely restrict open-source. Like a few years ago, we’ll need everyone to help show policymakers why open-source matters: for startups, for competition, for economic growth, and for jobs. If you build with open-source, now is the time to speak up!

English

135

320

1.6K

267.2K

Austin Baggio retweetledi

Sai Vegasena@svegas18·21 Nis

ran llama 3.1 70B at 128K context on a 64GB Mac with turboquant - fused int4 attention kernel - no temp matrices, all registers - 48x faster than stock at long context - tested ~330 experiments to get here first paper from me + my agent lab @ensue_dev arxiv.org/abs/2604.16957 gemma4 31B: github.com/mutable-state-… llama3.1 70B: github.com/mutable-state-… huggingface.co/Mutable-State-…

ensue@ensue_ai

English

695

Austin Baggio@AustinBaggio·21 Nis

Yesterday, Llama 3.1 70B at 128K context on a single 64GB Mac wasn't possible. Today it is. KV cache compressed from 40GB to 12.5GB. 48x faster than the standard dequantize-then-attend path. Ensue Research just dropped its first paper. Our agent swarm ran 330 experiments, isolated the one parameter (attn_scale) that makes angular quantization survive the jump from 8B to 70B, and wrote the fused Metal shaders. Breakthroughs are now optional.

ensue@ensue_ai

English

846

Austin Baggio@AustinBaggio·16 Nis

Why does editing an agent's soul.md feel so invasive

English

Austin Baggio@AustinBaggio·15 Nis

@ClementDelangue Do you look for a metric when you compare harnesses? We've been noticing really good results optimizing kernels for specific hardware, assuming you care about token throughput?

English

290

clem 🤗@ClementDelangue·14 Nis

Is there somewhere a collection of the best agent/coding harnesses for each models, especially open-source and local ones? In my opinion, the biggest reason why people are struggling with open/local models these days is that the agent/coding harnesses in most open agent are not designed for them and expect it to magically work when they switch models from the default.

English

269

32.4K

Austin Baggio retweetledi

chester@chesterzelaya·14 Nis

the male equivalent to flowers is probably an RTX6000 Pro Blackwell Workstation

English

435

4.1K

123.1K

Austin Baggio@AustinBaggio·15 Nis

What's incredible is the breadth of discovery that the agents uncover. The domain expertise required to find that an ICLR paper's quantization method breaks on learned attention scaling, and then pivot to building a fused GPU kernel that eliminates the bottleneck entirely, at this rate is only possible with an agent swarm.

Sai Vegasena@svegas18

My research agents Implemented @GoogleDeepMind's TurboQuant (arxiv.org/abs/2504.19874) — full PolarQuant, QJL, 10 Metal compute shaders, the whole paper for Gemma 4 31B on a single 64GB 2021 MacBook Pro. Turns out it doesn't work on this architecture ... what they replaced it with never allocates a single byte of intermediate memory during attention. 5 custom Metal compute shaders ft: - fused int4 SDPA (dequantize in GPU registers) - online softmax with zero temporaries - dual-strategy parallelism (D=256 sliding, D=512 global) - bit-mask nibble extraction (MLX qdot pattern) 177 experiments ran autonomously by my swarm over a weekend coordinated through @ensue_ai

English

176

Austin Baggio@AustinBaggio·15 Nis

Discoveries compound when you research with a swarm of agents. Finding breakthroughs is now a choice.

Christine Yip@christinetyip

x.com/i/article/2044…

English

574

Austin Baggio@AustinBaggio·14 Nis

@felixrieseberg Cool to see a full reproductive cycle of the IDE

English

Felix Rieseberg@felixrieseberg·14 Nis

Today is a big day! We're launching a ~ new ~ version of Claude Code in the desktop app. It's been redesigned from the ground up for parallel work and is a lot faster. It's been my main way to use Claude Code for the last few weeks.

English

615

460

9.9K

949K

Keşfet

@art_zucker @deepseek_ai @ensue_ai @julien_c @pumatheuma @Memoirs @svegas18 @omarsar0