Dillon Erb

320 posts

Dillon Erb banner
Dillon Erb

Dillon Erb

@dlnrb

building something new @a____t____g — prev: CEO / co-founder @hellopaperspace (acquired by @digitalocean)

New York, USA Katılım Ağustos 2010
1.1K Takip Edilen1.6K Takipçiler
Dillon Erb retweetledi
turbopuffer
turbopuffer@turbopuffer·
queue.json on object storage is all you need to build a reliable distributed job queue → FIFO execution → at-least-once delivery → 10x lower tail latencies tpuf.link/queue
English
15
58
789
206.2K
Dillon Erb
Dillon Erb@dlnrb·
Excited to finally come out of stealth and share what we've been building! Introducing Autonomous — a superintelligent financial advisor at 0% advisory fees. We just announced our $15M fundraise led by @garrytan @ @ycombinator along with some other amazing investors! Get early access → becomeautonomous.com We are hiring across multiple roles in NYC and SF @a____t____g
English
111
53
1.4K
202.7K
Dillon Erb retweetledi
Y Combinator
Y Combinator@ycombinator·
The founders of Paperspace (YC W15) just announced Autonomous (@a____t____g), an AI-native wealth strategist that brings elite strategies used by the ultra-wealthy, now available to everyone at 0% advisory fees. Millions of people already ask AI what to do with money. Autonomous is building the missing piece: the Cursor "apply" button that connects your financial life with AI. Get early access: becomeautonomous.com
Y Combinator tweet media
English
35
51
1.3K
150K
Dillon Erb
Dillon Erb@dlnrb·
Great work 👏
xjdr@_xjdr

today we’re open-sourcing nmoe: github.com/Noumena-Networ… i started this because training deepseek-shaped ultra-sparse moes should be straightforward at research scale, but in practice it’s painful: - expert flops get stranded (router shatters your batch → tiny per-expert gemms → gpus idle) - router stability is fragile (especially without deepseek’s batch sizes) - data + mixtures dominate (proxy runs are useless if mixtures aren’t deterministic/resumable) nmoe is our attempt at a clean, production-grade reference path for moe training that you can actually read + modify (outside of the highly optimized kernels). what’s inside: - rdep (replicated dense / expert parallel): replicate dense/attention, shard experts, pool+dispatch routed tokens so per-expert batches are hot (no nccl all-to-all on the moe path; direct dispatch/return via ipc + nvshmem) - mixed precision experts (bf16/fp8/nvfp4), with a focus on killing the usual “mixed precision overhead” taxes - a frontier-ish data pipeline: deterministic mixtures, exact resume, and tooling for building/inspecting datasets (including hydra-style grading) - metrics + nviz: sqlite experiments + duckdb timeseries + a dashboard that reads from shared storage - container-first + toml-first, and intentionally narrow: b200-only (sm_100a), no tensor parallel, no expert all-to-all this repo started in the spirit of nanochat (small, hackable, end-to-end), then grew into a rewrite of a bunch of the core components we wish existed as a public reference for moe training. over the next few weeks i’ll post deep dives on: - rdep + why per-expert batch size is the whole moe problem - router stability in small runs - fp8/nvfp4 expert training without drowning in overhead - deterministic mixtures + why “close enough” sampling breaks proxy validity - the metrics/nviz stack and what we track that actually matters

English
0
0
5
2.2K
Dillon Erb retweetledi
Dillon Erb retweetledi
Sebastian Raschka
Sebastian Raschka@rasbt·
Implemented Olmo 3 from scratch (in a standalone notebook) this weekend! If you are a coder, probably the best way to read the architecture details at a glance: github.com/rasbt/LLMs-fro…
Sebastian Raschka tweet media
Sebastian Raschka@rasbt

Olmo models are always a highlight due to them being fully transparent and their nice, detailed technical reports. I am sure I'll talk more about the interesting training-related aspects from that 100-pager in the upcoming days and weeks. In the meantime, here's the side-by-side architecture comparison with Qwen3. 1) As we can see, the Olmo 3 architecture is relatively similar to Qwen3. However, it's worth noting that this is essentially likely inspired by the Olmo 2 predecessor, not Qwen3. 2) Similar to Olmo 2, Olmo 3 still uses a post-norm flavor instead of pre-norm, as they found in the Olmo 2 paper that it stabilizes the training. 3) Interestingly, the 7B model still uses multi-head attention similar to Olmo 2. However, to make things more efficient and shrink the KV cache size, they now use sliding window attention (e.g., similar to Gemma 3.) Next, let's look at the 32B model. 4) Overall, it's the same architecture but just scaled up. Also, the proportions (e.g., going from the input to the intermediate size in the feed forward layer, and so on) roughly match the ones in Qwen3. 5) My guess is the architecture was initially somewhat smaller than Qwen3 due to the smaller vocabulary, and they then scaled up the intermediate size expansion from 5x in Qwen 3 to 5.4 in Olmo 3 to have a 32B model for a direct comparison. 6) Also, note that the 32B model (finally!) uses grouped query attention.

English
17
285
2K
165.8K
Dillon Erb retweetledi
Ryan D’Onofrio
Ryan D’Onofrio@rsdgpt·
I built osgrep It’s a local code search tool that understands natural language. Works as a standalone CLI or a plugin for Claude Code. No API keys or subscription. I wanted the power of "semantic search" without the latency, price, or privacy trade-offs. Video in realtime.
English
32
52
610
68.7K
Dillon Erb
Dillon Erb@dlnrb·
Super talented team excited to see what they are building!
Y Combinator@ycombinator

hillclimb (@hillclimbai) is the human superintelligence community, dedicated to building golden datasets for AGI. Starting with math, their team of IMO medalists, lean experts, PhDs is designing RL environments for @NousResearch.

English
0
0
4
465