jason
2.8K posts

jason
@jvmncs
❤️s/RTs are randomized and differentially private.

yesterday I was debugging a poorly-performing training run with Claude Code and I discovered that instead of training on 30 batches of data it had somehow decided to train a new model for 500 steps on each batch and then average the 30 sets of weights

Modal Auto Endpoints provide state-of-the-art open source inference perf with a click. Learn how we developed our low latency inference playbook with @DecagonAI, delivering responses 60ms faster than the best proprietary provider. modal.com/blog/achieve-s…

Modal's been super important for our velocity over the last 6 months - Training on each user's context means scaling out to thousands of GPUs in quick bursts. Modal allowed us to do this from day zero, before we could keep a large committed cluster hot - Our research team experiments with weird parameterizations all the time and needs to make changes to our inference and training servers. Modal makes it super easy for everyone on the team to deploy new endpoints for dogfooding and eval

It is not too late to _actually_ own your inference. Introducing: Modal Auto Endpoints.


1st and 2nd degree moots: I will be hosting another movie club in south bay in june, DM me for details Paris, Texas (1984)

Speculation Is All You Need. In this blog post, we announce the co-release (w/ Z Lab) of six more state-of-the-art DFlash speculators for @Alibaba_Qwen 3.x. Over 1k output tps for 3.5 122B-A10B on a B200. Read the blog for why we're all-in on spec dec. modal.com/blog/spec-is-a…


We worked with @lmsysorg and z-lab.ai to - integrate DFlash spec into @sgl_project - make it faster with overlap - train a DFlash drafter for @Alibaba_Qwen 397B-A17B The result: up to 4.3x greater throughput over baseline and 1.5x over native MTP.



So excited to be opening up OpenEnv to the whole community. It will now be owned by @huggingface , Meta-PyTorch, @reflection_ai , @UnslothAI , @modal, @PrimeIntellect , @NVIDIAAI , @mercor_ai , and @fleet_ai . the reason is: frontier labs train the model and the harness together, so the model is fitted to its harness. that coupling is a chunk of why claude code and codex feel so good. open source can't do that. you bring whatever harness, whatever model, whatever env, whatever trainer. which is the whole point of open source and also the problem for training. openenv is the socket in between all of this. in short: it's a protocol layer, not a reward framework. it does not have opinions about your rewards or your training loop. those live in the libs that are actually good at them. read more in the blog post. it's early, come break it.



I get asked a lot about what actually matters in the inference space. The conversation has shifted as OSS frameworks have closed much of the gap on raw latency, but workload-specific tuning remains an open problem. Increasingly, more differentiation lives in the product layer around infrastructure. What separates providers now: Latency: for synchronous, latency-sensitive workloads, the ability to tailor deployments to meet specific needs (whether TTFT or e2e) is critical and highly dependent on token profiles and use case requirements. Throughput & cost: these form a pareto frontier with latency. Reliability: table stakes. Observability and alerting are a big part of this. Developer velocity: underrated on most lists. Self-serve configurability is a massive force multiplier for sophisticated teams. Autoscaling flexibility: not just "does it scale" but what triggers it and how fast. Capacity: still a real constraint for newer hardware, and the geographic dimension for colocation can make this a harder constraint.













