Harley Wiltzer

226 posts

Harley Wiltzer banner
Harley Wiltzer

Harley Wiltzer

@harwiltz

PhD Student @McGillU / @Mila_Quebec, distributional RL enthusiast

Montréal, Québec Katılım Temmuz 2020
409 Takip Edilen252 Takipçiler
Harley Wiltzer retweetledi
Volodymyr Kuleshov 🇺🇦
The transformer made training parallel and unlocked scaling laws for pre-training. Inference is still sequential. Diffusion is the evolution for inference.
Inception@_inception_ai

Will the next decade of LLMs run on autoregression, or on diffusion? One of the top questions we got at MLSys this week. Part 6, the final part of our founder story series with @timt at @MenloVentures. Featuring @StefanoErmon, @adityagrover_, @volokuleshov

English
0
1
19
2.3K
Harley Wiltzer retweetledi
Stefano Ermon
Stefano Ermon@StefanoErmon·
This is the future of coding agents: specialized, high-speed subagents handling critical workflows in the background. Proud to see @augmentcode running Mercury 2 in production for context compaction with major latency and cost improvements.
Inception@_inception_ai

@augmentcode rebuilt their context compaction layer around Mercury 2. 82% latency cut. 90% cost cut. Comparable quality to Opus 4.7. Running in production today. "We took a counter-intuitive bet. We decoupled summarization entirely, offloading it to Mercury 2 as a dedicated subagent. Mercury 2 is the highly efficient engine powering our most critical workflows." -@RustagiAnkur & @jm1234567890, Members of Technical Staff at Augment Code The subagent layer needs the most efficient model. Full methodology and eval setup in the writeup. inceptionlabs.ai/blog/rise-of-r…

English
2
9
80
10.6K
Harley Wiltzer retweetledi
Augment Code
Augment Code@augmentcode·
At @augmentcode , we took a counter-intuitive bet on our AI architecture. Instead of using the primary coding model to preserve KV cache (the industry standard), we used Mercury 2 by @_inception_ai as a dedicated subagent. The payoff for our users: 82% faster context compaction, 90% lower summarization costs, <1s tool-search summaries, 30% lower LLM spend via Prism routing Read the full story here: inceptionlabs.ai/blog/rise-of-r…
Inception@_inception_ai

@augmentcode rebuilt their context compaction layer around Mercury 2. 82% latency cut. 90% cost cut. Comparable quality to Opus 4.7. Running in production today. "We took a counter-intuitive bet. We decoupled summarization entirely, offloading it to Mercury 2 as a dedicated subagent. Mercury 2 is the highly efficient engine powering our most critical workflows." -@RustagiAnkur & @jm1234567890, Members of Technical Staff at Augment Code The subagent layer needs the most efficient model. Full methodology and eval setup in the writeup. inceptionlabs.ai/blog/rise-of-r…

English
0
8
68
8.9K
Harley Wiltzer retweetledi
Inception
Inception@_inception_ai·
@augmentcode rebuilt their context compaction layer around Mercury 2. 82% latency cut. 90% cost cut. Comparable quality to Opus 4.7. Running in production today. "We took a counter-intuitive bet. We decoupled summarization entirely, offloading it to Mercury 2 as a dedicated subagent. Mercury 2 is the highly efficient engine powering our most critical workflows." -@RustagiAnkur & @jm1234567890, Members of Technical Staff at Augment Code The subagent layer needs the most efficient model. Full methodology and eval setup in the writeup. inceptionlabs.ai/blog/rise-of-r…
Inception tweet media
English
2
15
78
26.9K
Harley Wiltzer retweetledi
Ben Lang
Ben Lang@benln·
Pulled the fastest-growing startups based on X follower growth over the past 90 days:
Ben Lang tweet media
English
55
78
989
103.7K
Harley Wiltzer retweetledi
Tomas Hernando Kofman
Tomas Hernando Kofman@tomas_hk·
Our model router now supports @_inception_ai's Mercury 2, the fastest code gen model in existence. Use it with Not Diamond or @OpenRouter's /auto mode. For max speeds, use the latency tradeoff in nd or the plugins param in OpenRouter to route bw Mercury and a stronger model.
English
3
9
24
2.4K
Harley Wiltzer retweetledi
Sid Sharma
Sid Sharma@phylera14·
p50: 175ms vs 686ms p99: 517ms vs 1183ms a top-10 US tech company benchmarked Mercury 2 from @_inception_ai against Gemini Flash on their search pipelines in prod. same tasks. same eval. diffusion LLMs are a different animal.
Sid Sharma tweet media
English
0
7
18
3.6K
Harley Wiltzer retweetledi
Mohamad H. Danesh
Mohamad H. Danesh@mo_danesh·
📢 New paper out! We introduce QWM: a single locomotion world model trained across 8 quadrupeds and deployed zero-shot on robots it had never seen by conditioning on their morphology specs: ANYmal-D and Unitree Go1 🦾 No fine-tuning, no warm-up, no retraining from scratch. The key insight: robot morphology isn't a latent variable to infer from motion history, it's a known engineering spec sitting in the USD (or URDF) file. So we just use it directly.
GIF
GIF
English
1
3
21
1.4K
Harley Wiltzer retweetledi
Stefano Ermon
Stefano Ermon@StefanoErmon·
If your fast inference capacity is constrained, Mercury 2 is ready. Same speed class, GPU-based deployment, and enough capacity for production traffic.
Sid Sharma@phylera14

Abandoned by Cerebras and your prod traffic has nowhere to go? Mercury 2 is still here: >1,000 tok/s Haiku/Flash-level quality $0.75 / 1M output tokens Fast inference shouldn’t disappear just because your provider changed strategy. DM: sid@inceptionlabs.ai

English
1
1
34
9.8K
Harley Wiltzer retweetledi
Sid Sharma
Sid Sharma@phylera14·
Abandoned by Cerebras and your prod traffic has nowhere to go? Mercury 2 is still here: >1,000 tok/s Haiku/Flash-level quality $0.75 / 1M output tokens Fast inference shouldn’t disappear just because your provider changed strategy. DM: sid@inceptionlabs.ai
Sid Sharma tweet media
English
2
8
32
13.8K
Harley Wiltzer retweetledi
Inception
Inception@_inception_ai·
Mercury 2 is in a league of its own. 1,200 tok/s at comparable quality to speed-optimized autoregressive models, per @ArtificialAnlys.
Inception tweet media
English
8
12
152
12.4K
Ziyan "Ray" Luo
Ziyan "Ray" Luo@RayZiyan41307·
Nishanth’s expertise in RL is exceptional—precise, insightful, and reliable. A true contributor to both research and community, and an outstanding teacher who taught me a lot. Congrats, my friend, glad to be part of this beautiful photo.
Nishanth Anand@itsNVA7

Thrilled to share that on March 20th, I defended my Ph.D. with flying colours! 🎨🎓🧠 My thesis, "The permanent and transient framework for continual RL," had an amazing reception from my committee: Peter Dayan, Marc Bellemare, Paul Masset, and Doina Precup.

English
2
0
16
1K
Harley Wiltzer retweetledi
Inception
Inception@_inception_ai·
Artificial Analysis launched a Model Recommender. Set your priorities for intelligence, speed, and cost, and it ranks the best models for your stack. Mercury 2 ranks first. See the full ranking on @ArtificialAnlys: artificialanalysis.ai/models/recomme…
Inception tweet media
English
8
22
130
23.6K
Harley Wiltzer retweetledi
Inception
Inception@_inception_ai·
1,000+ tokens per second. 10x faster than autoregressive models. On standard GPUs. @StefanoErmon and @volokuleshov break down where that speed matters most: voice agents, coding, and production agent systems where latency compounds across every call. Our founder series with @timt at @MenloVentures.
English
4
9
43
6.5K
Harley Wiltzer retweetledi
Inception
Inception@_inception_ai·
An Evening with Inception x @iendeavors - Tues, April 14, Palo Alto. @StefanoErmon and the Inception team. Drinks, bites, and conversation about diffusion LLMs. No talks. No slides. We'd love to meet researchers, engineers, and students who are curious about dLLMs and where they're headed. Space is limited - invite here: luma.com/5p7cz4tq
Inception tweet media
English
2
9
63
8.8K
Harley Wiltzer retweetledi
Lucas Bunzel
Lucas Bunzel@LBunzel·
Everyone's talking about which model to run in OpenClaw. Looked at the PinchBench data. Step 3.5 Flash is the most popular model on OpenRouter — 3T+ tokens in March. Here's how it compares to Mercury 2 on real agent tasks: Step 3.5 Flash → Mercury 2 → Speed: 62 → 96 → Cost efficiency: 94 → 99 → Consistency: 89 → 90 Comparable task success to GPT-5 Mini, GPT-4o, Gemini Flash, and DeepSeek Chat. Agents chain 20+ tool calls per task. Latency compounds. Cost compounds. That's where this gap matters most. pinchbench.com/?view=graphs&g…
Lucas Bunzel tweet media
English
1
4
13
1.2K
Harley Wiltzer retweetledi
Inception
Inception@_inception_ai·
We're hiring. Inception builds diffusion-based LLMs that generate tokens in parallel, not one at a time. Our founders helped invent diffusion models, flash attention, decision transformers, and DPO. Team from DeepMind, OpenAI, Meta AI, Microsoft AI, AWS, Scale, and Stripe. Open roles: inceptionlabs.ai/careers
Inception tweet media
English
5
20
344
19.4K