left pocket cheesecake
936 posts

left pocket cheesecake
@TraeMurray24
kubernetes enjoyer
Katılım Aralık 2012
773 Takip Edilen238 Takipçiler

@ItsmeAjayKV @ggerganov Google released MTP for Gemma4 today but this post specifically mentions and links to a separate draft model - x.com/googlegemma/st…
Am I misunderstanding something or are they just calling their standard speculative decoding MTP?
Google Gemma@googlegemma
English

The one big thing I'm waiting for in llama.cpp (shout out to @ggerganov ) right now is MTP. With Qwen 3.6 (my fav model) already supporting it, we are going to see massive improvements in generation speed once it's fully merged.
So, what exactly is MTP?
It stands for Multi-Token Prediction. If you understand speculative decoding, this is the next level. Instead of relying on a smaller, separate draft model, MTP is built directly into the model during its initial training. The main model simply produces draft tokens on its own auxiliary heads that allow it to naturally output multiple future tokens simultaneously. It's leaner, faster, and incredibly efficient for local hardware.
How is it different from other methods. Well lets go over them in brief.
1. Standard Speculative Decoding (Draft models)
You load two models into memory: the big target model (e.g 35B) and a tiny fast draft model ( < 2B) from the same family. The small draft model runs ahead, generating 4 or 5 tokens sequentially. The massive target model then does a single forward pass to check the drafts math.
Pros: Consistent speedups across workloads.
Cons: Eats more VRAM, if i talk about my 3060 case, where i try to squeeze a heavy model into 12GB of VRAM, sacrificing a GB or two just to host a draft model can be a painful trade-off.
2. n-gram speculative decoding (prompt lookup)
This one is interesting, different idea than draft, n-gram decoding simply looks at the text already in the "prompt" and guesses that it will be repeated (which is also its biggest issue). Good for coding, JSON formatting, or even rag.
Pros: Zero VRAM overhead. Nothing extra to load. Good speedup for above mentioned tasks.
Cons: Very situational. For creative writing it fails miserably and offers almost no speedup.
3. DFlash (Block diffusion drafting)
DFlash replaces traditional autoregressive draft model with a lightweight block diffusion model. Instead of guessing tokens sequentially, DFlash generates an entire block of tokens in parallel in a single forward pass. It achieves this by pulling hidden state features directrly from the target model andusing them as context to denoise a block of next tokens immedietly.
Pros: Super fast, by removing sequential bottleneck of drafting phase, this can achieve high loseless acceleration.
Cons: Nothing much actually, it does requires specialized checkpoints trained specifically to align with the target model.
Also take a look at LuceBox-hub D-Flash and P-flash by @davideciffa

English

@evanjconrad Are yall looking to build more datacenters in the near future?
English

San Francisco Compute is growing rapidly & we're hiring across the board for our systems engineering, data center development, & (product & brand) design teams.
We're the local supercomputing company. We sell people GPU clusters on contracts they can sublease.
SFC's goal is to reduce the financial risk one of the largest infrastructure build outs in history. To do that, we vertically integrated. That means we build data centers, the clusters in the data center, and a cloud platform that we built on top of the most order book of it's kind.
This lets you do cool stuff like "buy a 1 month contract 3 months out, but only if I can get it colocated, and only if the price is at a 25% discount to the current market price." You can walk in the door on Friday, buy a 3-year contract, and then walk out the door on Monday by selling the whole thing. In other words, we build the cloud for people who care about margins & their risk exposure.
We did that because SFC was originally "Junelark" (a teeny tiny 2-person AI lab), which bought too big of a GPU cluster & was forced to sublease it. The first year of the company was tremendously stressful because if we didn't sell the cluster, we'd go bankrupt. This forced us to become a very rough accidental cloud. We'd operate on top of other clouds, but ran out of folks who would give us access to key parts of the cluster (like BMC, UFM, & switch access) needed to offer a viable experience. To build something great, we vertically integrated down and down until we hit the dirt.
These days, I like to operate the company somewhat quietly. Our website's a single page (we may change this). We don't show up on VC market maps and we're not in the news much. I hope this doesn't deter you; SFC operates at very large scale & has been growing at an incredible pace. We're just very focused on standing up clusters & shipping features that help our customers.
Our team includes industry veterans, like the cofounder of Voltage Park, key folks from Tesla, Meta, Lambda, Redhat, Hut8, Canonical, & Sun.
We'd love for you to join us! My DMs are open, or you can reach me at evan at sf compute dot com.
English

@menhguin What memory stocks are in? Been looking at those but they've pumped so much in last year I'm not sure how much juice is left
English

fyi, nowadays im busy so i just have openclaw automations+deep research tracking @zephyr_z9 and @aleabitoreddit for new positions.
up about ~60% YTD mostly from existing positions:
memory stocks, intel calls, palantir puts, zai and minimax shares all of which are up ~100-200%.
Minh Nhat Nguyen@menhguin
Leopold's having fun, so here's my AI Safety twink portfolio. Total 1-year return: +892%. Criteria: Product advances human civilisation + good team. 50% Oklo (+1700%) 45% Tesla (+87%) 5% Nvidia (+55%)
English

@heyimandy This is awesome. Where can I find the design for this?
English

@jxmnop What are some good resources to learn writing kernels?
English

models I am still using for research:
gpt-oss series
dsv3 series
and new, nemotron super entering the canon
Neil Chowdhury@ChowdhuryNeil
i disagree. gpt-oss-120b is *the* model i use most frequently for my research. it is ridiculously good for how cheap it is (5b active). it gets hate for being worse than larger chinese models, but it is one of my favorites -- i really hope that openai releases future oss models
English
left pocket cheesecake retweetledi
left pocket cheesecake retweetledi

Just a quick detour to build the gooner bot but we promise to cure cancer next
*Walter Bloomberg@DeItaone
SAM ALTMAN SAYS OPENAI WILL ALLOW EROTICA FOR ADULT USERS - AXIOS
English

@tekbog Me at last job deploying nvidia gpu operator onto eks nodes with preinstalled drivers

English

@tekbog relearned networking for the 67th consecutive week
English













