left pocket cheesecake

936 posts

left pocket cheesecake

left pocket cheesecake

@TraeMurray24

kubernetes enjoyer

Katılım Aralık 2012
773 Takip Edilen238 Takipçiler
AJ
AJ@ItsmeAjayKV·
The one big thing I'm waiting for in llama.cpp (shout out to @ggerganov ) right now is MTP. With Qwen 3.6 (my fav model) already supporting it, we are going to see massive improvements in generation speed once it's fully merged. So, what exactly is MTP? It stands for Multi-Token Prediction. If you understand speculative decoding, this is the next level. Instead of relying on a smaller, separate draft model, MTP is built directly into the model during its initial training. The main model simply produces draft tokens on its own auxiliary heads that allow it to naturally output multiple future tokens simultaneously. It's leaner, faster, and incredibly efficient for local hardware. How is it different from other methods. Well lets go over them in brief. 1. Standard Speculative Decoding (Draft models) You load two models into memory: the big target model (e.g 35B) and a tiny fast draft model ( < 2B) from the same family. The small draft model runs ahead, generating 4 or 5 tokens sequentially. The massive target model then does a single forward pass to check the drafts math. Pros: Consistent speedups across workloads. Cons: Eats more VRAM, if i talk about my 3060 case, where i try to squeeze a heavy model into 12GB of VRAM, sacrificing a GB or two just to host a draft model can be a painful trade-off. 2. n-gram speculative decoding (prompt lookup) This one is interesting, different idea than draft, n-gram decoding simply looks at the text already in the "prompt" and guesses that it will be repeated (which is also its biggest issue). Good for coding, JSON formatting, or even rag. Pros: Zero VRAM overhead. Nothing extra to load. Good speedup for above mentioned tasks. Cons: Very situational. For creative writing it fails miserably and offers almost no speedup. 3. DFlash (Block diffusion drafting) DFlash replaces traditional autoregressive draft model with a lightweight block diffusion model. Instead of guessing tokens sequentially, DFlash generates an entire block of tokens in parallel in a single forward pass. It achieves this by pulling hidden state features directrly from the target model andusing them as context to denoise a block of next tokens immedietly. Pros: Super fast, by removing sequential bottleneck of drafting phase, this can achieve high loseless acceleration. Cons: Nothing much actually, it does requires specialized checkpoints trained specifically to align with the target model. Also take a look at LuceBox-hub D-Flash and P-flash by @davideciffa
AJ tweet media
English
4
7
52
8.7K
Google Gemma
Google Gemma@googlegemma·
Gemma 4 just got even faster! We're releasing Multi-Token Prediction (MTP) drafters that deliver up to a 3x speedup, without any degradation in output quality or reasoning logic.
GIF
English
99
356
3.4K
205.1K
evan conrad
evan conrad@evanjconrad·
San Francisco Compute is growing rapidly & we're hiring across the board for our systems engineering, data center development, & (product & brand) design teams. We're the local supercomputing company. We sell people GPU clusters on contracts they can sublease. SFC's goal is to reduce the financial risk one of the largest infrastructure build outs in history. To do that, we vertically integrated. That means we build data centers, the clusters in the data center, and a cloud platform that we built on top of the most order book of it's kind. This lets you do cool stuff like "buy a 1 month contract 3 months out, but only if I can get it colocated, and only if the price is at a 25% discount to the current market price." You can walk in the door on Friday, buy a 3-year contract, and then walk out the door on Monday by selling the whole thing. In other words, we build the cloud for people who care about margins & their risk exposure. We did that because SFC was originally "Junelark" (a teeny tiny 2-person AI lab), which bought too big of a GPU cluster & was forced to sublease it. The first year of the company was tremendously stressful because if we didn't sell the cluster, we'd go bankrupt. This forced us to become a very rough accidental cloud. We'd operate on top of other clouds, but ran out of folks who would give us access to key parts of the cluster (like BMC, UFM, & switch access) needed to offer a viable experience. To build something great, we vertically integrated down and down until we hit the dirt. These days, I like to operate the company somewhat quietly. Our website's a single page (we may change this). We don't show up on VC market maps and we're not in the news much. I hope this doesn't deter you; SFC operates at very large scale & has been growing at an incredible pace. We're just very focused on standing up clusters & shipping features that help our customers. Our team includes industry veterans, like the cofounder of Voltage Park, key folks from Tesla, Meta, Lambda, Redhat, Hut8, Canonical, & Sun. We'd love for you to join us! My DMs are open, or you can reach me at evan at sf compute dot com.
English
7
22
300
43.3K
left pocket cheesecake
left pocket cheesecake@TraeMurray24·
@menhguin What memory stocks are in? Been looking at those but they've pumped so much in last year I'm not sure how much juice is left
English
0
0
0
130
Minh Nhat Nguyen
Minh Nhat Nguyen@menhguin·
new positions in q2: seed investment + small 2-3% allocation in solana and worldcoin (agentic payments) current positions, might rebalance: 40% memory and AIXA and intel calls 15% zai and minimax 15% PLTR puts 10% nuclear (oklo) 10% aerospace (fly and rklb) 10% optics/photonics
English
4
1
27
3.5K
Minh Nhat Nguyen
Minh Nhat Nguyen@menhguin·
fyi, nowadays im busy so i just have openclaw automations+deep research tracking @zephyr_z9 and @aleabitoreddit for new positions. up about ~60% YTD mostly from existing positions: memory stocks, intel calls, palantir puts, zai and minimax shares all of which are up ~100-200%.
Minh Nhat Nguyen@menhguin

Leopold's having fun, so here's my AI Safety twink portfolio. Total 1-year return: +892%. Criteria: Product advances human civilisation + good team. 50% Oklo (+1700%) 45% Tesla (+87%) 5% Nvidia (+55%)

English
15
15
582
83.1K
Andy Wood
Andy Wood@heyimandy·
Quick weekend 3D print: 4 of these cups to hold my daughters art supplies on her art easel. I think every household should have a basic 3D printer.
Andy Wood tweet media
English
6
0
25
12.3K
Jack Morris
Jack Morris@jxmnop·
Learning to write kernels might be the highest-ROI activity for displaced SWEs: → prereq: reasonable engineering ablity → six to twelve months of study → millions of dollars, mark zuckerberg showing up at your house to hire you, etc. i wish this were an exaggeration
English
41
63
1.9K
124.1K
Robinhood
Robinhood@RobinhoodApp·
You deserve a treat. Comment below and we may send you some merch.
English
14.7K
408
9K
2M
left pocket cheesecake retweetledi
cowboy
cowboy@nextokens·
Here Comes Another Bubble (AI Edition)
English
113
487
3.7K
425.7K
left pocket cheesecake
left pocket cheesecake@TraeMurray24·
Saw a dad at the airport repping a delta t shirt like it's game day
English
0
0
1
42
Taco MacArthur
Taco MacArthur@Accenture69·
Hey man, I know you’re trying to do your job, but I didn’t come to Costco to sign up for AT&T internet
English
1
0
2
57
NVIDIA GeForce
NVIDIA GeForce@NVIDIAGeForce·
🟢 GEFORCE DAY IS BACK 🟢 To celebrate, we're giving away TWO GeForce RTX 5080 Founders Edition GPUs, signed by NVIDIA CEO Jensen Huang. Want one? Comment "GeForce Day" for a chance to WIN & stay tuned for more!
NVIDIA GeForce tweet media
English
57.7K
3.5K
47.1K
5.9M
left pocket cheesecake
left pocket cheesecake@TraeMurray24·
Getting ready to board a cross country flight with a toddler
left pocket cheesecake tweet media
English
0
0
1
34
left pocket cheesecake
left pocket cheesecake@TraeMurray24·
@tekbog Me at last job deploying nvidia gpu operator onto eks nodes with preinstalled drivers
left pocket cheesecake tweet media
English
0
0
7
44
terminally onλine εngineer
MLOps sounds cool until you discover you just deal with cuda versions, python package versions and supported gpus
terminally onλine εngineer tweet media
English
15
1
158
4.2K