stableAPY.hl

0

1

46

stableAPY.hl retweetledi

stableAPY.hl@stableAPY·1d

I still can't get my head around the fact that my 3060 12gb is running Qwen 3.6 35B at 40 tok/s this card is around 200$ second-hand while everyone's shilling super expensive 128gb unified memory or RTX 6000 cards, one 3060 12gb can be far from enough for first local AI experiments it's cheap, and paired with a bit of RAM and a somehow decent CPU it'll do the work of course decode drops when context grows and you won't be able to run multiple sub agents, but it's a nice cheap entry point for example, it pairs very well with my 3090: > 3090 running main agent 35B -np 2 => so I can have 2 concurrent agent > 3060 sub-agent 35B -np1 this way my main hermes can delegate work to this sub agent while working on something else I also run Hermes cron job so they does not overload the main agent and I don't mind it behind slower because it happens in the background

English

18

11

155

10.3K

stableAPY.hl@stableAPY·17h

@JakeKAllDay @ItsmeAjayKV @UnslothAI Ty I'll

English

1

23

Jake@JakeKAllDay·21h

@stableAPY @ItsmeAjayKV @UnslothAI Try the 4 k M version, it seems to be much faster. Have consistently seen IQ cause slower dequant even at smaller model sizes (6 K M also good too)

English

0

2

38

stableAPY.hl@stableAPY·1d

it seems that Unsloth is all in on MTP (Multi-Token Prediction) they released Qwen 3.5 0.8B, 2B, 4B and 9B MTP versions looks like I'll have new stuff to test out on my 3060 huggingface.co/unsloth/Qwen3.…

English

6

35

262

12.3K

stableAPY.hl@stableAPY·1d

@Russian_ikigai not the MTP on this screen but had some good tests today for MTP on the 3060 might use it as the "prod" setup

English

94

ал@Russian_ikigai·1d

@stableAPY Это MTP версия? - она еще быстрее

Русский

0

99

stableAPY.hl@stableAPY·1d

@jonorozcoc I find it very good! i've also tried gemma, but not much real coding test, more benchmarks : x.com/stableAPY/stat…

After testing out Qwen 3.6 35B on my 3060, it's Gemma 4 26B A4B UD-IQ4_XS's turn sadly it seems that it requires more CPU offloading than Qwen, leading to lower prefill and decode speeds while it felt counterintuitive, it's actually pretty logical even though Gemma is 26B vs 35B for Qwen, the 4B active parameters lead to higher CPU offload I'll still try it out to see how good the model is. I've been hyperfocused on Qwen and it's time to see what other labs have to offer

English

1

157

Jonatan Orozco@jonorozcoc·1d

@stableAPY How good it is? Have you tried with Gemma 4?

English

0

130

stableAPY.hl@stableAPY·1d

@darth_turnip try ik_llama.cpp as a start, then try their MTP maybe x.com/stableAPY/stat…

small GPU owners, 3060's fellows, 8 or 12gb VRAM chads, > if you're using llama.cpp you should check out ik_llama.cpp it gives about 10% better performances at 128k total context running Qwen3.6-35B-A3B IQ4_XS on a 3060 12gb 31,7 tok/s at 50% of context is not that bad for a subagent for local hermes setup github.com/ikawrakow/ik_l…

English

21

Charlie 🇬🇧@darth_turnip·1d

@stableAPY 12GB VRAM. Various back ends, Ollama, Kobold. I have 64GB of system RAM.

English

0

26

stableAPY.hl@stableAPY·1d

@darth_turnip I think behind Q3_XL it's pretty degraded I use IQ4_XS, what version of the 3060, 8 or 12gb? What backend? kaitchup.substack.com/p/summary-of-q…

English

0

1

24

Charlie 🇬🇧@darth_turnip·1d

@stableAPY Qwen3.6-35B-A3B-UD-IQ2_M - What would you recommend?

English

0

23

stableAPY.hl@stableAPY·1d

@BitcoinComfy @georgecursor bought mine 750 eur and got another one for around 600, there are some opportunities, but I've alerts and stuff

English

1

18

Bitcoin Comfy@BitcoinComfy·1d

@stableAPY @georgecursor 3090 is around 900/950 (was half price 1y ago)

English

0

1

11

George Saoulidis@georgecursor·1d

It's about 700 euros second hand

I still can't get my head around the fact that my 3060 12gb is running Qwen 3.6 35B at 40 tok/s this card is around 200$ second-hand while everyone's shilling super expensive 128gb unified memory or RTX 6000 cards, one 3060 12gb can be far from enough for first local AI experiments it's cheap, and paired with a bit of RAM and a somehow decent CPU it'll do the work of course decode drops when context grows and you won't be able to run multiple sub agents, but it's a nice cheap entry point for example, it pairs very well with my 3090: > 3090 running main agent 35B -np 2 => so I can have 2 concurrent agent > 3060 sub-agent 35B -np1 this way my main hermes can delegate work to this sub agent while working on something else I also run Hermes cron job so they does not overload the main agent and I don't mind it behind slower because it happens in the background

English

0

1

87

stableAPY.hl@stableAPY·1d

@LenSeaside yep a PR branch

English

3

237

Len Seaside@LenSeaside·1d

@stableAPY Are they still asking us to load a special fork of llama.cpp?

English

0

2

284

stableAPY.hl@stableAPY·1d

@khkh656fhff idk if lmstudio has MTP tho

English

1

201

Marla Memes@khkh656fhff·1d

@stableAPY Doesn't work in lmstudio

English

0

223

stableAPY.hl@stableAPY·1d

@darth_turnip Running which model? Which quant?

English

0

95

Charlie 🇬🇧@darth_turnip·1d

@stableAPY I got 17 TKS on my 3060. Not sure why.

English

0

106

stableAPY.hl@stableAPY·1d

@EditorEnBici What quant are you using?

English

0

104

CEO del Socialismo de Mercado🌹🕊️ 市场社会主义CEO@EditorEnBici·1d

@stableAPY Mine was kinda dumb 😂 Have to test it more, just started yesterday.

English

0

112

stableAPY.hl@stableAPY·1d

@HYPEconomist seems like we already had a stablecoin

English

1

94

stableAPY.hl retweetledi

HYPEconomist | Theo Arc@HYPEconomist·1d

coinbase announcement summary: > expected to add ~$150M in annual revenue (~25% increase) > coinbase will use builder codes to relaunch perps on its platform despite the news, hyperliquid:native is only up 4% today

English

28

12

242

17K

stableAPY.hl retweetledi

AJ@ItsmeAjayKV·4 May

The one big thing I'm waiting for in llama.cpp (shout out to @ggerganov ) right now is MTP. With Qwen 3.6 (my fav model) already supporting it, we are going to see massive improvements in generation speed once it's fully merged. So, what exactly is MTP? It stands for Multi-Token Prediction. If you understand speculative decoding, this is the next level. Instead of relying on a smaller, separate draft model, MTP is built directly into the model during its initial training. The main model simply produces draft tokens on its own auxiliary heads that allow it to naturally output multiple future tokens simultaneously. It's leaner, faster, and incredibly efficient for local hardware. How is it different from other methods. Well lets go over them in brief. 1. Standard Speculative Decoding (Draft models) You load two models into memory: the big target model (e.g 35B) and a tiny fast draft model ( < 2B) from the same family. The small draft model runs ahead, generating 4 or 5 tokens sequentially. The massive target model then does a single forward pass to check the drafts math. Pros: Consistent speedups across workloads. Cons: Eats more VRAM, if i talk about my 3060 case, where i try to squeeze a heavy model into 12GB of VRAM, sacrificing a GB or two just to host a draft model can be a painful trade-off. 2. n-gram speculative decoding (prompt lookup) This one is interesting, different idea than draft, n-gram decoding simply looks at the text already in the "prompt" and guesses that it will be repeated (which is also its biggest issue). Good for coding, JSON formatting, or even rag. Pros: Zero VRAM overhead. Nothing extra to load. Good speedup for above mentioned tasks. Cons: Very situational. For creative writing it fails miserably and offers almost no speedup. 3. DFlash (Block diffusion drafting) DFlash replaces traditional autoregressive draft model with a lightweight block diffusion model. Instead of guessing tokens sequentially, DFlash generates an entire block of tokens in parallel in a single forward pass. It achieves this by pulling hidden state features directrly from the target model andusing them as context to denoise a block of next tokens immedietly. Pros: Super fast, by removing sequential bottleneck of drafting phase, this can achieve high loseless acceleration. Cons: Nothing much actually, it does requires specialized checkpoints trained specifically to align with the target model. Also take a look at LuceBox-hub D-Flash and P-flash by @davideciffa

English

4

7

52

8.6K

stableAPY.hl@stableAPY·1d

@ElkaFirmanda 🤝

QME

3

13

Elka Firmanda@ElkaFirmanda·1d

can confirm, this is true, i use my 3060 for Qwen 3.6 35B and got around 44 tok/s

I still can't get my head around the fact that my 3060 12gb is running Qwen 3.6 35B at 40 tok/s this card is around 200$ second-hand while everyone's shilling super expensive 128gb unified memory or RTX 6000 cards, one 3060 12gb can be far from enough for first local AI experiments it's cheap, and paired with a bit of RAM and a somehow decent CPU it'll do the work of course decode drops when context grows and you won't be able to run multiple sub agents, but it's a nice cheap entry point for example, it pairs very well with my 3090: > 3090 running main agent 35B -np 2 => so I can have 2 concurrent agent > 3060 sub-agent 35B -np1 this way my main hermes can delegate work to this sub agent while working on something else I also run Hermes cron job so they does not overload the main agent and I don't mind it behind slower because it happens in the background

English

2

0

2

115

stableAPY.hl@stableAPY·1d

@0xasrequired "5) coinbase gives up on fighting hyperliquid and kisses the ring, as required." > my favorite part

English