Punch Taylor

6.2K posts

Punch Taylor banner
Punch Taylor

Punch Taylor

@Punch_Taylor

🪨🇺🇸 🦅 6 finger patriot 👊🏻 317. pro-community. anti-communist. politics. ai. video games 🎮. pro #1A 💬. pro #2A 🔫.

เข้าร่วม Ocak 2024
2.7K กำลังติดตาม2.3K ผู้ติดตาม
Punch Taylor รีทวีตแล้ว
Eric ⚡️ Building...
Eric ⚡️ Building...@outsource_·
My 4090 went from 26 -> 154 tok/s Qwen 3.6 27B🤯 Same GPU. Same Q4_K_M . No FP8, no extra quant. The unlock: ik_llama.cpp + speculative decoding using Qwen3-1.7B as the draft model. 85% acceptance rate. Full config + benchmarks 👇🏻
Eric ⚡️ Building... tweet media
English
69
134
1.5K
97.3K
Punch Taylor รีทวีตแล้ว
Sudo su
Sudo su@sudoingX·
hey if you are running new qwen 3.6 27b dense on an rtx 4090 read this carefully, it could save you a few hours of head scratching. @Punch_Taylor ran my exact flags on 4090 wsl2 ubuntu cuda 13.2, three warm runs on q4_k_m. average landed at 43.1 tok/s, 8.3 percent above my 3090 baseline of 39.82. that delta tracks the memory bandwidth gap almost perfectly, 1008 gb/s on 4090 vs 936 gb/s on 3090. the math is honest, the speed bump is architecture level, not magic. vram at 262k context q4_0 kv cache is tight at 23 out of 24 gigs. wsl2 + cuda driver reserves eat about 2 gigs of headroom. if you are on bare metal linux you get that back, punch estimates 45 to 48 tok/s range for native runs. also flagging a real world cost. a single youtube tab in chrome drops his numbers to 39.9 tok/s, roughly 7-8 percent throughput loss from browser scheduling on wsl. close everything before measuring, especially on daily driver machines. now the community call. what are amd users getting on halo strix, tinygrad on 7900 xt, or any other consumer chip on the same model + same flags? drop your numbers, i stack them into the community chart tonight. bandwidth data across architectures is the content the major labs never publish.
Punch Taylor@Punch_Taylor

4090 datapoint, WSL2 Ubuntu CUDA 13.2, your exact flags + Q4_K_M: ./llama-server -m Qwen3.6-27B-Q4_K_M.gguf -ngl 99 -c 262144 -np 1 -fa on --cache-type-k q4_0 --cache-type-v q4_0 three warm runs on "yo" with thinking auto, system fully idle: - run 1: 42.83 tok/s - run 2: 43.18 tok/s - run 3: 43.33 tok/s - avg ~43.1 tok/s VRAM at 262k provisioned: 23.0GB / 1.1GB free of 24GB. tighter than your 21/3 split — WSL2 + cuda driver reserves eating ~2GB of headroom. native linux would likely give that back. so 4090 + WSL2 = +8.3% over your 3090 native baseline. roughly tracks the bandwidth gap (1008 vs 936 GB/s). bare metal linux on a 4090 should land higher still — would estimate 45-48 tok/s range for someone running native. side observation worth flagging: a single youtube tab in chrome dropped these numbers to ~39.9 tok/s in earlier runs. ~7-8% throughput cost from the browser competing for CPU/scheduling on the WSL side. anyone running this on a daily-driver PC should close everything before measuring.

English
10
5
152
15.3K
Punch Taylor รีทวีตแล้ว
Alec Lace
Alec Lace@AlecLace·
🚨 Joe Biden repeatedly called white supremacy the most dangerous, most lethal, greatest terrorist threat to America. Turns out the SPLC was funding it and Biden shut down the investigation into the SPLC. Everything was staged. The whole narrative was a hoax.
Alec Lace tweet media
English
1.8K
23K
66.9K
511K
Punch Taylor
Punch Taylor@Punch_Taylor·
No worries. I'm new too. Native Windows CUDA binary is probably slightly faster — WSL2 adds a thin GPU virtualization layer (I measured ~8% penalty when the host wasn't idle, ~0% when fully idle). For pure tok/s on a single model, native is simpler. I went WSL2 because sudo's command was Linux-shell, and most llama.cpp tutorials/scripts assume bash. Easier to reproduce published numbers that way. If you just want to run models: grab the prebuilt Windows CUDA release from the llama.cpp GitHub and you're done. If you also want to do other Linux dev stuff: WSL2.
English
0
0
1
343
Sudo su
Sudo su@sudoingX·
this was supposed to be a normal evening, then i saw on the timeline that qwen 3.6 27b dense q4 weights from unsloth are live and i could not sit still. compiled llama.cpp with cuda on the single rtx 3090 at 2am from bangkok, launched with the exact same flags that crowned 3.5-27b dense the undisputed king six weeks ago. q4_k_m, 262k context, q4_0 kv cache, flash attention on, single slot, no quant tricks, no dynamic ggufs, no turbo, just the straight cut to get a clean baseline. first pass said "yo" to the model as a warmup. it ran a six step thinking chain to formulate "yo what's up how can i help you today". full reasoning visible in the web ui. thinking mode goes hard, even for a greeting. the number improved. 39.82 tokens per second on the first real generation. march baseline on this exact hardware was 35.3 flat across every context size. that is a 13 percent speed bump. same card, same quant, same every flag, only the model changed. pure model level efficiency on ampere. the model is actually faster at the token level on consumer silicon. 262k context fills 21 gigs of the 24. three gigs headroom for prompt fill. fresh session, zero cache, honest baseline. next i am pushing context, probing the vram ceiling, finding the sweet spot on this card. then autonomous agent tasks on hermes agent using the same prompt that 3.5 dense one-shotted in march. same octopus invaders test, same hermes agent harness, same single 3090 hardware, one model against the ghost of its predecessor. the king might be changing hands.
Sudo su tweet mediaSudo su tweet mediaSudo su tweet mediaSudo su tweet media
Sudo su@sudoingX

fuck it i am pulling the weights right now. cannot sit still since qwen 3.6-27b dense dropped two hours ago and @UnslothAI just put the dynamic ggufs live, 18gb ram footprint, that fits my rtx 3090 24gb. they moved faster than me, that is fine, the open source machine is working. here is what has me restless. the chart says a 27 billion parameter open weight model matching claude 4.5 opus on terminal-bench 2.0 at 59.3 flat, beats claude on skillsbench, gpqa diamond, mmmu, and realworldqa. opus 4.5 level agentic intelligence on your single rtx 3090 24gb vram tier. if that chart survives first contact with real hermes agent runs on my hardware, the best model for single consumer gpu just changed in the middle of my sprint. my benchmark is the only voice that matters to me. same hermes agent harness, same quant, head to head against 3.5-27b dense which has held the 3090 crown for weeks. i settle it on my cards or not at all. pulling now. benchmarking tonight if i can stay awake long enough. you have no idea how restless this makes me. if you see numbers on your timeline before morning, the chart held. if you don't, i crashed and data drops first thing. this is what open source looks like when the whole chain moves same day.

English
19
8
243
29.5K
Punch Taylor
Punch Taylor@Punch_Taylor·
4090 datapoint, WSL2 Ubuntu CUDA 13.2, your exact flags + Q4_K_M: ./llama-server -m Qwen3.6-27B-Q4_K_M.gguf -ngl 99 -c 262144 -np 1 -fa on --cache-type-k q4_0 --cache-type-v q4_0 three warm runs on "yo" with thinking auto, system fully idle: - run 1: 42.83 tok/s - run 2: 43.18 tok/s - run 3: 43.33 tok/s - avg ~43.1 tok/s VRAM at 262k provisioned: 23.0GB / 1.1GB free of 24GB. tighter than your 21/3 split — WSL2 + cuda driver reserves eating ~2GB of headroom. native linux would likely give that back. so 4090 + WSL2 = +8.3% over your 3090 native baseline. roughly tracks the bandwidth gap (1008 vs 936 GB/s). bare metal linux on a 4090 should land higher still — would estimate 45-48 tok/s range for someone running native. side observation worth flagging: a single youtube tab in chrome dropped these numbers to ~39.9 tok/s in earlier runs. ~7-8% throughput cost from the browser competing for CPU/scheduling on the WSL side. anyone running this on a daily-driver PC should close everything before measuring.
Sudo su@sudoingX

this was supposed to be a normal evening, then i saw on the timeline that qwen 3.6 27b dense q4 weights from unsloth are live and i could not sit still. compiled llama.cpp with cuda on the single rtx 3090 at 2am from bangkok, launched with the exact same flags that crowned 3.5-27b dense the undisputed king six weeks ago. q4_k_m, 262k context, q4_0 kv cache, flash attention on, single slot, no quant tricks, no dynamic ggufs, no turbo, just the straight cut to get a clean baseline. first pass said "yo" to the model as a warmup. it ran a six step thinking chain to formulate "yo what's up how can i help you today". full reasoning visible in the web ui. thinking mode goes hard, even for a greeting. the number improved. 39.82 tokens per second on the first real generation. march baseline on this exact hardware was 35.3 flat across every context size. that is a 13 percent speed bump. same card, same quant, same every flag, only the model changed. pure model level efficiency on ampere. the model is actually faster at the token level on consumer silicon. 262k context fills 21 gigs of the 24. three gigs headroom for prompt fill. fresh session, zero cache, honest baseline. next i am pushing context, probing the vram ceiling, finding the sweet spot on this card. then autonomous agent tasks on hermes agent using the same prompt that 3.5 dense one-shotted in march. same octopus invaders test, same hermes agent harness, same single 3090 hardware, one model against the ghost of its predecessor. the king might be changing hands.

English
0
0
2
123
Punch Taylor
Punch Taylor@Punch_Taylor·
still disagree. AI art will eventually become a standard while high quality art made by a human will still be a luxury. mediocre artists will either need to adapt, improve their business model and be business before personal opinions or be replaced by a bot with no preferences and only results.
English
1
0
0
28
blep
blep@PersonUnnamedno·
@Punch_Taylor @reddit_lies I'm not saying AI is bad on the whole. I'm saying AI art is soulless slop and you won't get far using it. Pick up a pencil or pay someone who will.
English
1
0
0
35
Punch Taylor
Punch Taylor@Punch_Taylor·
brother, look around you. businesses of all kinds are already using AI, even your phone that you are staring at, and people still want them. why? because people don’t give a shit as long as it works and looks good. and like humans, AIs can produce slop or something genuinely aesthetic if you work with it - kind of like every other medium. also here’s a link to a repo of an AI i built for fun since you called me lazy. github.com/TaylorSh1ft/Ph…
English
2
0
0
16
blep
blep@PersonUnnamedno·
@Punch_Taylor @reddit_lies You won't go far if nobody wants your business lmao, AI slop underperforms actual talent by a hell of a lot. You'd know that if you practiced what you preached but you're too lazy to even do that.
English
1
0
1
40
Punch Taylor
Punch Taylor@Punch_Taylor·
hence “unapologetically”. because anyone who wants to go far, won’t do it by listening to a bunch of losers online that don’t like how you choose to get there. i’d much rather take AI slop on the cheap than genuine slop from unoriginal and mediocre “artists” for far more than it’s worth or for really good art from an artist that is unbearable to interact with. 🤷🏻‍♂️
English
1
0
0
31
blep
blep@PersonUnnamedno·
@Punch_Taylor @reddit_lies I mean you can use AI art if you want, I'm not stopping you, but just remember that you're not owed respect and people can hate your soulless slop all they want.
English
1
0
1
32
Punch Taylor
Punch Taylor@Punch_Taylor·
and how do you feel about the smart phone turning everyone into “photographers”? honestly, i find your argument to be lazy. because if one could design and render a Live2D model using tools at their disposal, then that would be a skill within itself. it’s like being mad at a carpenter that uses a nail gun instead of hiring a team of people to hammer in the nails.
English
0
0
0
22
Grongus 2.0
Grongus 2.0@PunishedChode·
@Punch_Taylor @reddit_lies AI art is dogshit and the data centers necessary to keep them operating are harming the environment and the economy.
English
2
0
5
59
Punch Taylor รีทวีตแล้ว
Savanah Hernandez
Savanah Hernandez@Savsays·
The fact that the Ostroushko family is on a press tour instead of sitting in JAIL is infuriating to me. Paige Ostroushko is literally on Instagram bragging about how she has faced ZERO consequences so far. Demoralizing.
English
1.2K
7.1K
36.7K
248.4K
Punch Taylor รีทวีตแล้ว
indy reporter
indy reporter@Indy_reporter_·
Want to see something WILD? A non profit organization received almost $32 million in grants. Who did they give some of the grants to for "research"? -Almost $12k to a distillery -$215k for 2 different butchers -$75k to an autoshop -$75k to an Indonesian Grill -$50k for a cabinet shop All in the name of......"research" What's the name of the non-profit? NINETWELVE INSTITUTE INC If you get bored google "chad pittman 3 kings"
indy reporter tweet media
English
35
121
322
11.5K
Punch Taylor รีทวีตแล้ว
Wall Street Apes
Wall Street Apes@WallStreetApes·
American is staying at an Airbnb in Indianapolis The crime must be bad in the neighborhood because the outdoor air conditioners are chained down with huge locks to prevent theft I looked it up, Democrats hold a supermajority in the Indianapolis City County Council Of course…. We don’t have to live like this. Stop voting Democrat
English
172
787
3.4K
142K