azzurro

3.5K posts

azzurro

@therealazzurro

Nerd. Shitposting all day long. Not Russian. Cloud Insultant.

Katılım Nisan 2021

188 Takip Edilen40 Takipçiler

azzurro@therealazzurro·4h

@geerlingguy ai gonna take my job writing comments wasting people's neurons

English

Jeff Geerling@geerlingguy·6h

My kingdom for a way to automatically filter all comments on X, GitHub, Bsky, Reddit, blogs, etc. that were authored by an LLM so my brain doesn't have to waste any neurons doing it.

English

189

7.5K

azzurro@therealazzurro·1d

@populartourist i just had qwen3-coder MoE analyze and fix something that gemma 4 MoE would chase its own tail on over and over and over again. it just kept looping.

English

148

wd 🔺@populartourist·1d

I remember when Qwen3 30B-A3 Coder was the hype, and Devstral Small 2507 variant wasn't, and yet it beat the crap out of it in real work without reasoning blocks and far fewer tokens, even against GPT-OSS-120B. Devstral Small 2 is the last local coding monster that didn't get its due upgrade (last December). Benchmarks seem to be heavily skewed to benchmaxxing and Mistral never did that. I also remember OpenAI's own GPT-OSS model cards saying it did over 60% on SWE Bench with High reasoning mode and yet they never gave the harness to replicate it. GPT-OSS-20B was rife with reasoning loops. Let that sink in.

English

3.8K

azzurro@therealazzurro·1d

@witcheer repost? 3.6 has been out for a WHILE

English

439

witcheer ☯︎@witcheer·1d

qwen 3.6 is out and here’s what you need to know before upgrading from 3.5: qwen3.6-27B is dense (all 27B params fire every token). runs on a single RTX 4090 or 24GB mac. 262K native context, extensible to 1M with YaRN. gets within 4 points of claude opus 4.6 on SWE-bench Verified. apache 2.0. qwen3.6-35B-A3B is MoE (only ~3B active per token). same model I recommended yesterday for the RTX 4060 Ti + 32GB RAM setup. 128K context. two things to watch: 1. qwen3.6 GGUFs don’t work in ollama yet. the vision model needs separate mmproj files that ollama doesn’t handle. use llama.cpp, unsloth studio, or vLLM instead. if you set up qwen3.5-9B via ollama yesterday, keep it running. it works. upgrade to 3.6 when ollama support lands. if you’re on nvidia CUDA 13.2, don’t run qwen3.6. you’ll get gibberish output. nvidia is working on a fix. 2. for mac users: unsloth uploaded dynamic 4-bit MLX quants. qwen3.6-27B runs on 18GB unified memory. qwen3.6-35B-A3B runs on 22GB. if you have the M4 pro with 24GB+, the 27B dense model is now your best local coding model. stay on qwen3.5-9B via ollama if: you have 16GB, you want zero friction, or you need it working today. upgrade to 3.6 via llama.cpp if: you have 24GB+, you want coding performance close to frontier, and you’re comfortable with manual setup.

English

124

13.3K

azzurro@therealazzurro·1d

@sudoingX where's 16 and 32 gigs at 😭

English

Sudo su@sudoingX·1d

drop your vote on the next benchmark sweep. which vram tier should i test local ai models + tool calls on?

English

19.3K

azzurro@therealazzurro·1d

@wbic16 they run on glue fumes maybe

English

will bickford@wbic16·2d

100 MHz was more than adequate to run a GUI. At 100 fps, we had a budget of 1 million cycles per frame. At 4 GHz with 8 cores, no UI task should ever take more than 1 ms. We have 320 MHz per frame available. What the actual fuck is Microsoft doing these days?!

Dave W Plummer@davepl1968

I worked on the XP run dialog. I'm a grizzled old man now, barely recognizable in the mirror, but even I think 94ms is a long-assed time to wait for a dialog to open.

English

124

333

5.5K

206.5K

azzurro@therealazzurro·1d

@vmiss33 need a quite decent CPU though. Broadwell Xeons ain't cutting the mustard here. You're basically running parts of inference on your CPU there.

English

vmiss@vmiss33·2d

My RTX 4070 thanks you for sharing this.

AboveSpec@above_spec

"You need a 24 GB GPU for serious local LLMs in 2026." Everyone repeats this. It's not true anymore. Just ran a 35B-parameter model on an RTX 4060 Ti 8 GB: • 41 tok/s at 16k context • 24 tok/s at 200k context Recipe + benchmarks below 🧵

English

5.5K

azzurro@therealazzurro·1d

@TechnoTimLive yes, that i knew.

English

Techno Tim@TechnoTimLive·1d

@therealazzurro Also, kernel 7 is what Ubuntu 26.04 ships with

English

Techno Tim@TechnoTimLive·2d

Just a heads up if you are updating Proxmox to kernel 7.0.0-3-pve. Some LXCs might not be compatible with that kernel version. This is the first time I have run into this but figured I would mention it in case you are wondering why some of your LXC services might be crashing after that update. VMs of course are not affected because they are fully isolated. To be clear you should update to the latest kernel, just be sure to check your LXCs afterwards.

English

6.7K

azzurro@therealazzurro·3d

@sudoingX R9700 and Arc Pro B70 🫠

English

Sudo su@sudoingX·3d

what gpu runs your local llm? drop your tier. let's see who's winning the battle ground in local ai.

English

134

20.7K

azzurro@therealazzurro·4d

@stupidtechtakes fedoraaaaaaaaaaaaa #engagementfarming

Español

stupid tech takes@stupidtechtakes·5d

i might try out linux again again, what distro is supposed to be good?

English

434

729

43.4K

azzurro@therealazzurro·4d

@JusseSav @Lexcyn @linusgsebastian @Snapdragon but it would still have WIndows on it.

English

Monsterix@JusseSav·4d

@therealazzurro @Lexcyn @linusgsebastian @Snapdragon You can get a W11 laptop 16" with new Snapdragon X2 Elite Extreme with 48GB RAM at $1699. It comes with touchscreen 120Hz OLED, 1000 nits brightness. You could get this + base MacBook Air M5 and still have left over money vs MacBook Pro 16" with 48GB RAM.

English

159

Devin Arthur@Lexcyn·6d

Sorry @linusgsebastian but I disagree that Windows on ARM can't take on Apple's M-series and the Neo. The @Snapdragon X series still provides the *best* experience of Windows IMO (responsiveness, fluidity, battery life, etc) compared to x86.

English

19.7K

azzurro@therealazzurro·5d

@Lexcyn @linusgsebastian @Snapdragon like IF you gonna go down a route that might introduce any incompatibilities, you might aswell just buy a macbook

English

164

Devin Arthur@Lexcyn·5d

@linusgsebastian @Snapdragon That's fair - and I think one reason is lack of competition in the ARM space. We need another player like NVIDIA (or even AMD) to make their own ARM chip, which I think would tip the scale

English

1.2K

azzurro@therealazzurro·5d

@Lexcyn @linusgsebastian @Snapdragon Windows is the problem.

English

114

azzurro@therealazzurro·6d

@chmousset @eevblog he shits you not

English

Chmouss@chmousset·6d

@eevblog Use a salad bowl and thin-foil to create a makeshift parabola, toss the 5g antenna in its center and point to the nearest GSM tower

English

Dave Jones@eevblog·6d

Better 5G hotspot search time

English

azzurro@therealazzurro·6d

@sudoingX does quality not win, though?

English

105

Sudo su@sudoingX·6d

"how do you fit qwen 3.6 27b q4 on 24gb at 262k context" lands in my dms 5 times a week. here is the exact memory math. model bytes at idle = 16gb (q4_k_m of 27b dense) kv cache at 262k context with q4_0 for both k and v = 5gb total = 21gb on the card headroom = 3gb for prompts and tool call traces the magic is the kv cache type. most people leave it at default fp16 or push to q8 thinking quality wins. on qwen 3.6 27b dense at 262k: - fp16 kv cache = does not fit at all - q8 kv cache = fits at 23gb but runs 3x slower (double penalty: more vram, less speed) - q4_0 kv cache = fits at 21gb at full speed (40 tok/s flat curve, same speed at 4k or 262k) most builders never test the kv cache type because tutorials never mention it. it is the single biggest unlock on consumer 24gb hardware. flags i run: ./llama-server -m Qwen3.6-27B-Q4_K_M.gguf -ngl 99 -c 262144 -np 1 -fa on --cache-type-k q4_0 --cache-type-v q4_0 what they do: -ngl 99 = offload everything to gpu -c 262144 = 262k context window -np 1 = single user slot (do not enable multi-slot, eats headroom) -fa on = flash attention on (memory and speed both win) --cache-type-k q4_0 --cache-type-v q4_0 = the unlock if you are sitting on 24gb and not running this config, you are leaving 250k of context on the table. or worse, you are running q8 kv cache and burning 3x your speed for nothing. q4 is not a compromise on consumer hardware. it is the right call.

English

110

1.3K

73.1K

azzurro@therealazzurro·6d

@davepl1968 @lauriewired the bend in the led strip in the upper right corner would drive me nuts though

English

Dave W Plummer@davepl1968·27 Nis

@lauriewired Here's where the magic happens :-). I would have cleaned up if I knew people were coming by, but...

English

LaurieWired@lauriewired·27 Nis

Every era has had a “1% computer nerd” setup. I like to think, what would a homelab person look like in previous generations? Today: Server Rack, GPU lab, Home Assistant, NAS 1990s: Linux or BSD server, Sun, maybe SGI workstation, Web + email hosting 1980s: BBS admin, dot matrix printer, Also home automation (X10!) 1970s: Teletype, Altair, maybe homebrew computer clubs? I’m sure I’m missing some, if you lived during any of these eras, I’d be super curious what the 1% hobbyist looked like.

English

994

42.4K

azzurro@therealazzurro·27 Nis

@pupposandro would love to see something like this for Intel Arc 🥲

English

157

Sandro@pupposandro·27 Nis

89.7 tok/s with Qwen3.6-27B at 60K context on a single RTX 3090. 3.64x faster than full attention, 100% speculative acceptance. Just merged sliding window flash attention + two-phase cache into Luce DFlash. FA now attends to the last 2048 KV positions instead of the full 60K, decode jumps from 25 to 91 tok/s. Two-phase cache skips ~1.4 GB of rollback tensors during prefill, migrates them after. Freed enough VRAM to bump prefill ubatch from 192 to 384. Huge thanks to @dusterbloom for the PR, @davideciffa for the review. Repo in the first comment ⬇️

English

436

26.1K

azzurro@therealazzurro·26 Nis

@Anaya_sharma876 ubuntu for servers (for now), fedora for desktops

English

Anaya@Anaya_sharma876·25 Nis

Linux users be honest. Ubuntu or Fedora?

Indonesia

383

437

34.6K

azzurro@therealazzurro·26 Nis

@BrodieOnLinux they might be juuuuust a little bit retarded, but i don't know.

English

Brodie Robertson@BrodieOnLinux·26 Nis

I am fascinated by GNOME's choice to hide the log out button unless you're on a multi-user system or have multiple desktops, we must study how choices are made in this environment

English

1.2K

55.9K

azzurro@therealazzurro·26 Nis

@sudoingX how are the usage limits at wherever you are working with gpt-5.5? does only openai offer it atm?

English

375

Sudo su@sudoingX·26 Nis

lately opus 4.7 sounds so retarded next to gpt-5.5. i did not expect this but i am so back. so so fucking back baby

English

217

11.2K

Keşfet

@geerlingguy @populartourist @witcheer @sudoingX @wbic16 @vmiss33 @TechnoTimLive @stupidtechtakes