azzurro

3.5K posts

azzurro

azzurro

@therealazzurro

Nerd. Shitposting all day long. Not Russian. Cloud Insultant.

Katılım Nisan 2021
188 Takip Edilen40 Takipçiler
azzurro
azzurro@therealazzurro·
@geerlingguy ai gonna take my job writing comments wasting people's neurons
English
0
0
0
32
Jeff Geerling
Jeff Geerling@geerlingguy·
My kingdom for a way to automatically filter all comments on X, GitHub, Bsky, Reddit, blogs, etc. that were authored by an LLM so my brain doesn't have to waste any neurons doing it.
English
24
5
189
7.5K
azzurro
azzurro@therealazzurro·
@populartourist i just had qwen3-coder MoE analyze and fix something that gemma 4 MoE would chase its own tail on over and over and over again. it just kept looping.
English
0
0
0
148
wd 🔺
wd 🔺@populartourist·
I remember when Qwen3 30B-A3 Coder was the hype, and Devstral Small 2507 variant wasn't, and yet it beat the crap out of it in real work without reasoning blocks and far fewer tokens, even against GPT-OSS-120B. Devstral Small 2 is the last local coding monster that didn't get its due upgrade (last December). Benchmarks seem to be heavily skewed to benchmaxxing and Mistral never did that. I also remember OpenAI's own GPT-OSS model cards saying it did over 60% on SWE Bench with High reasoning mode and yet they never gave the harness to replicate it. GPT-OSS-20B was rife with reasoning loops. Let that sink in.
English
6
2
22
3.8K
azzurro
azzurro@therealazzurro·
@witcheer repost? 3.6 has been out for a WHILE
English
1
0
1
439
witcheer ☯︎
witcheer ☯︎@witcheer·
qwen 3.6 is out and here’s what you need to know before upgrading from 3.5: qwen3.6-27B is dense (all 27B params fire every token). runs on a single RTX 4090 or 24GB mac. 262K native context, extensible to 1M with YaRN. gets within 4 points of claude opus 4.6 on SWE-bench Verified. apache 2.0. qwen3.6-35B-A3B is MoE (only ~3B active per token). same model I recommended yesterday for the RTX 4060 Ti + 32GB RAM setup. 128K context. two things to watch: 1. qwen3.6 GGUFs don’t work in ollama yet. the vision model needs separate mmproj files that ollama doesn’t handle. use llama.cpp, unsloth studio, or vLLM instead. if you set up qwen3.5-9B via ollama yesterday, keep it running. it works. upgrade to 3.6 when ollama support lands. if you’re on nvidia CUDA 13.2, don’t run qwen3.6. you’ll get gibberish output. nvidia is working on a fix. 2. for mac users: unsloth uploaded dynamic 4-bit MLX quants. qwen3.6-27B runs on 18GB unified memory. qwen3.6-35B-A3B runs on 22GB. if you have the M4 pro with 24GB+, the 27B dense model is now your best local coding model. stay on qwen3.5-9B via ollama if: you have 16GB, you want zero friction, or you need it working today. upgrade to 3.6 via llama.cpp if: you have 24GB+, you want coding performance close to frontier, and you’re comfortable with manual setup.
English
15
12
124
13.3K
azzurro
azzurro@therealazzurro·
@sudoingX where's 16 and 32 gigs at 😭
English
0
0
0
40
Sudo su
Sudo su@sudoingX·
drop your vote on the next benchmark sweep. which vram tier should i test local ai models + tool calls on?
English
19
0
23
19.3K
azzurro
azzurro@therealazzurro·
@wbic16 they run on glue fumes maybe
English
0
0
1
59
will bickford
will bickford@wbic16·
100 MHz was more than adequate to run a GUI. At 100 fps, we had a budget of 1 million cycles per frame. At 4 GHz with 8 cores, no UI task should ever take more than 1 ms. We have 320 MHz per frame available. What the actual fuck is Microsoft doing these days?!
Dave W Plummer@davepl1968

I worked on the XP run dialog. I'm a grizzled old man now, barely recognizable in the mirror, but even I think 94ms is a long-assed time to wait for a dialog to open.

English
124
333
5.5K
206.5K
azzurro
azzurro@therealazzurro·
@vmiss33 need a quite decent CPU though. Broadwell Xeons ain't cutting the mustard here. You're basically running parts of inference on your CPU there.
English
0
0
0
44
Techno Tim
Techno Tim@TechnoTimLive·
Just a heads up if you are updating Proxmox to kernel 7.0.0-3-pve. Some LXCs might not be compatible with that kernel version. This is the first time I have run into this but figured I would mention it in case you are wondering why some of your LXC services might be crashing after that update. VMs of course are not affected because they are fully isolated. To be clear you should update to the latest kernel, just be sure to check your LXCs afterwards.
English
4
7
67
6.7K
Sudo su
Sudo su@sudoingX·
what gpu runs your local llm? drop your tier. let's see who's winning the battle ground in local ai.
English
134
6
82
20.7K
stupid tech takes
stupid tech takes@stupidtechtakes·
i might try out linux again again, what distro is supposed to be good?
English
434
4
729
43.4K
Monsterix
Monsterix@JusseSav·
@therealazzurro @Lexcyn @linusgsebastian @Snapdragon You can get a W11 laptop 16" with new Snapdragon X2 Elite Extreme with 48GB RAM at $1699. It comes with touchscreen 120Hz OLED, 1000 nits brightness. You could get this + base MacBook Air M5 and still have left over money vs MacBook Pro 16" with 48GB RAM.
English
2
0
2
159
Devin Arthur
Devin Arthur@Lexcyn·
Sorry @linusgsebastian but I disagree that Windows on ARM can't take on Apple's M-series and the Neo. The @Snapdragon X series still provides the *best* experience of Windows IMO (responsiveness, fluidity, battery life, etc) compared to x86.
English
10
0
35
19.7K
Devin Arthur
Devin Arthur@Lexcyn·
@linusgsebastian @Snapdragon That's fair - and I think one reason is lack of competition in the ARM space. We need another player like NVIDIA (or even AMD) to make their own ARM chip, which I think would tip the scale
English
4
0
2
1.2K
Chmouss
Chmouss@chmousset·
@eevblog Use a salad bowl and thin-foil to create a makeshift parabola, toss the 5g antenna in its center and point to the nearest GSM tower
English
1
0
0
30
Dave Jones
Dave Jones@eevblog·
Better 5G hotspot search time
Dave Jones tweet media
English
7
0
17
2K
azzurro
azzurro@therealazzurro·
@sudoingX does quality not win, though?
English
0
0
0
105
Sudo su
Sudo su@sudoingX·
"how do you fit qwen 3.6 27b q4 on 24gb at 262k context" lands in my dms 5 times a week. here is the exact memory math. model bytes at idle = 16gb (q4_k_m of 27b dense) kv cache at 262k context with q4_0 for both k and v = 5gb total = 21gb on the card headroom = 3gb for prompts and tool call traces the magic is the kv cache type. most people leave it at default fp16 or push to q8 thinking quality wins. on qwen 3.6 27b dense at 262k: - fp16 kv cache = does not fit at all - q8 kv cache = fits at 23gb but runs 3x slower (double penalty: more vram, less speed) - q4_0 kv cache = fits at 21gb at full speed (40 tok/s flat curve, same speed at 4k or 262k) most builders never test the kv cache type because tutorials never mention it. it is the single biggest unlock on consumer 24gb hardware. flags i run: ./llama-server -m Qwen3.6-27B-Q4_K_M.gguf -ngl 99 -c 262144 -np 1 -fa on --cache-type-k q4_0 --cache-type-v q4_0 what they do: -ngl 99 = offload everything to gpu -c 262144 = 262k context window -np 1 = single user slot (do not enable multi-slot, eats headroom) -fa on = flash attention on (memory and speed both win) --cache-type-k q4_0 --cache-type-v q4_0 = the unlock if you are sitting on 24gb and not running this config, you are leaving 250k of context on the table. or worse, you are running q8 kv cache and burning 3x your speed for nothing. q4 is not a compromise on consumer hardware. it is the right call.
English
85
110
1.3K
73.1K
Dave W Plummer
Dave W Plummer@davepl1968·
@lauriewired Here's where the magic happens :-). I would have cleaned up if I knew people were coming by, but...
English
5
2
99
3K
LaurieWired
LaurieWired@lauriewired·
Every era has had a “1% computer nerd” setup. I like to think, what would a homelab person look like in previous generations? Today: Server Rack, GPU lab, Home Assistant, NAS 1990s: Linux or BSD server, Sun, maybe SGI workstation, Web + email hosting 1980s: BBS admin, dot matrix printer, Also home automation (X10!) 1970s: Teletype, Altair, maybe homebrew computer clubs? I’m sure I’m missing some, if you lived during any of these eras, I’d be super curious what the 1% hobbyist looked like.
LaurieWired tweet mediaLaurieWired tweet media
English
84
41
994
42.4K
azzurro
azzurro@therealazzurro·
@pupposandro would love to see something like this for Intel Arc 🥲
English
0
0
1
157
Sandro
Sandro@pupposandro·
89.7 tok/s with Qwen3.6-27B at 60K context on a single RTX 3090. 3.64x faster than full attention, 100% speculative acceptance. Just merged sliding window flash attention + two-phase cache into Luce DFlash. FA now attends to the last 2048 KV positions instead of the full 60K, decode jumps from 25 to 91 tok/s. Two-phase cache skips ~1.4 GB of rollback tensors during prefill, migrates them after. Freed enough VRAM to bump prefill ubatch from 192 to 384. Huge thanks to @dusterbloom for the PR, @davideciffa for the review. Repo in the first comment ⬇️
Sandro tweet media
English
42
29
436
26.1K
Anaya
Anaya@Anaya_sharma876·
Linux users be honest. Ubuntu or Fedora?
Anaya tweet mediaAnaya tweet media
Indonesia
383
16
437
34.6K
azzurro
azzurro@therealazzurro·
@BrodieOnLinux they might be juuuuust a little bit retarded, but i don't know.
English
0
0
0
45
Brodie Robertson
Brodie Robertson@BrodieOnLinux·
I am fascinated by GNOME's choice to hide the log out button unless you're on a multi-user system or have multiple desktops, we must study how choices are made in this environment
English
69
38
1.2K
55.9K
azzurro
azzurro@therealazzurro·
@sudoingX how are the usage limits at wherever you are working with gpt-5.5? does only openai offer it atm?
English
0
0
2
375
Sudo su
Sudo su@sudoingX·
lately opus 4.7 sounds so retarded next to gpt-5.5. i did not expect this but i am so back. so so fucking back baby
English
16
3
217
11.2K