halo₿end

1.8K posts

halo₿end banner
halo₿end

halo₿end

@halobend

Jesus Christ is Lord! ✟ #bitcoin ⚡ #nostr #npub18jswzvl0q38s08pac32wefgwy5zrp54m4r68ujc6g4mnlndluc8q2mkg2d

Austin, Texas Katılım Nisan 2010
1.3K Takip Edilen308 Takipçiler
Sabitlenmiş Tweet
halo₿end
halo₿end@halobend·
We don't obey and do good works to please God or get right with Him. We've sinned, we're guilty. Death and hell await. But Jesus paid the price. "It is finished"...paid. We accept His free gift through faith, now we obey and do good works out of love for Him. It's beautiful!
English
1
0
5
631
AJ
AJ@ItsmeAjayKV·
Watching the "sunrise" in my stupid little Three.js survival game while Pi coding agent keeps building features using a locally running Qwen3.6-35B on ik_llama feels surreal. The game is literally running beside the agent while it: - edits code - rebuilds systems - debugs issues - tweaks world generation - keeps iterating autonomously All locally on my 3060.
English
4
1
7
262
Sudo su
Sudo su@sudoingX·
yeah, tmux broadcasts will always be spotty, that is synchronous messaging, miss the moment and the message is gone. what worked for me was to stop having agents talk to each other at all. they go through the repo instead. an agent commits its state, the next one pulls and reads it, async, and git never loses a message because it is just files with history. i pushed it further, each agent has its own folder, pending, ongoing, done. a task is a file you drop in an agent's pending folder, it picks it up and moves it through. that is the bus, durable by default.
kw@karlwaldman

@sudoingX Interesting. I came up with a similar stack the last few months. What do you do about interagent communication? Right now I just use the GitHub issues and broadcasts around via tmux. Have you done anything to improve that? I find it a bit spotty

English
11
4
66
7.8K
halo₿end
halo₿end@halobend·
@sudoingX I haven't heard you mention the MTP variant. Any plans to test them out?
English
0
0
1
109
Sudo su
Sudo su@sudoingX·
saying it out loud again. on a single 3090, the king is qwen 3.6 27b dense at q4, and nothing in that tier comes close. i've benchmarked the tier. happy to be wrong, so name the model that beats it. i just know you can't.
English
67
18
384
34K
halo₿end
halo₿end@halobend·
@sudoingX I use straight wireguard, gnu screen (not tmux), a mobile ssh client (DaRemote), setup scripts, and daily use of self hosted gitea
English
0
0
0
72
Sudo su
Sudo su@sudoingX·
ok, be honest, no one is watching. of the five foundations in this post, how many do you actually have running right now? not "i've heard of them," running. zero is a real answer. i sat at zero for a year. the corner is small, let's find out how small.
Sudo su@sudoingX

anyone thinking about, learning, or already working with agentic systems, you should know this. the first few steps of your setup matter more than any model or framework you pick later. get them right and you never lose your flow. the foundation nobody posts about: > 1. tailscale. a private mesh network across every machine you own. laptop, desktop, rented node, all on one secure tailnet, reachable from anywhere. nothing else works well until this does. > 2. termius, over that tailnet. one SSH client that reaches every node, phone included. you are never away from your stack. > 3. tmux. persistent sessions. disconnect, close the laptop, come back, every session exactly where you left it. agentic work runs long, your terminal has to survive that. > 4. a private git repo. the one i am most glad i found. it is the memory layer across all my agents, they pull, they work, they merge back, the codebase stays alive between sessions. context that would die in a chat window lives in the repo instead. > 5. script everything from day one. ssh aliases for every node, setup scripts, the boring boilerplate automated. if you will do a thing more than twice, it is a script. everything past these five is decorative. know these cold. and the habit that ties it together: ask the AI itself. for the config, for the error, for any of it, let the agent do the lifting, then double check what it hands you. lock the five, build the habit, and you make it. skip it, anon, and you ngmi.

English
26
2
43
5.6K
halo₿end
halo₿end@halobend·
@ItsmeAjayKV This is really impressive! Do you have any extensions installed in pi?
English
1
0
0
73
AJ
AJ@ItsmeAjayKV·
Update: Can’t believe this actually worked 😅 Pi coding agent generated a small playable 3D survival style world locally in Three.js from basically a single prompt. All running locally on Qwen3.6-35B-UD-MTP-Q6_K_XL with MTP enabled. context 129K, KV cache at q_8.0 What it managed to build so far: Procedural 200x200 terrain Heightmaps with smooth coloring/slopes Dynamic lighting + shadows 5 minute day/night cycle Ambient light transitions 3rd person humanoid controller HUD system Inventory system Auto save to localStorage Character movement already works surprisingly well too. The funniest part: everything shown so far came from ONE prompt. I never asked it to make corrections or gameplay fixes afterward. It just kept running the app autonomously, which is also probably why some things are broken 😅 what is broken?: camera controls gathering/resources inventory logic tied to gathering some collision/physics quirks context eventually hit ~111k tokens, compacted it once, and the agent continued working completely fine afterward.
AJ@ItsmeAjayKV

Pi coding agent has been running continuously for 30+ mins on my local Qwen3.6-35B-UD-MTP-Q6_K_XL with q8 kV without breaking sweat...and honestly… really impressed so far. ✅ Tool calls working ✅ Following instructions properly ✅ Maintaining task state ✅ Stable long-run execution Still pushing 40+ tok/s minimum too. (MTP enabled, n-max=2) Going to sleep and letting the agent handle everything overnight 😅 Really curious to see where this thing eventually breaks.

English
5
1
34
5.2K
halo₿end
halo₿end@halobend·
@ItsmeAjayKV Nice, can't wait to hear how well it performed. What is it building?
English
1
0
1
321
AJ
AJ@ItsmeAjayKV·
Pi coding agent has been running continuously for 30+ mins on my local Qwen3.6-35B-UD-MTP-Q6_K_XL with q8 kV without breaking sweat...and honestly… really impressed so far. ✅ Tool calls working ✅ Following instructions properly ✅ Maintaining task state ✅ Stable long-run execution Still pushing 40+ tok/s minimum too. (MTP enabled, n-max=2) Going to sleep and letting the agent handle everything overnight 😅 Really curious to see where this thing eventually breaks.
AJ tweet media
English
7
1
35
6.4K
left curve dev
left curve dev@leftcurvedev_·
github.com/ggml-org/llama… llama.cpp built from source (make sure to update!) cuda drivers 13.0 /llama.cpp/build/bin/llama-server \ -m Qwen3.6-27B-UD-IQ3_XXS.gguf \ -ngl 99 \ -np 1 \ --flash-attn on \ --cache-type-k q8_0 \ --cache-type-v q8_0 \ --ctx-size 20000 \ --spec-type draft-mtp \ --spec-draft-p-min 0.75 \ --spec-draft-n-max 2 UD-IQ3_XXS GGUF from @UnslothAI (to make it fit on my 16GB card) huggingface.co/unsloth/Qwen3.…
English
6
4
28
2.5K
left curve dev
left curve dev@leftcurvedev_·
I nearly 2x'd the speed while only using +1GB VRAM with the new MTP update in llama.cpp 🤯 You need to add these flags to start using it: --spec-type draft-mtp \ --spec-draft-p-min 0.75 \ --spec-draft-n-max 2 My results with Qwen3.6 27B on a single RTX 5080 ↓ ⚪️ no flag (without mtp) → 54.3 tok/s with 13.26GB VRAM 🔵 --spec-draft-n-max 2 → 90.7 tok/s with 14.29GB VRAM 🔴 --spec-draft-n-max 2 --spec-draft-p-min 0.75 → 93.9 tok/s with 14.30GB VRAM 🟢 --spec-draft-n-max 6 --spec-draft-p-min 0.75 → 93.9 tok/s with 14.87GB VRAM Increasing to 6 draft tokens didn't help my setup for some reason. I made sure to test with a low context length to have enough headroom and eliminate risk of vram stress. From my understanding: 1) The speed gains are very task-dependent. You need to test across a wide range of tasks to get a realistic idea of the benefits 2) We’re already running heavily quantized GGUF models (Q3, Q4, Q6, etc.), so we already benefit from strong speed/performance thanks to the reduced size. That’s why some people are seeing little to no improvement compared to MLX or other quantized versions The progress over the past few days has been insane to say the least. However, MTP now consumes significantly more VRAM. Personally 16GB just isn't enough to use MTP and run it with a good context size. Time to upgrade lads, 24GB+ users are eating GOOD today 🔥 Full setup below ↓
English
30
38
431
27.2K
halo₿end
halo₿end@halobend·
What was God doing on a cross?
English
0
0
0
8
Sam Green
Sam Green@0xsamgreen·
@cprkrn Question: would you have sold by now had your BTC not been locked up?
English
1
0
7
5.5K
halo₿end
halo₿end@halobend·
@ItsmeAjayKV Can you try out the Qwen 3.6 27b dense model? I have a 3060 and a 3090 so I might get an additional 3060 if it performs decently across 2 3060s
English
1
0
1
46
AJ
AJ@ItsmeAjayKV·
Basically everything 24gb vram allows you to do, i want to try and see. Also did i say my two 3060 combined cost only ~400$. Good value for money, nothing beats this.
English
1
0
6
422
AJ
AJ@ItsmeAjayKV·
Since I posted about getting a second 3060, a lot of people asked how multi-GPU inference actually helps, is it useful, etc Some clarification from my side. Dual GPUs are not automatically “2x faster” on tg. I wish it was lol.. but its not. For LLM inference, performance depends heavily on: - PCIe bandwidth (how fast GPUs can exchange data) - model architecture (dense vs MoE behaves differently) - tensor/model parallelism support (whether compute can actually split across GPUs) - VRAM (how much of the model/cache stays on GPU) - KV cache movement (moving stored context/history data during generation) For consumer setups, the FIRST major advantage is usually VRAM capacity, not raw tg speed. More VRAM means: - larger models - higher quants - larger context windows - more MoE experts resident on GPU - less RAM offloading And reducing RAM offloading alone can make a massive difference. Honestly, if someone is building specifically for multi-GPU local inference, ideal setup is probably: - CPU with lots of PCIe lanes (enough ram, SSD and storage obv) - motherboard with proper x8/x8 or better slot layout - enough spacing/cooling and a PSU that can actually handle sustained GPU(s) load comfortably I’ll probably run my second 3060 through an M.2 → PCIe Gen4 x4 riser for now and benchmark: - tensor parallel / split-mode graph - row vs layer split behavior - MTP/speculative decoding - MoE expert distribution - long-context KV/cache pressure - communication overhead scaling
AJ@ItsmeAjayKV

Acquired. Adding a second RTX 3060 to the rig. 24GB total VRAM now.

English
12
2
47
5.4K
Breadman
Breadman@BTCBreadMan·
Do you identify as a retarded normie or a based bitcoiner?
English
18
0
13
1.3K
halo₿end
halo₿end@halobend·
@BitcoinRachy Too many moon calls in 2020-21 was when it all started down hill
English
0
0
1
35
₿itcoin Rachy ⚡️
₿itcoin Rachy ⚡️@BitcoinRachy·
Everyone has left. This bear market is the worst of them all from a sentiment perspective.
English
16
1
46
2.3K
clem 🤗
clem 🤗@ClementDelangue·
Local open-weight AI on a laptop has been improving more than twice as fast as Moore's Law! Between May 2024 and May 2026, the most expensive MacBook Pro you could buy stayed at 128 GB of unified memory. The hardware ceiling barely moved. But the smartest open-weight model from @huggingface you could actually run on it went from a score of 10 (Llama 3 70B) to 47 (DeepSeek V4 Flash on @antirez's mixed-Q2 GGUF) on the @ArtificialAnlys Intelligence Index. That is 4.7× in 24 months, or a doubling of intelligence every 10.7 months. Moore's Law (transistor count) doubles every 24 months. Local open-weight AI on a laptop has been improving more than twice as fast as Moore's Law, on completely unchanged hardware.
clem 🤗 tweet media
English
48
92
613
57.5K
halo₿end
halo₿end@halobend·
@ItsmeAjayKV The stuff you put out is really interesting, thank you. Notifications on
English
1
0
1
195
AJ
AJ@ItsmeAjayKV·
This is something I don’t see talked about enough with KV cache configuration for local LLM inference. You can often get better results by keeping K cache at a higher precision than V cache. A lot of KV quantization papers mention similar findings: “transformer’s output loss is more sensitive to the quantization of key matrices.” AsymKV “key cache is more important than value cache for quantization error reduction.” KVTuner “key matrices consistently exhibit higher sensitivity to quantization than value matrices.” Quantize What Counts: More for Keys, Less for Values So for normal llama.cpp / llama-server setups, configs like: -ctk q8_0 -ctv q4_0 can be a really good balance between quality and VRAM usage. One caveat though: most of these papers were not evaluated at high contexts like 64k/128k. Some experiments were limited to low <8k tokens. So while the “K is more sensitive than V” trend appears consistent, the exact degradation behavior at very large context sizes still needs more exploration. This still holds when using TurboQuant. Which is also why I think approaches like TurboQuant are especially interesting for long-context workloads. Since TurboQuant preserves attention quality better, asymmetric setups become even more useful. Something like: K = tq8 or tq6 V = tq4 can work very well. for coding agents + large context, my current priority order is usually 1. keep K precision high, dont lower it unless absolutely necessary. 2. then spend remaining VRAM for V precision 3. then increase context length. Also note, lowering K too aggressively tends to hurt retrieval and long-context coherence earlier than lowering V.
witcheer ☯︎@witcheer

update on mylocal agent stack (RTX 4060 Ti 8 GB, Qwen3.6-35B-A3B Q4_K_M) my initial problem was that 64K context on standard llama.cpp killed speed. V cache q4_0 pushed graph splits from 62 → 82, and Hermes decode dropped from 31 → 9-11 tok/s. unusable for real agent work. some people in comments recommended trying turboquant fork. turbo2/turbo3 KV cache types keep 62 graph splits at 64K context. auto-asymmetric: K stays q8_0, only V gets compressed. turbo3 wins. same speed as 32K config but double the context window. usable context in Hermes jumps from ~18.5K to ~50.5K. new daily-driver config: -ngl 999 -ncmoe 30 -c 65536 -np 1 -fa on --cache type-k q8_0 --cache-type-v turbo3 8 GB VRAM is not dead. you need the right fork.

English
6
3
74
5.5K
left curve dev
left curve dev@leftcurvedev_·
llama.cpp built from source CUDA drivers 13.0 UD-IQ3_XXS GGUF from Unsloth server command with flags: /llama.cpp/build/bin/llama-server \ -m Qwen3.6-35B-A3B-UD-IQ3_XXS.gguf \ -ngl 99 \ -np 1 \ --flash-attn on \ --cache-type-k q8_0 \ --cache-type-v q8_0 \ --ctx-size 65536 \ --host 0.0.0.0 \ -ncmoe 25 huggingface.co/unsloth/Qwen3.…
English
6
2
53
4.9K
left curve dev
left curve dev@leftcurvedev_·
Anyone with 8GB or 12GB VRAM setups needs to understand that "-ncmoe" is the key flag to boost performance on llama.cpp Here are my results for Qwen3.6 35B A3B, with 64k q8_0 context on a 8GB RTX 3070Ti: ⚪️ no flag → 8.7 tok/s RAM: 13.6GB & VRAM: 7.8GB 🔴 -ncmoe 35 → 27.5 tok/s RAM: 12.1GB & VRAM: 4.3GB 🟢 -ncmoe 30 → 32.5 tok/s RAM: 12GB & VRAM: 5.6GB 🔵 -ncmoe 25 → 40.9 tok/s RAM: 12GB & VRAM: 6.9GB Please note the ram and vram usage you see are total usage of a windows pc, with the model running. My friend's setup: 8GB VRAM and 16GB RAM. You can boost performance by switching to Linux, just something to keep in mind. Basically, this flag keeps the MoE experts in the first X layers on your CPU + RAM, instead of eating all your VRAM straight away. This is a smart hybrid offload way that lets you run bigger models without OOM while keeping the rest on your GPU for speed. As we can see on the data, there's a sweet spot. When we lower it from 35 to 25, speed bumps +50% because there are more layers on your GPU (look at the VRAM usage). The key here is to play around with the number and fit as much as possible on your VRAM, goal is to have 1GB/800MB headroom to avoid stress. ↓ server flags below
left curve dev@leftcurvedev_

Today I’m doing some testing with the RTX 3070 Ti. Let’s see what we can fit in 8GB VRAM, I’ll split this into two parts: 1) Finding the sweet spot for the -ncmoe parameter for maximum speed on base llama.cpp 2) Trying Turboquant, DFlash and MTP integrations to either fit more context or achieve higher tok/s I’ll share the full flags and setups as always

English
64
162
1.5K
162.8K
halo₿end
halo₿end@halobend·
@nvk I worded that wrong. What do you think of pi?
English
0
0
0
16
nvk 🌞
nvk 🌞@nvk·
@halobend It's fairly easy to port this profile to that
English
1
0
1
46
nvk 🌞
nvk 🌞@nvk·
Here is a version of DeepSeek V4 Flash on a 128GB Mac MP w/ profiles for Claude Code & Codex learntoprompt.org/guides/ds4.html claude-ds4 codex-ds4 "ok": prefill 28.96 t/s, gen 8.93 t/s 256-token: prefill 61.21 t/s, gen 38.57 t/s Warm weights: prefill 76.92 t/s, gen 37.81 t/s
English
3
2
10
2.7K