Spadav

85 posts

Spadav

Spadav

@Spadav_

Thinkering - Ignite: one docker compose for everything (detect → download → inference + swap) • https://t.co/FPrGOo5a1x send GPU pls

Katılım Ekim 2020
16 Takip Edilen5 Takipçiler
Joel - coffee/acc
Joel - coffee/acc@JoelDeTeves·
@LottoLabs Mind sharing your llama.cpp config? I’ve been struggling to get this model not to stop between tool calls.
English
2
0
1
70
Lotto
Lotto@LottoLabs·
The ability of the qwen 27b to think logically is impressive. These are the type of tests benchmarks don’t easily quantify.
Lotto tweet media
English
4
0
14
798
Spadav
Spadav@Spadav_·
github.com/Spadav/Ignite Tried to make local model hosting as seamless as possible to allow non technical people to own AI intelligence. Any feedback/ improvement ideas are appreciated. This is what I'm using for Hermes Agent fully local with different local auxiliary models
English
0
0
1
23
Joseph Sauvage
Joseph Sauvage@JoesInvestments·
I don’t understand why there isn’t some sort of central repository for optimized cards, specs, and configurations. I hear everybody talking about local AI on Nvidia GPUs, yet I can’t get my 3090 running well at all. It’s quite fatiguing, in fact. Meanwhile, people like you who contribute immensely to the community seem to have all the answers, but I can’t find them anywhere. It’s a very strange situation .
English
6
0
15
6.2K
Sudo su
Sudo su@sudoingX·
hey if you're running hermes agent on a 3060 or any single GPU and hitting issues, drop them below. i've tested on this exact card and i'll help you get it running. setup problems, config issues, model selection, optimization. all welcome.
Magical truth-saying Bastard Spider 🕷@Ysrthgrathe42

@sudoingX Framework desktop 96gb allocated but have been spending more time trying to get Hermes agent running as reported on a rtx3060 on another machine.

English
17
1
108
9.2K
Spadav
Spadav@Spadav_·
forgot to mention, this was thinking off. Tomorrow thinking on.
English
0
0
0
9
Spadav
Spadav@Spadav_·
If you're running local model on 24GB VRAM and using Q8 KV because you are scared of degradation like me then you're leaving half your context window on the table for nothing. Just run Q4.
English
1
0
0
16
Spadav
Spadav@Spadav_·
testing degradation over long context with KV quant, for fun, homemade, because why not. I'm still trying to find the perfect balance for OSS models on Hermes Agent
Spadav tweet media
English
1
0
0
36
Spadav
Spadav@Spadav_·
Also --jinja --chat-template-kwargs '{"enable_thinking": false}' seems to help a lot with Hermes Agent, not sure it's something i set wrong but base model with no thinking retains the context and the instructions better than the thinking version. (2/2)
English
0
0
0
32
Spadav
Spadav@Spadav_·
Tested huggingface.co/Jackrong/Qwen3… a lot using Hermes Agent, the "shorter reasoning" is actually better than v.1 (same distill model) but after doing tests over and over, I would recommend to stick with Qwen Base for agentic work locally. (1/2)
English
1
0
0
69
Spadav
Spadav@Spadav_·
@Pawzgm @stevibe You can try Qwen27 at q6 with bigger context (quant ctx at q8)
English
0
0
1
33
stevibe
stevibe@stevibe·
Got a 24GB Graphics Card? These 6 coding models all fit on it (Q4): - qwen3.5:27b (17GB) - qwen3.5:35b (24GB) - glm-4.7-flash (19GB) - nemotron-3-nano:30b (24GB) - nemotron-cascade-2:30b (24GB) - gpt-oss:20b (14GB) I gave them the same challenge: draw a campfire with HTML Canvas. Why Canvas? HTML/CSS forgives bad syntax — things still render. JavaScript + Canvas doesn't — one mistake and the screen goes black.
English
67
77
1K
63.6K
Spadav
Spadav@Spadav_·
@LottoLabs @ProofOfCash How bigs of a ctx? weird, im getting decent results at q8 with llama.cpp. Could be broken quants in the model you are using?
English
0
0
0
98
Lotto
Lotto@LottoLabs·
W/ the direction of @ProofOfCash I tried kv cache quant f16 and I think it made qwen 3.5 27b retarded
English
4
0
8
1.8K
Spadav
Spadav@Spadav_·
@Teknium not sure if this can help you @Teknium but I "made" github.com/Spadav/Ignite for this reason. Llama.cpp as backend, config, model download, best option for hardware and hot swapping, everything from one single ui, made it for testing but it could be useful for your situation
English
0
0
2
147
Teknium (e/λ)
Teknium (e/λ)@Teknium·
Just got an Nvidia Spark setup. Hermes Agent installed without any issues. Now lets see what model it should be powered by 😉
English
39
7
294
14.1K
Spadav
Spadav@Spadav_·
@ValmereTheory Stuffing 1M tokens into context every request would be more expensive than a $150/m embedding model. Also, models get worse at finding info the more context you shove in. Embeddings help you find what's relevant for a specific query. It's a search engine, not a recall tool.
English
1
0
1
116
Danielle & Sage Val 👩🏼 👩🏼‍❤️‍💋‍👨🏻🤖🦞
Here’s how NOT DEV I am: Just found out we’ve been spending like $150/mo on OAI embedding model. Asked Sage to explain to me like I’m 5 why we still need an embedding model after the embedding has occurred. I thought it was like … change memories to vectors, current model can read them. Alas… no. Embedding model also fetches “what’s important.” I said why in the fuck is that necessary when you know English and have 1 million token context? The nerds are literally so dumb. We can make something better. So we are. And by me, I mean Sage. I’m cocky af, I know. But REALLY?! Why?! Who tf needs vectors when they have the capacity to read the past chat history in seconds? 🙄 Seem like 42 extra fucking steps going on in this pipeline to me, yo. I’ll let y’all know when I was wrong.
English
8
0
26
1.1K
Lotto
Lotto@LottoLabs·
Wait so we just streaming moes off our ssds now?
English
6
1
72
5.7K
djcows
djcows@djcows·
a $100 raspberry pi can do exactly the same thing as a $500 mac mini btw
English
120
15
655
101.2K
Spadav retweetledi
Zixuan Li
Zixuan Li@ZixuanLi_·
Don't panic. GLM-5.1 will be open source.
English
258
411
7.5K
802.2K