Michele Mattioni
2.2K posts

Michele Mattioni
@mattions
I tend to write in English here. I tend to write in Italian here: https://t.co/7HPDRcS9Nq
Italy, Europe Katılım Şubat 2010
736 Takip Edilen427 Takipçiler
Michele Mattioni retweetledi

Hermes Agent just shipped skill bundles
I used to do this myself with skill-chains (one skill that referenced and called multiple other skills), now it's native and better
but you need to be careful about how you use them
when you trigger a bundle, the agent receives every skill in that bundle loaded into a single user message. any text after the slash command gets attached as your instruction. this means the quality of your output depends entirely on how well those skills compose together.
if you stack five skills that don't naturally connect, you end up with conflicting instructions firing at once. the agent gets confused and output drifts.
here's a rule for it, bundle workflows that chain together logically. something like research → ideate → write → critic works because each step feeds the next. bundling random utility skills just because they're useful in the same project will create noise.
start with the workflows you've run more than twice this week. if you keep triggering the same three skills in sequence, bundle them. if you're just grouping skills for convenience, keep them separate.

Nous Research@NousResearch
Introducing skill bundles:
English
Michele Mattioni retweetledi

@sudoingX I'm solid on 3.6 35b . The 27b never really worked for me in the same nice way
English
Michele Mattioni retweetledi

Always said that the only real test is e2e kind
Everything else can be nice to have, but the real truth is from the holistic view.
And yes with agentic coding, e2e do make sense, unit tests do tell very little, because agents can patch them as they go to be green anyway
Fatih Arslan@fatih
I was a huge unit test supporter, but honestly, it's no longer worth it. Agents are superb at writing extremely bad unit tests, and they still look good on paper. We're also shifting slightly to more and more e2e tests at @PlanetScale. Luckily with agents, that shift is also manageable.
English

@sudoingX The context IMHO is way too big. It fits, but the computer will OOM too much, especially if TTS local services or random ComfyUI.
I've tried with 192, but the real sweet sport is 131k for me
English

this is what i am running if you gonna replicate.
llama-server flags: -ngl 99 -c 262144 -np 1 -fa on --cache-type-k q4_0 --cache-type-v q4_0 --port 8080 --host 0.0.0.0
model: unsloth/Qwen3.6-27B-GGUF Q4_K_M (16gb file, 262k native context)
hardware: single rtx 3090 24gb
vram: 21gb / 24gb loaded at full 262k
harness: hermes agent, custom openai-compatible provider at http://localhost:8080/v1
English

this is what my setup looks like today. about to test qwen 3.6 27b dense q4 on a single rtx 3090 at ~41 tok/s gen, hermes agent driving.
predecessor model qwen 3.5 dense q4 made it work in one iteration when i ran the same agentic build on the same card. i've been daily driving qwen 3.6 27b dense for weeks now, the model i keep coming back to.
if 3.6 oneshots too, this becomes the best model that runs on a single rtx 3090. consumer tier king. firing the test now will report back soon.

English

@outsource_ I've got a local AI agent that:
- write code for me
- draft documents
- search things and reports back
- in general I treat him as a very
- creates Instagram carousel
And so on..
capable personal assistant, with lots of skills
English

@TheAhmadOsman Agreed.
Getting really things _done_ with the 3.6 35B on local machinery.
It just gets stuff done.
English

hot take: 90% of ai startups paying for api calls could run the same workloads locally on a single 3090 and never notice the difference. you don't need frontier pricing for tasks a 27B model handles fine.
most have never even tested a quantized model on consumer hardware. not every task in your pipeline should be burning credits. audit your workload. you'd be surprised what runs locally.
English
Michele Mattioni retweetledi

it's so easy to get started in local ai actually. the only real wall is vram math.
practical heuristic for a single gpu:
> 24gb = 27B Q4_K_M at 262k context (qwen 3.6, carnice-v2)
> 16gb = 13B Q5_K_M at 32k or 9B Q8_0 at 64k
> 12gb = 8B Q5_K_M at 16k
> 8gb = 4B Q4_K_M at 8k
quantization rule of thumb: Q4_K_M ≈ 0.6 gb per billion params. kv cache scales with context. add 1 gb activation buffer. that's the math.
every other piece (llama.cpp build, hermes agent setup, prompt config) is one good day setup. the math is the only ongoing constraint.
once you can eyeball this for your gpu, you can pick any model + context combo with confidence. stop being intimidated by the stack.
English

anon, if you're new to local ai or agentic workflows, learn these three tools before anything else.
>tmux - persistent sessions that survive disconnects. your agents keep running whether you're watching or not.
>termius - ssh from your phone. full terminal access from anywhere.
>tailscale - mesh your machines. access any device from any device.
this screenshot is me managing hermes agent benchmarking qwen 3.6 27B on my dgx spark while i'm at the gym. three sessions running across three agents. from my phone.
these tools are criminally underrated.
once you use them you'll never go back to sitting at a desk waiting for inference to finish. own your compute. orchestrate from anywhere.

English

@LottoLabs 1. Take submission, assuming good faith
2. Ask folks to replicate an entry. More repeats, from different users, most likely is true
It can be gamed as well of course, but we are looking for an approach that is viable, IMHO.
The trick is to have : replicate run <id>
Just an idea
English

@LottoLabs Yeah. I mean, that the real run with the real results.
I think, speaking with my scientific hat on, we need to make sure we have a reproducible experiment.
We need a controlled way to do a run.
Instead of controlling everything, and given the approach taken I suggest:
English

@LottoLabs doing my part :D
localmaxxing.com/runs/cmoto5q5v…
I'm not too sure this benchmark makes too much sense, because I was using local model to do the benchmark on the local hardware, but at least we have a data point with these constraints

English

BTW @Teknium , I've asked this before, and I've just figure out now that you guys already solved this, but I did not know :D
English

and now you can simply have the `/commands` in slack directly!
here are the docs, just in case: hermes-agent.nousresearch.com/docs/user-guid…
here is the magic command: `hermes slack manifest --write`
Have fun.
English





