Ivan Rocha retweetledi

I'm pretty excited to test this one:
Gemopus-4-26B-A4B-it-GGUF Q6_K
Using @spiritbuun Llama.cpp TurboQuant fork:
- Speed: 75 tokens/sec
- VRAM usage: 95% (22.7 GB)
- Context size: 131072
- GPU: RTX A5000 (Ampere) 24 GB
Pretty amazing that you can fit this entire model on GPU with Q6 quality and still have room for a large amount of context! Plus MoE models are still fast at higher quality.
Woodchuck Norris vibe check: PASSED
Square root of 999999999 -> Correct
Hermes Agent -> Interesting behavior. Retains 26B's speed on short prompts, thinks deeply for more complex requests - sometimes thinks a little too much, it might be worth playing with top + temp settings
Coding test -> One-shotted a fully working Tetris game - no other MoE model including vanilla 26B was able to do this
A very interesting model
-m Gemopus-4-26B-A4B-it-Preview-Q6_K.gguf --n-gpu-layers 99 --ctx-size 131072 --cont-batching --cache-type-k turbo4 --cache-type-v turbo4 --fit on --jinja --reasoning-format auto --flash-attn on
huggingface.co/Jackrong/Gemop…
English
























