Revived my old 2080ti, loaded Gemma 3 12B with llama.cpp. This is where we are at:
prompt eval time = 76.59 ms / 17 tokens ( 4.51 ms per token, 221.96 tokens per second)
eval time = 32043.16 ms / 1598 tokens ( 20.05 ms per token, 49.87 tokens per second)
total time = 32119.75 ms / 1615 tokens
If the correctness tests will pass, expect a sensibly faster DS4 inference speed for DGX Spark, and especially a lot flatter prefill as context increases. Soon in the repo if everything goes as expected.
hot take: unrestricted social media algorithms are as dangerous as weapons of mass destruction
you can literally reprogram the minds of billions of humans
@GaryMarcus I've recently started using it for all general searches, automating basic tasks. Experimenting with code as well, but not as much as codex / claude yet.
DS4 running on DGX Spark (GB10 / CUDA), private branch for now. 12 tokens/sec, the memory bandwidth is limited in this system, at 270GB/sec. But prefill is ways more alighed to M3 Max at ~200 t/s. I'll release when more mature, but it is almost sure that it will get merged.