The_tale_May_told...

2K posts

The_tale_May_told... banner
The_tale_May_told...

The_tale_May_told...

@_2nji_

M.S. NLP, Technical artist - writer @actvt_io @comicverse_ai

Paris, France Katılım Ağustos 2012
616 Takip Edilen423 Takipçiler
Ursula von der Leyen
Ursula von der Leyen@vonderleyen·
Our app ticks all the boxes. ✅ Highest privacy standards in the world ✅ Works on any device ✅ Easy to use ✅ Fully open source
English
531
113
554
200.4K
ÆON FORGE ✨
ÆON FORGE ✨@SpaceTimeViking·
@_2nji_ @TheAhmadOsman @heydave7 They do at least with Qwen 100% but the Gemma 4 ones it’s harder to tell. That said 50 Tok/s MoE vs 13 Tok/s Dense Makes the selection more complicated on something like a DGX Spark.
English
2
0
1
55
Dave Lee
Dave Lee@heydave7·
This afternoon I picked up a new Nvidia DGX Spark computer with the goal of trying to run Gemma 4 31b (4bit) on it locally as a server. Just 1.5 hours later, it’s working! Using Open WebUI on my MacBook as the interface and it’s connecting to my DGX Spark running as a Gemma 4 server.
English
99
30
964
129.4K
The_tale_May_told...
@SpaceTimeViking @TheAhmadOsman @heydave7 In my experience, the dense models (Qwen and Gemma) have a certain oomph over their MOE counterparts, I use Claude code for agentic stuff, my major local llm use cases are web search and llm server for development, so the Dense latency is fine for me.
English
2
0
1
74
The_tale_May_told...
One of my favourite usecases of my local LLM is having extremely weird discussions.
The_tale_May_told... tweet media
English
0
0
0
17
sparkarena
sparkarena@spark_arena·
Check out how nvidia/Gemma-4-31B-IT-NVFP4 achieved 10.96 tokens/sec on text generation on NVIDIA DGX Spark with vllm! View full benchmark at spark-arena.com/benchmark/sub1…
English
2
0
9
478
The_tale_May_told...
The_tale_May_told...@_2nji_·
I think they have niches they are good at, your task was more suited to GPT 5.4, reminds me of using both Opus 4.6 and GPT 5.4 to do a CTF task, Opus found about 5 candidates and GPT 5.4 about 3, Both had a first option that was very plausible and did not appear on the others list and all their other options were slop.
English
0
0
0
664
antirez
antirez@antirez·
The difference between the two, for serious engineering work, is simply brutal. Claude Code with Opus is, when the task at hand is very complicated, borderline useless, while GPT 5.4 can do a reverse engineering mixing: hardware knowledges, major disassembly skills, and so on.
English
10
52
794
108.7K
antirez
antirez@antirez·
During the last week I executed very long autonomous sessions of Claude Code Opus 4.6 and Codex GPT 5.4 (both at max thinking budget), in cloned directories (refreshed every time one was behind). I burned a lot of (flat rate, my OSS free account + my PRO account) of tokens...
English
64
178
2.2K
912.2K
Benjamin Marie
Benjamin Marie@bnjmn_marie·
List of quantized Gemma 4 31B I’m evaluating: - Intel/gemma-4-31B-it-int4-AutoRound (19.2 GB) - cyankiwi/gemma-4-31B-it-AWQ-4bit (20.5 GB) - RedHatAI/gemma-4-31B-it-NVFP4 (23.3 GB) - nvidia/Gemma-4-31B-IT-NVFP4 (32.7 GB) - RedHatAI/gemma-4-31B-it-FP8-block (33.3 GB) → yes, NVIDIA’s NVFP4 checkpoint is as large as an FP8 checkpoint. This is what happens when you don’t quantize the attention layers of a dense model.
English
16
9
183
15.7K
The_tale_May_told...
The_tale_May_told...@_2nji_·
Even if we never have AGI, these transformer based models are very complimentary to human intelligence, it fills the gap of memory storage with near perfect retrieval, and to have that at the scale of the internet, That’s the real win.
English
0
0
1
30
The_tale_May_told...
The_tale_May_told...@_2nji_·
@fishright Haha, I absolutely love Qwen3.5 27B so I’m not in a hurry to take the extra effort it requires to use Gemma 4 31B.
English
0
0
0
48
Mike McQuade
Mike McQuade@fishright·
@_2nji_ I think it’s better on Mac, it spends less time thinking.
English
1
0
1
20
The_tale_May_told...
The_tale_May_told...@_2nji_·
What's the vibe for Gemma 4 31B, i haven't found any comprehensive benchmark comparison to Qwen 3.5 27B. also seems to be broken on LM Studio.
English
1
0
0
419
The_tale_May_told...
The_tale_May_told...@_2nji_·
@AndrewMayne @ClementDelangue @NaveenGRao Alt Alt take: Because of the compute constraints APIs become stupidly expensive (High Demand & low Supply) compared with the same amount of tokens on their direct products and still makes it scary and uneconomical to compete.
English
0
0
0
22
Andrew Mayne
Andrew Mayne@AndrewMayne·
Alt take: Both OpenAI and Anthropic have mega-sized deals with AWS, Azure, etc to serve their models via API because these are such huge money earners for them. These are contractual obligations going into the next decade. APIs will continue to serve the majority of tokens. While Anthropic is struggling with capacity issues, their API business is huge and a major part of their revenue. I don't think they enjoy limiting API access right now and wish they were in OpenAI's shoes regarding capacity. But they're very smart and have lots of money. They'll figure it out. Meanwhile, OpenAI has even talked about eventually renting capacity in Stargate to other providers.
English
1
0
15
1K
clem 🤗
clem 🤗@ClementDelangue·
I think it’s @NaveenGRao who said it before but wouldn’t be surprised if the frontier labs cut their APIs entirely at some point. In a compute constrained world, they’ll always prioritize their own direct products/customers. Makes it scary and unsustainable to only build on top of their APIs!
English
55
36
496
77.4K
The_tale_May_told...
The_tale_May_told...@_2nji_·
@TheAhmadOsman I use Qwen 3.5 27B with llama.cpp on DGX Spark, average wait time is decent and I think there’s room to optimize my setup.
English
0
0
0
257
Ahmad
Ahmad@TheAhmadOsman·
Which model to use locally with Hermes agent? on Unified Memory Hardware* > Gemma 4 26B-A4B on GPUs > Qwen 3.5 27B * Mac Studio, DGX Spark, MacBook, etc
English
55
11
422
40.2K
The_tale_May_told...
The_tale_May_told...@_2nji_·
@menhguin I have an agent skill that uses rsync and /tmp to transfer files across devices in my tailnet, feels like one unified file system.
English
0
0
0
32
Ahmad
Ahmad@TheAhmadOsman·
I am happy to announce 2 new additions to the x/LocalLLaMA community mod team - 0xSero - sudoingX Hoping this helps streamline growth of the local inference community and deliver grounded, real content over hype Find the community spotlighted at the top of our profiles to join
Ahmad tweet mediaAhmad tweet media
English
28
14
458
26.6K