Aivan Monceller

956 posts

Aivan Monceller banner
Aivan Monceller

Aivan Monceller

@aivandroid

I share interesting things I find, what I’m learning, and projects I’m trying. Follow me if you like tech, creative stuff, and learning new things. INFJ-T

Sinapore ⇄ Philippines 가입일 Mart 2018
995 팔로잉183 팔로워
Rohan Joshi
Rohan Joshi@ron_joshi·
Introducing Kitten TTS V0.8: open-source TTS that fits in 25MB. Three variants: 80M | 40M | 14M (<25MB) Highly expressive. Runs on CPU. Built for edge. No GPU? No problem. Ship voice anywhere. Check it out:
English
77
208
1.8K
99.9K
gabriel
gabriel@gabriel1·
only bottleneck is consuming code, so make sure to tell codex that you want just that: "write extremely easy to consume code, optimize for how easy the code is to read. make the code skimmable. avoid cleverness. use early returns."
English
70
61
1.9K
110.2K
Thomas Ricouard
Thomas Ricouard@Dimillian·
Codex is one-shotting a full Animal Crossing PC port to a macOS port, and from 32 to 64 bit
Thomas Ricouard tweet media
English
32
18
532
46.7K
Aivan Monceller
Aivan Monceller@aivandroid·
Stack Overflow just turned their users into a free labeling farm. That new "Challenges" tab isn't for fun. It’s a solution to the LLM data contamination problem. By creating "fresh" puzzles, they generate data that no AI has seen yet. They aren't a Q&A site anymore. They are a data moat.
Aivan Monceller tweet media
English
1
0
0
24
Alex Ellis
Alex Ellis@alexellisuk·
@chriswinfield @UnslothAI Nothing special in this case. It looks like @UnslothAI did all the work in the GGUF file. The unlock for me was getting off Ollama and moving to llama.cpp directly. Newer builds and more flexible tuning.
English
1
0
0
573
Alex Ellis
Alex Ellis@alexellisuk·
I did a thing.. and now local Claude feels _really_ fast. (Qwen3.5-35B-A3B Q4 M from @UnslothAI - Superterm for the terminal/AI manager)
English
8
11
285
36.8K
Aivan Monceller
Aivan Monceller@aivandroid·
why moving checkpoints to s3 is a bottleneck: it treats every 10GB save as a 100% new file. @huggingface’s new Storage Buckets use Xet deduplication to solve the training to inference loop: Training: it only uploads the 5% weight changes (content defined chunking). Inference: localized CDN cache lets you stream models directly to GPU nodes without the pull-then-load lag. no git overhead. no full-file re-syncs. just deltas.
Victor M@victormustar

Introducing Storage Buckets on Hugging Face 🧑‍🚀 The first new repo type on the Hub in 4 years: S3-like object storage, mutable, non-versioned, built on Xet deduplication. - Starting at $8/TB/mo. That's 3x cheaper than S3. You (and your coding agents) need somewhere to dump checkpoints, logs, and artifacts. Now they have a home.

English
0
0
1
42
Yuchen Jin
Yuchen Jin@Yuchenj_UW·
GPT-5.4 xhigh seems bad at following instructions. Last night I launched two AI research agents running @karpathy’s autoresearch. Claude Opus 4.6 (high): > ran for 12+ hours, 118 experiments done, still running GPT-5.4 xhigh: > stopped after 6 experiments > blamed me for “manually interrupting” it > I interrogated it > It admitted it made a mistake and stopped the loop itself, despite an explicit LOOP FOREVER instruction in the md file. 💀
Yuchen Jin tweet media
English
160
72
1.5K
238.4K
Fahd Mirza
Fahd Mirza@fahdmirza·
💥 We ran Qwen3.5 9B at INT4 with ParoQuant — and it barely lost anything 🌀 ♠ A brand new ICLR 2026 quantization method that finally solves reasoning accuracy at 4-bit 🔬 🔹 4x smaller than FP16 — runs locally on consumer hardware 🔹 AIME-24 score drops only 2.3 points vs FP16 — AWQ drops 13 🔹 Smarter than AWQ, faster than QTIP — best of both worlds 🔹 Single fused CUDA kernel — less than 10% overhead 🔹 Works with vLLM and Open WebUI out of the box 🔹 Tested live with a reasoning prompt to see if the quality holds 🔹 The quantization method that reasoning models actually needed 🔥 Watch the full video below 👇
English
16
14
224
18.3K
Eric
Eric@Ex0byt·
3/9/26 Experiment Updates: — Qwen 3.5 INT4 · Raw Safetensors · WebGPU · M4 Max · Pure WGSL 220 tok/s sustained on 0.8B — flat regardless of context window size. Full family: 0.8B: ~220tok/s · 2B: 163 tok/s· 4B: 74 tok/s· 9B: 55 tok/s · 27B: tok/s 17 tok/s (49% of theoretical max) Three underutilization enhancements that got us here: → DeltaNet WG 256→128 — 50% threads were idle → zero wasted SIMD slots → MLP gate+SiLU 4T split — 192→768 workgroups → 4× memory pipeline saturation → Major breaktorugh: FlashDecode splits 32→64 — tok/s now stays flat as context size grows!
Eric tweet media
English
5
1
43
3.9K
Aivan Monceller
Aivan Monceller@aivandroid·
@alexellisuk @stevibe Yes I did not use a chat template file, I followed the official docs of unsloth. They did not mention the need to customize these parameters. So I thought you intentionally needed that for some reason. It compacted multiple times , so it did reach 200k context several times.
English
0
0
0
20
Alex Ellis
Alex Ellis@alexellisuk·
@aivandroid @stevibe Yes this seems to be a persistent issue. I noticed that you didn't use --jinja or --chat-template-file? How much context did you get up to?
English
1
0
0
20
stevibe
stevibe@stevibe·
The RTX 3090 is a 5-year-old GPU and it still runs a 27B model at 20 tok/s I tested Qwen3.5:27b across 3 generations of NVIDIA: 5090 → ~60 tok/s 4090 → ~40 tok/s 3090 → ~20 tok/s Perfectly linear scaling. Double the generation, double the speed.
English
83
108
1.3K
140.9K
Alex Ellis
Alex Ellis@alexellisuk·
@aivandroid @stevibe Third time trying to get it to write out its analysis ~ 34k context into a markdown file. Runs intensely for > 2mins every time and never writes a file or any text to disk.
Alex Ellis tweet media
English
1
0
0
35
Aivan Monceller
Aivan Monceller@aivandroid·
@alexellisuk @stevibe I've been using it continously yesterday over the course of 3 hours on claude code , improving the same single codebase, on a single 3090 , it's slower when it compacts but never really halted
English
1
0
1
44
Alex Ellis
Alex Ellis@alexellisuk·
@aivandroid @stevibe Thanks glad I could help. Still a bit bumpy.. I got a great deep dive after almost 5m of processing - it produced beautiful and concise output But then I said "now write that response to ARCH.md" and it took several more minutes then did nothing at all Like I've seen before..
Alex Ellis tweet mediaAlex Ellis tweet media
English
1
0
0
29
Ostris
Ostris@ostrisai·
Codex, Claude Code, or something else? What is the current best for a big repo, this week.
English
20
0
24
5.6K
Aivan Monceller
Aivan Monceller@aivandroid·
Try this 262k context on 3090 this time, I learned from you. This is very useable nn claude code llama-server -hf unsloth/Qwen3.5-27B-GGUF:Q4_K_M -np 1 -ngl 99 -c 262144 -fa on --cache-type-k q4_0 --cache-type-v q4_0 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --presence_penalty 0.0
English
1
0
1
75
Alex Ellis
Alex Ellis@alexellisuk·
Got the same error again with: CLAUDE_AUTOCOMPACT_PCT_OVERRIDE=80 ⎿ API Error: 400 {"error":{"code":400,"message":"request (66049 tokens) exceeds the available context size (66048 tokens), try increasing it","type":"exceed_context_size_error","n_prompt_tokens":66049,"n_ctx":66048}}
English
1
0
0
71
Aivan Monceller 리트윗함
Melvin Vivas
Melvin Vivas@DonvitoAI·
This is output of Qwen3.5-27B-GGUF by @UnslothAI Using Claude Code and the frontend skill This is promising!!! Coding just using a local model in a consumer gpu RTX 3090 Use this settings to run llama-server -hf unsloth/Qwen3.5-27B-GGUF:Q4_K_M -ngl 99 -c 262144 -fa on --cache-type-k q4_0 --cache-type-v q4_0 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 Thanks @aivandroid
Melvin Vivas tweet media
English
0
1
1
194
Aivan Monceller
Aivan Monceller@aivandroid·
@thsottiaux make it configurable to the point that there is no reason to fork it for most use cases.
English
0
0
0
23
Tibo
Tibo@thsottiaux·
With GPT-5.4 out. What should Codex ship or improve next?
English
1.1K
17
1.2K
112.1K