Aivan Monceller

956 posts

Aivan Monceller

@aivandroid

I share interesting things I find, what I’m learning, and projects I’m trying. Follow me if you like tech, creative stuff, and learning new things. INFJ-T

Sinapore ⇄ Philippines 가입일 Mart 2018

995 팔로잉183 팔로워

고정된 트윗

Aivan Monceller@aivandroid·2 Ara

Created a video how to train Flux LoRAs using my convenience script utilizing ai-toolkit on @quickpodio youtu.be/0I06JHyvuwQ

YouTube

English

2.9K

Aivan Monceller 리트윗함

Theo - t3.gg@theo·10h

Responses to this are hilarious. Codex implemented this correctly in 15 minutes btw

Theo - t3.gg@theo

Just let Opus go for over an hour on a new feature. When it was done, I asked how I can test it. 20 minutes later, it realized I can't test it because it did the whole thing entirely wrong. Idk how you guys use this model every day for real work 🙃

English

721

85.9K

Aivan Monceller@aivandroid·12h

@ron_joshi Can it clone voice?

English

Rohan Joshi@ron_joshi·22h

Introducing Kitten TTS V0.8: open-source TTS that fits in 25MB. Three variants: 80M | 40M | 14M (<25MB) Highly expressive. Runs on CPU. Built for edge. No GPU? No problem. Ship voice anywhere. Check it out:

English

208

1.8K

99.9K

Aivan Monceller@aivandroid·1d

@gabriel1 Only if I want to read the code.

English

gabriel@gabriel1·2d

only bottleneck is consuming code, so make sure to tell codex that you want just that: "write extremely easy to consume code, optimize for how easy the code is to read. make the code skimmable. avoid cleverness. use early returns."

English

1.9K

110.2K

Aivan Monceller@aivandroid·2d

@Dimillian Definitely my cup of tea ..

English

Thomas Ricouard@Dimillian·3d

Codex is one-shotting a full Animal Crossing PC port to a macOS port, and from 32 to 64 bit

English

532

46.7K

Aivan Monceller@aivandroid·2d

Stack Overflow just turned their users into a free labeling farm. That new "Challenges" tab isn't for fun. It’s a solution to the LLM data contamination problem. By creating "fresh" puzzles, they generate data that no AI has seen yet. They aren't a Q&A site anymore. They are a data moat.

English

Aivan Monceller@aivandroid·6d

@alexellisuk @chriswinfield @UnslothAI What changed since our last conversation? , it just works right?

English

Alex Ellis@alexellisuk·6d

@chriswinfield @UnslothAI Nothing special in this case. It looks like @UnslothAI did all the work in the GGUF file. The unlock for me was getting off Ollama and moving to llama.cpp directly. Newer builds and more flexible tuning.

English

573

Alex Ellis@alexellisuk·13 Mar

I did a thing.. and now local Claude feels _really_ fast. (Qwen3.5-35B-A3B Q4 M from @UnslothAI - Superterm for the terminal/AI manager)

English

285

36.8K

Aivan Monceller@aivandroid·12 Mar

why moving checkpoints to s3 is a bottleneck: it treats every 10GB save as a 100% new file. @huggingface’s new Storage Buckets use Xet deduplication to solve the training to inference loop: Training: it only uploads the 5% weight changes (content defined chunking). Inference: localized CDN cache lets you stream models directly to GPU nodes without the pull-then-load lag. no git overhead. no full-file re-syncs. just deltas.

Victor M@victormustar

Introducing Storage Buckets on Hugging Face 🧑‍🚀 The first new repo type on the Hub in 4 years: S3-like object storage, mutable, non-versioned, built on Xet deduplication. - Starting at $8/TB/mo. That's 3x cheaper than S3. You (and your coding agents) need somewhere to dump checkpoints, logs, and artifacts. Now they have a home.

English

Aivan Monceller@aivandroid·10 Mar

@LeeLeepenkman @karpathy @Yuchenj_UW Do you keep up with upstream?

English

Lee Penkman@LeeLeepenkman·10 Mar

@karpathy @Yuchenj_UW good ad for my codex fork btw - plz try github.com/lee101/codex-i… see what you think :D a lot of this was a issue with prompting inside codex itself

English

189

Yuchen Jin@Yuchenj_UW·9 Mar

GPT-5.4 xhigh seems bad at following instructions. Last night I launched two AI research agents running @karpathy’s autoresearch. Claude Opus 4.6 (high): > ran for 12+ hours, 118 experiments done, still running GPT-5.4 xhigh: > stopped after 6 experiments > blamed me for “manually interrupting” it > I interrogated it > It admitted it made a mistake and stopped the loop itself, despite an explicit LOOP FOREVER instruction in the md file. 💀

English

160

1.5K

238.4K

Aivan Monceller@aivandroid·10 Mar

@fahdmirza @zhijianliu_ @QuixiAI Please release 27B and support GGUF , will test in a heartbeat

English

Fahd Mirza@fahdmirza·10 Mar

youtu.be/J3EdGgez1-I?si… @zhijianliu_ @QuixiAI

YouTube

QME

3.7K

Fahd Mirza@fahdmirza·10 Mar

💥 We ran Qwen3.5 9B at INT4 with ParoQuant — and it barely lost anything 🌀 ♠ A brand new ICLR 2026 quantization method that finally solves reasoning accuracy at 4-bit 🔬 🔹 4x smaller than FP16 — runs locally on consumer hardware 🔹 AIME-24 score drops only 2.3 points vs FP16 — AWQ drops 13 🔹 Smarter than AWQ, faster than QTIP — best of both worlds 🔹 Single fused CUDA kernel — less than 10% overhead 🔹 Works with vLLM and Open WebUI out of the box 🔹 Tested live with a reasoning prompt to see if the quality holds 🔹 The quantization method that reasoning models actually needed 🔥 Watch the full video below 👇

English

224

18.3K

Aivan Monceller@aivandroid·9 Mar

@Ex0byt Will this work on an Nvidia GPU?

English

147

Eric@Ex0byt·9 Mar

3/9/26 Experiment Updates: — Qwen 3.5 INT4 · Raw Safetensors · WebGPU · M4 Max · Pure WGSL 220 tok/s sustained on 0.8B — flat regardless of context window size. Full family: 0.8B: ~220tok/s · 2B: 163 tok/s· 4B: 74 tok/s· 9B: 55 tok/s · 27B: tok/s 17 tok/s (49% of theoretical max) Three underutilization enhancements that got us here: → DeltaNet WG 256→128 — 50% threads were idle → zero wasted SIMD slots → MLP gate+SiLU 4T split — 192→768 workgroups → 4× memory pipeline saturation → Major breaktorugh: FlashDecode splits 32→64 — tok/s now stays flat as context size grows!

English

3.9K

Aivan Monceller@aivandroid·9 Mar

@ToNYD2WiLD @DonvitoAI @ComfyUI What do you mean by using WAN to run LTX. The WAN I know is the WAN video models. Is there an inference engine called WAN? is this Wan2GP?

English

ToNYD2WiLD@ToNYD2WiLD·8 Mar

@aivandroid @DonvitoAI @ComfyUI I’m using WAN to run LTX though but I’ll try comfy to see if it works now it was giving issues

English

Melvin Vivas@DonvitoAI·7 Mar

Running LTX 2.3 locally in @ComfyUI RTX 3090

English

375

Aivan Monceller@aivandroid·9 Mar

@alexellisuk @stevibe Yes I did not use a chat template file, I followed the official docs of unsloth. They did not mention the need to customize these parameters. So I thought you intentionally needed that for some reason. It compacted multiple times , so it did reach 200k context several times.

English

Alex Ellis@alexellisuk·9 Mar

@aivandroid @stevibe Yes this seems to be a persistent issue. I noticed that you didn't use --jinja or --chat-template-file? How much context did you get up to?

English

stevibe@stevibe·7 Mar

The RTX 3090 is a 5-year-old GPU and it still runs a 27B model at 20 tok/s I tested Qwen3.5:27b across 3 generations of NVIDIA: 5090 → ~60 tok/s 4090 → ~40 tok/s 3090 → ~20 tok/s Perfectly linear scaling. Double the generation, double the speed.

English

108

1.3K

140.9K

Aivan Monceller@aivandroid·9 Mar

@alexellisuk @stevibe I wonder why, it doesn't have a problem writing and updating files for me.

English

Alex Ellis@alexellisuk·9 Mar

@aivandroid @stevibe Third time trying to get it to write out its analysis ~ 34k context into a markdown file. Runs intensely for > 2mins every time and never writes a file or any text to disk.

English

Aivan Monceller@aivandroid·9 Mar

@alexellisuk @stevibe I've been using it continously yesterday over the course of 3 hours on claude code , improving the same single codebase, on a single 3090 , it's slower when it compacts but never really halted

English

Alex Ellis@alexellisuk·9 Mar

@aivandroid @stevibe Thanks glad I could help. Still a bit bumpy.. I got a great deep dive after almost 5m of processing - it produced beautiful and concise output But then I said "now write that response to ARCH.md" and it took several more minutes then did nothing at all Like I've seen before..

English

Aivan Monceller@aivandroid·9 Mar

@CoinhubD @ostrisai How do you teach it overtime?

English

Altered Suda | thesudio@CoinhubD·24 Şub

@ostrisai Kimi on a run pod and just teach it over time?

English

Ostris@ostrisai·24 Şub

Codex, Claude Code, or something else? What is the current best for a big repo, this week.

English

5.6K

Aivan Monceller@aivandroid·9 Mar

Try this 262k context on 3090 this time, I learned from you. This is very useable nn claude code llama-server -hf unsloth/Qwen3.5-27B-GGUF:Q4_K_M -np 1 -ngl 99 -c 262144 -fa on --cache-type-k q4_0 --cache-type-v q4_0 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --presence_penalty 0.0

English

Alex Ellis@alexellisuk·7 Mar

Got the same error again with: CLAUDE_AUTOCOMPACT_PCT_OVERRIDE=80 ⎿ API Error: 400 {"error":{"code":400,"message":"request (66049 tokens) exceeds the available context size (66048 tokens), try increasing it","type":"exceed_context_size_error","n_prompt_tokens":66049,"n_ctx":66048}}

English

Aivan Monceller@aivandroid·8 Mar

@ToNYD2WiLD @DonvitoAI @ComfyUI WAN times are not the same as LTX times. WAN is slower

English

ToNYD2WiLD@ToNYD2WiLD·8 Mar

@DonvitoAI @ComfyUI im doing something wrong then, im doing WAN and it is taking 30mins

English

Aivan Monceller 리트윗함

Melvin Vivas@DonvitoAI·7 Mar

This is output of Qwen3.5-27B-GGUF by @UnslothAI Using Claude Code and the frontend skill This is promising!!! Coding just using a local model in a consumer gpu RTX 3090 Use this settings to run llama-server -hf unsloth/Qwen3.5-27B-GGUF:Q4_K_M -ngl 99 -c 262144 -fa on --cache-type-k q4_0 --cache-type-v q4_0 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 Thanks @aivandroid