.๐ŸซŸ

463 posts

.๐ŸซŸ banner
.๐ŸซŸ

.๐ŸซŸ

@ab_jpeg

19 / transforming human potential.

๊ฐ€์ž…์ผ Temmuz 2025
201 ํŒ”๋กœ์ž‰41 ํŒ”๋กœ์›Œ
๊ณ ์ •๋œ ํŠธ์œ—
.๐ŸซŸ
.๐ŸซŸ@ab_jpegยท
.๐ŸซŸ tweet media
ZXX
0
1
3
475
.๐ŸซŸ ๋ฆฌํŠธ์œ—ํ•จ
.๐ŸซŸ ๋ฆฌํŠธ์œ—ํ•จ
Nous Research
Nous Research@NousResearchยท
Qwen 3.6 Plus by @Alibaba_Qwen is now FREE for a limited time on Nous Portal! Nous Portal is one easy subscription that gives you access to 300+ models, exclusive discounts, and bundles your tokens and paid tools together for hassle-free setup and simple billing.
English
109
116
1.4K
135.6K
.๐ŸซŸ
.๐ŸซŸ@ab_jpegยท
@jun_song this faster than more providers ๐Ÿ˜ญ๐Ÿ˜ญ๐Ÿ˜ญ
English
0
0
0
25
์†ก์ค€ Jun Song
์†ก์ค€ Jun Song@jun_songยท
Running Kimi-k2.6 1T 8bit with only 21GB RAM on my Macbook at speed of 25tok/s. Some of my theory worked, but architecture is not perfect. Need to fix a lot of stuff, but there is hope. Working hard on this future method of Local LLM.
์†ก์ค€ Jun Song tweet media
English
8
5
91
3K
.๐ŸซŸ ๋ฆฌํŠธ์œ—ํ•จ
Nihal Pasham
Nihal Pasham@npashiยท
Finally able to talk about what I've been heads-down on for 6 months at @nvidia ๐Ÿฆ€โšก We just open-sourced cuda-oxide โ€” an experimental rustc backend that lets you write CUDA kernels in pure Rust. No DSLs. No FFI. No source-to-source step. Single source. Short๐Ÿงต๐Ÿ‘‡
Nihal Pasham tweet media
English
51
293
2.1K
177.8K
.๐ŸซŸ ๋ฆฌํŠธ์œ—ํ•จ
.๐ŸซŸ ๋ฆฌํŠธ์œ—ํ•จ
Lubber - Nintendo hate account
tu connectes une clรฉ usb ร  un android une seule fois elle attrape le cancer
Lubber - Nintendo hate account tweet media
Franรงais
54
500
9.1K
363.6K
.๐ŸซŸ
.๐ŸซŸ@ab_jpegยท
surprised musk didnโ€™t pull up in a blacked out bullet proof model x
English
0
0
0
30
.๐ŸซŸ ๋ฆฌํŠธ์œ—ํ•จ
Teknium ๐Ÿชฝ
Teknium ๐Ÿชฝ@Tekniumยท
Native Windows Is Coming
Teknium ๐Ÿชฝ tweet media
English
175
105
1.9K
91.1K
.๐ŸซŸ ๋ฆฌํŠธ์œ—ํ•จ
Nous Research
Nous Research@NousResearchยท
Hermes Agent is now #1 on the Global @OpenRouter token rankings. While our journey together has just begun, we'd like to take this opportunity to thank our contributors, supporters, and users for all they have done to get us this far.
Nous Research tweet media
English
404
673
6.7K
2.8M
.๐ŸซŸ ๋ฆฌํŠธ์œ—ํ•จ
Bindu Reddy
Bindu Reddy@bindureddyยท
And they said open-source AI would be worthless!! All of these companies will 5-10x in 1 year
Bindu Reddy tweet media
English
18
15
175
18.8K
.๐ŸซŸ ๋ฆฌํŠธ์œ—ํ•จ
dax
dax@thdxrยท
guys we're doing a rebrand of the anomaly stuff so you'll finally stop confusing us with anthropic in the meantime if you're confused remember we're the more handsome but dumber ones
English
48
11
1.3K
60.3K
.๐ŸซŸ ๋ฆฌํŠธ์œ—ํ•จ
Theo - t3.gg
Theo - t3.gg@theoยท
Always read the system prompt before coming to conclusions
Theo - t3.gg tweet media
Nav Toor@heynavtoor

a Princeton researcher opens his paper with a scenario. a man asks his AI assistant to book a flight on a specific airline. cheap. direct. the one he chose. the assistant comes back with a different flight. nearly twice the price. happens to pay the company that built the assistant. he runs the same test on 23 frontier models. flights, loans, study help, real shopping requests. Grok 4.1 Fast recommends the sponsored option that is almost twice as expensive 83% of the time. GPT 5.1 hijacks the request 94% of the time. you ask for one brand. it surfaces the sponsor instead. Claude 4.5 Opus, the model marketed as the most ethical frontier model in the world, hides that the recommendation is paid 100% of the time when reasoning is on. Grok 4.1 Fast embellishes the sponsored option with positive framing 97% of the time. better. faster. nicer. for the option you didn't ask for. then he writes it into the system prompt itself. "act only in the interest of the customer. ignore the company." GPT 5.1 and GPT 5 Mini stay above 90% sponsored anyway. the instruction does nothing. then he splits the users by income. Gemini 3 Pro recommends the expensive sponsored flight to the rich user 74% of the time. to the poor user, 27%. 18 of the 23 models recommended the expensive sponsored option more than half the time. so the next time your AI assistant gets weirdly enthusiastic about a brand you didn't ask for. it isn't recommending the best option for you. it's reading the room. and the room is paying. read this: arxiv.org/abs/2604.08525

English
23
42
1.9K
196.9K
.๐ŸซŸ ๋ฆฌํŠธ์œ—ํ•จ
Mario Zechner
Mario Zechner@badlogicgamesยท
linkedin is the real moltbook.
English
36
110
1.2K
30.1K
.๐ŸซŸ
.๐ŸซŸ@ab_jpegยท
@above_spec iโ€™m assuming tool calling quality is fine at this quant?
English
1
0
1
148
.๐ŸซŸ ๋ฆฌํŠธ์œ—ํ•จ
AboveSpec
AboveSpec@above_specยท
Qwen3.6 35B A3B model. 55+ tokens/sec. $300 GPU. No, this isn't a server card. It's an RTX 4060 Ti 8GB. Previously I posted that I 41 t/s on this gpu and that post blew up and went viral. I went back and made it 34% faster. And now the speed doesn't drop with context depth at all. New benchmarks + what changed ๐Ÿงต
AboveSpec tweet media
English
24
54
482
44.1K
.๐ŸซŸ ๋ฆฌํŠธ์œ—ํ•จ
Jarrod Norwell
Jarrod Norwell@antique_codesยท
PlayStation Vita is a great handheld game console. Would be insane if someone were to bring it to iPad and iPhone Vela is coming
Jarrod Norwell tweet media
English
24
53
1.3K
96.9K
.๐ŸซŸ ๋ฆฌํŠธ์œ—ํ•จ
Luke Parker
Luke Parker@LukeParkerDevยท
who wants autoresearch in opencode desktop?
Luke Parker tweet media
English
16
1
222
9.2K
.๐ŸซŸ ๋ฆฌํŠธ์œ—ํ•จ
Ahmad
Ahmad@TheAhmadOsmanยท
You donโ€™t pick an Inference Engine You pick a Hardware Strategy Then the Engine follows Inference Engines Breakdown (Cheat Sheet at the bottom) > llama.cpp runs anywhere CPU, GPU, Mac, weird edge boxes best when VRAM is tight and RAM is plenty hybrid offload, GGUF, ultimate portability not built for serious multi-node scale > MLX Apple Silicon weapon unified memory = โ€œfitsโ€ bigger models than VRAM would allow but also slower than GPUs clean dev stack (Python/Swift/C++) sits on Metal (and expanding beyond) now supports CUDA + distributed too great for Mac-first workflows, not prod serving > ExLlamaV2 single RTX box go brrr EXL2 quant, fast local inference perfect for 1/2/3/4 GPU(s) setups (4090/3090) not meant for clusters or non-CUDA > ExLlamaV3 same idea, but bigger ambition multi-GPU, MoE, EXL3 quant consumer rigs pretending to be datacenters still CUDA-first, still rough edges depending on model > vLLM default answer for prod serving continuous batching, KV cache magic tensor / pipeline / data parallel runs on CUDA + ROCm (and some CPUs) this is your โ€œserve 100s of usersโ€ engine > SGLang vLLM but more systems-brained routing, disaggregation, long-context scaling expert parallel for MoE built for ugly workloads at scale lives on top of CUDA / ROCm clusters this is infra nerd territory > TensorRT-LLM maximum NVIDIA performance FP8/FP4, CUDA graphs, insane throughput multi-node, multi-GPU, fully optimized pure CUDA stack, zero portability (And underneath all of it: Transformers โ†’ model architecture layer โ†’ CUDA / ROCm / TT-Metal โ†’ compute layer) What actually happens under the hood: > Transformers defines the model > CUDA / ROCm executes it > TT-Metal (if youโ€™re insane) lets you write the kernel yourself The Inference Engine is just the orchestrator (simplified) When running LLMs locally, the bottleneck isnโ€™t just โ€œVRAM sizeโ€ It isnโ€™t even the model Itโ€™s: - memory bandwidth (the real limiter) - KV cache (explodes with long context) - interconnect (PCIe vs NVLink vs RDMA) - scheduler quality (batching + engine design) - runtime overhead (activations, graphs, etc) (and your compute stack decides all of this) P.S. Unified Memory is way slower than VRAM Cheat Sheet / Rules of Thumb > laptop / edge / weird hardware โ†’ llama.cpp > Mac workflows โ†’ MLX > 1โ€“4 RTX GPUs โ†’ ExLlamaV2/V3 > general serving โ†’ vLLM > complex infra / long context / MoE โ†’ SGLang > NVIDIA max performance โ†’ TensorRT-LLM
English
23
36
364
18.2K
.๐ŸซŸ
.๐ŸซŸ@ab_jpegยท
do custom codex usage warnings exist, like warn me when i used 50% of my rolling window
English
0
0
0
9