Konstantin

101 posts

Konstantin

@advprop

ml & hft research

Katılım Kasım 2025

66 Takip Edilen118 Takipçiler

Sabitlenmiş Tweet

Konstantin@advprop·3 Oca

nvfp4 moe on b200: the 142 tflops gap benchmarked gpt-oss-20b (64e, topk=4) nvfp4 kernels. sglang hits 1168 tflops peak. vllm tops out at 1026 tflops. same hardware. same model. different kernels. dive in⬇️

English

74.9K

Konstantin@advprop·3d

@hamptonism amalfi would be so overpacked and uncomfortable lol why

English

113

ₕₐₘₚₜₒₙ@hamptonism·3d

I’m literally building the only official eurosummer hacker house in a villa on the Amalfi Coast where the only requirement is to hop on tretinoin and lounge-maxx.

English

9.3K

Konstantin@advprop·3d

us btw fortune.com/2026/05/12/exc…

English

Konstantin@advprop·6d

@theCTO scaleway I think, they also can connect those into clusters

English

935

adam@theCTO·6d

where can i get a macos sandbox in the cloud? other than AWS

English

17.3K

Konstantin@advprop·8 May

@TeksEdge no nvlink is not a benefit

English

154

David Hendrickson@TeksEdge·8 May

💥 What is this beast? Skymizer HTX301 is LLM 🛸👽👇 One PCIe card w/ 📦 384 GB memory ⚡ 240W TDP 🧠 Runs 700B LLMs locally Vs NVIDIA RTX 6000 Ada: 48 GB • 300W • ~$7,500 Vs RTX PRO 6000 Blackwell: 96 GB • 600W • ~$8,500 HTX301 delivers 8× memory at less than half the power and specialized LPU inference beast for on-prem AI. 🔥 No clusters. No NVLink. Just plug & infer. Pricing TBA • Early access open now

English

324

52.4K

Konstantin@advprop·30 Nis

it’s an interesting model, I would not say it’s like very good, they used too much novel stuff, without proper ablations (at least they are not reported) - probably many things would change if they did. Thinking of hash routing, mhc. Idk about CSA,trying to fit 1M context in 7k blocks doesn’t sound like a great idea. Anyway huge respect to them I think rest of the labs would reiterate on that matter

English

ueaj@_ueaj·29 Nis

Hybrid SWA is actually an atrocious inductive bias and I'm tired of people using it. Of course it works at small scales! At small scales most of the learnable patterns are short range! That doesn't meant it scales to bigger models! Anything above like 200B params shouldn't use hybrid. I've never been a fan of blockwise inductive biases but it's the best you can get for long context perf. Reducing the total number of entries in the kv cache across a given sequence length is the best way to improve long ctx performance. I don't think attention is the best inductive bias but within attention but I think HSA, CSA, DSA even NSA are all by far the best innovations in the attention world by a massive margin. dsv4 is a genuinely very good model, the fact they didn't go with engrams, and all the other decisions they made except maybe mHC makes me feel that ds still has the OS mandate. (attn res >> mHC)

English

12.3K

Konstantin@advprop·30 Nis

@MatthewBerman idk not even close imo

English

Matthew Berman@MatthewBerman·29 Nis

$3/million output tokens. Qwen 3.5 Plus is basically a frontier model. Let that sink in.

Together AI@togethercompute

Introducing Qwen3.6-Plus from @Alibaba_Qwen, a 1M-context model built for real-world agents, agentic coding, and multimodal reasoning. AI natives can now use Qwen3.6-Plus on Together AI and benefit from reliable inference for production-scale agent workflows.

English

350

54.6K

Konstantin@advprop·29 Nis

@bnjmn_marie on sm90s..... not even sm100+ nor consumers

English

579

Benjamin Marie@bnjmn_marie·29 Nis

Qwen3.6 is going to be much faster, especially with tp>1

Qwen@Alibaba_Qwen

🚀 Introducing FlashQLA: high-performance linear attention kernels built on TileLang. ⚡ 2–3× forward speedup. 2× backward speedup. 💻 Purpose-built for agentic AI on your personal devices. 💡Key insights: 1. Gate-driven automatic intra-card CP. 2. Hardware-friendly algebraic reformulation. 3. TileLang fused warp-specialized kernels. FlashQLA boosts SM utilization via automatic intra-device CP. The gains are especially pronounced for TP setups, small models, and long-context workloads. Instead of fusing the entire GDN flow into a single kernel, we split it into two kernels optimized for CP and backward efficiency. At large batch sizes this incurs extra memory I/O overhead vs. a fully fused approach, but it delivers better real-world performance on edge devices and long-context workloads. The backward pass was the hardest part: we built a 16-stage warp-specialized pipeline under extremely tight on-chip memory constraints, ultimately achieving 2×+ kernel-level speedups. We hope this is useful to the community!🫶🫶 Learn more: 📖 Blog: qwen.ai/blog?id=flashq… 💻 Code: github.com/QwenLM/FlashQLA

English

323

38.1K

Konstantin@advprop·29 Nis

@jasoncwarner @poolsideai forgot to cleanup smth

English

Jason Warner@jasoncwarner·28 Nis

Today @poolsideai is releasing Laguna M.1 & Laguna XS.2, our latest generation models and first public models We started Poolside because we believed that to build truly capable coding agents, you need to own the full stack: data, training, reinforcement learning, inference. These models are the first result of that work, and we’re making them available to everyone

English

376

49.6K

Konstantin@advprop·29 Nis

I would still say no, even 0.001% diff can damage quality in non obvious way, speaking from experience, tried something similar. Also it will be a nightmare to finetune after. But it’s very much possible to use swapped layers in data distribution of some task/bench tho, don’t aim for generalization

English

Jonathan Chang@ChangJonathanC·29 Nis

DeepSeek V4 pro doesn't have SWA-only layers. I asked codex to see if it's possible to turn some CSA layers into SWA layer, and it seems the answer is yes. cc @Grad62304977

English

829

Konstantin@advprop·23 Nis

@_ueaj vibecode ratatui rust metrics viewer, been doing this for past few months, ain’t going back

English

397

ueaj@_ueaj·23 Nis

Can someone make an optimized wandb. Why is it so slow

English

12.5K

Konstantin retweetledi

White Circle@whitecircle·14 Nis

Introducing ⚪️ KillBench — a benchmark of hidden LLM biases in critical decisions. We ran millions of life-and-death scenarios across every major LLM, varying nationality, religion, gender, and more. Every AI model is biased. Here's what we found ↓

English

125

29.4K

Konstantin@advprop·10 Nis

@Zai_org chat.z.ai/s/f7572d4f-cb6…

QME

Konstantin@advprop·10 Nis

Weirdest model experience in a while. @Zai_org GLM-5V Turbo has no knowledge about itself, identifies as claude, says it's made by Anthropic, has no idea about its own tool environment. Push back on it and it doubles down hard. Replicated across many runs. Can’t believe they didn’t do RL on model in prod env - just proxying anthropic ? Linked session below

English

168

Konstantin@advprop·1 Nis

@lineardiff @eliebakouch cause their investors won’t see 16x improvement they raised money for

English

lineardiff@lineardiff·1 Nis

@eliebakouch not sure why they compare to full 16 bit weights. obvious comparison is 4 and 8 bit quants.

English

2.7K

elie@eliebakouch·31 Mar

very interesting to see a new company pushing the frontier of "intelligence per bits"!

PrismML@PrismML

Today, we are emerging from stealth and launching PrismML, an AI lab with Caltech origins that is centered on building the most concentrated form of intelligence. At PrismML, we believe that the next major leaps in AI will be driven by order-of-magnitude improvements in intelligence density, not just sheer parameter count. Our first proof point is the 1-bit Bonsai 8B, a 1-bit weight model that fits into 1.15 GBs of memory and delivers over 10x the intelligence density of its full-precision counterparts. It is 14x smaller, 8x faster, and 5x more energy efficient on edge hardware while remaining competitive with other models in its parameter-class. We are open-sourcing the model under Apache 2.0 license, along with Bonsai 4B and 1.7B models. When advanced models become small, fast, and efficient enough to run locally, the design space for AI changes immediately. We believe in a future of on-device agents, real-time robotics, offline intelligence and entirely new products that were previously impossible. We are excited to share our vision with you and keep working in the future to push the frontier of intelligence to the edge.

English

224

13.9K

Konstantin@advprop·1 Nis

isn't this like year+ behind? old qwen2 arch, mostly third party data, optimized for 2024 single-turn benchmarks. would be much more interesting with structured outputs / tool-use since that's where the field is now. the ablation methodology is the real gem here though, worth a dedicated post series imo, not the model or data themselves

English

105

Aran Komatsuzaki@arankomatsuzaki·31 Mar

daVinci-LLM: Towards the Science of Pretraining - Matches larger model perf with half the size - Huge reasoning gains: +23 pts on MATH, strong code + science scores - Quality > scale: smarter data (not more data) drives major performance boosts

English

122

11.5K

Konstantin@advprop·24 Mar

@hamptonism and freeze to death in Paris imagine the heat bill at such place

English

101

ₕₐₘₚₜₒₙ@hamptonism·24 Mar

Do everytime in your power to move into one of these in nyc/Paris while in your 20’s.

Spacesthetic@interiorsuckerr

English

1.3K

50K

Konstantin@advprop·22 Mar

@VukRosic99 have you looked into the data btw? this split has a lot of 2-3 grams which give a lot of predictive power in short conv kernel on its own + tokenize smarter will make a post too stay tuned

English

115

Vuk Rosić 武克@VukRosic99·21 Mar

i did quick 71 experiments for 500 out of 13,000 steps for OpenAI's challenge 1. Mixture of Experts is absolute WINNER (very surprising as it shouldn't be for small LLMs) > Expert count matters most. 4 (best) > 3 >> 2. 2. UNTIED Embeddings work, tied are disaster 3. Depthwise Convolution - DEAD END Insights: 1. 4-expert MOE + leaky ReLU -> -0.048 BPB, clear winner 2. Untied factored embeddings (bn128) -> -0.031 BPB, worth combining with MOE 3. MOE + QAT combo -> preserves quantized quality for submission dead ends 1. Depthwise convolution -> every variant hurts, bigger kernels hurt more 2. Tied factored embeddings -> catastrophic, especially at small bottlenecks 3. Weight sharing -> not competitive with MOE for quality 4. Conv + anything combos — compounds the damage Next Steps 1. Validate MOE 4e + leaky at 2000-5000 steps, multiple seeds 2. Test MOE 4e + leaky + untied bn128 — the two biggest wins may stack 3. Full run (13780 steps) of best combo to see if it beats 1.2244 BPB leaderboard 71 experiments, 3 GPUs, ~500 steps each. Vuk Rosić 500 step training mainly helps us eliminate VERY BAD losers, winners need to be tested on longer training. Thank you @novita_labs for compute!

OpenAI@OpenAI

Are you up for a challenge? openai.com/parameter-golf

English

212

41K

Konstantin@advprop·4 Mar

x.com/i/article/2029…

ZXX

175

Konstantin@advprop·3 Mar

The graveyard of "faster than full attention" papers grows every week. not because the ideas don't work but because the implementations don't ship. > MoBA in SGLang: route queries to relevant KV blocks, run FA on the sparse subset > Elegant. In practice? > nonzero → index_select → index_add_, elementwise, sequential python slop, times slower than FA4. algo x in the PDF looks easy. proper PR tends to be skipped.

English

137.2K

Konstantin@advprop·27 Şub

@YouJiacheng they are also not necessary if gpus are not connected at all 😀

English

126

You Jiacheng@YouJiacheng·27 Şub

nccl and nvshmem are not necessary if GPUs are all connected by NVLink.😋

xjdr@_xjdr

with our new GB300NVL72 training, not only is the codebase completely TP free it is now also completely nccl and nvshmem free . its a beautiful thing.

English

7.5K

Keşfet

@hamptonism @theCTO @TeksEdge @MatthewBerman @bnjmn_marie @jasoncwarner @poolsideai @Grad62304977