Konstantin

101 posts

Konstantin banner
Konstantin

Konstantin

@advprop

ml & hft research

Katılım Kasım 2025
66 Takip Edilen118 Takipçiler
Sabitlenmiş Tweet
Konstantin
Konstantin@advprop·
nvfp4 moe on b200: the 142 tflops gap benchmarked gpt-oss-20b (64e, topk=4) nvfp4 kernels. sglang hits 1168 tflops peak. vllm tops out at 1026 tflops. same hardware. same model. different kernels. dive in⬇️
Konstantin tweet media
English
5
7
76
74.9K
ₕₐₘₚₜₒₙ
ₕₐₘₚₜₒₙ@hamptonism·
I’m literally building the only official eurosummer hacker house in a villa on the Amalfi Coast where the only requirement is to hop on tretinoin and lounge-maxx.
English
11
1
79
9.3K
Konstantin
Konstantin@advprop·
@theCTO scaleway I think, they also can connect those into clusters
English
0
0
0
935
adam
adam@theCTO·
where can i get a macos sandbox in the cloud? other than AWS
English
18
0
47
17.3K
David Hendrickson
David Hendrickson@TeksEdge·
💥 What is this beast? Skymizer HTX301 is LLM 🛸👽👇 One PCIe card w/ 📦 384 GB memory ⚡ 240W TDP 🧠 Runs 700B LLMs locally Vs NVIDIA RTX 6000 Ada: 48 GB • 300W • ~$7,500 Vs RTX PRO 6000 Blackwell: 96 GB • 600W • ~$8,500 HTX301 delivers 8× memory at less than half the power and specialized LPU inference beast for on-prem AI. 🔥 No clusters. No NVLink. Just plug & infer. Pricing TBA • Early access open now
David Hendrickson tweet media
English
34
33
324
52.4K
Konstantin
Konstantin@advprop·
it’s an interesting model, I would not say it’s like very good, they used too much novel stuff, without proper ablations (at least they are not reported) - probably many things would change if they did. Thinking of hash routing, mhc. Idk about CSA,trying to fit 1M context in 7k blocks doesn’t sound like a great idea. Anyway huge respect to them I think rest of the labs would reiterate on that matter
English
0
0
0
75
ueaj
ueaj@_ueaj·
Hybrid SWA is actually an atrocious inductive bias and I'm tired of people using it. Of course it works at small scales! At small scales most of the learnable patterns are short range! That doesn't meant it scales to bigger models! Anything above like 200B params shouldn't use hybrid. I've never been a fan of blockwise inductive biases but it's the best you can get for long context perf. Reducing the total number of entries in the kv cache across a given sequence length is the best way to improve long ctx performance. I don't think attention is the best inductive bias but within attention but I think HSA, CSA, DSA even NSA are all by far the best innovations in the attention world by a massive margin. dsv4 is a genuinely very good model, the fact they didn't go with engrams, and all the other decisions they made except maybe mHC makes me feel that ds still has the OS mandate. (attn res >> mHC)
English
6
1
39
12.3K
Matthew Berman
Matthew Berman@MatthewBerman·
$3/million output tokens. Qwen 3.5 Plus is basically a frontier model. Let that sink in.
Together AI@togethercompute

Introducing Qwen3.6-Plus from @Alibaba_Qwen, a 1M-context model built for real-world agents, agentic coding, and multimodal reasoning. AI natives can now use Qwen3.6-Plus on Together AI and benefit from reliable inference for production-scale agent workflows.

English
38
26
350
54.6K
Jason Warner
Jason Warner@jasoncwarner·
Today @poolsideai is releasing Laguna M.1 & Laguna XS.2, our latest generation models and first public models We started Poolside because we believed that to build truly capable coding agents, you need to own the full stack: data, training, reinforcement learning, inference. These models are the first result of that work, and we’re making them available to everyone
English
39
40
376
49.6K
Konstantin
Konstantin@advprop·
I would still say no, even 0.001% diff can damage quality in non obvious way, speaking from experience, tried something similar. Also it will be a nightmare to finetune after. But it’s very much possible to use swapped layers in data distribution of some task/bench tho, don’t aim for generalization
English
1
0
1
61
Jonathan Chang
Jonathan Chang@ChangJonathanC·
DeepSeek V4 pro doesn't have SWA-only layers. I asked codex to see if it's possible to turn some CSA layers into SWA layer, and it seems the answer is yes. cc @Grad62304977
Jonathan Chang tweet media
English
2
1
8
829
Konstantin
Konstantin@advprop·
@_ueaj vibecode ratatui rust metrics viewer, been doing this for past few months, ain’t going back
English
0
0
1
397
ueaj
ueaj@_ueaj·
Can someone make an optimized wandb. Why is it so slow
English
12
2
99
12.5K
Konstantin retweetledi
White Circle
White Circle@whitecircle·
Introducing ⚪️ KillBench — a benchmark of hidden LLM biases in critical decisions. We ran millions of life-and-death scenarios across every major LLM, varying nationality, religion, gender, and more. Every AI model is biased. Here's what we found ↓
White Circle tweet media
English
17
28
125
29.4K
Konstantin
Konstantin@advprop·
Weirdest model experience in a while. @Zai_org GLM-5V Turbo has no knowledge about itself, identifies as claude, says it's made by Anthropic, has no idea about its own tool environment. Push back on it and it doubles down hard. Replicated across many runs. Can’t believe they didn’t do RL on model in prod env - just proxying anthropic ? Linked session below
Konstantin tweet media
English
1
0
2
168
lineardiff
lineardiff@lineardiff·
@eliebakouch not sure why they compare to full 16 bit weights. obvious comparison is 4 and 8 bit quants.
English
1
0
8
2.7K
Konstantin
Konstantin@advprop·
isn't this like year+ behind? old qwen2 arch, mostly third party data, optimized for 2024 single-turn benchmarks. would be much more interesting with structured outputs / tool-use since that's where the field is now. the ablation methodology is the real gem here though, worth a dedicated post series imo, not the model or data themselves
English
0
0
1
105
Aran Komatsuzaki
Aran Komatsuzaki@arankomatsuzaki·
daVinci-LLM: Towards the Science of Pretraining - Matches larger model perf with half the size - Huge reasoning gains: +23 pts on MATH, strong code + science scores - Quality > scale: smarter data (not more data) drives major performance boosts
Aran Komatsuzaki tweet media
English
6
11
122
11.5K
Konstantin
Konstantin@advprop·
@hamptonism and freeze to death in Paris imagine the heat bill at such place
English
0
0
0
101
Konstantin
Konstantin@advprop·
@VukRosic99 have you looked into the data btw? this split has a lot of 2-3 grams which give a lot of predictive power in short conv kernel on its own + tokenize smarter will make a post too stay tuned
English
1
0
1
115
Vuk Rosić 武克
Vuk Rosić 武克@VukRosic99·
i did quick 71 experiments for 500 out of 13,000 steps for OpenAI's challenge 1. Mixture of Experts is absolute WINNER (very surprising as it shouldn't be for small LLMs) > Expert count matters most. 4 (best) > 3 >> 2. 2. UNTIED Embeddings work, tied are disaster 3. Depthwise Convolution - DEAD END Insights: 1. 4-expert MOE + leaky ReLU -> -0.048 BPB, clear winner 2. Untied factored embeddings (bn128) -> -0.031 BPB, worth combining with MOE 3. MOE + QAT combo -> preserves quantized quality for submission dead ends 1. Depthwise convolution -> every variant hurts, bigger kernels hurt more 2. Tied factored embeddings -> catastrophic, especially at small bottlenecks 3. Weight sharing -> not competitive with MOE for quality 4. Conv + anything combos — compounds the damage Next Steps 1. Validate MOE 4e + leaky at 2000-5000 steps, multiple seeds 2. Test MOE 4e + leaky + untied bn128 — the two biggest wins may stack 3. Full run (13780 steps) of best combo to see if it beats 1.2244 BPB leaderboard 71 experiments, 3 GPUs, ~500 steps each. Vuk Rosić 500 step training mainly helps us eliminate VERY BAD losers, winners need to be tested on longer training. Thank you @novita_labs for compute!
Vuk Rosić 武克 tweet mediaVuk Rosić 武克 tweet mediaVuk Rosić 武克 tweet media
OpenAI@OpenAI

Are you up for a challenge? openai.com/parameter-golf

English
13
16
212
41K
Konstantin
Konstantin@advprop·
The graveyard of "faster than full attention" papers grows every week. not because the ideas don't work but because the implementations don't ship. > MoBA in SGLang: route queries to relevant KV blocks, run FA on the sparse subset > Elegant. In practice? > nonzero → index_select → index_add_, elementwise, sequential python slop, times slower than FA4. algo x in the PDF looks easy. proper PR tends to be skipped.
Konstantin tweet media
English
1
0
18
137.2K
Konstantin
Konstantin@advprop·
@YouJiacheng they are also not necessary if gpus are not connected at all 😀
English
0
0
1
126