cudnn_cu12

1.9K posts

cudnn_cu12

@_proteuss_

Interests: Machine learning research, learning in neural networks, applied AI, startups, etc

شامل ہوئے Eylül 2022

952 فالونگ305 فالوورز

پن کیا گیا ٹویٹ

cudnn_cu12@_proteuss_·25 Tem

Lord God, I come to You a sinner And I humbly repent for my sins I believe that Jesus is Lord I believe You raised Him from the dead I would ask that Jesus come into my life And be my Lord and Savior I receive Jesus to take control of my life And that I may live for Him from this day forward Thank you, Lord Jesus For saving me with Your precious blood In Jesus' name, amen

English

1.1K

cudnn_cu12@_proteuss_·1h

@datavorous_ thanks, i havent been able to crack under 10ms using agents only. will have to do it by hand

English

datavorous@datavorous_·5h

I tried to speedrun the QR_2 challenge of GPU MODE in 16 hours i stopped at 4.7ms, this was my first "proper" gpu mode contest. constraints being: no way to tokenmaxx, no gpu to profile against extensively. what I did to understand the math: > householder Reflection method to do QR decomposition - reading the problem statement multiple times - did a 3x3 matrix dry run by hand - consulted gemini for validation of my mental model - studied parallel prefix sums - drew out the tree structure on paper before writing a single line of code - thought of a simpler approach (blocked method) instead of directly jumping to the tree like implementation what worked: - blocked householder with GEMM trailing update - loading panels at ONCE into the SRAM - tuning block sizes to prevent spilling - tuning warp count for better occupancy - (unintended) pytorch fallback for n=4096 what did not work: - fp8/nvfp4 quantisation (precision violated) - Cholesky method (not allowed) - Mixed precision with iterative correction - TF32 global flag (failed tests lmao) - full panel resident in shared memory (spilled badly) takeaways: i didn't get a chance to do extensive kernel programming before, so I had to bet on my intuitions: - bytes moved is the real bottleneck, i have to reduce the round trips of data - asymptomatic wins can still lose in practise - if you're broke, tokenmaxxing is not an option. manual? doable, but time is a constraint. human x ai? hell yeah!

English

5.8K

cudnn_cu12@_proteuss_·11h

@blelbach thank u

English

158

Bryce, the CUDA Colonel@blelbach·12h

This isn't one problem - it's a dozen or more! Specialized kernels for different shapes & kinds of matrices will win the day. Leveraging tensorcores & a mixture of numeric formats (FP64, FP32, TF32, FP16, FP8) is likely key to top answers. This may even end up memory bound.

Mark Saroufim@marksaroufim

We've released the QR problem, a more robust qr_v2 with a fresh leaderboard so please resubmit! Thank you to @blelbach, @myainotez and @nikhilbarhate99 for sharing feedback. Sorry if I missed anyone! I considered automatically backfilling all submissions but the rankings do change quite a bit so I figured a refresh would be better. Changelog * Fail submissions if they fail when we change random seeds * Add nasty correctness cases with more degenerate inputs in mixed batches * Recheck correctness when doing perf testing to avoid Volkswagen cheat * Reject Nan/Inf residuals * Validate each matrix factorization residual, since averaging was hiding bad matrices * Old QR is still open so folks can't see submissions but you can't submit anything to it Wontfix * Stream hacking is still banned via very blunt ban of the word "stream" we don't have a good solution for this * CUDA graphs are allowed but not particularly interesting to us Best submissions so far if I resubmit their solutions are

English

8.4K

cudnn_cu12@_proteuss_·11h

@hotschmoe and you will try some kernel engineering?

English

StrongEngineer_@hotschmoe·11h

@_proteuss_ Totally usable in my opinion. Mature W4A8/W8A8 should be much better imo to activate int8 fastpaths x.com/i/status/20685…

StrongEngineer_@hotschmoe

one intel b70 ($950), first day setup Qwen3.6-27B W4A16 (autoround), *no* MTP 128k context, kv cache fp16 1 session: 28.1 tok/s 2 concurrent sessions: 52.0 tok/s cumulative 4 concurrent sessions: 87.8 tok/s cumulative 64 concurrent sessions: 234.7 tok/s cumulative

English

267

StrongEngineer_@hotschmoe·22h

I almost bought a dgx spark at $2999, then I almost bought a rtx pro 5000 for $3400. Watched as prices continued to march. Not getting left behind, I decided to throw my hat in the ring with Intel, got 2xB70s for $1900 Has barely been 2 days and I can't believe how much fun this is. Excited to contribute to Intel optimizations and provide useful AI to my family and friends

English

112

15.7K

cudnn_cu12@_proteuss_·12h

@Dadahelper1 interested in how you make these!!

English

yuxinlu1@Dadahelper1·12h

v2 is out on HF and the jump in agentic tool-use is huge — on tau2-bench telecom it goes from ~15% (base Gemma 4 12B) to ~55%: it actually diagnoses → fixes → verifies instead of bailing after step one. A lot of folks are hitting issues across different runtimes (llama.cpp, Ollama, LM Studio, koboldcpp…) — almost all of it is template / tool-format quirks, not the weights. I've written every fix up in the HF discussions, so check there first 🙏 And honestly? Google's models are way less convenient to work with than Qwen — the non-standard chat template + native tool-call format trip up half the ecosystem, while Qwen's standard ChatML just works everywhere. Shipped it anyway 💪 huggingface.co/yuxinlu1/gemma…

English

672

cudnn_cu12@_proteuss_·12h

my version of alopex, a gradient free / correlative learning algorithm now beats adamW on a mlp learning mnist with zero probes and only forward passes. they said it couldn't be done

English

cudnn_cu12@_proteuss_·16h

@andreisavin @oleksoleksoleks its not economical. i have vpc at $30 a month

English

112

Andrei Savin@andreisavin·16h

@_proteuss_ @oleksoleksoleks I've had good success with vast.ai

English

117

Olek@oleksoleksoleks·18h

You can just rent Blackwell hours at 90% discount of inference API prices. Model providers are playing you all like fiddles.

English

483

75.3K

cudnn_cu12@_proteuss_·17h

what are some examples of people that have made their own harness?

English

cudnn_cu12@_proteuss_·17h

@0xSero thanks ill go thru it. but assume im dumb and if i want to make small version like this -what do i have to do?

English

0xSero@0xSero·17h

@_proteuss_ Model cards are on huggingface.co/0xsero

English

0xSero@0xSero·19h

GLM-5.2-REAP checks it's work to catch errors before serving me the game. I love this MF

English

3.4K

cudnn_cu12@_proteuss_·18h

i will try glm 5.2 again on a fresh repo for kernel competition but which harness, inference endpoint, ide should i use?

English

cudnn_cu12@_proteuss_·18h

@Dadahelper1 @HuggingModels how was this fine-tuned?

English

yuxinlu1@Dadahelper1·4d

Thank you @HuggingModels for the feature, this made my day 🙏 v1 has been incredible to see take off in the community. Good news: v2 is already cooking — dataset's basically done. It'll be stronger across the board. Releasing as soon as it's ready 🚀

Hugging Models@HuggingModels

Gemma 4 12B Coder is here and it's a game changer for local code generation. This GGUF model packs Google's latest gemma-4 architecture into a compact 12B size, perfect for running on consumer hardware. It's optimized for reasoning and thinking, making it ideal for developers who want fast, private coding assistance without the cloud.

English

961

cudnn_cu12@_proteuss_·23h

@minu_who @hrishioa what harness , provider, ide are yall using?

English

minu@minu_who·1d

glm 5.2 outperformed opus 4.8 on our undefeated (by humans and agents) take-home. it also moved fewer tokens and we've known it's cheaper, like 3 times cheaper. i'm kinda having a moment

Hrishi@hrishioa

This is a watershed moment. GLM-5.2 solidly beat Opus 4.8 and human participants in our backend take-home, making the whole thing obsolete. It also pushed forward the state-of-the-art for multi-stage media-to-transcript, with a new release: offmute-v2. I come with receipts.

English

194

17.1K

cudnn_cu12@_proteuss_·1d

@difficultyang i was using this for this and it couldn't get to a solution that was sub 2ms and i spent nearly $100 x.com/_proteuss_/sta…

cudnn_cu12@_proteuss_

first submission - im not last

English

1.2K

difficultyang@difficultyang·1d

My verdict on GLM-5.2 is that it passes the inflection point threshold. It isn't too cheap to meter but in interactive use you'll have trouble pushing to $100 spend a day, I think.

English

609

46.8K

cudnn_cu12@_proteuss_·1d

this algo loves ai slop articles only viewed by bots or something

English

cudnn_cu12@_proteuss_·1d

im not in the weights

English

cudnn_cu12@_proteuss_·1d

@ericzedd pc - i have some old gpus but no where near enough vram for large models

English

eric zedd@ericzedd·1d

@_proteuss_ PC or Mac? 👀

English

cudnn_cu12@_proteuss_·1d

ive hit usage limits on everything this month : codex, copilot, and claude. i really need to get a local coding model hmmm

English

cudnn_cu12@_proteuss_·2d

@blelbach @_arohan_ interesting ..curious how one reward hacks this

English

Bryce, the CUDA Colonel@blelbach·2d

@_proteuss_ @_arohan_ It was reward hacked. More validation was added.

English

rohan anil@_arohan_·2d

dhu.randhar is beating Cuda Colonels’s forces and currently ahead by 101us. a cool username I must say.

rohan anil@_arohan_

Here is a tutorial for uninitiated for QR for those wanting to enter the competition. All you need to know is a little bit of algebra. Here @ means matrix multiplication. Let A be a 3 x 3 matrix and task is given A can you find Q and R. A special property of Q is that Q @ Q.T = I and R is a upper triangular matrix: A find Q and R. Q @ R = A [ [0, 1, 0], [ [2, 3, 4], [ [0, 5, 6], [1, 0, 0], @ [0, 5, 6], = [2, 3, 4], [0, 0, 1] ] [0, 0, 7] ] [0, 0, 7] ] The R matrix has a special property: [ [2, 3, 4], [0, 5, 6], [0, 0, 7] ] We are generally fine with any R matrix with this structure, see the zeros on the lower triangle. [ [0, 5, 6], A = [2, 3, 4], [0, 0, 7] ] First column of A: [ [0], x = [2], [0] ] Householder matrix: [ [0, 1, 0], H = [1, 0, 0], [0, 0, 1] ] Now compute Hx: [ [0, 1, 0], [ [0], H x = [1, 0, 0], @ [2], [0, 0, 1] ] [0] ] [ [2], = [0], [0] ] Now how do we get H? Well it turns out these are called reflectors. The householder reflectors. Think of putting a mirror on your vector.

English

8.5K

دریافت کریں

@datavorous_ @blelbach @hotschmoe @Dadahelper1 @andreisavin @oleksoleksoleks @0xSero @HuggingModels