cudnn_cu12

1.9K posts

cudnn_cu12

cudnn_cu12

@_proteuss_

Interests: Machine learning research, learning in neural networks, applied AI, startups, etc

شامل ہوئے Eylül 2022
952 فالونگ305 فالوورز
پن کیا گیا ٹویٹ
cudnn_cu12
cudnn_cu12@_proteuss_·
Lord God, I come to You a sinner And I humbly repent for my sins I believe that Jesus is Lord I believe You raised Him from the dead I would ask that Jesus come into my life And be my Lord and Savior I receive Jesus to take control of my life And that I may live for Him from this day forward Thank you, Lord Jesus For saving me with Your precious blood In Jesus' name, amen
English
0
0
6
1.1K
cudnn_cu12
cudnn_cu12@_proteuss_·
@datavorous_ thanks, i havent been able to crack under 10ms using agents only. will have to do it by hand
English
0
0
0
31
datavorous
datavorous@datavorous_·
I tried to speedrun the QR_2 challenge of GPU MODE in 16 hours i stopped at 4.7ms, this was my first "proper" gpu mode contest. constraints being: no way to tokenmaxx, no gpu to profile against extensively. what I did to understand the math: > householder Reflection method to do QR decomposition - reading the problem statement multiple times - did a 3x3 matrix dry run by hand - consulted gemini for validation of my mental model - studied parallel prefix sums - drew out the tree structure on paper before writing a single line of code - thought of a simpler approach (blocked method) instead of directly jumping to the tree like implementation what worked: - blocked householder with GEMM trailing update - loading panels at ONCE into the SRAM - tuning block sizes to prevent spilling - tuning warp count for better occupancy - (unintended) pytorch fallback for n=4096 what did not work: - fp8/nvfp4 quantisation (precision violated) - Cholesky method (not allowed) - Mixed precision with iterative correction - TF32 global flag (failed tests lmao) - full panel resident in shared memory (spilled badly) takeaways: i didn't get a chance to do extensive kernel programming before, so I had to bet on my intuitions: - bytes moved is the real bottleneck, i have to reduce the round trips of data - asymptomatic wins can still lose in practise - if you're broke, tokenmaxxing is not an option. manual? doable, but time is a constraint. human x ai? hell yeah!
datavorous tweet media
English
7
2
80
5.8K
Bryce, the CUDA Colonel
This isn't one problem - it's a dozen or more! Specialized kernels for different shapes & kinds of matrices will win the day. Leveraging tensorcores & a mixture of numeric formats (FP64, FP32, TF32, FP16, FP8) is likely key to top answers. This may even end up memory bound.
Bryce, the CUDA Colonel tweet media
Mark Saroufim@marksaroufim

We've released the QR problem, a more robust qr_v2 with a fresh leaderboard so please resubmit! Thank you to @blelbach, @myainotez and @nikhilbarhate99 for sharing feedback. Sorry if I missed anyone! I considered automatically backfilling all submissions but the rankings do change quite a bit so I figured a refresh would be better. Changelog * Fail submissions if they fail when we change random seeds * Add nasty correctness cases with more degenerate inputs in mixed batches * Recheck correctness when doing perf testing to avoid Volkswagen cheat * Reject Nan/Inf residuals * Validate each matrix factorization residual, since averaging was hiding bad matrices * Old QR is still open so folks can't see submissions but you can't submit anything to it Wontfix * Stream hacking is still banned via very blunt ban of the word "stream" we don't have a good solution for this * CUDA graphs are allowed but not particularly interesting to us Best submissions so far if I resubmit their solutions are

English
3
3
51
8.4K
StrongEngineer_
StrongEngineer_@hotschmoe·
I almost bought a dgx spark at $2999, then I almost bought a rtx pro 5000 for $3400. Watched as prices continued to march. Not getting left behind, I decided to throw my hat in the ring with Intel, got 2xB70s for $1900 Has barely been 2 days and I can't believe how much fun this is. Excited to contribute to Intel optimizations and provide useful AI to my family and friends
English
13
0
112
15.7K
yuxinlu1
yuxinlu1@Dadahelper1·
v2 is out on HF and the jump in agentic tool-use is huge — on tau2-bench telecom it goes from ~15% (base Gemma 4 12B) to ~55%: it actually diagnoses → fixes → verifies instead of bailing after step one. A lot of folks are hitting issues across different runtimes (llama.cpp, Ollama, LM Studio, koboldcpp…) — almost all of it is template / tool-format quirks, not the weights. I've written every fix up in the HF discussions, so check there first 🙏 And honestly? Google's models are way less convenient to work with than Qwen — the non-standard chat template + native tool-call format trip up half the ecosystem, while Qwen's standard ChatML just works everywhere. Shipped it anyway 💪 huggingface.co/yuxinlu1/gemma…
English
4
1
15
672
cudnn_cu12
cudnn_cu12@_proteuss_·
my version of alopex, a gradient free / correlative learning algorithm now beats adamW on a mlp learning mnist with zero probes and only forward passes. they said it couldn't be done
cudnn_cu12 tweet media
English
0
0
0
53
Olek
Olek@oleksoleksoleks·
You can just rent Blackwell hours at 90% discount of inference API prices. Model providers are playing you all like fiddles.
English
49
2
483
75.3K
cudnn_cu12
cudnn_cu12@_proteuss_·
what are some examples of people that have made their own harness?
English
1
0
0
77
cudnn_cu12
cudnn_cu12@_proteuss_·
@0xSero thanks ill go thru it. but assume im dumb and if i want to make small version like this -what do i have to do?
English
0
0
0
15
0xSero
0xSero@0xSero·
GLM-5.2-REAP checks it's work to catch errors before serving me the game. I love this MF
0xSero tweet media
English
5
0
41
3.4K
cudnn_cu12
cudnn_cu12@_proteuss_·
i will try glm 5.2 again on a fresh repo for kernel competition but which harness, inference endpoint, ide should i use?
English
0
0
0
30
yuxinlu1
yuxinlu1@Dadahelper1·
Thank you @HuggingModels for the feature, this made my day 🙏 v1 has been incredible to see take off in the community. Good news: v2 is already cooking — dataset's basically done. It'll be stronger across the board. Releasing as soon as it's ready 🚀
Hugging Models@HuggingModels

Gemma 4 12B Coder is here and it's a game changer for local code generation. This GGUF model packs Google's latest gemma-4 architecture into a compact 12B size, perfect for running on consumer hardware. It's optimized for reasoning and thinking, making it ideal for developers who want fast, private coding assistance without the cloud.

English
2
0
3
961
minu
minu@minu_who·
glm 5.2 outperformed opus 4.8 on our undefeated (by humans and agents) take-home. it also moved fewer tokens and we've known it's cheaper, like 3 times cheaper. i'm kinda having a moment
minu tweet media
Hrishi@hrishioa

This is a watershed moment. GLM-5.2 solidly beat Opus 4.8 and human participants in our backend take-home, making the whole thing obsolete. It also pushed forward the state-of-the-art for multi-stage media-to-transcript, with a new release: offmute-v2. I come with receipts.

English
9
8
194
17.1K
difficultyang
difficultyang@difficultyang·
My verdict on GLM-5.2 is that it passes the inflection point threshold. It isn't too cheap to meter but in interactive use you'll have trouble pushing to $100 spend a day, I think.
English
20
18
609
46.8K
cudnn_cu12
cudnn_cu12@_proteuss_·
this algo loves ai slop articles only viewed by bots or something
English
0
0
0
31
cudnn_cu12
cudnn_cu12@_proteuss_·
im not in the weights
English
0
0
0
20
cudnn_cu12
cudnn_cu12@_proteuss_·
@ericzedd pc - i have some old gpus but no where near enough vram for large models
English
0
0
0
5
cudnn_cu12
cudnn_cu12@_proteuss_·
ive hit usage limits on everything this month : codex, copilot, and claude. i really need to get a local coding model hmmm
English
1
0
1
60