Shauray

779 posts

Shauray

@Shauray7

I'm Jack's wasted life

انضم Eylül 2017

771 يتبع209 المتابعون

تغريدة مثبتة

Shauray@Shauray7·3 Mar

Twinflow distillation blog for qwen-image-2512 is finally up: the idea: 2 NFE on a 20B flow model. standard twinflow gives you a performance ceiling (student can't beat teacher). so i bolted on latent-space rl gradients to escape the teacher's distribution + dynamic renoise sampling to stop early training from collapsing. did it work? kinda! checkpoints from step 5k-10k are on hf if you want to poke at them and figure out what i did wrong. my guess is the rl weight ramp needs to be sigmoidal not a hard switch and more steps! All the links to the blog, code and the checkpoints are in comments (just so that X does not limit this post).

Shauray@Shauray7

Distilling Qwen-Image-2512 using TwinFlow, student sucking up knowledge from a monster teacher I think qwen-image is capable of doing far more then Z-image atleast on the realistic front (personal observations). Slashed batch times with MP and custom augs, 8xH200 pinned at max pretty much. Also added RL to the loop in theory it should get better then the teacher but that remains to be seen since the RL kicks in after 2k steps. The loss wont be very indicative of how the training is working I guess since its a distillation run, attaching some results on how it's going [at 1100 and 1600].

English

6.9K

Tanmay Patil@TanmayPatil79·5h

🚀 Just released: Krea-2 Depth ControlNet-LoRA Keeps near-perfect 3D structure while letting you completely change the image with any prompt. Works great with Krea-2 Turbo too. huggingface.co/Patil/Krea-2-d… Big thanks to @edwixxxx @Shauray7 who helped test and refine it along the way. Thanks @krea_ai for great model 🤗.

English

978

Shauray@Shauray7·5h

@TanmayPatil79 @edwixxxx dropout always helps, oh wait i forgot did you dropout on cfg?

English

117

Shauray@Shauray7·5h

@SubhoGhosh02 @tqchenml agreed. the explicit dispatch control is the interesting part though both still lower through the same TVM layer underneath

English

subho ghosh@SubhoGhosh02·5h

@Shauray7 @tqchenml TIRx felt like tilelang but more control over proxies and dispatch

English

152

Shauray@Shauray7·7h

this is gold ! Crazy they only credited @tqchenml for XGBoost, he literally maintains TVM and the FFI, also I see TIRx in here !!

机器之心 JIQIZHIXIN@jiqizhixin

New book from Tianqi Chen, the creator of XGBoost! Modern GPU Programming For MLSys Link: mlc.ai/modern-gpu-pro…

English

143

14K

Shauray@Shauray7·21h

clever, I think descending only matters because the scheduler is dynamic. on a static strided persistent scheduler, it should be symmetric against reversal, so asc should give similar speedups? though won't call it LPT anymore would be just grouping equal-K tiles to balance CTA bands.

English

Shauray أُعيد تغريده

subho ghosh@SubhoGhosh02·1d

FA4's SingleTileLPTScheduler exploits that causal attention work grows with block index, so it just visits blocks in reverse (block = num_block - 1 - block). So why not try something similar on grouped gemm! In grouped GEMM the analog is that a tile's mainloop time is proportional to its group's K, and StaticPersistentGroupTileScheduler visits tiles in group-metadata order. So LPT = order groups by descending K. Result is 1.74x speedup in grouped gemm, just by sorting the scheduling path.

English

2.5K

Shauray@Shauray7·1d

If you're buying a UGREEN NVMe enclosure for anything chip-specific (atleast in india), there new batches atleast on there USB3.x enclosures come with an RTL chip (RTL9210) even though they mention ASM chips on some of there enclosures, might just be silicon lottery across there batches, this is for USB3.x, there USB4 enclosures might still come with an ASM chip

English

123

Shauray أُعيد تغريده

Yifei Wang@WangYw251·1d

Pixel-space autoregressive generation. The model demonstrates strong generative performance while maintaining high linear probing accuracy. Can we unify image understanding and generation with AR? arxiv.org/abs/2606.27978

English

3.2K

Shauray@Shauray7·2d

@__tinygrad__ @bfspector @Etched x.com/geoffreywoo/st…

GEOFF@geoffreywoo

we love @Etched 🚀 congrats @robertwachen @UbertiGavin and the entire team. @LoganPaul and I @antifund saw the early demos last winter and what this team is building is insane. excited to run my own frontier inference cluster in my garage 😂

QME

the tiny corp@__tinygrad__·2d

@bfspector @Etched x.com/__tinygrad__/s…

the tiny corp@__tinygrad__

@Etched If you see a technical person in the replies saying good things about them, cross check the (paid) advisor list. It's a classic playbook from crypto.

QME

718

Etched@Etched·3d

We're coming out of stealth. We've built our first racks after a successful A0 tapeout, $1B+ in customer contracts, and $800m raised. Early customer tests show us achieving SOTA throughput, latency, and power efficiency on inference workloads. Our first racks ship this summer.

English

607

901

9.3K

5.8M

Shauray@Shauray7·2d

arxiv.org/pdf/2507.13347

ZXX

Shauray@Shauray7·4d

blog.hellas.ai/blog/thunderbo…

ZXX

243

Shauray@Shauray7·4d

step 11600, changed a lot of stuff, had a measurement bug on PSNR (it is the only metric i have however bullshit it might be), I thought it was undersampling so added a 50 step sample (It wasn't), added a decode recon test in order to measure the right PSNR, and more importantly made the run true to the paper, fixed LR, global BS 64, and much more balanced sigma selection, I feel its just underconverged the paper mentions 100k steps I'm not even close to being done but its good to remove all paths to failure. [compare the images to the quoted once, noticeably less noisy now] more details on Github [github.com/shauray8/l2p_q…]

Shauray@Shauray7

5800, moving slowly ofc, no visible changes, though I finally got to writing the readme for it and everything is in its place now ! links to everything below ofc, using the readme as a report here on what worked and the progress of it all.

English

340

Shauray@Shauray7·26 Haz

finally got time to play around with krea 2 and yeah it does blow my mind how much this does out of the box, always wanted to make something similar to "into the spiderverse" inspired something, I guess lets see how well it can replicate those. btw I'm using it with the layer scaling from here huggingface.co/Beinsezii/Krea…

English

192

Shauray@Shauray7·23 Haz

@__tinygrad__ tilelang

Indonesia

134

the tiny corp@__tinygrad__·23 Haz

Today, if you were writing a bunch of kernels, what would you reach for? Raw CUDA? tile-lang? Triton? ThunderKittens?

English

375

46.1K

Shauray@Shauray7·23 Haz

links to the training script and the dataset - github.com/shauray8/l2p_q…

Shauray@Shauray7

So some progress on this, got a decent overfit run, Using 6/6 blocks rather then 3/3 i think in the paper cause that just works better on my runs, the thing does memorizes the images in pixel space, single step recon is pretty gooood. from pure noise is a little soft but hey it was a small run, does plateau after a while but could be just a scaling thing. training more on low-noise steps does make it plateau a little later but it does plateau after a while. I remember talking to @NagaSaiAbhinay about Muon a few days back so tried that as well, Muon on 2D attn/mlp weights and AdamW for embeddings, norm and decoder did not quite beat AdamW on my test runs - could be cause most of the work is being done by the detailer head here and Muon just doesn't finetune an Adam-pretrained model well. attached a couple of samples from the overfit run and loss for adam and muon runs, sudden jumps in the loss is due to low sigma.

English

290

Shauray@Shauray7·23 Haz

training script and other stuff- github.com/shauray8/l2p_q… cleaned dataset - huggingface.co/datasets/shaur…

English

Shauray@Shauray7·23 Haz

Shauray@Shauray7

4600, there were no GPUs where I wanted for the whole of yesterday lets see how today goes, still on 2 GPUs I'm seriously thinking of setting grad accum to 2 to match the paper, also on the second image the guy started walking the right direction now (look at the same image in the quoted post below)

English

564

Shauray أُعيد تغريده

Bohan Hou@bohanhou1998·22 Haz

We release TIRx today, a minimal compiler stack and hardware-native DSL for frontier ML kernels, built around storage-first tensor layouts and reusable tile primitives. tvm.apache.org/2026/06/22/tirx On NVIDIA B200, TIRx delivers up to ~1.08× over cuBLASLt on dense GEMM, outperforms DeepGEMM on all FP8 blockwise workloads with up to ~1.09× speedup, keeps FlashAttention-4 (FA4) typically within ~±2% of CuTeDSL, and remains competitive with cuBLASLt/FlashInfer on NVFP4 GEMM. Through our past experiences building frontier ML kernels, megakernels, and agentic kernel systems, we kept seeing the same boundary problem: new operators and new hardware require new optimization strategies that often break old programming models or compiler passes. TIRx builds on top of Apache TVM and moves toward a simple goal: let users and agents express the best-performing program, even for future hardware generations, while keeping the engineering effort for new kernels and new hardware as low as possible.