Shauray

779 posts

Shauray banner
Shauray

Shauray

@Shauray7

I'm Jack's wasted life

انضم Eylül 2017
771 يتبع209 المتابعون
تغريدة مثبتة
Shauray
Shauray@Shauray7·
Twinflow distillation blog for qwen-image-2512 is finally up: the idea: 2 NFE on a 20B flow model. standard twinflow gives you a performance ceiling (student can't beat teacher). so i bolted on latent-space rl gradients to escape the teacher's distribution + dynamic renoise sampling to stop early training from collapsing. did it work? kinda! checkpoints from step 5k-10k are on hf if you want to poke at them and figure out what i did wrong. my guess is the rl weight ramp needs to be sigmoidal not a hard switch and more steps! All the links to the blog, code and the checkpoints are in comments (just so that X does not limit this post).
Shauray tweet media
Shauray@Shauray7

Distilling Qwen-Image-2512 using TwinFlow, student sucking up knowledge from a monster teacher I think qwen-image is capable of doing far more then Z-image atleast on the realistic front (personal observations). Slashed batch times with MP and custom augs, 8xH200 pinned at max pretty much. Also added RL to the loop in theory it should get better then the teacher but that remains to be seen since the RL kicks in after 2k steps. The loss wont be very indicative of how the training is working I guess since its a distillation run, attaching some results on how it's going [at 1100 and 1600].

English
1
7
64
6.9K
Tanmay Patil
Tanmay Patil@TanmayPatil79·
🚀 Just released: Krea-2 Depth ControlNet-LoRA Keeps near-perfect 3D structure while letting you completely change the image with any prompt. Works great with Krea-2 Turbo too. huggingface.co/Patil/Krea-2-d… Big thanks to @edwixxxx @Shauray7 who helped test and refine it along the way. Thanks @krea_ai for great model 🤗.
Tanmay Patil tweet media
English
2
3
10
978
Shauray
Shauray@Shauray7·
@SubhoGhosh02 @tqchenml agreed. the explicit dispatch control is the interesting part though both still lower through the same TVM layer underneath
English
0
0
5
80
Shauray
Shauray@Shauray7·
clever, I think descending only matters because the scheduler is dynamic. on a static strided persistent scheduler, it should be symmetric against reversal, so asc should give similar speedups? though won't call it LPT anymore would be just grouping equal-K tiles to balance CTA bands.
English
1
0
2
90
Shauray أُعيد تغريده
subho ghosh
subho ghosh@SubhoGhosh02·
FA4's SingleTileLPTScheduler exploits that causal attention work grows with block index, so it just visits blocks in reverse (block = num_block - 1 - block). So why not try something similar on grouped gemm! In grouped GEMM the analog is that a tile's mainloop time is proportional to its group's K, and StaticPersistentGroupTileScheduler visits tiles in group-metadata order. So LPT = order groups by descending K. Result is 1.74x speedup in grouped gemm, just by sorting the scheduling path.
subho ghosh tweet mediasubho ghosh tweet media
English
3
9
57
2.5K
Shauray
Shauray@Shauray7·
If you're buying a UGREEN NVMe enclosure for anything chip-specific (atleast in india), there new batches atleast on there USB3.x enclosures come with an RTL chip (RTL9210) even though they mention ASM chips on some of there enclosures, might just be silicon lottery across there batches, this is for USB3.x, there USB4 enclosures might still come with an ASM chip
Shauray tweet mediaShauray tweet media
English
0
0
4
123
Shauray أُعيد تغريده
Yifei Wang
Yifei Wang@WangYw251·
Pixel-space autoregressive generation. The model demonstrates strong generative performance while maintaining high linear probing accuracy. Can we unify image understanding and generation with AR? arxiv.org/abs/2606.27978
English
0
5
32
3.2K
Etched
Etched@Etched·
We're coming out of stealth. We've built our first racks after a successful A0 tapeout, $1B+ in customer contracts, and $800m raised. Early customer tests show us achieving SOTA throughput, latency, and power efficiency on inference workloads. Our first racks ship this summer.
Etched tweet media
English
607
901
9.3K
5.8M
Shauray
Shauray@Shauray7·
step 11600, changed a lot of stuff, had a measurement bug on PSNR (it is the only metric i have however bullshit it might be), I thought it was undersampling so added a 50 step sample (It wasn't), added a decode recon test in order to measure the right PSNR, and more importantly made the run true to the paper, fixed LR, global BS 64, and much more balanced sigma selection, I feel its just underconverged the paper mentions 100k steps I'm not even close to being done but its good to remove all paths to failure. [compare the images to the quoted once, noticeably less noisy now] more details on Github [github.com/shauray8/l2p_q…]
Shauray tweet mediaShauray tweet mediaShauray tweet mediaShauray tweet media
Shauray@Shauray7

5800, moving slowly ofc, no visible changes, though I finally got to writing the readme for it and everything is in its place now ! links to everything below ofc, using the readme as a report here on what worked and the progress of it all.

English
0
1
9
340
Shauray
Shauray@Shauray7·
finally got time to play around with krea 2 and yeah it does blow my mind how much this does out of the box, always wanted to make something similar to "into the spiderverse" inspired something, I guess lets see how well it can replicate those. btw I'm using it with the layer scaling from here huggingface.co/Beinsezii/Krea…
Shauray tweet mediaShauray tweet mediaShauray tweet mediaShauray tweet media
English
0
0
4
192
the tiny corp
the tiny corp@__tinygrad__·
Today, if you were writing a bunch of kernels, what would you reach for? Raw CUDA? tile-lang? Triton? ThunderKittens?
English
77
7
375
46.1K
Shauray
Shauray@Shauray7·
links to the training script and the dataset - github.com/shauray8/l2p_q…
Shauray@Shauray7

So some progress on this, got a decent overfit run, Using 6/6 blocks rather then 3/3 i think in the paper cause that just works better on my runs, the thing does memorizes the images in pixel space, single step recon is pretty gooood. from pure noise is a little soft but hey it was a small run, does plateau after a while but could be just a scaling thing. training more on low-noise steps does make it plateau a little later but it does plateau after a while. I remember talking to @NagaSaiAbhinay about Muon a few days back so tried that as well, Muon on 2D attn/mlp weights and AdamW for embeddings, norm and decoder did not quite beat AdamW on my test runs - could be cause most of the work is being done by the detailer head here and Muon just doesn't finetune an Adam-pretrained model well. attached a couple of samples from the overfit run and loss for adam and muon runs, sudden jumps in the loss is due to low sigma.

English
0
0
3
290
Shauray
Shauray@Shauray7·
5800, moving slowly ofc, no visible changes, though I finally got to writing the readme for it and everything is in its place now ! links to everything below ofc, using the readme as a report here on what worked and the progress of it all.
Shauray tweet mediaShauray tweet mediaShauray tweet media
Shauray@Shauray7

4600, there were no GPUs where I wanted for the whole of yesterday lets see how today goes, still on 2 GPUs I'm seriously thinking of setting grad accum to 2 to match the paper, also on the second image the guy started walking the right direction now (look at the same image in the quoted post below)

English
1
0
6
564
Shauray أُعيد تغريده
Bohan Hou
Bohan Hou@bohanhou1998·
We release TIRx today, a minimal compiler stack and hardware-native DSL for frontier ML kernels, built around storage-first tensor layouts and reusable tile primitives. tvm.apache.org/2026/06/22/tirx On NVIDIA B200, TIRx delivers up to ~1.08× over cuBLASLt on dense GEMM, outperforms DeepGEMM on all FP8 blockwise workloads with up to ~1.09× speedup, keeps FlashAttention-4 (FA4) typically within ~±2% of CuTeDSL, and remains competitive with cuBLASLt/FlashInfer on NVFP4 GEMM. Through our past experiences building frontier ML kernels, megakernels, and agentic kernel systems, we kept seeing the same boundary problem: new operators and new hardware require new optimization strategies that often break old programming models or compiler passes. TIRx builds on top of Apache TVM and moves toward a simple goal: let users and agents express the best-performing program, even for future hardware generations, while keeping the engineering effort for new kernels and new hardware as low as possible.
Bohan Hou tweet media
English
4
44
143
37.8K
Shauray أُعيد تغريده
NYRE
NYRE@sleenyre·
Spend a lot of time writing this over the past week. If you want juicy details on our training stack and infra. Go read it! krea.ai/blog/krea-2-te…
English
9
28
162
20.1K