Isalia20

607 posts

Isalia20

@Is36E

PyTorch enjoyer @ HF 🤗 https://t.co/G5bfywg70T

เข้าร่วม Mayıs 2017

93 กำลังติดตาม177 ผู้ติดตาม

Isalia20@Is36E·9h

@baggiponte Agree on it being second-class citizen. AFAIK there is no official roadmap, but it should become much better/faster in next 2 releases (2.13/2.14)

English

114

Luca Baggi@baggiponte·9h

@Is36E Is there a sort of "roadmap" for pytorch on MPS? I _feel_ it's like a second-class citizen, but I might be wrong.

English

147

Isalia20@Is36E·10h

Shipped specialized SDPA kernels for PyTorch MPS, up to 16x faster than the previous MPSGraph path 🚀 Metal kernels for both decode (q_len=1) and prefill (long causal) - Decode, 16k ctx, D=128: **1.42 → 0.087 ms (16.3x) - Prefill, 4k seq, D=96: **99.6 → 18.8 ms (5.3x)

English

3.8K

Isalia20@Is36E·25 Nis

@0xkeenz yes, please let me know if any op is slower than CPU on MPS and I'll look into it

English

KeenZ😶‍🌫️@0xkeenz·25 Nis

太好了····我之前用 PyTorch 感觉 CPU 处理速度比 MPS 还快就很诡异，终于要解决了吗😭

Isalia20@Is36E

This marks the end of my first week at @huggingface! I'm joining as a founding engineer on HF's PyTorch team. My first project: safetensors on Mac is up to 3x faster🚀 Parallel reads straight into MPS unified memory, no CPU staging. MB Pro M5 Pro - Cold 16 GB: **2.97 → 8.23 GB/s** (2.8×) - Warm 3 GB: **10.3 → 26.6 GB/s** (2.6×)

中文

620

Isalia20@Is36E·24 Nis

English

154

12.3K

Isalia20@Is36E·21 Nis

@francoisfleuret Also worth to mention that getting good gpu utilization/power usage on pipeline parallelism is quite tricky

English

369

François Fleuret@francoisfleuret·21 Nis

Nothing shockingly dumb?

English

25K

Isalia20@Is36E·12 Nis

Pesky bug killing performance on @PyTorch MPS. Can you spot it?

English

220

Isalia20@Is36E·6 Nis

@josephjojoe @awnihannun Did you torch.compile MPS one?

English

199

Joseph Jojoe@josephjojoe·6 Nis

I'm open-sourcing an MLX port of the ESM-2 protein model family so more people can tinker with AI & biology on Apple silicon! (1/5)

English

174

22.9K

Isalia20@Is36E·30 Mar

@mohitwt_ I think fusing means not having an extra kernel launch. Doing: GEMM -> inplace op -> GEMM isn't fusion. GEMM still writes the output back to global memory, inplace still needs to read each element from global memory and write it back

English

mohit@mohitwt_·30 Mar

Built a fused GELU CUDA kernel from scratch and plugged it into an MLP block (Linear -> GELU -> Linear): Instead of: GEMM -> GELU -> GEMM (3 separate kernels, intermediate buffer written and read back from global memory) Now: GEMM -> in-place GELU kernel -> GEMM (GELU applied directly on the Linear1 output buffer, no extra allocation) Results: - up to 2.9x speedup on smaller sizes - 1.1 to 1.6x on larger ones cuBLAS handles the matmuls. the custom kernel only targets what cuBLAS can't do, the elementwise activation sitting between two projections.

English

1.8K

Isalia20@Is36E·24 Mar

@mohitwt_ If you do augmentations on data, this isn't as easy

English

125

mohit@mohitwt_·24 Mar

you don't need the teacher model present during student training run the teacher model separately across your entire data and save the output (logits) to disk now you load the raw logits during student training 1. load teacher in GPU > run training > save output > unload teacher 2. load student > load logits > apply softmax+temp > train

Raj Dabre@prajdabre

ML interview question: Suppose you are implementing Knowledge distillation, and you have a teacher and a student model. However you simply do not have the necessary GPU resources to fit both the teacher and the student into the GPU at the same time. What is your solution?

English

3.2K

Isalia20@Is36E·23 Mar

@Birchlabs oh cool, must have missed it

English

Birchlabs@Birchlabs·23 Mar

@Is36E it says it is github.com/pytorch/pytorc… github.com/pytorch/pytorc… > Use Metal-4.0 on MacOS-26 (supports lambdas, more C++17 features and tensor arguments)

English

Birchlabs@Birchlabs·23 Mar

pytorch 2.11 is now out! flex attention now has an FA4 backend. differentiable collectives. varlen attention supports sliding window. Metal 4, MPS operator expansion. inductor can now emit NVGEMM. and… LSTMs are back? github.com/pytorch/pytorc…

English

963

Isalia20@Is36E·19 Mar

@qtnx_ @qkvproj legendary handle

English

183

Isalia20@Is36E·7 Mar

@mmaaz_98 not too hard

English

174

Maaz@mmaaz_98·7 Mar

How hard is it to translate CUDA to MPS

English

4.7K

Isalia20@Is36E·4 Mar

Hopefully soon distributed on @PyTorch MPS will come to life

English

137

Isalia20@Is36E·24 Şub

@vikhyatk or power usage

English

210

vik@vikhyatk·24 Şub

this is why it’s best to measure utilization by gpu temperature

Amin@__aminima__

recently learned that "volatile gpu utilization" (in nvidia-smi) shows the % of time AT LEAST ONE kernel was executing if we have a single kernel running an infinite loop on one block, nvidia-smi would show gpu util as 100% despite most of the gpu being idle

English

294

17.9K

Isalia20@Is36E·21 Şub

@_avichawla You can also do div_(255) to divide it inplace

English

Avi Chawla@_avichawla·20 Şub

Here's a neural net optimization trick that leads to ~4x faster CPU to GPU transfers. Imagine an image classification task. - We define the network, load the data and transform it. - In the training loop, we transfer the data to the GPU and train. Here's the problem with this: If you look at the profiler: - Most of the time/resources will be allocated to the kernel (the actual training code). - However, a significant amount of time will also be dedicated to data transfer from CPU to GPU (this appears under cudaMemcpyAsync). Reducing the data transfer is simple. Recall that the original dataset was composed of pixel values. These were 8-bit integers, and we transformed them to 32-bit floats. Next, we transferred these 32-bit floating-point tensors to the GPU. This meant that transforming the data led to more data (4x) being transferred. The solution is simple. Moving the transformation step after the data transfer, since in that case, we shall transfer 8-bit integers instead of 32-bit floats. As a result, you will notice a significant drop in the data transfer step. Of course, this technique doesn’t apply to all neural network use cases, like NLP, where we inherently deal with 32-bit float embeddings. However, whenever I have identified any possibility to use this trick, I have experienced noticeable gains from it. 👉 Over to you: What other NN optimization techniques are you aware of?