Isalia20

607 posts

Isalia20

Isalia20

@Is36E

PyTorch enjoyer @ HF 🤗 https://t.co/G5bfywg70T

Beigetreten Mayıs 2017
93 Folgt178 Follower
Isalia20
Isalia20@Is36E·
@baggiponte Agree on it being second-class citizen. AFAIK there is no official roadmap, but it should become much better/faster in next 2 releases (2.13/2.14)
English
1
0
5
139
Luca Baggi
Luca Baggi@baggiponte·
@Is36E Is there a sort of "roadmap" for pytorch on MPS? I _feel_ it's like a second-class citizen, but I might be wrong.
English
1
0
4
177
Isalia20
Isalia20@Is36E·
Shipped specialized SDPA kernels for PyTorch MPS, up to 16x faster than the previous MPSGraph path 🚀 Metal kernels for both decode (q_len=1) and prefill (long causal) - Decode, 16k ctx, D=128: **1.42 → 0.087 ms (16.3x) - Prefill, 4k seq, D=96: **99.6 → 18.8 ms (5.3x)
Isalia20 tweet media
English
2
6
78
5K
Isalia20
Isalia20@Is36E·
@0xkeenz yes, please let me know if any op is slower than CPU on MPS and I'll look into it
English
1
0
1
16
KeenZ😶‍🌫️
KeenZ😶‍🌫️@0xkeenz·
太好了····我之前用 PyTorch 感觉 CPU 处理速度比 MPS 还快就很诡异,终于要解决了吗😭
Isalia20@Is36E

This marks the end of my first week at @huggingface! I'm joining as a founding engineer on HF's PyTorch team. My first project: safetensors on Mac is up to 3x faster🚀 Parallel reads straight into MPS unified memory, no CPU staging. MB Pro M5 Pro - Cold 16 GB: **2.97 → 8.23 GB/s** (2.8×) - Warm 3 GB: **10.3 → 26.6 GB/s** (2.6×)

中文
2
0
4
620
Isalia20
Isalia20@Is36E·
This marks the end of my first week at @huggingface! I'm joining as a founding engineer on HF's PyTorch team. My first project: safetensors on Mac is up to 3x faster🚀 Parallel reads straight into MPS unified memory, no CPU staging. MB Pro M5 Pro - Cold 16 GB: **2.97 → 8.23 GB/s** (2.8×) - Warm 3 GB: **10.3 → 26.6 GB/s** (2.6×)
Isalia20 tweet media
English
6
7
154
12.3K
Isalia20
Isalia20@Is36E·
@francoisfleuret Also worth to mention that getting good gpu utilization/power usage on pipeline parallelism is quite tricky
English
0
0
1
369
François Fleuret
François Fleuret@francoisfleuret·
Nothing shockingly dumb?
François Fleuret tweet media
English
8
0
38
25K
Isalia20
Isalia20@Is36E·
Pesky bug killing performance on @PyTorch MPS. Can you spot it?
Isalia20 tweet media
English
0
0
2
223
Joseph Jojoe
Joseph Jojoe@josephjojoe·
I'm open-sourcing an MLX port of the ESM-2 protein model family so more people can tinker with AI & biology on Apple silicon! (1/5)
Joseph Jojoe tweet media
English
15
22
174
22.9K
Isalia20
Isalia20@Is36E·
@mohitwt_ I think fusing means not having an extra kernel launch. Doing: GEMM -> inplace op -> GEMM isn't fusion. GEMM still writes the output back to global memory, inplace still needs to read each element from global memory and write it back
English
0
0
2
63
mohit
mohit@mohitwt_·
Built a fused GELU CUDA kernel from scratch and plugged it into an MLP block (Linear -> GELU -> Linear): Instead of: GEMM -> GELU -> GEMM (3 separate kernels, intermediate buffer written and read back from global memory) Now: GEMM -> in-place GELU kernel -> GEMM (GELU applied directly on the Linear1 output buffer, no extra allocation) Results: - up to 2.9x speedup on smaller sizes - 1.1 to 1.6x on larger ones cuBLAS handles the matmuls. the custom kernel only targets what cuBLAS can't do, the elementwise activation sitting between two projections.
mohit tweet media
English
2
3
52
1.8K
Isalia20
Isalia20@Is36E·
@mohitwt_ If you do augmentations on data, this isn't as easy
English
0
0
0
125
mohit
mohit@mohitwt_·
you don't need the teacher model present during student training run the teacher model separately across your entire data and save the output (logits) to disk now you load the raw logits during student training 1. load teacher in GPU > run training > save output > unload teacher 2. load student > load logits > apply softmax+temp > train
Raj Dabre@prajdabre

ML interview question: Suppose you are implementing Knowledge distillation, and you have a teacher and a student model. However you simply do not have the necessary GPU resources to fit both the teacher and the student into the GPU at the same time. What is your solution?

English
4
0
38
3.2K
Birchlabs
Birchlabs@Birchlabs·
pytorch 2.11 is now out! flex attention now has an FA4 backend. differentiable collectives. varlen attention supports sliding window. Metal 4, MPS operator expansion. inductor can now emit NVGEMM. and… LSTMs are back? github.com/pytorch/pytorc…
Birchlabs tweet media
English
1
3
14
963
Maaz
Maaz@mmaaz_98·
How hard is it to translate CUDA to MPS
English
6
0
15
4.7K
Isalia20
Isalia20@Is36E·
Hopefully soon distributed on @PyTorch MPS will come to life
Isalia20 tweet media
English
0
0
1
137
Isalia20
Isalia20@Is36E·
@_avichawla You can also do div_(255) to divide it inplace
English
0
0
0
57
Avi Chawla
Avi Chawla@_avichawla·
Here's a neural net optimization trick that leads to ~4x faster CPU to GPU transfers. Imagine an image classification task. - We define the network, load the data and transform it. - In the training loop, we transfer the data to the GPU and train. Here's the problem with this: If you look at the profiler: - Most of the time/resources will be allocated to the kernel (the actual training code). - However, a significant amount of time will also be dedicated to data transfer from CPU to GPU (this appears under cudaMemcpyAsync). Reducing the data transfer is simple. Recall that the original dataset was composed of pixel values. These were 8-bit integers, and we transformed them to 32-bit floats. Next, we transferred these 32-bit floating-point tensors to the GPU. This meant that transforming the data led to more data (4x) being transferred. The solution is simple. Moving the transformation step after the data transfer, since in that case, we shall transfer 8-bit integers instead of 32-bit floats. As a result, you will notice a significant drop in the data transfer step. Of course, this technique doesn’t apply to all neural network use cases, like NLP, where we inherently deal with 32-bit float embeddings. However, whenever I have identified any possibility to use this trick, I have experienced noticeable gains from it. 👉 Over to you: What other NN optimization techniques are you aware of?
Avi Chawla tweet media
English
10
17
138
17.1K
Isalia20
Isalia20@Is36E·
@_xjdr setting dim to 1025 to piss of kernel developers
Isalia20 tweet media
English
0
0
27
1.8K
xjdr
xjdr@_xjdr·
if your model dim isn't divisible by 128, what the actual fuck are you doing? ive seen this like 4 times in the last few days
English
19
4
275
35.5K
Isalia20
Isalia20@Is36E·
@SkyLi0n Why do I feel this post targets me?😅
English
0
0
0
7
Aaron Gokaslan
Aaron Gokaslan@SkyLi0n·
Not enough people know about the advantages of using log1p and expm1 when appropriate!
English
2
0
3
276