Ramsha Khan

645 posts

Ramsha Khan

Ramsha Khan

@__ramshaaa__

Learning & exploring ML systems & infra, RL

Mumbai Katılım Aralık 2023
230 Takip Edilen64 Takipçiler
Harsh
Harsh@weiiiisuiii·
looking for Ai/ML Internship, here is my resume Feedback on my profile/projects would be appreciated
Harsh tweet mediaHarsh tweet media
English
21
10
203
16.1K
Ramsha Khan
Ramsha Khan@__ramshaaa__·
With NVLink training is faster compared to without NVLink. If you're curious about NVLink, give this a read: intuitionlabs.ai/articles/nvidi… Quick note: I might take a break from this series for a few days, will pick it up again soon!
English
0
0
3
53
Ramsha Khan
Ramsha Khan@__ramshaaa__·
NVLink = direct GPU interconnect -> much higher bandwidth than PCIe So communication paths actually matter a lot for distributed training. I was reading a Hugging Face article where they compared DDP performance with & without NVLink, and the difference is pretty clear: (2/n)
English
2
0
3
63
Ramsha Khan
Ramsha Khan@__ramshaaa__·
Day 14 of learning distributed training: Exploring GPU topology 👇 X means the GPU is referring to itself (so yeah, no communication with itself xD) When you see PHB, it means GPUs are connected via the PCIe Host Bridge (CPU) so no direct GPU <-> GPU link (1/n)
Ramsha Khan tweet mediaRamsha Khan tweet media
Ramsha Khan@__ramshaaa__

Day 13 of learning distributed training: We've covered collective operations where multiple processes take part in communication. Now there’s this -> point-to-point communication (one-to-one) where you pass data from one specific process to another (not all processes). (1/n)

English
2
0
7
236
Ramsha Khan
Ramsha Khan@__ramshaaa__·
Both should NOT send at the same time they'll wait forever because both are sending and both will keep waiting to receive leads to deadlock. Don’t modify the tensor before .wait() when using non-blocking functions.
English
0
0
2
60
Ramsha Khan
Ramsha Khan@__ramshaaa__·
Things NOT to do! Be careful while setting src & dst and the operation If process0 sends to process1, then process1 should receive from process0 (3/n)
English
1
0
2
56
Ramsha Khan
Ramsha Khan@__ramshaaa__·
Send the tensor between processes using send() and recv() There's also isend() and irecv() - non-blocking functions > transfer can happen in the background while some other computation/work runs simultaneously (2/n)
English
1
0
2
60
Ramsha Khan
Ramsha Khan@__ramshaaa__·
Day 13 of learning distributed training: We've covered collective operations where multiple processes take part in communication. Now there’s this -> point-to-point communication (one-to-one) where you pass data from one specific process to another (not all processes). (1/n)
Ramsha Khan tweet media
Ramsha Khan@__ramshaaa__

Day 12 of learning distributed training: We saw a linear relationship between workers and the steps needed for communication, so there was an assumption that latency doesn’t matter much. But in large distributed systems, that assumption breaks as latency is not negligible. (1/n)

English
1
0
5
446
Saad
Saad@sodakeyEatsMush·
@__ramshaaa__ Thanks!! Also btw your posts on distributed training are pretty cool!!
English
1
0
1
15
Saad
Saad@sodakeyEatsMush·
Wrote some tests for my utf8 implementation for the text editor. Btw this is like the first time that I have written tests so it was nice to learn a new thing. Now that all the tests have passed I can move forward with the other stuff.
Saad tweet mediaSaad tweet media
English
1
0
12
150
Ramsha Khan
Ramsha Khan@__ramshaaa__·
A leaf that was idle in Tree A now becomes active in Tree B. So even if some GPUs are idle in one tree, they are active in the other.
English
0
0
2
34
Ramsha Khan
Ramsha Khan@__ramshaaa__·
Then, to fix the utilization issue, Double Binary Trees were introduced. It’s a sweet spot between the previous two approaches. There are two trees: a root in Tree A acts as a leaf in Tree B. (3/n)
English
1
0
3
46
Ramsha Khan
Ramsha Khan@__ramshaaa__·
Day 12 of learning distributed training: We saw a linear relationship between workers and the steps needed for communication, so there was an assumption that latency doesn’t matter much. But in large distributed systems, that assumption breaks as latency is not negligible. (1/n)
Ramsha Khan tweet media
Ramsha Khan@__ramshaaa__

Day 11 of learning distributed training: Let's keep going with collective ops by zooming into All-Reduce and what's happening behind the scenes. So I found a couple of ways to do naive all-reduce: (1/n)

English
1
0
7
474
Saad
Saad@sodakeyEatsMush·
@__ramshaaa__ If you’ve seen neovim or emacs, I’m working on smth similar to that. Basically my own TUI text editor. Tho it’s more of a passion project which I work on during my free time, so it will take a bit of time to complete.
English
1
0
0
36
pdawg
pdawg@prathamgrv·
I made a Claude Code skill that turns any arxiv paper into working code. Every line traces back to the paper section it came from & any implementation detail the paper skips will be flagged, and not assumed. open sourcing it - github.com/PrathamLearnsT…
English
51
282
2.6K
195.1K
srija
srija@srijatwt·
got a mail I never thought I'd receive :) would love to connect with fellow @MATSprogram scholars, looking forward to this summer!
srija tweet media
English
39
2
424
16.8K
Ramsha Khan
Ramsha Khan@__ramshaaa__·
all good so far… but yeah, nothing's perfect ring also comes with a downside of linear latency scaling with number of processes/workers. we'll see in the next thread what comes up to solve this.
English
0
0
2
45
Ramsha Khan
Ramsha Khan@__ramshaaa__·
so each phase takes (N − 1) steps, and since there are 2 phases, total steps = 2 × (N − 1) (N: number of gpus) in every step, each GPU sends/receives a chunk of size D/N, overall data moved per GPU becomes: -> 2 × (N − 1) × (D / N) which is roughly ~2x the data size
English
1
0
2
46
Ramsha Khan
Ramsha Khan@__ramshaaa__·
Day 11 of learning distributed training: Let's keep going with collective ops by zooming into All-Reduce and what's happening behind the scenes. So I found a couple of ways to do naive all-reduce: (1/n)
Ramsha Khan@__ramshaaa__

Day 10 of learning distributed training: How do processes talk to each other? We make use of collective operations and here's an example of using one of them! Processes need to share data with each other so let’s make these processes communicate using all_gather.

English
1
0
3
586