c0de517e/AngeloPesce

34.1K posts

c0de517e/AngeloPesce banner
c0de517e/AngeloPesce

c0de517e/AngeloPesce

@kenpex

he/him 🌞: Senior Director, Simulation. 🌚: Photography & Creative coding. Blog: c0de517e. My opinions are not my own. I DON'T tweet about ROBLOX.

Vancouver & San Francisco Katılım Haziran 2007
590 Takip Edilen8.3K Takipçiler
c0de517e/AngeloPesce
@yiningkarlli Practically all software is some sort of bytecode VM - you only have to look from the right vantage point :)
English
1
1
5
847
Yining Karl Li
Yining Karl Li@yiningkarlli·
Here's a fun blog post about bytecode VMs in surprising places. Mildly hot take of mine: a bytecode VM in a GPU kernel is not as bad of an idea as one might think, and in some cases it can actually be the best solution. Fun examples in thread: (1/5) dubroy.com/blog/bytecode-…
English
4
10
63
5.2K
c0de517e/AngeloPesce
@AgileJebrim @xlrndo ...matter because we're not in the business of writing loops to add numbers etc - it's not about microbenchmarks and speed of light. But - if you want to play these games etc... Ofc it's all silly and I imagine the "interview question" is just aimed at having a conversation.
English
1
0
0
69
c0de517e/AngeloPesce
@AgileJebrim @xlrndo No. But the same is true if you use all SMs, and even more if all SMs then have a ton of waves and keep trying to randomly flip beween them to issue even more. There is a point where issuing more in parallel is good, and a point where it slows you down. In practice it doesn't
English
1
0
0
58
c0de517e/AngeloPesce
@AgileJebrim @xlrndo Also, achieving speed of light (i.e. the theoretical max speed - here should be effectively = the bandwidth posted for GDDR on a given GPU) is really hard to achieve - likely impossible on top of generic spir-v / vulkan etc stuff. U might get lucky but in general it's hard.
English
1
0
1
84
Jiayin Cao
Jiayin Cao@Jiayin_Cao·
@maxliani Exactly what I did. I took some detour by making an attempt to SIMD the whole shading language first, only to realize that I had to introduce too many new concept to deal with divergency. The AD solution is quite neat. Though it has its problem, but it should be good for my use
English
2
0
2
649
Jiayin Cao
Jiayin Cao@Jiayin_Cao·
After years of thinking about it, I finally got a reasonable approximation for mipmap selection working in my renderer. The biggest challenge, by far, was supporting procedural UV generation in my shading language. I believe I have enough content to put down a blog post next.
Jiayin Cao tweet mediaJiayin Cao tweet mediaJiayin Cao tweet mediaJiayin Cao tweet media
English
2
13
183
10.6K
c0de517e/AngeloPesce
@AgileJebrim @xlrndo Not really. If you want a fun experiment, you can make a testbed that tries various number for TG/WG sizes, and various numbers for loops within a single shader run - and different ways of fetching. You'll see that optimal configurations are not necessarily intuitive.
English
2
0
0
52
Jebrim
Jebrim@AgileJebrim·
The followup question to this, assuming you weren’t wise enough to ask for clarifying details first, would be how you would do this knowing that we had 16m 8-bit unsigned integers to sum up?
Jebrim@AgileJebrim

I’ve taken to using a simple interview question with folks. It’s surprising how much people seem to struggle with answering it. Tell me how you would efficiently sum a large array of numbers on a GPU into a single accumulated value. Don’t need to see code, just explain it conceptually in relation to the hardware.

English
8
0
21
10.3K
c0de517e/AngeloPesce
@AgileJebrim @xlrndo So you (whatever "you" means - I don't know what's "we" there - is that a company?) know that the unrolling issue you hallucinated above is incorrect.
English
1
0
0
29
Jebrim
Jebrim@AgileJebrim·
@kenpex @xlrndo We perform a lot of tests analyzing what backend compilers generate for a given input.
English
1
0
0
27
c0de517e/AngeloPesce
@AgileJebrim @xlrndo The way I read it his idea is to use persistent or persistent-ish threads to keep accumulating. There's no problem w/register use, and definitely no sane driver will unroll so much to cause spills. Nobody can know what the driver compiler does but that's a very safe bet.
English
1
0
0
35
Jebrim
Jebrim@AgileJebrim·
His approach appears very hardware-specific and makes a lot of assumptions. He has no idea how many registers are available, how much the compiler will unroll it, or what the bandwidth of his hardware will be. He certainly won’t have any predictability in execution times as he’s given the warp scheduler a ton of work to do (as opposed to bouncing back and forth between the same two warps).
English
2
0
0
54
Jebrim
Jebrim@AgileJebrim·
An interview question to work on our tech stack for our company, obviously. AMD GPUs can subdivide evenly into the same construct with CUs that ultimately result in 1KB registers per lane existing there as well. In any case, we compile from our bytecode to our SPIR-V at runtime, not ahead of time, for our portable non-SC version. This means we have the potential to adapt accordingly if we ever needed to, all without changing the higher level bytecode. That said, there is no issue here with AMD. The real concern I’ve had with mobile GPUs is actually over whether they create a proper scratchpad vs just emulate it with an L1 cache.
English
2
0
0
58
c0de517e/AngeloPesce
@AgileJebrim @xlrndo Yes, if everything's deterministic, everything's deterministic! There's no solution to this anyone can imagine, where you'd introduce divergence in the code - a sum it's a sum. I said something else: using ALL SMs for stuff like this is not usually the way U get speed of light.
English
1
0
0
30
Jebrim
Jebrim@AgileJebrim·
If they operate down identical code and data paths and successfully get evenly distributed across the hardware with no other resource sharing with other processes, then every run of the software will be deterministic and take extremely close to the same amount of time regardless of changing data.
English
2
0
0
25
c0de517e/AngeloPesce
@AgileJebrim @xlrndo You are assuming that threads all magically start in the same instant, which they do not. They can take the same time, but not end at the same time. Also, memory reads will always "jitter" as you say.
English
1
0
0
36
c0de517e/AngeloPesce
@AgileJebrim @xlrndo You posed this as an interview question... Moreover even the math that you have 4 wraps per SM etc is all contingent on NV-specific ideas of warp sizes. ZERO portability, we don't even need to go to mobile GPUs to show that this is wrong.
English
2
0
0
65
c0de517e/AngeloPesce
@AgileJebrim @xlrndo I'm not going to argue - as I've been falling in that trap with you in the past. What I said is factually correct. Feel free to use your time to learn about it.
English
1
0
0
21
Jebrim
Jebrim@AgileJebrim·
Work-stealing style algorithms where you let everything take a varying amount of time results in highly unpredictable execution times and an uneven distribution of work on hardware. Pipelining in general also suffers from a problem where you’re only as fast as the slowest bottleneck (making optimizing other things unhelpful) and you end up with significant idle hardware every time. Our approach of go fully data parallel across all hardware, sync whilst flushing caches, then go fully data parallel again enabled much better predictability in performance. It also enables serial algorithms to be applied where cross-hardware read after writes can occur across sync points. Furthermore, maximizing available register space per lane lets you essentially work with data in the highest performance regime closest to the ALUs. You’re concerned about bandwidth yes? The more you work out of registers and less out of VRAM, the better your performance will be. This addition by itself may seem like a light workload, but one can start expanding it with a lot more complex workloads integrated into the same shader without needing a full rewrite. Lastly, some shader compilers will literally hoist those memory loads to start occurring as early as possible anyways, even if you didn’t personally code it that way.
English
1
0
0
27
Jebrim
Jebrim@AgileJebrim·
There are 4 physical warps per SM. We assign 8 logical warps per SM with the goal of hyperthreading 2 per physical warp and a 1KB register file per lane. We have a special compiler that guarantees divergence never occurs. All work is designed to take an equal amount of cycles such that there are no meaningfully slower ones relative to other ones. This setup is also identical for the set of all possible shaders that can be implemented in our environment. At sync points, there may be a very slight amount of waiting that occurs, but it is pretty negligible. These are to be minimized by design.
English
2
0
0
72
c0de517e/AngeloPesce
@AgileJebrim @xlrndo ...the idea that doing this "takes control away from the scheduling hardware" is wrong. It's not that because U divided work this way you ENSURE you get a perfect division per SM and even less that you get SMs to start at the same time (and thus end w/o waiting for "slower" ones)
English
1
0
0
94
c0de517e/AngeloPesce
@AgileJebrim @xlrndo AFAICT you're assuming that going wide first - i.e. saturating with at least some work per SM - and hopefully enough for multiple waves (which you pretty much ensure with 256-sized WGs - as no HW has 256-thread waves) will be optimal. It's not BAD but not optimal. Definitely...
English
1
0
0
69
Jebrim
Jebrim@AgileJebrim·
Saturate the bottleneck as early as possible. The alternative is that you risk creating bubbles, which involves needless stalls in between periods of saturation. My approach is also highly portable across a wide range of hardware and algorithms and is very simple to implement. I’m also taking control over scheduling away from the black box hardware trouble occupancy, critical for real-time guarantees.
English
1
0
0
62