c0de517e/AngeloPesce

34.1K posts

c0de517e/AngeloPesce

@kenpex

he/him 🌞: Senior Director, Simulation. 🌚: Photography & Creative coding. Blog: c0de517e. My opinions are not my own. I DON'T tweet about ROBLOX.

Vancouver & San Francisco Katılım Haziran 2007

590 Takip Edilen8.3K Takipçiler

Sabitlenmiş Tweet

c0de517e/AngeloPesce@kenpex·28 Mar

Why nothing great is ever good c0de517e.blogspot.com/2017/10/the-cu… - Over-engineering is the root of all evil c0de517e.blogspot.com/2016/10/over-e…

English

c0de517e/AngeloPesce@kenpex·17h

@yiningkarlli Practically all software is some sort of bytecode VM - you only have to look from the right vantage point :)

English

847

Yining Karl Li@yiningkarlli·17h

Here's a fun blog post about bytecode VMs in surprising places. Mildly hot take of mine: a bytecode VM in a GPU kernel is not as bad of an idea as one might think, and in some cases it can actually be the best solution. Fun examples in thread: (1/5) dubroy.com/blog/bytecode-…

English

5.2K

c0de517e/AngeloPesce@kenpex·17h

@JustDeezGuy Radiant, you mean...

English

553

Paul Snively@JustDeezGuy·23h

Or, hell, use the brilliant Quake III source code.

Ian JCV@jnvcia

Everyone's too busy forcing UE5 to look like a 1998 engine. What they could do instead is just make 1998 engines, which were routinely created by teams of 1-2 people. Use OpenGL 1.2, people, it's still available, and your games will run at 5 trillion FPS while looking authentic.

English

129

12.4K

c0de517e/AngeloPesce@kenpex·17h

@AgileJebrim @xlrndo ...matter because we're not in the business of writing loops to add numbers etc - it's not about microbenchmarks and speed of light. But - if you want to play these games etc... Ofc it's all silly and I imagine the "interview question" is just aimed at having a conversation.

English

c0de517e/AngeloPesce@kenpex·17h

@AgileJebrim @xlrndo No. But the same is true if you use all SMs, and even more if all SMs then have a ton of waves and keep trying to randomly flip beween them to issue even more. There is a point where issuing more in parallel is good, and a point where it slows you down. In practice it doesn't

English

c0de517e/AngeloPesce@kenpex·18h

@AgileJebrim @xlrndo Also, achieving speed of light (i.e. the theoretical max speed - here should be effectively = the bandwidth posted for GDDR on a given GPU) is really hard to achieve - likely impossible on top of generic spir-v / vulkan etc stuff. U might get lucky but in general it's hard.

English

c0de517e/AngeloPesce@kenpex·17h

@Jiayin_Cao @maxliani Is this for secondary rays? Wouldn't these need path differentials anyways?

English

153

Jiayin Cao@Jiayin_Cao·23h

@maxliani Exactly what I did. I took some detour by making an attempt to SIMD the whole shading language first, only to realize that I had to introduce too many new concept to deal with divergency. The AD solution is quite neat. Though it has its problem, but it should be good for my use

English

649

Jiayin Cao@Jiayin_Cao·1d

After years of thinking about it, I finally got a reasonable approximation for mipmap selection working in my renderer. The biggest challenge, by far, was supporting procedural UV generation in my shading language. I believe I have enough content to put down a blog post next.

English

183

10.6K

c0de517e/AngeloPesce@kenpex·18h

@AgileJebrim @xlrndo Not really. If you want a fun experiment, you can make a testbed that tries various number for TG/WG sizes, and various numbers for loops within a single shader run - and different ways of fetching. You'll see that optimal configurations are not necessarily intuitive.

English

Jebrim@AgileJebrim·1d

The followup question to this, assuming you weren’t wise enough to ask for clarifying details first, would be how you would do this knowing that we had 16m 8-bit unsigned integers to sum up?

Jebrim@AgileJebrim

I’ve taken to using a simple interview question with folks. It’s surprising how much people seem to struggle with answering it. Tell me how you would efficiently sum a large array of numbers on a GPU into a single accumulated value. Don’t need to see code, just explain it conceptually in relation to the hardware.

English

10.3K

c0de517e/AngeloPesce@kenpex·18h

@AgileJebrim @xlrndo So you (whatever "you" means - I don't know what's "we" there - is that a company?) know that the unrolling issue you hallucinated above is incorrect.

English

Jebrim@AgileJebrim·18h

@kenpex @xlrndo We perform a lot of tests analyzing what backend compilers generate for a given input.

English

c0de517e/AngeloPesce@kenpex·18h

@AgileJebrim @xlrndo The way I read it his idea is to use persistent or persistent-ish threads to keep accumulating. There's no problem w/register use, and definitely no sane driver will unroll so much to cause spills. Nobody can know what the driver compiler does but that's a very safe bet.

English

Jebrim@AgileJebrim·18h

His approach appears very hardware-specific and makes a lot of assumptions. He has no idea how many registers are available, how much the compiler will unroll it, or what the bandwidth of his hardware will be. He certainly won’t have any predictability in execution times as he’s given the warp scheduler a ton of work to do (as opposed to bouncing back and forth between the same two warps).

English

c0de517e/AngeloPesce@kenpex·18h

@AgileJebrim @xlrndo What company?

English

Jebrim@AgileJebrim·19h

An interview question to work on our tech stack for our company, obviously. AMD GPUs can subdivide evenly into the same construct with CUs that ultimately result in 1KB registers per lane existing there as well. In any case, we compile from our bytecode to our SPIR-V at runtime, not ahead of time, for our portable non-SC version. This means we have the potential to adapt accordingly if we ever needed to, all without changing the higher level bytecode. That said, there is no issue here with AMD. The real concern I’ve had with mobile GPUs is actually over whether they create a proper scratchpad vs just emulate it with an L1 cache.

English

c0de517e/AngeloPesce@kenpex·18h

@AgileJebrim @xlrndo Yes, if everything's deterministic, everything's deterministic! There's no solution to this anyone can imagine, where you'd introduce divergence in the code - a sum it's a sum. I said something else: using ALL SMs for stuff like this is not usually the way U get speed of light.

English

Jebrim@AgileJebrim·18h

If they operate down identical code and data paths and successfully get evenly distributed across the hardware with no other resource sharing with other processes, then every run of the software will be deterministic and take extremely close to the same amount of time regardless of changing data.

English

c0de517e/AngeloPesce@kenpex·18h

@AgileJebrim @xlrndo You are assuming that threads all magically start in the same instant, which they do not. They can take the same time, but not end at the same time. Also, memory reads will always "jitter" as you say.

English

c0de517e/AngeloPesce@kenpex·19h

@AgileJebrim @xlrndo Lastly, what I said about how threads are scheduled has nothing to do with divergence.

English

c0de517e/AngeloPesce@kenpex·19h

@AgileJebrim @xlrndo You posed this as an interview question... Moreover even the math that you have 4 wraps per SM etc is all contingent on NV-specific ideas of warp sizes. ZERO portability, we don't even need to go to mobile GPUs to show that this is wrong.

English

c0de517e/AngeloPesce@kenpex·19h

@AgileJebrim @xlrndo I'm not going to argue - as I've been falling in that trap with you in the past. What I said is factually correct. Feel free to use your time to learn about it.

English

Jebrim@AgileJebrim·19h

Work-stealing style algorithms where you let everything take a varying amount of time results in highly unpredictable execution times and an uneven distribution of work on hardware. Pipelining in general also suffers from a problem where you’re only as fast as the slowest bottleneck (making optimizing other things unhelpful) and you end up with significant idle hardware every time. Our approach of go fully data parallel across all hardware, sync whilst flushing caches, then go fully data parallel again enabled much better predictability in performance. It also enables serial algorithms to be applied where cross-hardware read after writes can occur across sync points. Furthermore, maximizing available register space per lane lets you essentially work with data in the highest performance regime closest to the ALUs. You’re concerned about bandwidth yes? The more you work out of registers and less out of VRAM, the better your performance will be. This addition by itself may seem like a light workload, but one can start expanding it with a lot more complex workloads integrated into the same shader without needing a full rewrite. Lastly, some shader compilers will literally hoist those memory loads to start occurring as early as possible anyways, even if you didn’t personally code it that way.

English

c0de517e/AngeloPesce@kenpex·19h

@AgileJebrim @xlrndo None of that is generic and portable across a "wide range of GPUs". A special compiler. Our setup. WTF?

English

Jebrim@AgileJebrim·19h

There are 4 physical warps per SM. We assign 8 logical warps per SM with the goal of hyperthreading 2 per physical warp and a 1KB register file per lane. We have a special compiler that guarantees divergence never occurs. All work is designed to take an equal amount of cycles such that there are no meaningfully slower ones relative to other ones. This setup is also identical for the set of all possible shaders that can be implemented in our environment. At sync points, there may be a very slight amount of waiting that occurs, but it is pretty negligible. These are to be minimized by design.

English

c0de517e/AngeloPesce@kenpex·19h

@AgileJebrim @xlrndo ...the idea that doing this "takes control away from the scheduling hardware" is wrong. It's not that because U divided work this way you ENSURE you get a perfect division per SM and even less that you get SMs to start at the same time (and thus end w/o waiting for "slower" ones)

English

c0de517e/AngeloPesce@kenpex·19h

@AgileJebrim @xlrndo AFAICT you're assuming that going wide first - i.e. saturating with at least some work per SM - and hopefully enough for multiple waves (which you pretty much ensure with 256-sized WGs - as no HW has 256-thread waves) will be optimal. It's not BAD but not optimal. Definitely...

English

c0de517e/AngeloPesce@kenpex·19h

@AgileJebrim @xlrndo You're not achieving any of that with what you posted.

English

Jebrim@AgileJebrim·19h

Saturate the bottleneck as early as possible. The alternative is that you risk creating bubbles, which involves needless stalls in between periods of saturation. My approach is also highly portable across a wide range of hardware and algorithms and is very simple to implement. I’m also taking control over scheduling away from the black box hardware trouble occupancy, critical for real-time guarantees.

English

Keşfet

@yiningkarlli @JustDeezGuy @AgileJebrim @xlrndo @Jiayin_Cao @maxliani @elonmusk @BarackObama