Locuza

4.9K posts

Locuza banner
Locuza

Locuza

@Locuza_

Content archive: https://t.co/XOUB5IcvCc

Katılım Şubat 2018
72 Takip Edilen4K Takipçiler
Locuza
Locuza@Locuza_·
@NOTimothyLottes @JBrooksBSI @bmcnett @SebAaltonen The client focused variants of Blackwell are still using logically a single L2 Cache, 128 MB in GB202 (96 MB active on 5090). x.com/Kurnalsalts/st… It's the AI/Datacenter variant of Blackwell that is using a dual-die with 10 TB/s D2D bandwidth, and a local/remote L2.
Locuza tweet media
Kurnal@Kurnalsalts

GB202 Dieshot/5090 Dieshot Thanks By @ASUS Tony 俞元麟 by Chip @万扯淡 by Dieshot @Kurnalsalts Layout Photo1 GB202 Dieshot Photo2 AD102 vs GB202 full Pixel Photo pls join in Kurnal’s Telegram team t.me/+DjmQ-kcsAXIyM…

English
0
0
3
186
NOTimothyLottes
NOTimothyLottes@NOTimothyLottes·
@JBrooksBSI @bmcnett @SebAaltonen Wasn't the 5090 already using dual die with chip2chip interconnect? The have NUMA to L2 (local and remote). But otherwise standard practice. Won't necessarily get radically different designs due to fan-out ...
English
2
0
3
300
Sebastian Aaltonen
Sebastian Aaltonen@SebAaltonen·
When you design data structures, always think in cache lines (64B or 128B). You don't want to have tiny nodes scattered around the memory. Often it's better to have wider nodes (preferably 1 cache line each) and shallower structures. Less pointer/offset indirections.
English
11
70
859
46.3K
Alovon
Alovon@IanMcCabe_1·
@Darth_Goldsmith @Kurnalsalts Yeah, the point is Orin has it and Switch 2 doesn't. So making a bench for Switch 2 using a part that has Lock-Step without making absolute sure you accounted for the behavior of Lockstep makes the data notably more inaccurate than usual
English
2
0
1
156
Kurnal
Kurnal@Kurnalsalts·
The world's first Nintendo Switch 2 Dieshot Samsung 8N 8Core A78C,Share 4M L2 1536Cuda/6TPC ampere GPU A detailed process and chip analysis report will be released on Youtube and Bili at 9:30 pm tomorrow. High-resolution photos in Telegram group: t.me/+DjmQ-kcsAXIyM…
Kurnal tweet mediaKurnal tweet media
English
62
347
1.9K
596.2K
Locuza
Locuza@Locuza_·
@IanMcCabe_1 @CentroLeaks The latency increase at the same clock speed for L2 access was 13% higher on N21 with 8x L1$ interfacing 16x L2$ Slices (4M) vs. the Steam Deck with 1x L1$ interfacing 4x L2$ Slices (1M). Considering this, what would likely be the latency increase from 1GPC x 12SM to 2GPC x 8SM?
Locuza tweet media
English
0
0
0
226
Alovon
Alovon@IanMcCabe_1·
@Locuza_ @CentroLeaks The point is latency. T239 has 1 GPC with 12SMs. With that 1MB of L2 going into the singular GPC. 2050M has only 8 per GPC with the 2MB having to get divided between each GPC. Which the difference there notably effects latency and effective bandwidth in cache
English
2
0
1
326
Centro LEAKS
Centro LEAKS@CentroLeaks·
Switch 2 performance: A simulated benchmark performed on a PC with the closest specs to Switch 2 shows that the GPU in docked mode is pretty good, similar to a GTX 1050 Ti. On portable mode, it's on par with PS4. However the CPU is pretty weak, considerably less than Steamdeck.
Centro LEAKS tweet mediaCentro LEAKS tweet media
English
96
174
2.7K
406.2K
Locuza
Locuza@Locuza_·
@IanMcCabe_1 @CentroLeaks It's not dramatically different. The 128KB L1 Caches are private per SM, they don't add up for a single workload on a GPC level. Furthermore, the L2 is globally shared by all SMs, it's not split by GPC count.
English
1
0
1
280
Alovon
Alovon@IanMcCabe_1·
@CentroLeaks Also the GPU used is the 2050M which has a dramatically different memory system than T239 (As in worse). T239 has 1.5MB of L1 on a single cluster versus 2MB split on two GPCs on 2050M
English
1
0
5
605
Locuza
Locuza@Locuza_·
@LeoWaldock @highyieldYT Indeed, smaller scope process and timing improvements are invisible on such die shots, as the structure changes are too tiny to see it. You also won't see any changes on the base layer between "14"nm Zen 1 und "12"nm Zen 1+ dies, while the latter did improve FMAX by ~5%.
English
0
0
4
124
High Yield
High Yield@highyieldYT·
Soon™ 🏹🌊👀
High Yield tweet media
English
8
2
170
8.1K
Locuza
Locuza@Locuza_·
@OneRaichu @LeoWaldock @highyieldYT What you mean, I'm almost always around... Though, obviously, I'm not posting much if anything since quite some time, and that will likely not change.
English
0
0
4
213
Locuza
Locuza@Locuza_·
@LeoWaldock @highyieldYT I could imagine that the performance profile may look quite a bit different vs. Alder Lake CPUs, but afaik at least APO was later also enabled for 12th Gen. Anyhow, almost all benefits should come outside of the core architecture itself vs. Golden Cove (excluding larger L2).
English
3
0
7
758
Locuza
Locuza@Locuza_·
@LeoWaldock @highyieldYT On the base layer a few layout changes can be spotted, but it looks mostly the same, excluding the enlarged L2. In SPEC2017 most gains come from increased Clock Frequencies, coupled with a bit of faster DDR5 and larger L2 Caches. Depending on the workload and if APO was in use...
Locuza tweet mediaLocuza tweet mediaLocuza tweet media
English
2
0
17
774
Locuza
Locuza@Locuza_·
@carygolomb @PinaJr That you have to question someone else. 😅 Even before GCN AMD had that 1:4 ratio, at one point workload analysis apparently showed that this was a good ratio, and worth the cost. For RB+ they decided that 1:2 was the better compromise. One could also check Intel/NV ratios.
English
0
0
6
237
Cary Golomb
Cary Golomb@carygolomb·
@Locuza_ @PinaJr Sorry I phrased that question wrong. What are they doing with all those Z-operations? I've played around with force enabling VRS 2x2 on RDNA2 in limited bandwidth scenarios and saw zero gains with same settings.
English
2
0
0
255
Cary Golomb
Cary Golomb@carygolomb·
Another thing people generally don't understand is how pushing clocks higher has a hockey-stick like response in terms of power required. As a result, even tho the Xbox Series X is around 15% more GPU performance than the PS5, the PS5 uses around ~12% *more* power.
English
2
3
29
5K
Locuza
Locuza@Locuza_·
@carygolomb @PinaJr Project start and settling down earlier on hw design than Microsoft. In that regard I wouldn't describe it as over-engineered, Sony started/used an older IP pool/options than MS. Based on the GFX IP versions listed for the initial PS5 HW, this all started even before PC RDNA1.
English
1
0
2
261
Cary Golomb
Cary Golomb@carygolomb·
@Locuza_ @PinaJr Any guess why Sony over-engineered the dROPs? Considering XSX has HW VRS, I suppose then it's just a matter of game devs *using* it. But considering that VRS can extend to cROPs in terms of grouping, the 20% raw increase on PS5 could be a wash when 20-30% of a frame is 2x2
English
1
0
0
249
Locuza
Locuza@Locuza_·
@carygolomb @PinaJr While there might be cases where the old configuration is faster, I would doubt that Sony specifically picked the old config, if they had the chance with no risk to use RB+ instead. Advantages are big regarding area efficiency, VRS support and likely better power efficiency too.
English
1
0
4
365
Locuza
Locuza@Locuza_·
@carygolomb @PinaJr I'm not sure if from a physical design RDNA1 ROPs would manage 2.23 GHz, but functionality that comparisons appears to apply (RDNA1 vs. 2 ROPs). What the PS5 uses is a "legacy" balance point, which was also used by GCN for multiple generations.
English
1
0
4
423
Locuza
Locuza@Locuza_·
@carygolomb @PinaJr Raw paper: PS5: 16 x 4 cROP = 64 C @ 2.23 GHz PS5: 16x 16 dROP = 256 D @ 2.23 GHz XSX: 8x 8 cROP = 64 C @ 1.825 GHz XSX: 8x 16 dROP = 128 D @ 1.825 GHz
Polski
1
0
3
278
Cary Golomb
Cary Golomb@carygolomb·
@PinaJr @Locuza_ Isn't the larger ROPs on PS5 essentially just "catching up" to the full fat RB+ on XSX? Each RB on PS5 has four color ROPs and sixteen depth ROPs. Each RB+ on XSX doubles the throughput from four to eight 32-bit pixels per cycle, along with sixteen depth samples
English
1
0
0
255
Locuza
Locuza@Locuza_·
@opinali @harukaze5719 This time around is was even 0% as I was notified very early on. 😄 Quite a lot of interesting stuff to be seen there.
English
0
0
1
133
Locuza
Locuza@Locuza_·
@opinali ..issues with the article, which would quickly change. Just at this point they mention the "weird register naming", and give some historic context why that is. Essentially that was all, but apparently enough to trigger annoyance and jumping to conclusions, in turn triggering me😞
English
0
0
0
204
Locuza
Locuza@Locuza_·
@opinali Eh, the first ~10 minutes is not true either, is it? ;) Already in 1:40 they bring up the counter article from C&C and follow up saying it's a great site with a good article on that. After quoting the first paragraph in 4:40 it's already telegraphed that "so far" nobody has...
English
1
0
0
235
Osvaldo Pinali Doederlein
These dudes side with "x86 should die", and they start it with a long dunking on x86 because... many registers have weird names that reveal their legacy like AL, EBX, etc. IQ going through the stratosphere here lol. youtube.com/watch?v=xCBrto…
YouTube video
YouTube
English
14
0
26
14K
Jacob Balma
Jacob Balma@jacob_balma·
Working on getting a cross section view of this @nvidiaGPU die after cutting with diamond saw and sanding, inspired by @Locuza_
English
2
0
2
549
Gray
Gray@Olrak29_·
Is there a die shot of Alder Lake-N somewhere?
English
1
0
13
2.5K
Locuza
Locuza@Locuza_·
@LupintheI Oh my, it really looks sorta like a computer chip with multiple dense high-level structures being repeated and interconnected. And yeah, I'm still fairly actively around, just not posting much. 😅
English
1
0
3
322