Locuza

4.9K posts

Locuza

@Locuza_

Content archive: https://t.co/XOUB5IcvCc

Katılım Şubat 2018

72 Takip Edilen4K Takipçiler

Sabitlenmiş Tweet

Locuza@Locuza_·18 Şub

Well, that's likely it, for the foreseeable future I'm retired as a content creator! My last analysis goes over the area breakdown of N31 and AD102/103, looks at very rough cost estimates for the chips and perf/mm². YT: youtube.com/watch?v=D34qur… Substack: locuza.substack.com/p/radeon-n31-v…

YouTube

テカナリエ清水@techanalye1

NVIDIA AD103 NVIDIA AD102

English

270

62.3K

Locuza@Locuza_·1 Eki

@NOTimothyLottes @JBrooksBSI @bmcnett @SebAaltonen The client focused variants of Blackwell are still using logically a single L2 Cache, 128 MB in GB202 (96 MB active on 5090). x.com/Kurnalsalts/st… It's the AI/Datacenter variant of Blackwell that is using a dual-die with 10 TB/s D2D bandwidth, and a local/remote L2.

Kurnal@Kurnalsalts

GB202 Dieshot/5090 Dieshot Thanks By @ASUS Tony 俞元麟 by Chip @万扯淡 by Dieshot @Kurnalsalts Layout Photo1 GB202 Dieshot Photo2 AD102 vs GB202 full Pixel Photo pls join in Kurnal’s Telegram team t.me/+DjmQ-kcsAXIyM…

English

186

NOTimothyLottes@NOTimothyLottes·30 Eyl

@JBrooksBSI @bmcnett @SebAaltonen Wasn't the 5090 already using dual die with chip2chip interconnect? The have NUMA to L2 (local and remote). But otherwise standard practice. Won't necessarily get radically different designs due to fan-out ...

English

300

Sebastian Aaltonen@SebAaltonen·30 Eyl

When you design data structures, always think in cache lines (64B or 128B). You don't want to have tiny nodes scattered around the memory. Often it's better to have wider nodes (preferably 1 cache line each) and shallower structures. Less pointer/offset indirections.

English

859

46.3K

Locuza@Locuza_·9 May

@IanMcCabe_1 @Darth_Goldsmith @Kurnalsalts But what if Lockstep wasn't even used?

English

128

Alovon@IanMcCabe_1·9 May

@Darth_Goldsmith @Kurnalsalts Yeah, the point is Orin has it and Switch 2 doesn't. So making a bench for Switch 2 using a part that has Lock-Step without making absolute sure you accounted for the behavior of Lockstep makes the data notably more inaccurate than usual

English

156

Kurnal@Kurnalsalts·7 May

The world's first Nintendo Switch 2 Dieshot Samsung 8N 8Core A78C,Share 4M L2 1536Cuda/6TPC ampere GPU A detailed process and chip analysis report will be released on Youtube and Bili at 9:30 pm tomorrow. High-resolution photos in Telegram group: t.me/+DjmQ-kcsAXIyM…

English

347

1.9K

596.2K

Locuza@Locuza_·7 May

@IanMcCabe_1 @CentroLeaks The latency increase at the same clock speed for L2 access was 13% higher on N21 with 8x L1$ interfacing 16x L2$ Slices (4M) vs. the Steam Deck with 1x L1$ interfacing 4x L2$ Slices (1M). Considering this, what would likely be the latency increase from 1GPC x 12SM to 2GPC x 8SM?

English

226

Alovon@IanMcCabe_1·7 May

@Locuza_ @CentroLeaks The point is latency. T239 has 1 GPC with 12SMs. With that 1MB of L2 going into the singular GPC. 2050M has only 8 per GPC with the 2MB having to get divided between each GPC. Which the difference there notably effects latency and effective bandwidth in cache

English

326

Centro LEAKS@CentroLeaks·7 May

Switch 2 performance: A simulated benchmark performed on a PC with the closest specs to Switch 2 shows that the GPU in docked mode is pretty good, similar to a GTX 1050 Ti. On portable mode, it's on par with PS4. However the CPU is pretty weak, considerably less than Steamdeck.

English

174

2.7K

406.2K

Locuza@Locuza_·7 May

@IanMcCabe_1 @CentroLeaks It's not dramatically different. The 128KB L1 Caches are private per SM, they don't add up for a single workload on a GPC level. Furthermore, the L2 is globally shared by all SMs, it's not split by GPC count.

English

280

Alovon@IanMcCabe_1·7 May

@CentroLeaks Also the GPU used is the 2050M which has a dramatically different memory system than T239 (As in worse). T239 has 1.5MB of L1 on a single cluster versus 2MB split on two GPCs on 2050M

English

605

Locuza@Locuza_·20 Nis

@LeoWaldock @highyieldYT Indeed, smaller scope process and timing improvements are invisible on such die shots, as the structure changes are too tiny to see it. You also won't see any changes on the base layer between "14"nm Zen 1 und "12"nm Zen 1+ dies, while the latter did improve FMAX by ~5%.

English

124

LeoWTech@LeoWaldock·20 Nis

@Locuza_ @highyieldYT You may have seen this. We discuss Raptor near the end. I get the impression there are plenty of refinements that are invisible to the eye m.youtube.com/watch?v=EJGr-H…

English

249

High Yield@highyieldYT·20 Nis

Soon™ 🏹🌊👀

English

170

8.1K

Locuza@Locuza_·20 Nis

@OneRaichu @LeoWaldock @highyieldYT What you mean, I'm almost always around... Though, obviously, I'm not posting much if anything since quite some time, and that will likely not change.

English

213

Locuza@Locuza_·20 Nis

@LeoWaldock @highyieldYT I could imagine that the performance profile may look quite a bit different vs. Alder Lake CPUs, but afaik at least APO was later also enabled for 12th Gen. Anyhow, almost all benefits should come outside of the core architecture itself vs. Golden Cove (excluding larger L2).

English

758

Locuza@Locuza_·20 Nis

@LeoWaldock @highyieldYT On the base layer a few layout changes can be spotted, but it looks mostly the same, excluding the enlarged L2. In SPEC2017 most gains come from increased Clock Frequencies, coupled with a bit of faster DDR5 and larger L2 Caches. Depending on the workload and if APO was in use...

English

774

Locuza@Locuza_·10 Eki

@carygolomb @PinaJr That you have to question someone else. 😅 Even before GCN AMD had that 1:4 ratio, at one point workload analysis apparently showed that this was a good ratio, and worth the cost. For RB+ they decided that 1:2 was the better compromise. One could also check Intel/NV ratios.

English

237

Cary Golomb@carygolomb·10 Eki

@Locuza_ @PinaJr Sorry I phrased that question wrong. What are they doing with all those Z-operations? I've played around with force enabling VRS 2x2 on RDNA2 in limited bandwidth scenarios and saw zero gains with same settings.

English

255

Cary Golomb@carygolomb·9 Eki

Another thing people generally don't understand is how pushing clocks higher has a hockey-stick like response in terms of power required. As a result, even tho the Xbox Series X is around 15% more GPU performance than the PS5, the PS5 uses around ~12% *more* power.

English

Locuza@Locuza_·10 Eki

@carygolomb @PinaJr Project start and settling down earlier on hw design than Microsoft. In that regard I wouldn't describe it as over-engineered, Sony started/used an older IP pool/options than MS. Based on the GFX IP versions listed for the initial PS5 HW, this all started even before PC RDNA1.

English

261

Cary Golomb@carygolomb·10 Eki

@Locuza_ @PinaJr Any guess why Sony over-engineered the dROPs? Considering XSX has HW VRS, I suppose then it's just a matter of game devs *using* it. But considering that VRS can extend to cROPs in terms of grouping, the 20% raw increase on PS5 could be a wash when 20-30% of a frame is 2x2

English

249

Locuza@Locuza_·10 Eki

@carygolomb @PinaJr While there might be cases where the old configuration is faster, I would doubt that Sony specifically picked the old config, if they had the chance with no risk to use RB+ instead. Advantages are big regarding area efficiency, VRS support and likely better power efficiency too.

English

365

Locuza@Locuza_·10 Eki

@carygolomb @PinaJr I'm not sure if from a physical design RDNA1 ROPs would manage 2.23 GHz, but functionality that comparisons appears to apply (RDNA1 vs. 2 ROPs). What the PS5 uses is a "legacy" balance point, which was also used by GCN for multiple generations.

English

423

Locuza@Locuza_·10 Eki

@carygolomb @PinaJr Raw paper: PS5: 16 x 4 cROP = 64 C @ 2.23 GHz PS5: 16x 16 dROP = 256 D @ 2.23 GHz XSX: 8x 8 cROP = 64 C @ 1.825 GHz XSX: 8x 16 dROP = 128 D @ 1.825 GHz

Polski

278

Cary Golomb@carygolomb·10 Eki

@PinaJr @Locuza_ Isn't the larger ROPs on PS5 essentially just "catching up" to the full fat RB+ on XSX? Each RB on PS5 has four color ROPs and sixteen depth ROPs. Each RB+ on XSX doubles the throughput from four to eight 32-bit pixels per cycle, along with sixteen depth samples

English

255

Locuza@Locuza_·5 Ağu

@opinali @harukaze5719 This time around is was even 0% as I was notified very early on. 😄 Quite a lot of interesting stuff to be seen there.

English

133

Osvaldo Pinali Doederlein@opinali·29 Tem

@harukaze5719 cc @Locuza_ in the 1% chance he hasn't seen this

English

201

포시포시@harukaze5719·29 Tem

Strix Point full die shot t.bilibili.com/95921729844333…

English

166

25.9K

Locuza@Locuza_·21 Nis

@opinali ..issues with the article, which would quickly change. Just at this point they mention the "weird register naming", and give some historic context why that is. Essentially that was all, but apparently enough to trigger annoyance and jumping to conclusions, in turn triggering me😞

English

204

Locuza@Locuza_·21 Nis

@opinali Eh, the first ~10 minutes is not true either, is it? ;) Already in 1:40 they bring up the counter article from C&C and follow up saying it's a great site with a good article on that. After quoting the first paragraph in 4:40 it's already telegraphed that "so far" nobody has...

English

235

Osvaldo Pinali Doederlein@opinali·19 Nis

These dudes side with "x86 should die", and they start it with a long dunking on x86 because... many registers have weird names that reveal their legacy like AL, EBX, etc. IQ going through the stratosphere here lol. youtube.com/watch?v=xCBrto…

YouTube

English

14K

Locuza@Locuza_·24 Mar

@jacob_balma @nvidiaGPU Looks like a fairly loose inspiration, as iirc I never really posted about cross-sections of chips. 😅 It reminds me of some of the work ChipsByLayers did twitter.com/ChipsByLayers/…

Chips by their layers@ChipsByLayers

Substrates for higher performance chips are vastly more dense than those for normal PCBs. Below is an old AM2 chip, and above an old Atom based laptop PCB; you can see about 3 full layers in the same space as about 1 and half a layer of insulation in a "normal" PCB.

English

600

Jacob Balma@jacob_balma·20 Mar

Working on getting a cross section view of this @nvidiaGPU die after cutting with diamond saw and sanding, inspired by @Locuza_

English

549

Locuza@Locuza_·20 Mar

@Olrak29_ @forgotten_leo Yeah

English

463

Gray@Olrak29_·20 Mar

@forgotten_leo Yeah

English

599

Gray@Olrak29_·20 Mar

Is there a die shot of Alder Lake-N somewhere?

English

2.5K

Locuza@Locuza_·11 Mar

@LupintheI Oh my, it really looks sorta like a computer chip with multiple dense high-level structures being repeated and interconnected. And yeah, I'm still fairly actively around, just not posting much. 😅

English

322

LupintheIII@LupintheI·11 Mar

It's funny how some part actually make sense in both cases... large deposit area in the middle -> cache... offload area on the lower edge -> DDR PHY... I think I must tag @Locuza_ on that for a breakdown :-) Are you still around?

Tim Urban@waitbutwhy

This isn’t a close up of a computer chip, it’s an aerial photo of a giant steel plant in South Korea called Gwangyang Steel Works (courtesy of @DOverview).

English

659

Keşfet

@NOTimothyLottes @JBrooksBSI @bmcnett @SebAaltonen @IanMcCabe_1 @Darth_Goldsmith @Kurnalsalts @CentroLeaks