Dc nigma

2.2K posts

Dc nigma banner
Dc nigma

Dc nigma

@Dcnigma

I am a bunny!

Internet 参加日 Ocak 2011
427 フォロー中44 フォロワー
Dc nigma がリツイート
Falco Girgis
Falco Girgis@falco_girgis·
HOLY SHIT, I just achieved MASSIVE GAINZ on multiplying and accumulating a 4x4 matrix held within memory onto the "active matrix" held within the 4x4 FP register back-bank of the SH4 FPU in my accelerated math library for the Sega Dreamcast! After an extremely tense two hour-long session of playing inline assembly Tetris by meticulously hand-scheduling and reorganizing SH4 instructions, I am finally SPANKING the legacy mat_apply() routine from KallistiOS rather than barely winning against it with my SH4ZAM library of accelerated math routines targeting the Dreamcast's SH4. What you see in the top left pane is the original out-of-line ASM implementation of math_apply(), which is offered by KOS's minimalist matrix.h API. What you see in the top right pane is my inline ASM implementation which is part of SH4ZAM's XMTRX API. The bottom left pane is the unit tests which benchmarks the two implementations against each other, and the bottom right pane is the output after I ran the test suite on my physical HW, using the cycle-accurate SH4 performance counters to measure timing... The results? When the matrix which gets passed to the routines as an argument is not already resident within the cache, I get about an 83% performance improvement. When the operand matrix is already resident within the cache, I get about a 21% perf improvement... which is ASTRONOMICAL GAINZ for a routine this hot!!! So what the hell did I do to achieve this? 1) First of all, I worked WITH the compiler instead of against it. Rather than implementing my routine as a black-box out-of-line ASM routine which has to pay the cost of a full function call, saving and restoring certain registers and managing the stack frame according to the C ABI, I opted to implement mine as a forcibly inlined routine implemented within inline ASM. By doing this, I'm able to tell the compiler precisely which registers I'm using and clobbering, which allows the compiler to not have to save and restore as much shit potentially as a full C ABI call and instead to only do it for the registers it actually gives a shit about preserving across the call. 2) Smarter stack management for when I need to push and pop values to and from the stack within the routine itself. Rather than using fmov.s to load and store single FP values to and from the stack, I align the stack up to 8 bytes, which allows me to swap to pairwise FMOV mode and use FMOV.D to load or store TWO floats for the exact same cycle cost as one. 3) Strategic prefetching. Since I know exactly what data I'm going to be operating on (the source matrix) and in what order, I can manually preload the data into the cache before I actually attempt to load it from memory, while I'm doing other stuff with the CPU. I'm prefetching the first cache line of the matrix while I'm dicking around with aligning the stack, so that when I start loading the first values right afterwards, it's already there. Then the second cache line gets prefetched while the first cache line being used, so by the time I get done with the first cache line, the second is also filled. 4) Grouping instructions by pairs in a manner that maximizes superscalar dual dispatch on the SH4. This one's tricky. Not all instructions can be executed 2-at-a-time. Only instruction pairs in certain compatible groups can be run in parallel, so I had to be very careful to group work so that I'm pairing integer work with floating point work or integer work with loading/storing, for example, which can both use different areas of the CPU, while doing something like pairing two loads together will result in only single instruction dispatch, as the two would compete for resources. 5) Reducing vector instruction pipeline stalls. So it turns out there's an undocumented bit of bullshit with how the pipeline forwarding works when loading operands into FP regs then attempting to use them with vector instructions, like the FTRV instruction you see there. Unlike regular FP instructions, the circuitry which allows for the result of a load to get forwarded on to an arithmetic FP instruction needing it as input before the load instruction is fully retired is evidently not connected to the vector unit, so there's an extra amount of cycles one must wait between loading operands into FP regs and trying to use them with vector instructions, or else you'll stall the pipeline. The only way we even know this is from rigorous measurements using the SH4 performance counters... and since I did know it, I was able to Tetris a few extra instructions of work between FMOV.D loading the operands into regs and FTRV trying to operate on them. Anyway... stoked AF that after two hours of getting my ass kicked by the SH4, I made HUGE wins!
Falco Girgis tweet media
English
20
46
484
14.5K
TracketPacer
TracketPacer@TracketPacer·
i found some cage nuts
English
116
63
1.4K
48.9K
Dc nigma
Dc nigma@Dcnigma·
@sciencegirl Someone times they will measure for magnetisme if they doubt the results and then your in serious trouble
English
0
0
0
603
Science girl
Science girl@sciencegirl·
An old-school electricity meter trick (Don’t do this at home)
English
31
45
530
124.2K
Dc nigma
Dc nigma@Dcnigma·
@droidbuilds And buy a HDMI dummy dongle for higher resolution 😉
English
0
0
0
4
DROID
DROID@droidbuilds·
finally got a macbook 🥹 what’s the first thing I should install?
DROID tweet media
English
252
9
339
25.3K
Cipher
Cipher@Cipher_twt·
😭😭
Cipher tweet media
QME
15
6
219
7.2K
Procyon
Procyon@Procyon86·
First time holding a Cyrix chip 🫠
Procyon tweet media
English
31
11
180
4.7K
Syra
Syra@Syraavibes·
What these two colors remind you of??
Syra tweet mediaSyra tweet media
English
3.7K
694
9.6K
8.4M
Kr$na
Kr$na@krishdotdev·
Memory leaks on macbook neo is now often.
Kr$na tweet media
English
171
45
1.5K
261.1K
Dc nigma
Dc nigma@Dcnigma·
@kmcnam1 Found a big bug 10 years ago at a company I worked they used roaming profiles, when you disconnected the lan at a certain time at login your privileges where elevated to domain admin 😂 good times
English
0
0
0
41
sudox
sudox@kmcnam1·
sudox tweet media
ZXX
3
7
108
2.8K
SolidSnake
SolidSnake@phinn888·
@Dcnigma @1GamewithDave1 Not just kid factor it was so limited. By late '96 Quake had dedicated multiplayer servers, custom maps, mods, clans, huge online community, LAN parties, etc.
English
1
0
0
15
Retro Dave
Retro Dave@1GamewithDave1·
Most beloved game of the 90s?
Retro Dave tweet media
English
98
130
1.3K
56.4K
Dc nigma
Dc nigma@Dcnigma·
@GameStalgiaX Jup a few and I repurposed one of them to a arcade cabinet 😜
English
0
0
2
90
Dc nigma
Dc nigma@Dcnigma·
@brockpierson I burned so many cds that I needed to buy a new one every year 😅😂🤣
English
0
0
0
107
⭕ Brock Pierson
⭕ Brock Pierson@brockpierson·
Did you own a CD burner back in the day when they used to cost upwards of $300?
⭕ Brock Pierson tweet media
English
398
50
1.4K
40.5K
Dc nigma
Dc nigma@Dcnigma·
@JustBeingDanny I got my nes and snes when I was 35 years never cared for Nintendo. I got my first sega when I was 9y.own a Master system 1 and 2, megadrive, 2 saturns 5 Dreamcast. Still remember the day sega said they would stop making consoles. 🥲
English
0
0
2
47
Danny Major 🧟‍♂️
Danny Major 🧟‍♂️@JustBeingDanny·
This is why I've almost had enough. Unfortunately, dick heads will believe this.
Danny Major 🧟‍♂️ tweet media
English
45
3
52
6.5K