
Andreas Abel
120 posts

Sabitlenmiş Tweet

I have added latency, throughput, and port usage data for Emerald Rapids, Meteor Lake, Arrow Lake, and Zen 5 to uops.info/table.html.
English

@IanCutress Lmao 😂🤣 I feel like they used chatGPT when writing it
English

The most amusing thing about this paper is that they're comparing the performance of superoptimizers analysis tools for simulated Intel architectures by running the benchmark suite on an AMD Ryzen 5900X. 😂👌
Matt@matt_dz
Facile: Fast, Accurate, and Interpretable Basic-Block Throughput Prediction arxiv.org/abs/2310.13212 IEEE International Symposium on Workload Characterization (IISWC) 2023 Andreas Abel (@uops_info), Shrey Sharma, Jan Reineke
English
Andreas Abel retweetledi

Facile: Fast, Accurate, and Interpretable Basic-Block Throughput Prediction
arxiv.org/abs/2310.13212
IEEE International Symposium on Workload Characterization (IISWC) 2023
Andreas Abel (@uops_info), Shrey Sharma, Jan Reineke




English

@FUZxxl @AgnerFog_ On Skylake, rd*sbase has 6 uops and a throughput of 6 cycles, wr*sbase has 7 uops and a throughput of 18.
English

Does anybody know the latency/throughput of rdfsbase, rdgsbase, wrfsbase, and wrgsbase? These could be (ab)used to turn FS/GS into extra index registers, but the usual tables (@uops_info @AgnerFog_) don't have any information on them.
English

Latency, throughput, and port usage data for #Zen4 is now available at uops.info/table.html
English

@corsix @trav_downs @geofflangdale @Wunkolo My benchmarks on SKX for VPADDD don't show such an extra uop: uops.info/html-tp/SKX/VP…
English

@uops_info @trav_downs @geofflangdale @Wunkolo They tend to decompose into several uops though, one of which being a merge op to combine the old and new contents of the destination.
English

Given:
1. crc32 has throughput 1 on port 1
2. pclmulqdq has throughput 1 on port 5
3. pclmulqdq+pxor can emulate crc32
It seems that fastest crc32 code should divide input in half and issue a crc32 _and_ a pclmulqdq every cycle.
Code and numbers at corsix.org/content/fast-c…
English

@trav_downs @geofflangdale @corsix @Wunkolo There are several instructions with a writemask (such as "VPADDD (XMM, K, XMM, XMM)") that technically also read all three XMM registers. Other than that, TERNLOG indeed seems to be unique.
English

@uops_info @geofflangdale @corsix @Wunkolo Good point, though those only arrive in VBMI2.
I wonder if TERNLOG is unique in that respect in SKX-ish?
English

@geofflangdale @corsix @Wunkolo It *is* very nice.
Are there even any other 1-latency 3-[xyz]mm input instructions out there?
cc @uops_info
English
Andreas Abel retweetledi

Good feature of uops.info: "URL" button in the top right corner gets you a URL that preserves the state of the table you've selected (which can be slowish to reconstruct).
I was too dim to notice this! Thanks @uops_info
for pointing this feature out.

English

@_monoid @trav_downs Whether Zen2 actually runs this at 1 cyc/iteration depends on how xmm1 is initialized. If the previous write to xmm1 zeros the upper bits (like "vmovd xmm1, eax") it works. On the other hand, for, e.g., "vmovupd xmm1, [r14]" it runs at 9 cyc/iteration (even if [r14] contains 0).
English

@trav_downs @uops_info Have you seen discussion which CPUs manage to avoid false dependency on scalar SSE ops such as roundss that merge unmodified high bits into the result?
Zen2 can, it runs this loop at 1cyc/iteration while UICA says all Intels stall:
bit.ly/39HgkDr
English

@InstLatX64 Golden Cove throughput/latency tables going to air soon too
English

#Intel released the 45th edition of the x86/x64 Software Optimization Manual with #AlderLake #GoldenCove and #Gracemont microarchitecture
intel.com/content/www/us…


InstLatX64@InstLatX64
#Intel released the 44th edition of the x86/x64 Software Optimization Manual with fixed and downloadable code samples: software.intel.com/content/dam/de… GitHub: github.com/intel/optimiza…
English

@IanCutress @BloodyTangerine AVX512 data for Alder Lake is now available at uops.info. I have also added instruction data for Tremont, Goldmont (Plus), Airmont, and Bonnell.
English

Latency, throughput, and port usage data for Alder Lake is now available at uops.info/table.html.
#Intel #AlderLake (1/4)
English

On the Skylakes that didn't get their LSD disabled, are there documented corner cases of JCC erratum mitigation not behaving as it should?
This uops.info snippet with offset 50 bit.ly/3fJnTJx has a suspiciously high count of DSB+LSD when I actually run it
English

@trav_downs @pervognsen @gamozolabs What would be examples of things that happen on only on odd or even cycles?
English

@uops_info @pervognsen @gamozolabs Yes, but are there tricks for *all* of them?
I think we don't even know all of them: there is a long tail of hidden state that starts to matter less and less, but plenty of things which happen only on odd or even cycles (so the "parity" of your start cycle matters), lots of \
English

Today, I released uiCA, the "uops.info Code Analyzer". uiCA is based on data from uops.info, combined with a new detailed pipeline model. An online version (that also supports other tools) is available at uica.uops.info (1/3)
English

@uops_info Awesome work!
Is the extension to nanoBench which allows cycle-by-cycle measurement (so-called "Falk diagrams") available?
cc @gamozolabs
English



