Andreas Abel

120 posts

Andreas Abel banner
Andreas Abel

Andreas Abel

@uops_info

Zürich, Switzerland Katılım Mart 2014
46 Takip Edilen694 Takipçiler
Sabitlenmiş Tweet
Andreas Abel
Andreas Abel@uops_info·
I have added latency, throughput, and port usage data for Emerald Rapids, Meteor Lake, Arrow Lake, and Zen 5 to uops.info/table.html.
English
7
45
232
35.4K
𝐷𝑟. 𝐼𝑎𝑛 𝐶𝑢𝑡𝑟𝑒𝑠𝑠
The most amusing thing about this paper is that they're comparing the performance of superoptimizers analysis tools for simulated Intel architectures by running the benchmark suite on an AMD Ryzen 5900X. 😂👌
Matt@matt_dz

Facile: Fast, Accurate, and Interpretable Basic-Block Throughput Prediction arxiv.org/abs/2310.13212 IEEE International Symposium on Workload Characterization (IISWC) 2023 Andreas Abel (@uops_info), Shrey Sharma, Jan Reineke

English
2
0
47
10.9K
Andreas Abel retweetledi
Matt
Matt@matt_dz·
Facile: Fast, Accurate, and Interpretable Basic-Block Throughput Prediction arxiv.org/abs/2310.13212 IEEE International Symposium on Workload Characterization (IISWC) 2023 Andreas Abel (@uops_info), Shrey Sharma, Jan Reineke
Matt tweet mediaMatt tweet mediaMatt tweet mediaMatt tweet media
English
3
11
56
17.6K
Andreas Abel
Andreas Abel@uops_info·
@FUZxxl @AgnerFog_ On Skylake, rd*sbase has 6 uops and a throughput of 6 cycles, wr*sbase has 7 uops and a throughput of 18.
English
1
1
4
231
Robert Clausecker
Robert Clausecker@FUZxxl·
Does anybody know the latency/throughput of rdfsbase, rdgsbase, wrfsbase, and wrgsbase? These could be (ab)used to turn FS/GS into extra index registers, but the usual tables (@uops_info @AgnerFog_) don't have any information on them.
English
1
1
0
299
Pete Cawley
Pete Cawley@corsix·
Given: 1. crc32 has throughput 1 on port 1 2. pclmulqdq has throughput 1 on port 5 3. pclmulqdq+pxor can emulate crc32 It seems that fastest crc32 code should divide input in half and issue a crc32 _and_ a pclmulqdq every cycle. Code and numbers at corsix.org/content/fast-c…
English
1
18
51
0
Andreas Abel
Andreas Abel@uops_info·
@trav_downs @geofflangdale @corsix @Wunkolo There are several instructions with a writemask (such as "VPADDD (XMM, K, XMM, XMM)") that technically also read all three XMM registers. Other than that, TERNLOG indeed seems to be unique.
English
1
0
2
0
Andreas Abel retweetledi
Geoff Langdale
Geoff Langdale@geofflangdale·
Good feature of uops.info: "URL" button in the top right corner gets you a URL that preserves the state of the table you've selected (which can be slowish to reconstruct). I was too dim to notice this! Thanks @uops_info for pointing this feature out.
Geoff Langdale tweet media
English
3
5
29
0
Andreas Abel
Andreas Abel@uops_info·
@_monoid @trav_downs Whether Zen2 actually runs this at 1 cyc/iteration depends on how xmm1 is initialized. If the previous write to xmm1 zeros the upper bits (like "vmovd xmm1, eax") it works. On the other hand, for, e.g., "vmovupd xmm1, [r14]" it runs at 9 cyc/iteration (even if [r14] contains 0).
English
1
0
2
0
Alexander Monakov
Alexander Monakov@_monoid·
@trav_downs @uops_info Have you seen discussion which CPUs manage to avoid false dependency on scalar SSE ops such as roundss that merge unmodified high bits into the result? Zen2 can, it runs this loop at 1cyc/iteration while UICA says all Intels stall: bit.ly/39HgkDr
English
2
0
0
0
Stanislav
Stanislav@Stanisl61420489·
@InstLatX64 Golden Cove throughput/latency tables going to air soon too
English
2
0
7
0
Victor Michel
Victor Michel@vic_mic_·
On the Skylakes that didn't get their LSD disabled, are there documented corner cases of JCC erratum mitigation not behaving as it should? This uops.info snippet with offset 50 bit.ly/3fJnTJx has a suspiciously high count of DSB+LSD when I actually run it
English
1
0
0
0
Travis Downs
Travis Downs@trav_downs·
@uops_info @pervognsen @gamozolabs Yes, but are there tricks for *all* of them? I think we don't even know all of them: there is a long tail of hidden state that starts to matter less and less, but plenty of things which happen only on odd or even cycles (so the "parity" of your start cycle matters), lots of \
English
3
0
4
0
Andreas Abel
Andreas Abel@uops_info·
Today, I released uiCA, the "uops.info Code Analyzer". uiCA is based on data from uops.info, combined with a new detailed pipeline model. An online version (that also supports other tools) is available at uica.uops.info (1/3)
English
5
52
139
0
Travis Downs
Travis Downs@trav_downs·
@uops_info Awesome work! Is the extension to nanoBench which allows cycle-by-cycle measurement (so-called "Falk diagrams") available? cc @gamozolabs
English
2
0
2
0