Travis Downs

12.6K posts

Travis Downs banner
Travis Downs

Travis Downs

@trav_downs

Making stuff fast at @redpandadata.

Canada Katılım Aralık 2018
176 Takip Edilen4.2K Takipçiler
Sabitlenmiş Tweet
Travis Downs
Travis Downs@trav_downs·
Do you like footnotes? Do you like long, rambling lists? Then, this might be the thing for you: travisdowns.github.io/blog/2019/06/1… There, I smash my personal footnote count record (20). As for lists? It has lists *within* lists. The only thing left is to write it in LISP.
English
15
64
256
0
Travis Downs
Travis Downs@trav_downs·
@supahvee1234 This does work, though I had to disable layering check and the sandbox. It's possible that one or both could be re-enabled with more work.
English
0
0
0
189
Vittorio Romeo
Vittorio Romeo@supahvee1234·
Is there anyone who managed to get precompiled headers to work in Bazel for GCC/Clang? #cpp
English
1
0
2
1.1K
Le Saboteur
Le Saboteur@lesaboteur87·
@TD_Canada uhh whats going on with accounts this morning, first it tells me out of nowhere to reset my password and then tells me to call. But your app is saying theres a high call volume right now and delays. Is everyone locked out of their account?!
English
1
0
1
60
Travis Downs
Travis Downs@trav_downs·
@harupy36 It is down for me on http (https seems to be up).
English
0
0
1
136
Travis Downs
Travis Downs@trav_downs·
@tavianator back-to-back transitions (e.g., to S, then E). However, my testing seems to show that even without that stuff works better than you'd expect if the CPU was just using simple textbook rules about what state to initially bring in a line, no doubt there are predictors involved.
English
2
0
2
188
Travis Downs
Travis Downs@trav_downs·
@tavianator That's interesting. My intuition is that prefetchw is the right thing before a CAS because you want the line in E state, since a CAS, like any write, requires the line in that state. So prefetchw provides the hint needed to get it into that state, potentially two \
English
1
0
1
168
Tavian Barnes
Tavian Barnes@tavianator·
Does anyone have a good writeup of the behaviour of explicit prefetches on modern x86 uarches? In particular the cost of prefetching an invalid address (unmapped or even non-canonical).
English
1
0
4
452
Tavian Barnes
Tavian Barnes@tavianator·
Addendum: the 33% reduction in stalled cycles was actually from unrelated icache improvements due to (I assume) code alignment differences. But with that fixed, the prefetch is still good for a ~10% IPC boost
English
1
0
0
162
Travis Downs retweetledi
Daniel Lemire
Daniel Lemire@lemire·
The latest release of the simdutf C++ library (6.0.0) brings in more convenient for C++20 users. While you used to have to provide both a pointer and a size parameter... often you can now just pass your container... std::vector data{1, 2, 3, 4, 5}; // C++11 API auto cpp11 = simdutf::autodetect_encoding(data.data(), data.size()); // C++20 API auto cpp20 = simdutf::autodetect_encoding(data); Link in the comments.
English
4
2
46
4.8K
Travis Downs
Travis Downs@trav_downs·
@corsix @davidtgoldblatt Yeah, good point and yeah I was thinking of the scalar side. Without affineqb I guess you could do something not terrible with vanilla PSHUFB: split the top and bottom nibbles and use PSHUB LUT to reverse each, then re-assemble reversed and PSHUFB again to reverse the bytes.
English
0
0
2
93
David Goldblatt
David Goldblatt@davidtgoldblatt·
TIL: Starting with Arm V8.9/9.4, there are "good" count-trailing-zero and popcount instructions (well, at least the instruction; dunno about the impl).
English
2
0
13
642
David Goldblatt
David Goldblatt@davidtgoldblatt·
Previously you had to do "rbit; clz;" for the former, and move back and forth from a vector register for the latter. Same extension ("Common Short Sequence Compression") has absolute value and signed/unsigned min/max.
English
2
0
5
308
Robert Clausecker
Robert Clausecker@FUZxxl·
RISC fell prey to the end of Dennard scaling. Now it's “how can we do as much work as possible per cycle” instead of “how can we make each cycle as simple as possible so the CPU can be clocked as high as possible”
English
5
0
8
1.2K
Travis Downs
Travis Downs@trav_downs·
@tavianator @corsix Yes. There's another one about flag handling for folded immediates too, which is pretty interesting if only to understand the complexity caused by fault handling.
English
0
0
2
100
Tavian Barnes
Tavian Barnes@tavianator·
I wrote up what I think is the explanation for the Alder Lake shift latency anomaly. Thanks to @corsix, @trav_downs, and others on here/HN for your help!
English
2
4
19
1.4K
Tavian Barnes
Tavian Barnes@tavianator·
@trav_downs @eigenform Yes, this sequence runs at 3 IPC: andn rax, rbx, rcx lea rbx, [rax + 1] lea rcx, [rax + 2] This one runs at 4 IPC: andn rax, rbx, rcx lea rax, [rax + 1] lea rbx, [rax + 2] lea rcx, [rbx + 3] Quite impressive IMO
English
1
1
4
279
Travis Downs
Travis Downs@trav_downs·
@tavianator Yeah, I stopped following all this closely when there were still only 7 ports or whatever so not a problem I ran into :).
English
0
0
1
19
Travis Downs
Travis Downs@trav_downs·
@tavianator @eigenform BTW, feel free to stop testing my suggestions as soon as you get bored (I don't have an Alder Lake, so I can't do it myself). Now I'm curious if a chain something like: andn rax, rbx, rcx lea rbc, [rax + 1] lea rcx, [rax + 2] ... executes 3 instructions/cycle.
English
1
0
0
46
Travis Downs
Travis Downs@trav_downs·
@tavianator e.g. a 3 uop operation may have 2 independent uops which feed into the third, or have them all serially dependent, or have 1 initial uop which feeds into the other two (this makes sense if there multiple outputs, or side effects).
English
0
0
0
60
Travis Downs
Travis Downs@trav_downs·
@tavianator Yeah the syntax is like MpAB NpCD ... which is M uops to any of ports A, B and N uops to any of ports C, D, and so on. If N/M are omitted they are 1. This still isn't really enough to figure out all the details since the uops may have different dependency relationships,
English
2
0
0
76