Travis Downs

12.6K posts

Travis Downs

@trav_downs

Making stuff fast at @redpandadata.

Canada Katılım Aralık 2018

176 Takip Edilen4.2K Takipçiler

Sabitlenmiş Tweet

Travis Downs@trav_downs·11 Haz

Do you like footnotes? Do you like long, rambling lists? Then, this might be the thing for you: travisdowns.github.io/blog/2019/06/1… There, I smash my personal footnote count record (20). As for lists? It has lists *within* lists. The only thing left is to write it in LISP.

English

256

Travis Downs@trav_downs·6 Oca

@supahvee1234 This does work, though I had to disable layering check and the sandbox. It's possible that one or both could be re-enabled with more work.

English

189

Travis Downs@trav_downs·20 Ara

@supahvee1234 Looking into this myself now, though clang-only. There's this: github.com/envoyproxy/env… which seems to have worked out as it's still in there today: github.com/search?q=repo%…

English

259

Vittorio Romeo@supahvee1234·8 Ara

Is there anyone who managed to get precompiled headers to work in Bazel for GCC/Clang? #cpp

English

1.1K

Travis Downs@trav_downs·26 Kas

@lesaboteur87 @TD_Canada Even if you do call you will be immediately disconnected once you get to the "waiting for agent" stage.

English

Le Saboteur@lesaboteur87·26 Kas

@TD_Canada uhh whats going on with accounts this morning, first it tells me out of nowhere to reset my password and then tells me to call. But your app is saying theres a high call volume right now and delays. Is everyone locked out of their account?!

English

Travis Downs@trav_downs·29 May

@harupy36 It is down for me on http (https seems to be up).

English

136

harupy@harupy36·29 May

Is archive.ubuntu.com down?

English

595

Travis Downs@trav_downs·18 Oca

@tavianator back-to-back transitions (e.g., to S, then E). However, my testing seems to show that even without that stuff works better than you'd expect if the CPU was just using simple textbook rules about what state to initially bring in a line, no doubt there are predictors involved.

English

188

Travis Downs@trav_downs·18 Oca

@tavianator That's interesting. My intuition is that prefetchw is the right thing before a CAS because you want the line in E state, since a CAS, like any write, requires the line in that state. So prefetchw provides the hint needed to get it into that state, potentially two \

English

168

Tavian Barnes@tavianator·25 Kas

Does anyone have a good writeup of the behaviour of explicit prefetches on modern x86 uarches? In particular the cost of prefetching an invalid address (unmapped or even non-canonical).

English

452

Travis Downs@trav_downs·16 Oca

@WikiChip down for the count?

English

Travis Downs@trav_downs·16 Oca

@tavianator Are you using prefetchw or something else?

English

Tavian Barnes@tavianator·26 Kas

Addendum: the 33% reduction in stalled cycles was actually from unrelated icache improvements due to (I assume) code alignment differences. But with that fixed, the prefetch is still good for a ~10% IPC boost

English

162

Travis Downs retweetledi

Daniel Lemire@lemire·6 Oca

The latest release of the simdutf C++ library (6.0.0) brings in more convenient for C++20 users. While you used to have to provide both a pointer and a size parameter... often you can now just pass your container... std::vector data{1, 2, 3, 4, 5}; // C++11 API auto cpp11 = simdutf::autodetect_encoding(data.data(), data.size()); // C++20 API auto cpp20 = simdutf::autodetect_encoding(data); Link in the comments.

English

4.8K

Travis Downs@trav_downs·16 Oca

@corsix @davidtgoldblatt Yeah, good point and yeah I was thinking of the scalar side. Without affineqb I guess you could do something not terrible with vanilla PSHUFB: split the top and bottom nibbles and use PSHUB LUT to reverse each, then re-assemble reversed and PSHUFB again to reverse the bytes.

English

Pete Cawley@corsix·10 Oca

@trav_downs @davidtgoldblatt wunkolo.github.io/post/2020/11/g… can be used, albeit only on recent CPUs, and it’s a vector instruction rather than a scalar one.

English

171

David Goldblatt@davidtgoldblatt·9 Oca

TIL: Starting with Arm V8.9/9.4, there are "good" count-trailing-zero and popcount instructions (well, at least the instruction; dunno about the impl).

English

642

Travis Downs@trav_downs·10 Oca

@davidtgoldblatt I wish x86 had rbit. It's a handy building block and hard to emulate.

English

David Goldblatt@davidtgoldblatt·9 Oca

Previously you had to do "rbit; clz;" for the former, and move back and forth from a vector register for the latter. Same extension ("Common Short Sequence Compression") has absolute value and signed/unsigned min/max.

English

308

Travis Downs@trav_downs·9 Oca

@lemire @FUZxxl No.

Daniel Lemire@lemire·6 Oca

@FUZxxl Is RISC-V limited with respect to superscalarity?

English

291

Robert Clausecker@FUZxxl·6 Oca

RISC fell prey to the end of Dennard scaling. Now it's “how can we do as much work as possible per cycle” instead of “how can we make each cycle as simple as possible so the CPU can be clocked as high as possible”

English

1.2K

Travis Downs@trav_downs·7 Oca

@tavianator @corsix Yes. There's another one about flag handling for folded immediates too, which is pretty interesting if only to understand the complexity caused by fault handling.

English

100

Tavian Barnes@tavianator·6 Oca

@trav_downs @corsix This one right? patents.google.com/patent/US20170…

English

122

Tavian Barnes@tavianator·4 Oca

I wrote up what I think is the explanation for the Alder Lake shift latency anomaly. Thanks to @corsix, @trav_downs, and others on here/HN for your help!

English

1.4K

Travis Downs@trav_downs·3 Oca

@tavianator @eigenform > Quite impressive IMO Yes, exactly!

English

Tavian Barnes@tavianator·3 Oca

@trav_downs @eigenform Yes, this sequence runs at 3 IPC: andn rax, rbx, rcx lea rbx, [rax + 1] lea rcx, [rax + 2] This one runs at 4 IPC: andn rax, rbx, rcx lea rax, [rax + 1] lea rbx, [rax + 2] lea rcx, [rbx + 3] Quite impressive IMO

English

279

Tavian Barnes@tavianator·3 Oca

New blog post: tavianator.com/2025/shlx.html

English

4.1K

Travis Downs@trav_downs·3 Oca

@tavianator Yeah, I stopped following all this closely when there were still only 7 ports or whatever so not a problem I ran into :).

English

Tavian Barnes@tavianator·3 Oca

@trav_downs Kinda makes the two-digit port numbers ambiguous

English

Travis Downs@trav_downs·3 Oca

@tavianator @corsix @eigenform Right, but we only need 1 register to have it to trigger the slow case, in the SHLX case anyway.

English

Tavian Barnes@tavianator·3 Oca

@trav_downs @corsix @eigenform Just the destination register (rax) loses it

English

Travis Downs@trav_downs·3 Oca

@tavianator @eigenform This is testing whether the CPU can accept a hidden immediate on _both_ input arguments.

English

Travis Downs@trav_downs·3 Oca

@tavianator @eigenform BTW, feel free to stop testing my suggestions as soon as you get bored (I don't have an Alder Lake, so I can't do it myself). Now I'm curious if a chain something like: andn rax, rbx, rcx lea rbc, [rax + 1] lea rcx, [rax + 2] ... executes 3 instructions/cycle.

English

Travis Downs@trav_downs·3 Oca

@tavianator e.g. a 3 uop operation may have 2 independent uops which feed into the third, or have them all serially dependent, or have 1 initial uop which feeds into the other two (this makes sense if there multiple outputs, or side effects).

English

Travis Downs@trav_downs·3 Oca

@tavianator Yeah the syntax is like MpAB NpCD ... which is M uops to any of ports A, B and N uops to any of ports C, D, and so on. If N/M are omitted they are 1. This still isn't really enough to figure out all the details since the uops may have different dependency relationships,

English

Keşfet

@supahvee1234 @lesaboteur87 @TD_Canada @harupy36 @tavianator @WikiChip @corsix @davidtgoldblatt