ทวีตที่ปักหมุด
Fabian Giesen
83.6K posts

Fabian Giesen
@rygorous
Abstraction maker, abstraction breaker. @[email protected] he/him
เข้าร่วม Aralık 2009
91 กำลังติดตาม14.8K ผู้ติดตาม

@Streetware_ This is comparable to slightly less amount of work than just converting regular (u)int32 to float is!
English

New blog post: "UNORM and SNORM to float, hardware edition" fgiesen.wordpress.com/2024/12/24/uno…
English

@daniel_collin On x86, same thing with PSUBW + PMULHRSW + PADDW, FWIW. (PMULHRSW is basically the same as ARM SQRDMULH, the just-multiply-not-multiply-accumulate version of SQRDMLAH.)
English

My new favorite ARM Neon instruction is: sqrdmlah
developer.arm.com/architectures/…
It allows to do LERP of 8 x i16 values with only two instructions (a vsub and the instruction above) Super useful for what I'm currently fiddling with :)
Thanks to @rygorous for the tip!
English

New blog post: "Exact UNORM8 to float" fgiesen.wordpress.com/2024/11/06/exa… a satisfying solution to a problem that, quite possibly, nobody has
English

New blog post: "BC7 optimal solid-color blocks" fgiesen.wordpress.com/2024/11/03/bc7… clearing out my "I should write this up" queue, this technique is from... *checks git logs* May 2017. Oh my. (I have quite the backlog.)
English

@tom_forsyth PMULHW is at 0x0f 0xe5. PMULHUW is 0x0f 0xe4. MUL and IMUL are ModR/M mod=4 and mod=5 in their group. It's possible they just blocked out things this way by coincidence, but given this and Andy's comments, I doubt it.
English

@tom_forsyth Because at the time there was a mandate to be "more RISC-y" which management at the time interpreted as "fewer instructions is good". Andy Glew was still publicly salty about it 5 years later. web.stanford.edu/class/ee380/Ab…
English

New blog post: "Why those particular integer multiplies?" fgiesen.wordpress.com/2024/10/26/why… some explanation and some speculation on the integer SIMD multiplies offered in x86, along with some history
English

@rygorous Would you also try to fit pclmulqdq in to the same data path? It is after all kind of an integer mul, just without carries.
English

@geofflangdale It's different for every "iteration" and BC7 decode does it 1-3 times in a row. The actual decoder has this in vector regs so I don't have PDEP/PEXT to begin with.
English

@rygorous Nice!
In your application, how often is the 'pos' parameter a delightful surprise that varies unpredictably per iteration? Do you ever need this twice in a row?
I think this wins vs PDEP (I'm not as sure whether the "remove 0" wins vs PEXT), and is more portable, natch.
English

New blog post: "Inserting a 0 bit in the middle of a value" fgiesen.wordpress.com/2024/10/24/ins… I guess it's 2-for-1 bit hacks week.
English

New blog post: "Zero or sign extend" fgiesen.wordpress.com/2024/10/23/zer…
English

@nothings my first association on reading that string of letters is cs.toronto.edu/~simon/html/un…
English


@nothings @aras_p It's already shipped in UE 5.4! #radaudiocodec(experimental)" target="_blank" rel="nofollow noopener">dev.epicgames.com/documentation/…
English

@liam_whan @cmuratori Casey has me blocked so I can't even read the tweet in question. (Not that I'm really active on here anymore anyway.)
English

@cmuratori This is cheating, but Im curious about what @rygorous thinks...
English

@gdamjan @jonmasters It shipped in several Skylake SKUs. #eDRAM_architectural_changes" target="_blank" rel="nofollow noopener">en.wikichip.org/wiki/intel/mic…
English

@jonmasters IIRC Intel has been talking about “Memory Side Cache” since 10 years ago. They tried something with the eDRAM, but I guess then didn't follow thru
English

Don’t forget split scheduler on the back end, and a “Memory Side Cache”. I know my own Bingo card was full by the time they were done describing everything Apple already shipped 4 years ago
INIYSA@lafaiel
Intel is now essentially following Apple's design philosophy, with an integrated memory architecture, a large front-end, a large L1 cache, removal of SMT, 4+4 cores
English

@Simon_Fe1 @tom_forsyth @FreyaHolmer I was talking about Booth encoding in a regular multiplier (you never Booth encode both operands). I'm pretty sure squarers don't Booth encode at all, yes.
English

@tom_forsyth @FreyaHolmer The main application I'm aware of is "High-Speed Function Approximation Using a Minimax Quadratic Interpolator" by Piñeiro, Oberman, Muller and Bruguera. (Internals of NVidia GPU SFUs at some point, I think their current SFUs are still descended from this.)
English

@tom_forsyth @FreyaHolmer Squarers are mostly a thing in special function units for polynomial eval.
You always only Booth encode one of the argument, the other is left alone, so that doesn't save anything, but IIRC there are some shortcuts you can do for squaring.
English


