camel-cdr

828 posts

camel-cdr

@CamelCdr

🐘 @[email protected]

Katılım Mart 2024

69 Takip Edilen262 Takipçiler

camel-cdr@CamelCdr·2d

@ashvardanian DC Roma II is a laptop with RVV 1.0. You can get ssh access to hardware via the RISC-V labs program: riscv.org/developers/lab… cloud-v has a milkv-jupiter instance, EPCC also has them and a bananapi bpi-f3 instance. You can download RVV 1.0 RTL and simulate it with verilator.

English

Ash Vardanian@ashvardanian·2d

@CamelCdr Yes, I should probably adjust wording. I mean something accessible in a phone, tablet, laptop, desktop… like real “consumer tech” or at least a cloud hosted VM that I can rent hourly

English

Ash Vardanian@ashvardanian·3d

My biggest open-source release! NumKong — 2'000+ SIMD kernels for mixed-precision numerics, from Float6 to Float118. Started in 2023. Opened the PR in 2024. Finally, merged this week! RISC-V, Intel AMX & AVX-512, Apple SME & SVE, WASM Relaxed SIMD. 200'000 lines of code in a 5 MB binary. Same scale as OpenBLAS. Available for C 99, C++ 23, Python 3, Rust, Swift, GoLang, & JavaScript. Int4 dot products via nibble algebra. Ozaki Float64 GEMMs on Float32 tile hardware. 6-bit and 8-bit floats back-ported to 10-year-old CPUs. 5'300x faster Geospatial metrics than GeoPy. 200x faster Kabsch than BioPython. 0 ULP where OpenBLAS hits 56... and a lot more! pip install numkong Or pull it from NPM, Crates, GitHub... and let me know what breaks 🤗 Links & highlights ⬇️

English

461

23.2K

camel-cdr@CamelCdr·2d

@ashvardanian “no consumer silicon supporting RVV 1.0” yeah, no, there have been RVV 1.0 SBC available since 2023.

English

Ash Vardanian@ashvardanian·3d

Here's the full write up: ashvardanian.com/posts/numkong/

English

990

camel-cdr@CamelCdr·5d

@dzaima @FelixCLC_ @FUZxxl Yeah, I forgot the name.

English

dzaima@dzaima·5d

@CamelCdr @FelixCLC_ @FUZxxl Known as a funnel shift; `shld` on x86

English

@fclc@FelixCLC_·5d

Thinking about the viability of a sign preserving right shift; we already have the machinery to do this from the float side of things soft eliminates the need for 2's Comp?

English

538

camel-cdr@CamelCdr·5d

@FelixCLC_ @FUZxxl Maybe you can add something to get it to cover arithmetic shifts as well?

English

camel-cdr@CamelCdr·5d

@FelixCLC_ @FUZxxl Logical shift shifts in zeros, arithemtic shifts in src[xlen-1], rotate shifts in src[0], a 2->1 GPR shift shifts in from src[0] of the second GPR. The later generalizes to everything, but arithmetic shifts: logical: shift2 rd, rs1, x0, n rotate: shift2 rd, rs1, rs1, n

English

camel-cdr@CamelCdr·13 Mar

@FUZxxl fgiesen.wordpress.com/2013/08/29/sim…

QME

146

Robert Clausecker@FUZxxl·13 Mar

We implement transposition of square matrices by recursively transposing block matrices of varying sizes ((AB|CD) → (AC|BD)). If we transpose an arbitrary set of block matrix sizes instead of all of them we get generalised transposition. Has this been studied before?

English

1.7K

camel-cdr@CamelCdr·13 Mar

@FelixCLC_ Say you load 8-bit data, create mpreg which masks-out all negative values. Then you use the positive ones to lookup some 32-bit float values and continue your conputation with 32-bit values, still under the same mpreg.

English

camel-cdr@CamelCdr·13 Mar

@FelixCLC_ I mean when you want to use the same mpreg value to predicate operations of different element width. This is relevant if you have SIMT style code which operates on different eldment width (e.g. widen at some point, but still in the samd if).

English

@fclc@FelixCLC_·13 Mar

Fun SIMD design criteria problem: Given mask predication register mpreg0 that is 16 bits long And Packed Reg 0 that is 128 bits long And that you're trying to do a lanes wise op on 16 bit entries:

English

3.7K

camel-cdr@CamelCdr·13 Mar

@FelixCLC_ Have you considered how to deal with predication in nixed width workloads?

English

@fclc@FelixCLC_·13 Mar

(Thoughts requested and very welcome)

English

433

camel-cdr@CamelCdr·11 Mar

The RVP spec is coming along: github.com/riscv/riscv-p-… Here is a untested implementation of JPEG upsample in RVP: godbolt.org/z/r5bGGPsj5 This uses the current draft intrinsics. With the overloaded ones this will be less verbose. __riscv_preinterpret is still way to long IMO.

English

1.1K

camel-cdr@CamelCdr·10 Mar

@FelixCLC_ No

@fclc@FelixCLC_·10 Mar

SIMD ISA: Do you care about predication for data less than 8 bits?

English

2.2K

camel-cdr@CamelCdr·9 Mar

@FUZxxl There are a good ammount of problems, that you should be able to accelerate with gather/scatter state machines. E.g. huffman decoding from multiple synchronization points in parallel or batched operations on a binary heap. I haven't done benchmarks yet though.

English

112

Robert Clausecker@FUZxxl·9 Mar

I want to add some examples for scatter operations to my thesis, but all I can come up with are actually bit scatters, for which I have a separate section :-S. Any cool techniques I should cover? Ideally ones that don't need conflict resolution.

English

646

camel-cdr@CamelCdr·3 Mar

@FelixCLC_ Element slide is a must, full register bit shift is cool as well. If you expect hardware to make it perform better than emulating it with a element slide + two in-element shift + ADD.

English

@fclc@FelixCLC_·3 Mar

Is cross lane/full register shifting useful?

English

1.9K

camel-cdr@CamelCdr·1 Mar

The problem with mixing scalar and SIMD today, is that you need to be conservative with the number of scalar elements processee, because if the scalar iteration is slower than the SIMD one, the SIMD has to wait. With s1first you wouldn't have to wait, if one path is faster.

English

149

camel-cdr@CamelCdr·1 Mar

This allows you to dynamically load ballance different code paths, which can be especially usefull if you have a parallel problem and you want to process the elements with SIMD bit one or two simultaniously with scalar.

English

148

camel-cdr@CamelCdr·1 Mar

@FUZxxl A similar idea to the fast-path branch: s1first rd, rs1, rs2 Sets rd to 1 if rd1 becomes ready first, or 0 if rd2 becomes ready first, the other depenendcy is discarded.

English

209

camel-cdr@CamelCdr·28 Şub

@FelixCLC_ You need to support unmasked operations, or at a minimum have both merging and zeroing predication. In terms of tradeoffs, I think toggleable mpreg0 (hot) predication better than the SVE route of destructive operations. (I tried to think of a pun but failed)

English

@fclc@FelixCLC_·28 Şub

@CamelCdr I’m not convinced it’s the right long term option, but it’s a reasonable approach; if you add stronger masking options and make the instructions more universal, it gets interesting. In a 32 bit space, it’s a imperfect but reasonable

English

@fclc@FelixCLC_·28 Şub

In between the Correct Thing in green field, and the practical thing. Reads for elements are always done from mask prediction register mpreg0, but mask register generation can result in any of 1-N masks? Saves a lot of opcode encodings, but requires extra mpMovs

@fclc@FelixCLC_

Time to try and build The Good SIMD ISA

English

1.3K

camel-cdr@CamelCdr·27 Şub

@FUZxxl @FelixCLC_ @lauriewired Ah, ok. That makes sense, especially if you were going to overlap FP and SIMD anyways.

English

Robert Clausecker@FUZxxl·27 Şub

@CamelCdr @FelixCLC_ @lauriewired My proposal does not touch the scalar unit, instead the objective is to remove the special case of “scalar FP,” turning scalar FP ops into packed FP ops with a special mask register.

English

@fclc@FelixCLC_·27 Şub

More for the RE/Security folks than anything, how much do you care about opcode adjacency? IE: the packed SIMD version of op Foo has similar encoding to the scalar version of Foo? Thinking mainly for when a random binary lands on your desk CC @lauriewired

English

7.2K

Keşfet

@ashvardanian @dzaima @FelixCLC_ @FUZxxl @elonmusk @BarackObama @taylorswift13 @cristiano