camel-cdr

828 posts

camel-cdr

camel-cdr

@CamelCdr

🐘 @[email protected]

Katılım Mart 2024
69 Takip Edilen262 Takipçiler
camel-cdr
camel-cdr@CamelCdr·
@ashvardanian DC Roma II is a laptop with RVV 1.0. You can get ssh access to hardware via the RISC-V labs program: riscv.org/developers/lab… cloud-v has a milkv-jupiter instance, EPCC also has them and a bananapi bpi-f3 instance. You can download RVV 1.0 RTL and simulate it with verilator.
English
0
0
0
57
Ash Vardanian
Ash Vardanian@ashvardanian·
@CamelCdr Yes, I should probably adjust wording. I mean something accessible in a phone, tablet, laptop, desktop… like real “consumer tech” or at least a cloud hosted VM that I can rent hourly
English
1
0
0
30
Ash Vardanian
Ash Vardanian@ashvardanian·
My biggest open-source release! NumKong — 2'000+ SIMD kernels for mixed-precision numerics, from Float6 to Float118. Started in 2023. Opened the PR in 2024. Finally, merged this week! RISC-V, Intel AMX & AVX-512, Apple SME & SVE, WASM Relaxed SIMD. 200'000 lines of code in a 5 MB binary. Same scale as OpenBLAS. Available for C 99, C++ 23, Python 3, Rust, Swift, GoLang, & JavaScript. Int4 dot products via nibble algebra. Ozaki Float64 GEMMs on Float32 tile hardware. 6-bit and 8-bit floats back-ported to 10-year-old CPUs. 5'300x faster Geospatial metrics than GeoPy. 200x faster Kabsch than BioPython. 0 ULP where OpenBLAS hits 56... and a lot more! pip install numkong Or pull it from NPM, Crates, GitHub... and let me know what breaks 🤗 Links & highlights ⬇️
Ash Vardanian tweet media
English
13
64
461
23.2K
camel-cdr
camel-cdr@CamelCdr·
@ashvardanian “no consumer silicon supporting RVV 1.0” yeah, no, there have been RVV 1.0 SBC available since 2023.
English
1
0
0
43
@fclc
@fclc@FelixCLC_·
Thinking about the viability of a sign preserving right shift; we already have the machinery to do this from the float side of things soft eliminates the need for 2's Comp?
English
2
0
0
538
camel-cdr
camel-cdr@CamelCdr·
@FelixCLC_ @FUZxxl Logical shift shifts in zeros, arithemtic shifts in src[xlen-1], rotate shifts in src[0], a 2->1 GPR shift shifts in from src[0] of the second GPR. The later generalizes to everything, but arithmetic shifts: logical: shift2 rd, rs1, x0, n rotate: shift2 rd, rs1, rs1, n
English
2
0
1
77
Robert Clausecker
Robert Clausecker@FUZxxl·
We implement transposition of square matrices by recursively transposing block matrices of varying sizes ((AB|CD) → (AC|BD)). If we transpose an arbitrary set of block matrix sizes instead of all of them we get generalised transposition. Has this been studied before?
English
3
3
18
1.7K
camel-cdr
camel-cdr@CamelCdr·
@FelixCLC_ Say you load 8-bit data, create mpreg which masks-out all negative values. Then you use the positive ones to lookup some 32-bit float values and continue your conputation with 32-bit values, still under the same mpreg.
English
0
0
0
41
camel-cdr
camel-cdr@CamelCdr·
@FelixCLC_ I mean when you want to use the same mpreg value to predicate operations of different element width. This is relevant if you have SIMT style code which operates on different eldment width (e.g. widen at some point, but still in the samd if).
English
2
0
0
50
@fclc
@fclc@FelixCLC_·
Fun SIMD design criteria problem: Given mask predication register mpreg0 that is 16 bits long And Packed Reg 0 that is 128 bits long And that you're trying to do a lanes wise op on 16 bit entries:
English
4
0
24
3.7K
camel-cdr
camel-cdr@CamelCdr·
@FelixCLC_ Have you considered how to deal with predication in nixed width workloads?
English
1
0
0
43
@fclc
@fclc@FelixCLC_·
(Thoughts requested and very welcome)
English
1
0
2
433
camel-cdr
camel-cdr@CamelCdr·
The RVP spec is coming along: github.com/riscv/riscv-p-… Here is a untested implementation of JPEG upsample in RVP: godbolt.org/z/r5bGGPsj5 This uses the current draft intrinsics. With the overloaded ones this will be less verbose. __riscv_preinterpret is still way to long IMO.
English
0
3
10
1.1K
@fclc
@fclc@FelixCLC_·
SIMD ISA: Do you care about predication for data less than 8 bits?
English
7
1
14
2.2K
camel-cdr
camel-cdr@CamelCdr·
@FUZxxl There are a good ammount of problems, that you should be able to accelerate with gather/scatter state machines. E.g. huffman decoding from multiple synchronization points in parallel or batched operations on a binary heap. I haven't done benchmarks yet though.
English
2
0
2
112
Robert Clausecker
Robert Clausecker@FUZxxl·
I want to add some examples for scatter operations to my thesis, but all I can come up with are actually bit scatters, for which I have a separate section :-S. Any cool techniques I should cover? Ideally ones that don't need conflict resolution.
English
4
0
2
646
camel-cdr
camel-cdr@CamelCdr·
@FelixCLC_ Element slide is a must, full register bit shift is cool as well. If you expect hardware to make it perform better than emulating it with a element slide + two in-element shift + ADD.
English
0
0
1
85
@fclc
@fclc@FelixCLC_·
Is cross lane/full register shifting useful?
English
7
0
8
1.9K
camel-cdr
camel-cdr@CamelCdr·
The problem with mixing scalar and SIMD today, is that you need to be conservative with the number of scalar elements processee, because if the scalar iteration is slower than the SIMD one, the SIMD has to wait. With s1first you wouldn't have to wait, if one path is faster.
English
0
1
6
149
camel-cdr
camel-cdr@CamelCdr·
This allows you to dynamically load ballance different code paths, which can be especially usefull if you have a parallel problem and you want to process the elements with SIMD bit one or two simultaniously with scalar.
English
1
1
5
148
camel-cdr
camel-cdr@CamelCdr·
@FUZxxl A similar idea to the fast-path branch: s1first rd, rs1, rs2 Sets rd to 1 if rd1 becomes ready first, or 0 if rd2 becomes ready first, the other depenendcy is discarded.
English
2
1
4
209
camel-cdr
camel-cdr@CamelCdr·
@FelixCLC_ You need to support unmasked operations, or at a minimum have both merging and zeroing predication. In terms of tradeoffs, I think toggleable mpreg0 (hot) predication better than the SVE route of destructive operations. (I tried to think of a pun but failed)
English
0
0
3
52
@fclc
@fclc@FelixCLC_·
@CamelCdr I’m not convinced it’s the right long term option, but it’s a reasonable approach; if you add stronger masking options and make the instructions more universal, it gets interesting. In a 32 bit space, it’s a imperfect but reasonable
English
1
0
1
89
@fclc
@fclc@FelixCLC_·
In between the Correct Thing in green field, and the practical thing. Reads for elements are always done from mask prediction register mpreg0, but mask register generation can result in any of 1-N masks? Saves a lot of opcode encodings, but requires extra mpMovs
@fclc@FelixCLC_

Time to try and build The Good SIMD ISA

English
1
1
9
1.3K
Robert Clausecker
Robert Clausecker@FUZxxl·
@CamelCdr @FelixCLC_ @lauriewired My proposal does not touch the scalar unit, instead the objective is to remove the special case of “scalar FP,” turning scalar FP ops into packed FP ops with a special mask register.
English
1
0
0
46
@fclc
@fclc@FelixCLC_·
More for the RE/Security folks than anything, how much do you care about opcode adjacency? IE: the packed SIMD version of op Foo has similar encoding to the scalar version of Foo? Thinking mainly for when a random binary lands on your desk CC @lauriewired
English
4
2
26
7.2K