Apex Compute

2 posts

Apex Compute

Apex Compute

@apexcompute

Check our GitHub: https://t.co/owtsUfnqfc

가입일 Ocak 2025
21 팔로잉246 팔로워
Apex Compute 리트윗함
Hasan
Hasan@hasanunlu9·
We’ve released hardware v1.1 of the Apex Compute Unified Engine. If you have an FPGA from us, please update to the latest bin file version in the repo. We also added many new models. Here are the updates in this release, and many more hardware optimizations/features are coming in upcoming releases. RTL updates • Unified activation pipeline — GELU, SiLU, sigmoid, tanh implemented via a single (a+x)*sigmoid(-b*x) hardware block with configurable a and b parameters. ReLU is included as well. • Software reset added • Add-reduce block optimization — Used multi-input FP adder, latency reduced from 21 to 12 cycles, >45% reduction in LUT/FF total. • Argmax — top-4 selection for MoE support • Instruction set improvements — instruction queue replaced with instruction direct-mapped cache, supports absolute/relative jumps, this significantly reduced microcode size for matmul kernels and almost no instruction DMA overhead. • Timing — Kintex-7 frequency increased to 194 MHz Other FPGAs / Multi-Engine support • Bittware (Kintex UltraScale 15P) — new project, 400 MHz engine • Alveo U50 — stabilized build, HBM AXI tuning, 280 MHz engine and 450 MHz HBM, 8 engines synthesized • Kintex-7 — dual-engine build, 194 MHz target, updated address map Test infrastructure • Fmax debug test • Image patching test New models • Llama 3.2 1B • GPT-2 • Qwen 3 1.7B • SmolVLM2 500M • Parakeet • Swin Transformer Apex Compute - Unified Engine Repo: github.com/apex-compute/u…
English
1
2
11
1.1K
Apex Compute 리트윗함
Hasan
Hasan@hasanunlu9·
After 8+ years on the Tesla Autopilot team and 3 years at Intel, I started @apexcompute to design a new architecture for efficient AI inference. For the past 9 months, we’ve been building our custom inference accelerator. Today we’re releasing Unified Engine v1. Last June we raised our seed round with @maxitechinc , DeepFin Research, @Soma_Capital and an incredible group of angel investors. In less than 9 months, we completed our RTL architecture and brought our first pre-silicon prototype to life on FPGA. Our architecture combines systolic array and vector processing in a single compute engine with multiple architectural optimizations, achieving very high FLOPs utilization. A single engine is super lean and it uses less than 90K LUTs and 1 MB Block RAM. It may also be one of the smallest logic-footprint compute engines developed so far. Our Unified Engine v1 supports: -matrix-matrix multiplication (~95% FLOPs utilization) -softmax (~90% FLOPs utilization) -broadcast and element-wise operations -RMSNorm / LayerNorm -block quantization/dequantization (fp4, int4) -multi-engine synchronization and many other operations. We even implemented memory-efficient attention similar to FlashAttention, reaching ~90% FLOP utilization. Full benchmarks and the software stack are available on our GitHub: github.com/apex-compute/u… We have basic compiler written in Python and it supports PyTorch tensors directly to easily test and transfer tensors between the accelerator and host using bf16, fp4 and int4 formats. Our FPGA prototype can already run LLM inference and outperform NVIDIA Jetson Orin Nano, even on a mid-tier FPGA setup (6.4x lower memory bandwidth, 18% slower clock speed at 4.5 Watts). Check the side-by-side comparison video below. Our GitHub includes low-level operator implementations, examples for tiled matrix multiplication, operation chaining, tensor parallelism, attention kernel and a full Gemma 3 1B model implementation. Many more models(Vision Transformers and VLA) are coming soon. Our accelerator IP is AXI-ready for deployment on any AMD(Xilinx) FPGA platform today. Even better, our two-engine prototype runs on an entry-level AMD(Xilinx) FPGA as a PCIe accelerator card. You can purchase it here buy.stripe.com/6oUaEQf6365bgA… for $50 to experiment our pre-silicon prototype on your desktop PC or Raspberry Pi 5. We will be releasing hardware bitstream updates as the architecture gets new features. More to come soon! We are expanding our team and looking for compiler engineers and floating-point hardware design engineers. If you're interested, please send me a DM.
English
29
39
386
36.7K