



Robert Miller (🦄,🦄)
6.9K posts

@RobertMiller
Data Gerd (Geek-Nerd), Cloud Architect, MS SQL DBA/Architect/BI juggler, Big Data (Streaming), Full-Stack Developer, Chesapeake Bay Retrievers, and electronics.

















BOOM! MAJOR AI SPEEDUP! Hot Rod AI 100 times faster inference 100,000 times less power! — Reviving Analog Circuits: A Leap Toward Ultra-Efficient AI with In-Memory Attention I got my start in analog electronics when I was a kid and always thought analog computers would make a comeback. Analog computing of the 1960s neural networks used voltage-based circuits rather than binary clocks. Analog is Faster Than Digital Large language models at their core lies the transformer architecture, where self-attention mechanisms sift through vast sequences of data to predict the next word or token. On conventional GPUs, shuttling data between memory caches and processing units devours time and energy, bottlenecking the entire system. They require a clock cycle to precisely move bits in and out of memory and registers and this is >90% of the time and energy overhead. But now a groundbreaking study proposes a custom in-memory computing setup that could slash these inefficiencies, potentially reshaping how we deploy generative AI. The innovation centers on "gain cells"—emerging charge-based analog memories that double as both storage and computation engines. Unlike digital GPUs, which laboriously load token projections from cache into SRAM for each generation step, this architecture keeps data where the math happens: right ON THE CHIP! With a clock speed near the THE SPEED OF LIGHT because it is never on/off like in digital binary. By leveraging parallel analog dot-product operations, the design computes self-attention natively, sidestepping the data movement that plagues GPU hardware. To bridge the gap between ideal digital models and the noisy realities of analog circuits, the researchers devised a clever initialization algorithm. This method adapts pre-trained LLMs, such as GPT-2, without the need for full retraining, ensuring seamless performance parity despite non-idealities like voltage drifts or precision limits. The results are nothing short of staggering! Simulations show the system slashing attention latency to 100 times faster inference for token generation—while curbing energy use by a jaw-dropping five orders of magnitude, or 100,000 times less power-hungry than GPU baselines. For context, this could mean running a full LLM on a device no larger than a a card deck, without any thermal throttling or grid-straining demands of today's data centers. The approach targets the attention block specifically, the transformer’s energy hog, but slso broader integration with other in-memory techniques to turbocharge the entire model pipeline. Analog tech isn't pie-in-the-sky quantum wizardry; it's grounded in ancient mature electronics theory, with gain cells already prototyped in labs. The only engineering issue, and it is simple: tolerances for noise, scaling arrays of cells, and fabricating at microchip densities. Existing CMOS processes tweaks for analog fidelity. From there, Full ecosystem integration, including software stacks for model adaptation, could happen in a year, disrupting GPU dominance sooner than skeptics predict. Risks are low but hybrid digital-analog interfaces could introduce unforeseen bugs. However this can be rapidly iterated and addressed. This isn't just hardware tinkering; it's a philosophical pivot back to AI's analog origins, where computation flows continuously rather than ticking in discrete cycles. This in-memory attention could democratize AI power, making low power, lightning-fast AI not a luxury, but an inevitability to even the smallest devices. Most have no idea how big this is: It is the biggest shift in AI since the invention of LLMs. The world will struggle to find true experienced analog engineers, most are gone. In my garage I will have a test Analog CMOS Gain Cells using off the self parts in the next few days, if Radio Shack was still around I would have have done today. I suspect I can scale to a proto AI model in a few weeks. PAPER: arxiv.org/abs/2409.19315



