Alen Capalik

38 posts

Alen Capalik banner
Alen Capalik

Alen Capalik

@capalik

CTO and Chief AI Officer of https://t.co/ern34Oy3Iv & https://t.co/W8Q1cOhKyj. Founder of CounterTack (https://t.co/55foNoBYv3) & https://t.co/snWLnZolVI. Hacker, The Hidden Layer secrets hunter, AI whisperer

Newport Beach, CA Katılım Nisan 2009
2.1K Takip Edilen189 Takipçiler
Alen Capalik
Alen Capalik@capalik·
@TheCinesthetic It was beyond amazing. I remember people going NUTS! The greatest movie preview EVER!
English
0
0
1
26
cinesthetic.
cinesthetic.@TheCinesthetic·
Imagine being in a theater in 1999 and seeing this for the first time.
English
6.5K
8K
80.4K
6.1M
Danielle Mejia
Danielle Mejia@DanielleMejiaCA·
Can you take a Guess?
Danielle Mejia tweet media
English
6.3K
140
1.7K
3.4M
Alen Capalik
Alen Capalik@capalik·
Rapid prototyping at it's absolute finest 🔥
Andrej Karpathy@karpathy

I gave a talk at GPU MODE workshop last week on llm.c - the origin story of llm.c - being naked in the world without PyTorch and having to re-invent Array, Autograd, Device, Dtype, Compile, Distributed - how to port a PyTorch layer to 1) explicit PyTorch - and then to 2) write the backward pass - 3) port forward & backward pass to C - 4) string all the layers together - achieving one file of C with no dependencies that compiles and runs ~instantly, where all memory is pre-planned and allocated a single time, fully deterministic, portable code that can run on a potato or a von Neumann probe - how most of llm.c was built at 1am-7am in a water villa porch in Maldives and why this is the recommended way to develop software - convert all of it to run in CUDA on GPU in fp32 - port matmul to cuBLAS - port attention to cuDNN flash-attention - introduce bfloat16 mixed precision - introduce many more optimizations and features like kernel fusions, Packed128, stochastic rounding, full determinism - add multi-GPU training, NCCL, sharded optimizer - add multi-node with MPI or file system or socket - reproduce GPT-2 (1.6B) on one 8XH100 node in 24 hours for $672 in llm.c, achieving (at the time) 29% less memory, 19% faster training that PyTorch nightly, and much faster compile & run - how open source development attracts Avengers from the internet - port to training Llama 3 imminent (branch exists) - many other notable forks - last thought: how software abstractions like Python/PyTorch and everything else really exist only because humans are finite in knowledge, IQ and attention, and how with increasing AI capability LLMs may export custom binaries like llm.c for any application directly, tearing apart and refactoring all abstractions as needed. <|endoftext|> More links in reply

English
1
0
1
261
Adrian | The Web Scraping Guy
Adrian | The Web Scraping Guy@adrian_horning_·
I just scraped 2.8 million companies from crunchbase 🤯 Name, website, semrush stats, etc. I'm giving the entire thing away in the next 24 hours Comment "crunchbase" and I'll send it to you. Make sure DM's are open
Adrian | The Web Scraping Guy tweet media
English
3.1K
58
1.7K
496.6K
Alen Capalik
Alen Capalik@capalik·
23+ years listening to #SomaFM. Those guys got me through a lot of marathon #hacking sessions nights 😉. Still going strong. Even with all the choices we have today for music listening I always find myself going back to the the original. somafm.com
English
2
0
2
231
Alen Capalik retweetledi
Steven Adler
Steven Adler@sjgadler·
Think you can tell if a social media account is a bot? What about as AI gets better? A new paper—co-authored with researchers from ~20 orgs, & my OpenAI teammates Zoë Hitzig and David Schnurr—asks this question: What are AI-proof ways to tell who’s real online? (1/n)
Steven Adler tweet media
English
49
152
672
204.3K
Alen Capalik
Alen Capalik@capalik·
I've been using Linux for 30 years. For a long time I used xterm as my terminal in Linux. I have never ever gone through and read it's man page 🤣. Has anyone read the whole thing? #Linux
English
0
0
0
133
Alen Capalik
Alen Capalik@capalik·
This kind of advancements and research, if true, are the future. AI's only ceiling right now is efficiency. It's power hungry.
Rohan Paul@rohanpaul_ai

This is really a 'WOW' paper. 🤯 Claims that MatMul operations can be completely eliminated from LLMs while maintaining strong performance at billion-parameter scales and by utilizing an optimized kernel during inference, their model’s memory consumption can be reduced by more than 10× compared to unoptimized models. 🤯 'Scalable MatMul-free Language Modeling' Concludes that it is possible to create the first scalable MatMul-free LLM that achieves performance on par with state-of-the-art Transformers at billion-parameter scales. 📌 The proposed MatMul-free LLM replaces MatMul operations in dense layers with ternary accumulations using weights constrained to {-1, 0, +1}. This reduces computational cost and memory utilization while preserving network expressiveness. 📌 To remove MatMul from self-attention, the Gated Recurrent Unit (GRU) is optimized to rely solely on element-wise products, creating the MatMul-free Linear GRU (MLGRU) token mixer. The MLGRU simplifies the GRU by removing hidden-state related weights, enabling parallel computation, and replacing remaining weights with ternary matrices. 📌 For MatMul-free channel mixing, the Gated Linear Unit (GLU) is adapted to use BitLinear layers with ternary weights, eliminating expensive MatMuls while maintaining effectiveness in mixing information across channels. 📌 The paper introduces a hardware-efficient fused BitLinear layer that optimizes RMSNorm and BitLinear operations. By fusing these operations and utilizing shared memory, training speed improves by 25.6% and memory consumption reduces by 61% over an unoptimized baseline. 📌 Experimental results show that the MatMul-free LLM achieves competitive performance compared to Transformer++ baselines on downstream tasks, with the performance gap narrowing as model size increases. The scaling law projections suggest MatMul-free LLM can outperform Transformer++ in efficiency and potentially in loss when scaled up. 📌 A custom FPGA accelerator is built to exploit the lightweight operations of the MatMul-free LLM. The accelerator processes billion-parameter scale models at 13W beyond human-readable throughput, demonstrating the potential for brain-like efficiency in future lightweight LLMs.

English
1
1
9
1.1K
Alen Capalik
Alen Capalik@capalik·
@BitValentine They do not at all yet. Obviously, just reading that article and looking at the generated image of just a tiny portion of human brain you can see the complexity of it. Artificial Neural Networks are nowhere near that. However, we can learn from it and design accordingly :).
English
0
0
1
11
Alen Capalik
Alen Capalik@capalik·
I don't think we're there yet with artificial neural networks :). #AI #ArtificialIntelligence #ML #MachineLearning “The word ‘fragment’ is ironic,” Lichtman says. “A terabyte is, for most people, gigantic, yet a fragment of a human brain—just a miniscule, teeny-weeny little bit of human brain—is still thousands of terabytes.” nih.gov/news-events/ni…
English
1
0
2
221