anukool

176 posts

anukool banner
anukool

anukool

@anukools

Technology Leader | Cloud, Data and AI. Passionate about Quantum computing, Thinker's Toolkit and Dan Ariely | "Tweets are personal"

The Netherlands Entrou em Haziran 2009
395 Seguindo142 Seguidores
anukool retweetou
Vivek Bhatnagar
Vivek Bhatnagar@VibrantVivek·
Mind-blowing, getting goosebumps as I watch, the sheer perfection! "What a role model" is an understatement I think🙏. She is the epitome of unwavering determination. Congratulations #SheetalDevi 🇮🇳, THANK YOU for everything ! #AsianParaGames2023
THE SKIN DOCTOR@theskindoctor13

Sheetal, a 16 years old teen from Jammu, was born without arms due to Phocomelia. She is the first female archer without arms to compete internationally. Two gold, one silver in Asian Para Games, and one silver in World Para Games. What a role model! twitter.com/TheKhelIndia/s…

English
0
3
3
447
anukool retweetou
Bindu Reddy
Bindu Reddy@bindureddy·
Making LLMs More Accessible To The GPU Poor - Flash Attention and Flash Decoders By now it's common knowledge that LLMs take boatloads of money and tons of compute to train and host. Fundamentally, LLMs require a hefty amount of memory, which increases quadratically with the length of the sequences they process. As models grow, capturing long-range dependencies becomes increasingly challenging. LLMs have a gargantuan number of parameters, which demands significant computational resources and sophisticated optimization techniques to train efficiently. TLDR; each of these challenges can impede the training, deployment, or the real-time responsiveness of LLMs, especially as they grow in size or are tasked with processing longer sequences of data. One of the biggest complaints with GPT-4 is it's latency. Flash Attention digs into the core of transformer architectures to tweak the attention mechanism for better scaling, particularly catering to the needs of Large Language Models (LLMs) LLMs hog High Bandwidth Memory (HBM) to store, read, and write keys, queries, and values. Flash Attention is IO-aware it's finely tuned to optimize the Input/Output (IO) operations between the GPU's High Bandwidth Memory (HBM) and on-chip SRAM, minimizing the number of memory reads and writes. By employing a technique known as tiling, Flash Attention segments the attention computation into smaller, more manageable chunks, reducing the memory burden. It's like breaking down a hefty tome into smaller booklets, making it easier to handle. The algorithm reorders the attention computation in a way that enhances the efficiency, ensuring that the computational resources are utilized in the best way possible. It's like rearranging the assembly line in a factory for optimal performance. Unlike the standard attention mechanism that has a quadratic scaling with respect to sequence length, Flash Attention scales linearly. This is a major leap in terms of optimizing memory and computational resources when it comes to LLMs as they are often required to process long sequences of data. Amidst all the optimization, Flash Attention doesn't compromise on the quality of attention computation. It provides 'exact attention' which is crucial for maintaining the model's performance. With Flash Attention you get computational efficiency without trading off the accuracy. Flash attention significantly reduces training and inference times. This is extremely important for production traffic where latency and response times are fairly important. This technique has made it possible for the GPU poor to work on and fine-tune LLMs. By using Flash Attention, it is possible to fine tune LLMs using a technique called LoRA that is more efficient and takes up less memory than a LoRA adapter that is trained with just the standard attention mechanism. By using Flash Attention in the fine-tuned Llama models, it is possible to reduce the amount of time and memory required to perform inference. This makes it possible to deploy fine-tuned Llama models on more devices and to make them more accessible to users. Earlier this month, Flash Decoding was introduced. This technique takes Flash Attention a step further, by specifically targeting long-context inference, introducing an additional parallelization dimension based on the length of the keys/values sequence Flash Decoding helps with parallelization by splitting the keys/values into smaller chunks and computing the attention for each chunk in parallel. This strategy is tailored to harness the GPU's power fully, especially when the batch size is small but the context length is large​ Flash Decoding orchestrates a three-step workflow - splitting the keys/values, parallel computing of attention using Flash Attention, and then a final reduction step using log-sum-exp to compute the actual output. This triad of steps is orchestrated to maintain the integrity of the attention outputs despite the parallelization​. Flash Attention has already made a big difference in the open-source world. Flash Decoding is bound to be hugely helpful as well. These clever techniques will reduce computational costs, make LLMs accessible and increase LLM speeds while not sacrificing performance 👏👏
Bindu Reddy tweet media
English
13
75
341
58.3K
anukool retweetou
Adam Grant
Adam Grant@AdamMGrant·
Impostor syndrome is not a clue that you're unqualified. It's a sign of hidden potential. When you think others are overestimating you, it's more likely that you're underestimating yourself. They have an outside view. They see capacity for growth that's not yet visible to you.
Adam Grant tweet media
English
54
523
2.7K
294K
anukool retweetou
Cliff Pickover
Cliff Pickover@pickover·
Do you think the number "1" will ever again be considered a prime number? In the mid-18th century, mathematician Christian Goldbach listed 1 as prime. However, Euler did not consider 1 to be prime. In the 19th century many mathematicians still considered 1 to be prime, and lists of primes that included 1 continued to be published as recently as 1956. By the early 20th century, mathematicians began to agree that 1 should not be listed as prime. Source: bit.ly/3FrM2Rw
Cliff Pickover tweet media
English
28
28
143
29.9K
anukool retweetou
World of Statistics
World of Statistics@stats_feed·
% of people that believe in God or a supreme being: Indonesia 🇮🇩 - 93% Turkey 🇹🇷 - 91% Brazil 🇧🇷 - 84% South Africa 🇿🇦 - 83% Mexico 🇲🇽 - 78% USA 🇺🇸 - 70% Argentina 🇦🇷 - 62% Russia 🇷🇺 - 56% India 🇮🇳 - 56% Poland 🇵🇱 - 51% Italy 🇮🇹 - 50% Canada 🇨🇦 - 46% Hungary 🇭🇺 - 29% Australia 🇦🇺 - 29% Spain 🇪🇸 - 28% Germany 🇩🇪 - 27% UK 🇬🇧 - 25% Belgium 🇧🇪 - 20% France 🇫🇷 - 19% Sweden 🇸🇪 - 18% South Korea 🇰🇷 - 18% China 🇨🇳 - 9% Japan 🇯🇵 - 4%
English
1.3K
1.5K
12.9K
4.4M