José Vergara de la Fuente (@PP_borre) - Twitter Profili

José Vergara de la Fuente retweetledi

The study shows how individual fingerprint ridges deform when we touch different textures, revealing how subtle stretching and shifting along the ridge flanks may drive our remarkably fine tactile sensitivity. elifesciences.org/articles/93554…

English

139

32.4K

José Vergara de la Fuente retweetledi

Yuriria Vazquez@yurivazu·15 Haz

The brain is hierarchically organized to process sensory signals. How functional connections within & across areas contribute to this hierarchy? We explored these questions in the thalamocortical network, while monkeys detected a tactile stimulus. Check it out:

iScience journal@iScience_CP

Online now: Thalamocortical interactions shape hierarchical neural variability during stimulus perception dlvr.it/T8F7Np

English

1.4K

José Vergara de la Fuente@PP_borre·27 Şub

@ArtyomAstafurov @karpathy Amazing! Thanks!!

English

José Vergara de la Fuente@PP_borre·25 Şub

@ArtyomAstafurov @karpathy Very cool work¡ I wonder how easy is to make it multi-language (e.g. Spanish)

English

202

Andrej Karpathy@karpathy·22 Şub

Fun LLM challenge that I'm thinking about: take my 2h13m tokenizer video and translate the video into the format of a book chapter (or a blog post) on tokenization. Something like: 1. Whisper the video 2. Chop up into segments of aligned images and text 3. Prompt engineer an LLM to translate piece by piece 4. Export as a page, with links citing parts of original video More generally, a workflow like this could be applied to any input video and auto-generate "companion guides" for various tutorials in a more readable, skimmable, searchable format. Feels tractable but non-trivial.

English

201

350

4.7K

837.6K

José Vergara de la Fuente retweetledi

Yuriria Vazquez@yurivazu·22 Şub

If you have a video or audio and you want to generate a blog post (an article, summary, or pretty much any text) - reach out to @ArtyomAstafurov. Below is an example of what it looks like for an amazing 2h+ tutorial video from @karpathy. 🙏🙏 learnt tons about LLMs & tokenization

Artyom Astafurov@ArtyomAstafurov

Here you are, with timestamps. ## Understanding Tokenization in Language Models: From GPT-2 to GPT-4 Tokenization is a fundamental process in language models, involving the conversion of text into tokens. This process is crucial for language models as it supports multiple languages and special characters like emojis15:05. However, tokenization is not without its challenges, especially when dealing with large language models (LLMs) and non-English languages 04:11. ## Tokenization in Language Models Tokenization in language models varies by position, case, and language. For instance, English has shorter tokens compared to languages like Korean, which can affect model training and performance 08:15. Tokenization also involves converting strings to integers for model input, which can be a complex process due to the size of the vocabulary and changes in standards 17:00. ## Encoding Methods: UTF-8 and Byte-Level Encoding UTF-8 is a popular encoding method that translates Unicode to variable length byte streams, ranging from one to four bytes 18:27. It is preferred for its compatibility with ASCII and efficiency, and it's widely used online 19:13. However, naive use of UTF-8 can lead to long byte sequences, limiting vocabulary and context in transformers. Byte pair encoding offers a solution to this problem 21:06. ## Byte-level encoding is another method used in large language models. It involves a 50,257 token vocabulary and a 1024 token context 02:52. The BytePair encoding algorithm compresses sequences by iteratively replacing frequent token pairs with new tokens, reducing sequence length while expanding vocabulary 22:54. ## Improvements from GPT-2 to GPT-4 The GPT-4 tokenizer demonstrates significant efficiency improvements over GPT-2, particularly in handling programming languages like Python. By grouping more whitespace into single tokens and increasing the token count from 50k to 100k, GPT-4 reduces token bloat and allows for denser input, enabling the transformer to consider a larger context when predicting the next token. This results in better performance, especially in coding tasks, due to more efficient representation and attention to relevant context 10:48. ## Special Tokens in Tokenization Special tokens play a crucial role in data structuring and encoder vocab mapping 01:18:26. Language models use a special end of text token to delimit documents, aiding in training data segmentation01:19:11. Special tokens bypass typical byte pair encoding (BPE) merges and are handled by custom code, as seen in the TickToken library implemented in Rust. GPT-4 introduces new special tokens like 'Thim' for 'fill in the middle', requiring model surgery to accommodate them in the transformer's embedding matrix and final layer 01:20:43. ## Challenges and Anomalies in Tokenization Tokenization in language models can lead to unexpected behaviors and anomalies. For instance, GPT-2 faced tokenization issues, particularly with Python's handling of spaces, reducing context length. GPT-4 addressed this. Special tokens can confuse LLMs, posing potential attack surfaces 01:56:47. Moreover, clusters of 'unstable tokens' like 'sold gold Magikarp' can cause erratic LLM responses, potentially linked to Reddit user mentions in the tokenization dataset 02:04:06. ## Conclusion Tokenization is a crucial but complex process in language models. It has seen significant improvements from GPT-2 to GPT-4, particularly in efficiency and handling of special tokens. However, challenges persist, especially with non-English languages and anomalies in tokenization. As we continue to refine and develop language models, understanding and addressing these challenges will be key to improving their performance and utility 02:10:20. References: Understanding Tokenization in Language Models 00:00 Byte-Level Encoding in Language Models 02:52 Efficiency Improvements in Tokenization from GPT-2 to GPT-4 10:48 Understanding Special Tokens in GPT Tokenization 01:20:43 Challenges in Tokenization for GPT Models 01:56:47 Unstable Tokens and LLM Behavior 02:04:06 Tokenization in AI and Its Challenges 02:10:20 You are more than welcome to experiment here: web.platogram.ai/summary?thread… Happy to give you a perpetual license for all of your videos. Reach out!

English

700

José Vergara de la Fuente retweetledi

Tatiana Engel@EngelTatiana·23 Şub

Attentional mechanisms enable selective processing of information. With Ruobing Xia, @Xiaomo_CCLab, and Tirin Moore, we review work across species, forms of attention, and analysis levels, revealing convergence and differences. Out today in @TrendsCognSci: sciencedirect.com/science/articl…

English

7.9K

José Vergara de la Fuente retweetledi

Raymundo Báez-Mendoza@thunderNeurosci·21 Kas

Emocionado de participar como ponente en este taller sobre neurociencias el Jueves 23 de Nov. a las 10 am (CDMX). El taller será transmitido en vivo por los canales del @ColegioNal_mx Más detalles aquí: colnal.mx/agenda/las-neu…

Español

415

José Vergara de la Fuente retweetledi

Jeff Yau 姚明穎@JeffMYau·12 Ağu

Love you @slimanjbensmaia Shine on

English

10.8K

José Vergara de la Fuente retweetledi

Secretaría de Ciencia@Secihti_Mx·12 Ağu

#Convocatoria | A las personas interesadas que deseen tramitar su ingreso, permanencia o promoción en el Sistema Nacional de Investigadores. Consulta las bases #SNI2022➡️ bit.ly/3w1cf5o

Español

185

286

José Vergara de la Fuente

Keşfet