Shiv

2.8K posts

Shiv banner
Shiv

Shiv

@TensorTunesAI

Agentic AI | Learning ML and DL in public | Sharing resources & daily notes

Banglore, Karnataka Katılım Eylül 2024
510 Takip Edilen406 Takipçiler
Sabitlenmiş Tweet
Shiv
Shiv@TensorTunesAI·
I just trained my first LLM from scratch. No APIs or pre-trained models. A real transformer training pipeline. But the interesting part isn’t the model. It’s the hours of debugging before it finally worked. I wanted to deeply understand how modern language models actually work. Model: huggingface.co/hey-shiv/mini-… So I built a mini LLM pipeline myself: Dataset → BPE Tokenizer → Transformer → Training loop → Text generation This journey was inspired by Rishab sir’s AI classes, which pushed me to actually implement these systems instead of just learning the theory. Step 1 — Dataset I used the TinyStories dataset, which contains millions of short stories designed for training small language models. Challenges I ran into: • dataset shards (~250MB each) • slow downloads • broken scripts • environment issues Eventually everything was merged into one training corpus. Final dataset: ~1.78 GB text ~472M training tokens Step 2 — BPE Tokenizer Before the model can read text, it must convert it into tokens. I trained a Byte Pair Encoding (BPE) tokenizer. Vocabulary size: ~2000 tokens Example: "I love machine learning" → [41, 893, 176, 512] Tokenization is critical because it defines how the model sees language. Step 3 — Modern Transformer Architecture The model itself is a small transformer implemented in PyTorch. I implemented several modern transformer improvements used in today’s LLMs: • RoPE (Rotary Positional Embeddings) • RMSNorm • SwiGLU feed-forward layers • Grouped Query Attention (GQA) Even though the model is small, the architecture follows modern LLM design patterns. Step 4 — Training Training setup: Dataset size: 1.78 GB Training tokens: 472M Validation tokens: 52M Vocabulary size: ~2000 Hardware used: Apple Silicon GPU (MPS) Then came the best moment. Watching the model actually learn. Training loss: Start → 7.63 After training → ~2.29 That drop means the model is actually learning patterns from the dataset. Biggest lesson Building ML systems is 90% debugging infrastructure. Not fancy models. You spend most of your time fighting with: • datasets • tokenizers • training pipelines • environment issues But that’s where the real learning happens. Next experiments: • try different datasets • scale the model • improve tokenizer quality • experiment with RL approaches This was just the beginning. Huge thanks to @rishabh10x sir for the classes that inspired me to build this from scratch instead of just using APIs. If you're learning AI, try building at least one model pipeline yourself. It’s chaotic.
Shiv tweet media
English
8
0
15
2.3K
Ashh!! 🧋
Ashh!! 🧋@AnshikaK7·
Does the process know I'm trusting it ? 😶‍🌫️
English
3
1
18
268
Shiv
Shiv@TensorTunesAI·
Hey @lambdaviking Recently trained a small transformer (~472M tokens) with BPE, RoPE, GQA, etc., and now I’m exploring applying similar ideas to Indian classical music. Specifically looking at representing ragas as sequence data (starting with MIDI, possibly moving to audio later). Curious about whether transformers can actually capture deeper raga structure , not just note sequences, but progression, mood, and inherent constraints. do you think this is something transformers can learn with scale, or would it require a different modeling approach / inductive bias? Or i focus on our classic ml , cnn-rnn-lstm ? Would love your perspective x.com/i/status/20331…
English
0
0
0
8
William Merrill
William Merrill@lambdaviking·
[1/8] New paper with Hongjian Jiang, @YanhongLi2062, Anthony Lin, @Ashish_S_AI: 📜Why Are Linear RNNs More Parallelizable? We identify expressivity differences between linear/nonlinear RNNs and, conversely, barriers to parallelizing nonlinear RNNs 🧵👇
William Merrill tweet media
English
3
16
97
6.9K
Shiv
Shiv@TensorTunesAI·
Recently trained a small transformer (~472M tokens) with BPE, RoPE, GQA, etc., and now I’m exploring applying similar ideas to Indian classical music. Specifically looking at representing ragas as sequence data (starting with MIDI, possibly moving to audio later). Curious about whether transformers can actually capture deeper raga structure , not just note sequences, but progression, mood, and inherent constraints. do you think this is something transformers can learn with scale, or would it require a different modeling approach / inductive bias? Or i focus on our classic ml , cnn-rnn-lstm ? Would love your perspective. x.com/i/status/20331…
English
0
0
0
15
Mayank Mishra
Mayank Mishra@MayankMish98·
Big news! 🎉 TPU support for pretraining is now live on lm-engine, powered by PyTorch-XLA. Faster, scalable training is just a clone away: github.com/open-lm-engine… (tested on TPU v6e)
English
0
2
13
903
Ramakrishna kompella
Ramakrishna kompella@jojokompella·
I did some tests myself, putting it out soon. Expected it to be significantly better than the competition for Indian languages. For lower resource ones, it is. But not for high resource. Sarvam 30B is not significantly worse than 105B though
nullptr@resetptr

ran some quick weekend experiments on @SarvamAI's 105B model on a subset of the IndicMMLU-Pro dataset Sarvam's model is really good at reasoning efficiency. uses ~2.5x less tokens to reach ~same accuracy

English
2
0
2
564
Ashanvi
Ashanvi@ashanviii·
i tried designing something in 20 mins and yeah… not really sure what direction i was going in, kinda just winged it
Ashanvi tweet media
English
11
0
34
386
mihir
mihir@mihirss2·
@TensorTunesAI @lossfunk Yeah I am an Indian classical vocalist. A Raga is complex. Yet, it has many elements of adaptation from past renditions; lot of scope for 'learning.' Students borrow styles from their gurus all the time. It is a curious problem how this notion translates to LMs/ML in general.
English
1
0
1
23
Lossfunk
Lossfunk@lossfunk·
🚨 Shocking: Frontier LLMs score 85-95% on standard coding benchmarks. We gave them equivalent problems in languages they couldn't have memorized. They collapsed to 0-11%. Presenting EsoLang-Bench. Accepted to the Logical Reasoning and ICBINB workshops at ICLR 2026 🧵
English
104
193
1.5K
702.1K
mihir
mihir@mihirss2·
@TensorTunesAI @lossfunk Would love to explore the idea of representing ragas in models further with you. This is literally what I was trying to brainstorm over yesterday lol. Lmk
English
1
0
1
44
Shiv
Shiv@TensorTunesAI·
Starting a 10-day mini series on NLP Day 1 of learning NLP in 2 mins Starting with the basics. Before any model,we clean and structure the text. Bad text → bad model. Next: Text preprocessing and tokenization
Shiv tweet media
English
2
0
7
298
Shiv
Shiv@TensorTunesAI·
Shiv tweet media
ZXX
0
0
2
130