Sabitlenmiş Tweet

I just trained my first LLM from scratch.
No APIs or pre-trained models.
A real transformer training pipeline.
But the interesting part isn’t the model.
It’s the hours of debugging before it finally worked.
I wanted to deeply understand how modern language models actually work.
Model: huggingface.co/hey-shiv/mini-…
So I built a mini LLM pipeline myself:
Dataset
→ BPE Tokenizer
→ Transformer
→ Training loop
→ Text generation
This journey was inspired by Rishab sir’s AI classes, which pushed me to actually implement these systems instead of just learning the theory.
Step 1 — Dataset
I used the TinyStories dataset, which contains millions of short stories designed for training small language models.
Challenges I ran into:
• dataset shards (~250MB each)
• slow downloads
• broken scripts
• environment issues
Eventually everything was merged into one training corpus.
Final dataset:
~1.78 GB text
~472M training tokens
Step 2 — BPE Tokenizer
Before the model can read text, it must convert it into tokens.
I trained a Byte Pair Encoding (BPE) tokenizer.
Vocabulary size:
~2000 tokens
Example:
"I love machine learning"
→
[41, 893, 176, 512]
Tokenization is critical because it defines how the model sees language.
Step 3 — Modern Transformer Architecture
The model itself is a small transformer implemented in PyTorch.
I implemented several modern transformer improvements used in today’s LLMs:
• RoPE (Rotary Positional Embeddings)
• RMSNorm
• SwiGLU feed-forward layers
• Grouped Query Attention (GQA)
Even though the model is small, the architecture follows modern LLM design patterns.
Step 4 — Training
Training setup:
Dataset size: 1.78 GB
Training tokens: 472M
Validation tokens: 52M
Vocabulary size: ~2000
Hardware used:
Apple Silicon GPU (MPS)
Then came the best moment.
Watching the model actually learn.
Training loss:
Start → 7.63
After training → ~2.29
That drop means the model is actually learning patterns from the dataset.
Biggest lesson
Building ML systems is 90% debugging infrastructure.
Not fancy models.
You spend most of your time fighting with:
• datasets
• tokenizers
• training pipelines
• environment issues
But that’s where the real learning happens.
Next experiments:
• try different datasets
• scale the model
• improve tokenizer quality
• experiment with RL approaches
This was just the beginning.
Huge thanks to @rishabh10x sir for the classes that inspired me to build this from scratch instead of just using APIs.
If you're learning AI, try building at least one model pipeline yourself.
It’s chaotic.

English











