Shiv

2.8K posts

Shiv

@TensorTunesAI

Agentic AI | Learning ML and DL in public | Sharing resources & daily notes

Banglore, Karnataka Beigetreten Eylül 2024

515 Folgt408 Follower

Angehefteter Tweet

Shiv@TensorTunesAI·4d

I just trained my first LLM from scratch. No APIs or pre-trained models. A real transformer training pipeline. But the interesting part isn’t the model. It’s the hours of debugging before it finally worked. I wanted to deeply understand how modern language models actually work. Model: huggingface.co/hey-shiv/mini-… So I built a mini LLM pipeline myself: Dataset → BPE Tokenizer → Transformer → Training loop → Text generation This journey was inspired by Rishab sir’s AI classes, which pushed me to actually implement these systems instead of just learning the theory. Step 1 — Dataset I used the TinyStories dataset, which contains millions of short stories designed for training small language models. Challenges I ran into: • dataset shards (~250MB each) • slow downloads • broken scripts • environment issues Eventually everything was merged into one training corpus. Final dataset: ~1.78 GB text ~472M training tokens Step 2 — BPE Tokenizer Before the model can read text, it must convert it into tokens. I trained a Byte Pair Encoding (BPE) tokenizer. Vocabulary size: ~2000 tokens Example: "I love machine learning" → [41, 893, 176, 512] Tokenization is critical because it defines how the model sees language. Step 3 — Modern Transformer Architecture The model itself is a small transformer implemented in PyTorch. I implemented several modern transformer improvements used in today’s LLMs: • RoPE (Rotary Positional Embeddings) • RMSNorm • SwiGLU feed-forward layers • Grouped Query Attention (GQA) Even though the model is small, the architecture follows modern LLM design patterns. Step 4 — Training Training setup: Dataset size: 1.78 GB Training tokens: 472M Validation tokens: 52M Vocabulary size: ~2000 Hardware used: Apple Silicon GPU (MPS) Then came the best moment. Watching the model actually learn. Training loss: Start → 7.63 After training → ~2.29 That drop means the model is actually learning patterns from the dataset. Biggest lesson Building ML systems is 90% debugging infrastructure. Not fancy models. You spend most of your time fighting with: • datasets • tokenizers • training pipelines • environment issues But that’s where the real learning happens. Next experiments: • try different datasets • scale the model • improve tokenizer quality • experiment with RL approaches This was just the beginning. Huge thanks to @rishabh10x sir for the classes that inspired me to build this from scratch instead of just using APIs. If you're learning AI, try building at least one model pipeline yourself. It’s chaotic.

English

2.4K

Shiv@TensorTunesAI·36m

@sijiramakun Let's connect

English

siji.@sijiramakun·1d

i read the attention residuals paper from kimi and converted qwen 3.5 to it. wrote about it here sijibomi.vercel.app/library/Resear…

English

791

Shiv@TensorTunesAI·37m

@bleuonbase Just went through like 10+ of your posts, you don't need to be a researcher, Let's connect

English

agusti@bleuonbase·14h

I'm not a ML researcher but got a bit nerd-sniped by OAI new parameter-golf challenge I setup my pi-autoresearch loop on it, of course. I asked my clanker to do some research about all related papers that could help it come up with better ideas etc It ended up making this Knowledge Base it's nothing revolutionary, mostly notes and links to related papers. i hope its useful to some golf.agustif.com also if you have any feedbacks or ideas hmu

OpenAI@OpenAI

Are you up for a challenge? openai.com/parameter-golf

English

156

16.5K

Shiv@TensorTunesAI·43m

@jyothiwrites Let's connect

English

Jyothi Venkat@jyothiwrites·1d

Ok I have been reading more of Anthropic report. 81k sounds impressive until you ask who those 81k actually are. anthropic users who opted into talking to an ai interviewer thats already a massively self-selected sample! no weighting, no normalization to population. no accounting for who refused to participate and why. the barriers to participation are data points too this tells us what engaged claude users think about ai. which is interesting but very different from what the world thinks about ai

Abhishek Nagaraj 🗺️@abhishekn

I'm very bullish on the role of AI qualitative interviewers, but all the results from this exercise should have a big asterisk around what the specific sample is and what it says about AI in general. Who are these 81k users around the world that are responding to this call? Is this telling us something about how AI is generally perceived or what the 81k Claude users think about AI? We already know that Claude users are likely to be quite different than the average "user" on the consumer side, let alone how this selection varies across countries/occupations and continents. Survey research scholars have written entire textbooks about sample selection, and the good folks at organizations like the @uscensusbureau fret night and day about collecting representative data - but I saw very little discussion on this topic or any disclaimers in this report except for this paragraph below in the appendix.

English

105

Shiv@TensorTunesAI·50m

Recently trained a small transformer (~472M tokens) with BPE, RoPE, GQA, etc., and now I’m exploring applying similar ideas to Indian classical music. Specifically looking at representing ragas as sequence data (starting with MIDI, possibly moving to audio later). Curious about whether transformers can actually capture deeper raga structure , not just note sequences, but progression, mood, and inherent constraints. @karpathy do you think this is something transformers can learn with scale, or would it require a different modeling approach / inductive bias? Would love your perspective.

English

Andrej Karpathy@karpathy·5h

Had to go see Project Hail Mary right away (it's based on the book of Andy Weir, of also The Martian fame). Both very pleased and relieved to say that 1) the movie sticks very close to the book in both content and tone and 2) is really well executed. The book is one of my favorites when it comes to alien portrayals because a lot of thought was clearly given to the scientific details of an alternate biochemistry, evolutionary history, sensorium, psychology, language, tech tree, etc. It's different enough that it is highly creative and plausible, but also similar enough that you get a compelling story and one of the best bromances in fiction. Not to mention the other (single-cellular) aliens. I can count fictional portrayals of aliens of this depth on one hand. A lot of these aspects are briefly featured - if you read the book you'll spot them but if you haven't, the movie can't spend the time to do them justice. I'll say that the movie inches a little too much into the superhero movie tropes with the pacing, the quips, the Bathos and such for my taste, and we get a little bit less the grand of Interstellar and a little bit less of the science of The Martian, but I think it's ok considering the tone of the original content. And it does really well where it counts - on Rocky and the bromance. Thank you to the film crew for the gem!

English

175

151

4.1K

214.9K

Shiv@TensorTunesAI·53m

@xaemio Yep , it's something called Analysis-Paralysis. You learn only when you get stuck

English

saumya@xaemio·2h

ENOUGH TUTORIALS. DIRECTLY GOING TO CODING. WILL LEARN THINGS THROUGH CODING. TUTORIAL HELL IS REAL.

English

122

Shiv retweetet

Shiv@TensorTunesAI·18h

Day 16 of building Neural Networks from first principles Today: Why Softmax + Cross-Entropy gives such a clean gradient → Notes and NumPy implementation attached Most people memorize this: gradient = (y_pred - y_true) But here’s the real insight: Softmax + Cross-Entropy aren’t separate. They’re designed to collapse the chain rule. Instead of messy derivatives across layers, everything simplifies to: → error signal = prediction − truth That’s why: • Only the correct class gets strong correction • Wrong classes get proportional penalties • Training becomes stable and efficient Batch version: → (y_pred - y_true) / N This is the exact signal that flows backward and updates the network.

English

314

Shiv@TensorTunesAI·59m

@Anushka69532 Yes , not exactly research ..but what i wanna do (Music + Deep Reinforcement Learning) does require a lot of study😅

English

Anushka@Anushka69532·1h

@TensorTunesAI Are you more into ML research and all?

English

Shiv@TensorTunesAI·1h

Day 2 of Learning NLP Ever noticed how messy raw text is? typos, extra spaces, random symbols, models see the same mess. Before using text in models, we clean and normalize it. Notes and Summary attached Next: tokenization.

English

Shiv@TensorTunesAI·1h

@xaemio woah , that's very flexible

English

saumya@xaemio·1h

@TensorTunesAI mostly i run on 4 hrs of sleep and sometimes on 10 hrs of sleep.

English

saumya@xaemio·6h

Hehe.

Low Tier Sukuna🩸🔩| College Arc🥀🥀@Lowtiersukuna

Filipino

326

Shiv@TensorTunesAI·1h

@Anushka69532 @ayesha_fatiima arey yes

English

Anushka@Anushka69532·1h

@TensorTunesAI @ayesha_fatiima Mene abhi notice kiya. She is talented, can do both shayari and dsa well 😂

English

Anushka@Anushka69532·1h

90% of portfolios today could be generated by AI in a day. So what actually makes a developer stand out anymore?

English

Shiv@TensorTunesAI·1h

@ayesha_fatiima @Anushka69532 yaha bhi shayri ?

हिन्दी

Ayesha Fatima@ayesha_fatiima·1h

@Anushka69532 Its easy to generate, hard to own & explain

English

Shiv@TensorTunesAI·1h

x.com/TensorTunesAI/…

Shiv@TensorTunesAI

Starting a 10-day mini series on NLP Day 1 of learning NLP in 2 mins Starting with the basics. Before any model,we clean and structure the text. Bad text → bad model. Next: Text preprocessing and tokenization

ZXX

Shiv@TensorTunesAI·1h

@Anushka69532 I don't think personal portfolio websites matter anymore, have seen one of the best developers keeping most minimal ones - but ofc with the proof of work

English

Shiv@TensorTunesAI·5h

@fykawise Beautifully said

English

Fyka Ansari@fykawise·13h

A person can be deeply grateful and still ambitious 🌱

English

152

Shiv@TensorTunesAI·13h

@AnshikaK7 Happy ugadi 🤞

Indonesia

Ashh!! 🧋@AnshikaK7·1d

Does the process know I'm trusting it ? 😶‍🌫️

English

274

Shiv@TensorTunesAI·13h

Hey @lambdaviking Recently trained a small transformer (~472M tokens) with BPE, RoPE, GQA, etc., and now I’m exploring applying similar ideas to Indian classical music. Specifically looking at representing ragas as sequence data (starting with MIDI, possibly moving to audio later). Curious about whether transformers can actually capture deeper raga structure , not just note sequences, but progression, mood, and inherent constraints. do you think this is something transformers can learn with scale, or would it require a different modeling approach / inductive bias? Or i focus on our classic ml , cnn-rnn-lstm ? Would love your perspective x.com/i/status/20331…

English

William Merrill@lambdaviking·17h

[8/8] Paper link: arxiv.org/abs/2603.03612

English

390

William Merrill@lambdaviking·17h

[1/8] New paper with Hongjian Jiang, @YanhongLi2062, Anthony Lin, @Ashish_S_AI: 📜Why Are Linear RNNs More Parallelizable? We identify expressivity differences between linear/nonlinear RNNs and, conversely, barriers to parallelizing nonlinear RNNs 🧵👇

English

143

10.2K

Shiv@TensorTunesAI·13h

Recently trained a small transformer (~472M tokens) with BPE, RoPE, GQA, etc., and now I’m exploring applying similar ideas to Indian classical music. Specifically looking at representing ragas as sequence data (starting with MIDI, possibly moving to audio later). Curious about whether transformers can actually capture deeper raga structure , not just note sequences, but progression, mood, and inherent constraints. do you think this is something transformers can learn with scale, or would it require a different modeling approach / inductive bias? Or i focus on our classic ml , cnn-rnn-lstm ? Would love your perspective. x.com/i/status/20331…

English

Mayank Mishra@MayankMish98·14h

RNNs are cooking today! 🚀🚀🚀

William Merrill@lambdaviking

English

Shiv@TensorTunesAI·13h

@MayankMish98 Let's connect

English

Mayank Mishra@MayankMish98·1 Ara

Big news! 🎉 TPU support for pretraining is now live on lm-engine, powered by PyTorch-XLA. Faster, scalable training is just a clone away: github.com/open-lm-engine… (tested on TPU v6e)

English

904

Shiv@TensorTunesAI·15h

@jojokompella Let's connect

English

Ramakrishna kompella@jojokompella·3d

I did some tests myself, putting it out soon. Expected it to be significantly better than the competition for Indian languages. For lower resource ones, it is. But not for high resource. Sarvam 30B is not significantly worse than 105B though

nullptr@resetptr

ran some quick weekend experiments on @SarvamAI's 105B model on a subset of the IndicMMLU-Pro dataset Sarvam's model is really good at reasoning efficiency. uses ~2.5x less tokens to reach ~same accuracy

English

594

Entdecken

@sijiramakun @bleuonbase @jyothiwrites @karpathy @xaemio @Anushka69532 @ayesha_fatiima @elonmusk