Outcome School

1.6K posts

Outcome School

@outcome_school

Get High-Paying Tech Job. Software engineers like you join Outcome School to achieve the outcome that is a high-paying tech job.

Internet Katılım Haziran 2024

2 Takip Edilen1K Takipçiler

Sabitlenmiş Tweet

Outcome School@outcome_school·5 Nis

Our recent 6 articles on X: - KV Cache in LLMs - Paged Attention in LLMs - Causal Masking in Attention - Byte Pair Encoding in LLMs - Harness Engineering in AI - Math behind Attention - Q, K, and V X is a knowledge sharing platform.

Amit Shekhar@amitiitbhu

x.com/i/article/2039…

English

538

Outcome School@outcome_school·7h

64 heads × 100K tokens = massive KV Cache.That is Multi-Head Attention during long-context inference. GQA with 8 groups cuts that by 8x. Same model. Same GPU. 8x more room.

Amit Shekhar@amitiitbhu

x.com/i/article/2046…

English

562

Outcome School retweetledi

Pallavi@pallavishekhar_·9h

Multi-Head Attention was designed for quality. Multi-Query Attention was designed for speed. GQA was designed for both.

Amit Shekhar@amitiitbhu

x.com/i/article/2046…

English

444

Outcome School retweetledi

Amit Shekhar@amitiitbhu·10h

x.com/i/article/2046…

ZXX

104

5.2K

Outcome School retweetledi

Amit Shekhar@amitiitbhu·15h

Why do frameworks fuse softmax and cross-entropy into one operation? Because exp() of large numbers overflows, and log() of tiny numbers underflows. The fused version avoids both. Math that works on paper can break on hardware.

Amit Shekhar@amitiitbhu

x.com/i/article/2046…

English

1.3K

Outcome School@outcome_school·17h

Chain-of-Thought vs Direct Answer. Same question. Two different ways to ask the LLM.

GIF

English

250

Outcome School retweetledi

Amit Shekhar@amitiitbhu·1d

Cross-entropy penalizes hedging moderately and confident mistakes severely. Saying ‘33% each’ when unsure is okay. Saying ‘97% wrong answer’ is catastrophic. The log makes this distinction automatic.

Amit Shekhar@amitiitbhu

x.com/i/article/2046…

English

5.4K

Outcome School retweetledi

Pallavi@pallavishekhar_·1d

The Multi-Agent Handoff One agent cannot do everything well. Hence, specialists come into the picture. Multi-Agent Handoff = delegate + work + return. Here is the flow: - The user asks Agent A a question. - Agent A plans and decides to hand the work off. - Agent A delegates the subtask to Agent B. - Agent B does the work. - Agent B returns the result to Agent A. - Agent A returns the final answer to the user. Agent A is the coordinator. Agent B is the specialist. Each one does what it is best at. That is the power of the handoff pattern. Use case: A research agent hands off math problems to a math specialist agent. Watch the GIF for the full flow.

GIF

English

472

Outcome School retweetledi

Amit Shekhar@amitiitbhu·1d

Cross-entropy answers the simplest question in machine learning: how surprised are you by the correct answer? Not surprised at all? Low loss. Very surprised? High loss. -log(p) quantifies surprise.

Amit Shekhar@amitiitbhu

x.com/i/article/2046…

English

5.1K

Outcome School retweetledi

Amit Shekhar@amitiitbhu·2d

The next decade belongs to three domains: • Math & research in AI • Data centers • Robotics Pick one. Go deep.

English

267

9.5K

Outcome School retweetledi

Amit Shekhar@amitiitbhu·2d

Physics tells you what's happening. Math tells you why it had no choice.

English

1.1K

Outcome School retweetledi

Pallavi@pallavishekhar_·2d

A model that says '97% cat' when the answer is dog gets punished 100x harder than a model that says '50% cat'. Cross-entropy does not just penalize mistakes. It destroys overconfidence.

Amit Shekhar@amitiitbhu

x.com/i/article/2046…

English

Outcome School@outcome_school·2d

Cross-Entropy Loss is just one question: how much probability did you give to the right answer? High probability = small loss. Low probability = big loss. That is the entire idea.

Amit Shekhar@amitiitbhu

x.com/i/article/2046…

English

137

Outcome School retweetledi

Amit Shekhar@amitiitbhu·2d

x.com/i/article/2046…

ZXX

247

25K

Outcome School retweetledi

Pallavi@pallavishekhar_·2d

Causal Masking in Attention outcomeschool.com/blog/causal-ma…

English

1.3K

Outcome School retweetledi

Pallavi@pallavishekhar_·2d

The feed-forward network is the most underrated part of the Transformer. It holds most of the parameters, stores most of the knowledge, and runs on every single token. Yet we barely talk about it.

Amit Shekhar@amitiitbhu

x.com/i/article/2043…

English

Outcome School retweetledi

Amit Shekhar@amitiitbhu·3d

I've already decoded a few AI research papers and published articles on them: outcomeschool.com/blog Currently working on decoding more.

Amit Shekhar@amitiitbhu

x.com/i/article/2045…

English

114

10.2K

Outcome School@outcome_school·3d

AI Research Papers tell you what didn’t work and why. That’s often more valuable than knowing what did.

Amit Shekhar@amitiitbhu

x.com/i/article/2045…

English

2.2K

Outcome School@outcome_school·3d

Every Transformer runs on one quiet square root. Not tuned, not guessed, but derived. That's the kind of thing that makes math feel like poetry.

Amit Shekhar@amitiitbhu

x.com/i/article/2040…

English

3.1K

Outcome School@outcome_school·3d

Q × Kᵀ tells the model how relevant every word is to every other word. Softmax turns that into probabilities. V delivers the actual content. One formula. Three steps. The entire foundation of modern AI.

Amit Shekhar@amitiitbhu

x.com/i/article/2039…

English

5.7K

Outcome School@outcome_school·3d

CNNs had a head start - built-in knowledge that nearby pixels are related. ViT starts from zero. But starting from zero with enough data means no limits on what it can learn.

Amit Shekhar@amitiitbhu

x.com/i/article/2044…

English

1.9K

Keşfet

@elonmusk @BarackObama @taylorswift13 @cristiano @BillGates @NASA @nikifrancismediavine @katyperry