AmitP

174 posts

AmitP

@amitp_ai

Chip Designer using AI

San Jose, CA Katılım Nisan 2021

235 Takip Edilen41 Takipçiler

AmitP@amitp_ai·20 Ara

@techNmak LLM

Tech with Mak@techNmak·19 Ara

These are literally the kind of LLM interview questions most candidates wish they had seen earlier. A curated list of LLM interview questions - shared by Hao Hoang Want this doc? Follow @techNmak and comment “LLM” - I’ll send it over.

English

1.4K

497

4.2K

409.6K

AmitP retweetledi

Pedro Domingos@pmddomingos·7 Eki

Schmidhuber's revenge.

Dansk

1.1K

152.9K

AmitP retweetledi

PyQuant News 🐍@pyquantnews·4 Eki

Free YouTube channel for deep learning from UC Berkeley University. Learn deep learning with end-to-end projects:

English

228

1.4K

82.5K

AmitP retweetledi

Neo Kim@systemdesignone·4 Eki

Good engineers never stop learning. Here are some newsletters curating important resources each week:

English

328

51.2K

AmitP retweetledi

Nathan Lambert@natolambert·3 Eki

New paper from Nvidia's alignment team. These are always worth reading, right up there with Llama for post training insights. Focuses on different types of reward model training with HelpSteer2 data.

English

294

29K

AmitP retweetledi

François Chollet@fchollet·3 Eki

Interesting work on reviving RNNs. arxiv.org/abs/2410.01201 -- in general the fact that there are many recent architectures coming from different directions that roughly match Transformers is proof that architectures aren't fundamentally important in the curve-fitting paradigm (aka deep learning) Curve-fitting is about embedding a dataset on a curve. The critical factor is the dataset, not the specific hard-coded bells and whistles that constrain the curve's shape. As long as your curve is sufficiently expressive all architectures will converge to the same performance in the large-data regime.

English

253

1.9K

222.5K

AmitP retweetledi

Soumith Chintala@soumithchintala·2 Eki

There's three parts. 1. Fitting as large of a network and as large of a batch-size as possible onto the 10k/100k/1m H100s -- parallelizing and using memory-saving tricks. 2. Communicating state between these GPUs as quickly as possible 3. Recovering from failures (hardware, software, etc.) as quickly as possible 1. Fitting as large of a network and as large of a batch-size as possible onto the 10k H100s. Parallelizing: 1. parallelize over batches 2. parallelize over layers (i.e. split a layer across GPUs) 3. parallelize across layers (i.e. 1 to N are on GPU1, N+1th layer to N+10th layer are on GPU2) Keep parallelizing until you are able to use all GPUs well, with maximum utilization. Checkpointing / Compute vs memorize: * You need to save certain terms from forward to compute the backprop (save_for_backward). However, if the network is sufficiently large, it is more profitable to free these terms in order to fit a larger batch-size, and recompute them again when you need them to compute the backprop. * Tricks like FSDP discard parts of weights that are held in one GPU (to save memory), and ask for the shards of weights from other GPUs right before they need them. 2. Communicating state between these GPUs as quickly as possible Communication overlap: When you need to communicate among GPUs, try to start communication as soon as you can: * Exampel: when Nth layer is done with backward, while N-1th layer is computing backward, all GPUs with an Nth layer can all-reduce their gradients) Discover and leverage the underlying networking topology: Communicating large amounts of state (gradients, optimizer state) across multiple nodes is complicated. with Sync SGD, you have to communicate this state in a burst, as quickly as you can. we might have multiple layers of switches, and have RDMA (ability to copy GPU memory directly to NIC, bypassing CPU ram entirely), and have frontend and backend NICs (frontend connects to storage like NFS, backend connects GPUs to other GPUs in cluster). So, it's important to leverage all this info when running communication collectives like all-reduce or scatter/gather. All-reduce for example can be done algorithmically in log(n) if you tree-reduce; and the constant factors that change based on the type of fiber connecting one node to another in the tree of networking fiber is important to reduce overall time and latency. Libraries like NCCL do sophisticated discovery of the underlying networking topology and leverage them when we run all-reduce and other collectives. 3. Recovering from failures (hardware, software, etc.) as quickly as possible At 10k GPU scale, things fail all the time -- GPUs, NICs, cables, etc. Some of these failures are easy to detect quickly, some of them you can only detect because one node isn't replying back in time (say a NCCL all-reduce is stuck). We build various tools to monitor and detect fleet health, and remove failed nodes from the fleet as quickly as possible. This is quite hard. Separately, at this large of a scale you can have silent data corruptions from memory bits flipping randomly (due to basic physics and amplifying the probability at this scale), and you suddenly have loss-explosions for no reason other than this random phenomenon. These happen at small-scale too, but very very infrequently so you barely notice. This is very hard to detect before-hand in software. Some hardware has hardware circuitry that does built-in checksums after it computes things -- this way if bit-flips occur the hardware can throw an interrupt. H100s and previous NVIDIA GPUs don't have this feature. To counter all these failures, you would want to save your model state as frequently and as quickly as you can; and when a failure occurs, you want to recover and continue as quickly as you can. Usually, we save model state really quickly to CPU memory in a separate thread and in the background we save from CPU memory to disk or remote storage. We also save model state in shards (this is torch.distributed's checkpointing feature), i.e. not every GPU needs to save all of the model weights; each GPU only needs to save a portion of weights -- and they can recover the other part of weights from other GPU shard checkpoints.

English

167

1.3K

241.6K

AmitP retweetledi

Alex Xu@alexxubyte·1 Eki

Big O Notation 101: The Secret to Writing Efficient Algorithms

English

744

4.3K

435.3K

AmitP retweetledi

Aritra 🤗@ariG23498·1 Eki

I love @GoogleColab because I can do dirty pip installs and then delete runtime once I am done. To do the same on my system: * python -m venv .venv * source .venv/bin/activate * pip install <> Do not tell me I am the only one!

English

189

11.9K

AmitP retweetledi

Stas Bekman@StasBekman·30 Eyl

This is a pretty awesome simple step-by-step guide showing you how to build your own PyTorch (a subset of ops supported) which requires just basic knowledge of C/C++/Python. towardsdatascience.com/recreating-pyt… The reason to walk through it is to better understand how some of the common PyTorch ops work. I highly recommend the read.

English

467

21.3K

AmitP retweetledi

Max Weinbach@mweinbach·24 Eyl

@lafaiel So I noticed this on a security update. Once I uninstalled I got parity between battery and wall, it should be fixed soon with an update

English

620

AmitP retweetledi

Bojan Tunguz@tunguz·25 Eyl

I just got a copy of “Large Language Models: A Deep Dive.” I’ve been planning for a while to do just that with LLMs - delve deeper. ;) This books seems like an excelent up-to-date (as much as that is possible these days). Overview of this fascinating and important subject. Thanks Uday Kamath for sending this one to me! amzn.to/4ewmzql #AI #GenAI #LLM #LLMs

English

82.3K

AmitP retweetledi

himanshu@himanshustwts·23 Eyl

working of RLHF. had fun learning the core, can't wait to write on this. preparing myself to decode anthropic's 'constitutional ai' paper.

English

281

14.3K

AmitP retweetledi

Jakub Tomczak@jmtomczak·20 Eyl

🎊 It has arrived 🎊, the 2nd edition of my "Deep Generative Modeling" book. It has 100 new pages, 3 new chapters (incl. #LLMs) and new sections. It covers all deep generative models that constitute the core of all #GenerativeAI techs! Check it out: 💻tinyurl.com/mwj9dw83

English

555

44.3K

AmitP retweetledi

Philipp Schmid@_philschmid·20 Eyl

Is thats @OpenAI o1 missing secret? @GoogleDeepMind developed a multi-turn chain of thought online reinforcement learning (RL) approach, SCoRe, to improve self-correction using entirely self-generated data. SCoRe achieves state-of-the-art self-correction, improving performance by 15.6% and 9.1% on the MATH and HumanEval. 👀 Self-Correction via Reinforcement Learning (SCoRe), trains a single model that can both produce a response to a reasoning problem and also correct errors despite not receiving any oracle feedback entirely by training on self-generated data. Implementation: 0️⃣ Select a pre-trained LLM (e.g., Gemini 1.0 or 1.5 Flash) as the base model for self-correction enhancement, collect an initial set of training tasks. 1️⃣ SCoRe Stage 1: Use RL (REINFORCE) to train the model to produce high-reward revisions but force it not to change the first attempt using KL-divergence. This decouples the distribution of the first and second attempts. 2️⃣ SCoRe Stage 2: Remove the restriction on changing the first attempt and train both attempts toward optimizing the reward, including a shaped reward to maximize self-correction (higher reward for traces that flip correctness from the first to the second attempt). Insights: 🚫 Supervised fine-tuning (SFT) on offline model-generated correction traces is insufficient for instilling self-correction behavior. ⚖️ Using RL for Stage 1 as SFT leads to only being good at correcting or reasoning. 🔄 Used the REINFORCE algorithm during both stages. 🔑 On-policy sampling was crucial for successful multi-turn self-correction. 📈 Improved MATH by 15.6% and HumanEval by 9.1% over the base model. ⏱️ Single-turn training improved initial performance but did not enhance self-correction in subsequent turns. 📉 Replacing REINFORCE with STaR resulted in lower performance. ➕ Combining SCoRe with inference-time scaling (maj@32) led to a 10.5% improvement. Paper: huggingface.co/papers/2409.12… Great work, and Kudos for publishing the research! 🤗

English

122

639

54.3K

AmitP retweetledi

Scholarship for PhD@ScholarshipfPhd·16 Eyl

ZXX

AmitP retweetledi

Tomasz Gawroński@GawroskiT·11 Eyl

Comparison of lunarlake vs Z1 extreme handheld. This 4-core of lion cove and 4-ecore skymont are almost 1.5-2x faster then 8core/16threads Zen 4 at low 30w power.

English

AmitP@amitp_ai·9 Eyl

@G_melo_ding @SaintjohnD @harukaze5719 Thanks but I think notebookcheck is measuring the total system power and not just the cpu package power for those tests… so it’s different

English

Game.Keeps.Loading@G_melo_ding·9 Eyl

@amitp_ai @SaintjohnD @harukaze5719 notebookcheck.net/AMD-Zen-5-Stri…

QME

포시포시@harukaze5719·9 Eyl

Intel Lunar Lake Power XPS 13 youtu.be/AAKS3nV6QLE?si… youtu.be/xtkC4OD8iAs?si…

YouTube

English

121

20.4K

AmitP retweetledi

Atul Kumar@atulkumarzz·8 Eyl

500 TB Tutorials + Books + Courses + Trainings + Workshops -Data science -Python -AI -Cloud -BIG DATA -Data Analytics -BI -Google Cloud Training -Machine Learning -Deep Learning -Ethical Hacking To get it just - Follow me - like & RT it - Comment "Free"

English

1.1K

481

1.3K

284.1K

Keşfet

@techNmak @GoogleColab @lafaiel @OpenAI @GoogleDeepMind @G_melo_ding @SaintjohnD @harukaze5719