AAAAAAAAAA

112 posts

AAAAAAAAAA

@aaaaaaaaaaorg

a lab focused on making research more accessible.

www Katılım Ağustos 2023

6 Takip Edilen2.1K Takipçiler

AAAAAAAAAA retweetledi

naklecha@naklecha·31 Mar

i’m building something new for teams that want to automate 100s of tiny research experiments on gpus (think @karpathy’s auto research). if you want to try out an early version, reach out to me!

English

1.8K

AAAAAAAAAA retweetledi

naklecha@naklecha·6 Şub

for people who run experiments on gpus: i built a cli tool that gives coding agents access to cloudy's infra (~2x cheaper than other sandbox / serverless clouds). with this cli, you can ask claude to: "finetune kimi k2 on 64 h100s & to save money test your finetune on 8 h100s"

English

7.9K

AAAAAAAAAA retweetledi

naklecha@naklecha·8 Oca

introducing simple-llm: a ~950 line, powerful & extensible inference engine that performs on par with vllm. enjoy :) performance (gpt-oss-120b, on an h100): - batch=1: 135 tok/s (vllm: 138) - batch=64: 4,041 tok/s (vllm: 3,846) github.com/naklecha/simpl…

English

772

68.1K

AAAAAAAAAA retweetledi

Jacob Rintamaki@jacobrintamaki·7 Oca

THE FINAL OFFSHORING If you try to look to the past to understand the future of automation, the future is bleak. On one hand, if we don’t go forward with robotics, then we’ll likely enter a stagnant, malthusian hellscape of a world. But on the other hand, if we do go forward with robotics, then society might not survive the transition intact. If these are the two default futures, then it's no wonder so many people are pessimistic. Is there an option to just have neither of these, please? Yes, actually. But it won’t be easy. It won’t be simple. And it will require reading a lot of footnotes.

English

527

176.4K

AAAAAAAAAA retweetledi

naklecha@naklecha·4 Oca

saturday night random idea: i'm vibe reimplementing vllm into a single & simple inference file + a folder with custom kernels. rules: - allowed libs: torch, numpy, flash-attn - must match vllm's gpt-oss-120b inference speed - can optimise for the model and hardware (h100)

English

195

11.9K

AAAAAAAAAA retweetledi

naklecha@naklecha·29 Ara

2026, ????

naklecha@naklecha

today, i'm excited to release a reinforcement learning guide that carefully explains the intuition and implementation details behind every single fundamental algorithm in the field. enjoy :) naklecha.com/reinforcement-…

738

71.3K

AAAAAAAAAA retweetledi

naklecha@naklecha·9 Ara

Introducing "public volumes" on Cloudy - with this update, open-source projects can share as reproducible sandboxes that can be forked and mounted on GPU instances within seconds. Share, fork & mount 100TB+ sandboxes seamlessly with Cloudy. Here is a demo:

English

6.1K

AAAAAAAAAA retweetledi

naklecha@naklecha·1 Ara

i miss working on projects like these... we are working on something interesting @cloudysoftwares that will hopefully incentivise more projects like these on the timeline :')

naklecha@naklecha

i'm working on an open-source repo that teaches people llm inference optimisations. rn the fastest files on the repo are at 6.5-7k tps, as compared to vllm's 10k for the same batch size. i added a c++ implementation today. feel free to try it out (wip): github.com/naklecha/llm-i…

English

5.9K

AAAAAAAAAA retweetledi

Andrej Karpathy@karpathy·13 Eki

Excited to release new repo: nanochat! (it's among the most unhinged I've written). Unlike my earlier similar repo nanoGPT which only covered pretraining, nanochat is a minimal, from scratch, full-stack training/inference pipeline of a simple ChatGPT clone in a single, dependency-minimal codebase. You boot up a cloud GPU box, run a single script and in as little as 4 hours later you can talk to your own LLM in a ChatGPT-like web UI. It weighs ~8,000 lines of imo quite clean code to: - Train the tokenizer using a new Rust implementation - Pretrain a Transformer LLM on FineWeb, evaluate CORE score across a number of metrics - Midtrain on user-assistant conversations from SmolTalk, multiple choice questions, tool use. - SFT, evaluate the chat model on world knowledge multiple choice (ARC-E/C, MMLU), math (GSM8K), code (HumanEval) - RL the model optionally on GSM8K with "GRPO" - Efficient inference the model in an Engine with KV cache, simple prefill/decode, tool use (Python interpreter in a lightweight sandbox), talk to it over CLI or ChatGPT-like WebUI. - Write a single markdown report card, summarizing and gamifying the whole thing. Even for as low as ~$100 in cost (~4 hours on an 8XH100 node), you can train a little ChatGPT clone that you can kind of talk to, and which can write stories/poems, answer simple questions. About ~12 hours surpasses GPT-2 CORE metric. As you further scale up towards ~$1000 (~41.6 hours of training), it quickly becomes a lot more coherent and can solve simple math/code problems and take multiple choice tests. E.g. a depth 30 model trained for 24 hours (this is about equal to FLOPs of GPT-3 Small 125M and 1/1000th of GPT-3) gets into 40s on MMLU and 70s on ARC-Easy, 20s on GSM8K, etc. My goal is to get the full "strong baseline" stack into one cohesive, minimal, readable, hackable, maximally forkable repo. nanochat will be the capstone project of LLM101n (which is still being developed). I think it also has potential to grow into a research harness, or a benchmark, similar to nanoGPT before it. It is by no means finished, tuned or optimized (actually I think there's likely quite a bit of low-hanging fruit), but I think it's at a place where the overall skeleton is ok enough that it can go up on GitHub where all the parts of it can be improved. Link to repo and a detailed walkthrough of the nanochat speedrun is in the reply.

English

690

3.4K

24.2K

5.8M

AAAAAAAAAA retweetledi

naklecha@naklecha·5 Eki

Announcing Cloudy, a platform that seamlessly handles your training infrastructure. You can rent a single H100 or a cluster of 1000 H100s, manage petabyte-scale storage volumes & seamlessly go from running experiments to managing large scale training runs on a single interface.

English

261

33.1K

AAAAAAAAAA retweetledi

naklecha@naklecha·25 Eyl

today, i'm shutting down @aaaaaaaaaaorg. although i deeply believe in a10's mission, the only right way to build a10 would be to stay open, focus on education & stay non-profit. it will make a comeback someday, self-funded. until then, i'm focused on making fuck you money.

English

6.7K

AAAAAAAAAA retweetledi

naklecha@naklecha·15 May

English

225

15.5K

AAAAAAAAAA retweetledi

naklecha@naklecha·6 May

i'm working on a repository that implements increasingly complex llm inference optimizations in zero abstraction single file implementations. to start with, i put up 4 files on a repository that implements llama3.2 1b at 15tps, 60tps, 180tps, 1385tps on my 4090. the project will end when i am close to vllm's performance (7218 tps on my testset). then, i will do the same thing again with multiple gpus and implement parallel inference (maybe on a larger model). once i'm done with that, i'll write a blog / make a video explaining it all. i have a few more working optimisations and a few buggy ones that i'll fix and push to the repository this week. i wanted to get the ball rolling & set up the repository for you guys to check out. feedback/thoughts are welcome :)

English

128

9.8K

AAAAAAAAAA retweetledi

Pramod Goyal@goyal__pramod·26 Mar

Llama 3 implementation each matrix multiplication explained 1 by 1. One of the greatest resources to learn about LLMs from a very low level.

English

282

15K

AAAAAAAAAA retweetledi

naklecha@naklecha·11 Şub

i love writing!! it gives me an excuse to learn topics and is my small way of giving back to the community (inspired by @karpathy's videos). my past blogs: 1. a blog on lcms (10k readers) 2. llama3 from scratch (14k github stars, ~100k readers) 3. an rl guide (20k readers)

English

1.5K

AAAAAAAAAA retweetledi

naklecha@naklecha·11 Şub

for my next blog post -- i am working on a comprehensive guide on data parallelism. the guide will explain relevant parallelization tricks, training on thousands of h100s, networking within gpus and nodes, some relevant cuda concepts & a bunch of pytorch code :) eta ~2 months.

naklecha@naklecha

English

104

6.4K

AAAAAAAAAA retweetledi

Antaripa Saha@doesdatmaksense·21 Oca

with deepseek's r1 release, now is the best time to learn about reinforcement learning @naklecha already dropped a bomb guide covering almost everything for someone to get started with RL. crazy thing is it covers both theory plus code examples

English

102

1.1K

278.7K

AAAAAAAAAA retweetledi

naklecha@naklecha·21 Oca

notes on my reinforcement learning for rocket league side quest so far: i’m in awe that — we live in a time where i can train a reinforcement learning model that is learning to play rocket league and at the same time, test community made rocket league models. where, both processes can run in parallel & locally on my hardware. in the game window — i’m playing against a reinforcement learning ppo model that’s running locally. and in the vscode window — i’m training my own reinforcement learning model (just a dummy example script for now). also, the rocket league botting ecosystem has incredible tooling where you can test out different reinforcement learning ideas against other algorithms + it has great documentation for linux and windows. it’s honestly a beautiful hidden gem of a rabbit hole :)

naklecha@naklecha

my new ml side quest -- i'm going to build rocket league bots, using reinforcement learning. my goal is to train a bot (in 1v1s) that is better than ~95% of rocket league's player base (diamond/champ rank).

English

487

46.7K

AAAAAAAAAA retweetledi

naklecha@naklecha·18 Oca

English

140

47.1K

AAAAAAAAAA@aaaaaaaaaaorg·17 Oca

@naklecha !!!

QAM

Keşfet

@karpathy @cloudysoftwares @naklecha @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates