Jannis

2.5K posts

Jannis

@basement_agi

Deep Learner

Katılım Aralık 2013

1.6K Takip Edilen521 Takipçiler

Jannis@basement_agi·18h

@InterstellarUAP @PeterMcCormack Already wrong. They are waves AND particles at the same time. There are some operators which are compatible and some which are not. And impuls and position is not.

English

Interstellar@InterstellarUAP·1d

🚨 Simulation Theory: The Double Slit Experiment proves particles act like waves until observed then they snap into particles. What if our reality only "renders" when we're looking, just like a video game optimizing resources? Check out this episode from The Why Files breaking it down, tying it to Simulation Theory. Are we in a sim? This could be the key to unlocking the true nature of existence! The Why Files video did a great job on explaining the Double Slit Experiment & Simulation Theory What do YOU think—real or rendered? Drop your thoughts below!

English

1.5K

4.5K

27.2K

38.1M

Jannis@basement_agi·5d

@TheAhmadOsman How are your tenstorrent machines?

English

168

Ahmad@TheAhmadOsman·5d

If you’re asking what to buy, my recommendation based on the budget: 1. RTX PRO 6000 2. RTX 5090 3. RTX 3090 (used, from r/HardwareSwap)

Ahmad@TheAhmadOsman

Life after an RTX 3090 > Life before an RTX 3090 Buy a GPU

English

326

45.5K

Jannis@basement_agi·14 Mar

@rohanpaul_ai nice try NSA

English

Rohan Paul@rohanpaul_ai·14 Mar

Tinder is launching AI tools to fix dating app burnout by using computer vision to scan your camera roll and LLMs to improve safety. The engine Chemistry scans your camera roll to understand your personality through photo patterns. This tool uses computer vision to build a profile without you having to type out interests. Learning Mode tracks real-time activity to adjust profile suggestions while you are active. Internal tests on 14mn users showed these adjustments increased engagement for new users. --- ibtimes .co.uk/tinder-ai-features-dating-app-fatigue-1785489

English

6.8K

Jannis@basement_agi·14 Mar

@art_zucker Coool! But still not a fan of GRPO. It feels like finetuning on rollouts instead of real reinforcement learning.

English

112

Arthur Zucker@art_zucker·14 Mar

If you don't realize what that means: for easy dev / eval but mostly GRPO this is kinda game changer! No weight synchro. No accuracy drop. You just use the exact same codepath for training and generating. Once the model is trained you put to prod in vllm / sglang, same code..

Rémi Ouazan@remi_or_

The inference stack just got simpler. PagedAttention, the kernel that made vLLM fast, now ships natively in 🤗 Transformers CB. Result: 84% of vLLM throughput on a single GPU. Near SOTA with no extra runtime. The gap is closing 📈

English

149

18K

Jannis@basement_agi·14 Mar

@rawsalerts @MarktMatts @greg16676935420 @inversebrah

QAM

R A W S A L E R T S@rawsalerts·14 Mar

@MarktMatts i thought @greg16676935420 had thousand hot dog pictures in his phone

English

38.6K

R A W S A L E R T S@rawsalerts·14 Mar

🚨#BREAKING: Tinder has announced a new AI feature that would scan users’ camera rolls to help find better matches. The company says the tool would use artificial intelligence to analyze photos and better understand users’ interests and preferences, saying, We are using AI to better understand what you’re into.

English

471

226

2.4K

832.5K

Jannis@basement_agi·13 Mar

@TheAhmadOsman This + the hugging face cookbooks

English

542

Ahmad@TheAhmadOsman·13 Mar

BREAKING Elon Musk endorsed my Top 26 Essential Papers for Mastering LLMs and Transformers Implement those and you’ve captured ~90% of the alpha behind modern LLMs. Everything else is garnish. This list bridges the Transformer foundations with the reasoning, MoE, and agentic shift Recommended Reading Order 1. Attention Is All You Need (Vaswani et al., 2017) > The original Transformer paper. Covers self-attention, > multi-head attention, and the encoder-decoder structure > (even though most modern LLMs are decoder-only.) 2. The Illustrated Transformer (Jay Alammar, 2018) > Great intuition builder for understanding > attention and tensor flow before diving into implementations 3. BERT: Pre-training of Deep Bidirectional Transformers (Devlin et al., 2018) > Encoder-side fundamentals, masked language modeling, > and representation learning that still shape modern architectures 4. Language Models are Few-Shot Learners (GPT-3) (Brown et al., 2020) > Established in-context learning as a real > capability and shifted how prompting is understood 5. Scaling Laws for Neural Language Models (Kaplan et al., 2020) > First clean empirical scaling framework for parameters, data, and compute > Read alongside Chinchilla to understand why most models were undertrained 6. Training Compute-Optimal Large Language Models (Chinchilla) (Hoffmann et al., 2022) > Demonstrated that token count matters more than > parameter count for a fixed compute budget 7. LLaMA: Open and Efficient Foundation Language Models (Touvron et al., 2023) > The paper that triggered the open-weight era > Introduced architectural defaults like RMSNorm, SwiGLU > and RoPE as standard practice 8. RoFormer: Rotary Position Embedding (Su et al., 2021) > Positional encoding that became the modern default for long-context LLMs 9. FlashAttention (Dao et al., 2022) > Memory-efficient attention that enabled long context windows > and high-throughput inference by optimizing GPU memory access. 10. Retrieval-Augmented Generation (RAG) (Lewis et al., 2020) > Combines parametric models with external knowledge sources > Foundational for grounded and enterprise systems 11. Training Language Models to Follow Instructions with Human Feedback (InstructGPT) (Ouyang et al., 2022) > The modern post-training and alignment blueprint > that instruction-tuned models follow 12. Direct Preference Optimization (DPO) (Rafailov et al., 2023) > A simpler and more stable alternative to PPO-based RLHF > Preference alignment via the loss function 13. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Wei et al., 2022) > Demonstrated that reasoning can be elicited through prompting > alone and laid the groundwork for later reasoning-focused training 14. ReAct: Reasoning and Acting (Yao et al., 2022 / ICLR 2023) > The foundation of agentic systems > Combines reasoning traces with tool use and environment interaction 15. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (Guo et al., 2025) > The R1 paper. Proved that large-scale reinforcement learning without > supervised data can induce self-verification and structured reasoning behavior 16. Qwen3 Technical Report (Yang et al., 2025) > A modern architecture lightweight overview > Introduced unified MoE with Thinking Mode and Non-Thinking > Mode to dynamically trade off cost and reasoning depth 17. Outrageously Large Neural Networks: Sparsely-Gated Mixture of Experts (Shazeer et al., 2017) > The modern MoE ignition point > Conditional computation at scale 18. Switch Transformers (Fedus et al., 2021) > Simplified MoE routing using single-expert activation > Key to stabilizing trillion-parameter training 19. Mixtral of Experts (Mistral AI, 2024) > Open-weight MoE that proved sparse models can match dense quality > while running at small-model inference cost 20. Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints (Komatsuzaki et al., 2022 / ICLR 2023) > Practical technique for converting dense checkpoints into MoE models > Critical for compute reuse and iterative scaling 21. The Platonic Representation Hypothesis (Huh et al., 2024) > Evidence that scaled models converge toward shared > internal representations across modalities 22. Textbooks Are All You Need (Gunasekar et al., 2023) > Demonstrated that high-quality synthetic data allows > small models to outperform much larger ones 23. Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet (Templeton et al., 2024) > The biggest leap in mechanistic interpretability > Decomposes neural networks into millions of interpretable features 24. PaLM: Scaling Language Modeling with Pathways (Chowdhery et al., 2022) > A masterclass in large-scale training > orchestration across thousands of accelerators 25. GLaM: Generalist Language Model (Du et al., 2022) > Validated MoE scaling economics with massive > total parameters but small active parameter counts 26. The Smol Training Playbook (Hugging Face, 2025) > Practical end-to-end handbook for efficiently training language models Bonus Material > T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (Raffel et al., 2019) > Toolformer (Schick et al., 2023) > GShard (Lepikhin et al., 2020) > Adaptive Mixtures of Local Experts (Jacobs et al., 1991) > Hierarchical Mixtures of Experts (Jordan and Jacobs, 1994) If you deeply understand these fundamentals; Transformer core, scaling laws, FlashAttention, instruction tuning, R1-style reasoning, and MoE upcycling, you already understand LLMs better than most Time to lock-in, good luck!

English

114

900

58.2K

Jannis@basement_agi·13 Mar

@charles_irl oh wow!

English

253

Charles 🎉 Frye@charles_irl·13 Mar

ai.meta.com/blog/meta-mtia…

ZXX

4.8K

Jannis@basement_agi·6 Mar

@0xzak I just saw total VRAM is 64 GB so it's the 16 GB variant. Then you have to choose a 6 or 7 billion model so it will be 12 to 14 in fp16 and you have something left for kv cache. You will get memory bandwidth/active params tokens per second.

English

Jannis@basement_agi·6 Mar

@0xzak Hm, this is so slow. Volta doesn't support int8 with tensor cores. So you have to run it in fp16. But then qwen 3.5 30 billion is 60gb, which doesn't fit in one GPU. So you have to split and use nvlink which is possible super slow (with this nvlink generation). Choose a 10b model

English

zak.eth@0xzak·6 Mar

Ok, what up y'all. I'm back in action. Took some time off. About a week and a half. First real break in a while. Spent the entire thing building my own AI inference cluster because apparently that's what I do for fun now. Picked up a pair of used NVIDIA DGX Stations. Each one has 4x Tesla V100 GPUs, 64GB VRAM, NVLink, 256GB RAM. Originally $69,000 each. Got both used for a less than $10k total. Connected the first one to my main workstation (Threadripper 3970X, RTX 3090) over a direct 10GbE ethernet link with jumbo frames. 8 V100s, one RTX 3090, 160GB of combined GPU memory across three machines sitting on my desk. Two of them are former enterprise servers that a Fortune 500 company probably depreciated off their books years ago. The model lineup running right now: - Qwen3 30B (MoE, 3B active params): 21 tokens/sec - GPT-OSS 20B (MoE, 3.6B active): 13 tokens/sec - Qwen2.5 Coder 32B: 3 tok/s, best open-source coding model at this size - DeepSeek R1 32B: 3 tok/s, reasoning model that benchmarks above o1-mini - Llama 3.3 70B: 1.5 tok/s, strongest overall open model MoE architectures are the move for this hardware. Only 3B parameters active per forward pass means you get 21 tok/s out of a 30B model. Dense 32B models crawl at 3 tok/s on V100s. The 70B is usable for batch work but you're not having a conversation at 1.5 tok/s. Biggest lesson I learned the hard way is that vLLM (the standard high-performance inference engine) straight up does not work on V100 GPUs. AWQ quantization requires compute capability 7.5+. V100 is 7.0. GPTQ kernel is documented as "buggy" and hung indefinitely. Marlin needs Ampere or newer. Three quantization backends, three failures. Spent days debugging before pivoting entirely to Ollama with GGUF models. llama.cpp just works on everything. NVIDIA also dropped V100 driver support starting at version 550. Locked to the legacy R535 branch forever. Enterprise hardware depreciates like a brick and the software ecosystem moves on without it. For comparison, a Mac Mini M4 Pro with 48GB unified memory ($2,000) runs 30B models at 12-18 tok/s and MoE models at up to 83 tok/s via MLX. Faster than a single DGX for single-model inference and uses 30 watts instead of 1,200. The Mac wins on efficiency, but it can't run 70B models, which is what I need for the research I'm doing. A 70B Q4 needs 42GB, leaving almost nothing for context in 48GB. One DGX handles it with room to spare. Two of them with 128GB of combined VRAM opens up models the Mac can't even attempt. Cost math: - Claude Max: $100-200/mo - ChatGPT Pro: $200/mo - Heavy API usage (Opus): Can be thousands/mo - Two DGX Stations electricity at full load: $180-300/mo for unlimited local inference across both Unlimited requests, complete data privacy, zero dependence on third-party APIs. The hardware pays for itself within a year and then it's just electricity forever. Local models do not replace Claude or GPT-4 (yet) for complex multi-file agentic coding. That gap is still massive and I'll be real about it. I use local for code completion, quick questions, reasoning tasks, RAG pipelines, and anything I don't want touching a third-party server. Cloud for the heavy agentic loops. Optimization stack, for anyone building something similar: - CPU governor locked to performance mode - Transparent hugepages enabled, swap disabled, vm.swappiness at 1 - NCCL for NVLink GPU communication - Models pinned in VRAM for 24 hours (no cold-start penalty) - 10GbE jumbo frames (MTU 9000) between machines - UFW firewall locked to only accept requests from my workstation Every knob I could find, turned. Next step is clustering the two DGX Stations together. 128GB VRAM across 8 GPUs. 10GbE is too slow for tensor parallelism across machines (you need InfiniBand) but pipeline parallelism and model routing work fine over ethernet. Run different models on different boxes, route requests to whichever one has capacity. This is me getting back to the grind. I'm diving back into agentic research, specifically the intersection of decentralized AI and Ethereum. If individuals can assemble competitive inference clusters from depreciated enterprise hardware, the economics of AI access change permanently. You don't need a data center. You need a couple used servers, some ethernet cable, and about a week of fighting NVIDIA drivers. The models are open, the hardware is cheap, and the software exists. Sovereign compute is the play. Decentralized AI starts in your garage. We just have to build it. Will be sharing some cool new stuff here in the coming days, so stay tuned!

English

Jannis@basement_agi·6 Mar

@livinoffwater 🤣🤣 Also OCT is worth visiting (Penny Black Jazz Bar)

English

Natalie@livinoffwater·6 Mar

@basement_agi Is this where we go to get oiled up?

English

267

Natalie@livinoffwater·6 Mar

Flying to Shenzhen tonight. Documenting my journey through Huaqiangbei and the streets of Nanshan Who should I meet out there?

English

21.8K

Jannis@basement_agi·5 Mar

My toxic trade is that I think that the brain isn't some impossibly powerful supercomputer, it's actually doing absurdly little compute relative to its width. A GPU is ~100 million times more efficient.

English

Jannis@basement_agi·5 Mar

@dunik_7 Everyone wants to become a quant. Nobody wants to become a hardcore low level cpp engineer

English

139

dunik@dunik_7·5 Mar

$1.4m average comp at Jane Street. here’s why you won’t get it everyone talks about "becoming a quant" like it’s some 30-day challenge reality: the math alone is an 18-month grind through 5 levels, and each one gates the next / level 1. you simulate 10,000 coin flips just to learn that probability is not intuition it’s conditional math / level 2. you realize your first 10 strategies are noise. bonferroni correction exists because your brain wants to see patterns where there are none / level 3. a 500×500 covariance matrix, and the first 5 eigenvectors explain 70% of everything. the rest is garbage / level 4. gradient descent from scratch. no importing sklearn. you write the optimizer yourself / level 5. you derive black-scholes through the delta-hedging argument and realize that drift μ disappears completely the option doesn’t care about your conviction. risk-neutral pricing permanently breaks your worldview then polymarket shows up, and the cost function behind prediction markets is literally softmax - the same function sitting behind every neural net classifier the article gives you every formula, every textbook, every library but the line that hit hardest was this: “ai can write code but being able to derive why ito’s lemma has that extra term is what separates quants who create edge from quants who borrow it and borrowed edge has an expiration date” the tools are free. quantlib is free. pytorch is free mathematical literacy isn’t

gemchanger@gemchange_ltd

x.com/i/article/2028…

English

517

63.3K

Jannis@basement_agi·4 Mar

@tsungxu (9) looks like lorentz transformation

English

285

Tsung Xu@tsungxu·4 Mar

If you know what this is, dm me I will hire you

English

387

108

1.8K

278.1K

Jannis retweetledi

Davis Blalock@davisblalock·4 Mar

🚀 Today we’re releasing FlashOptim: better implementations of Adam, SGD, etc, that compute the same updates but save tons of memory. You can use it right now via `pip install flashoptim`. 🚀 arxiv.org/abs/2602.23349 A bunch of cool ideas make this possible: [1/n]

English

226

1.5K

203.6K

Jannis@basement_agi·1 Mar

@bluewmist Nvidia 8X B300

Português

blue@bluewmist·28 Şub

If you were given $1,000,000 in cash right now, what is the very first thing you would buy or do?

English

26.1K

Jannis@basement_agi·28 Şub

@prajdabre You can use one of the 4 models as your base model. And the other 3 models as "teacher" and distill into your base model. Like in the paper "Pre-training under infinite compute"

English

380

Raj Dabre@prajdabre·28 Şub

Basic ML interview question: You learned about model merging and decided to test it out. You trained 4 models on 4 tasks with the same architecture and vocabulary. You then averaged the parameters of the 4 models. You expected slightly poorer performance but the model just gave terrible performance. What went wrong?

English

188

44.1K

Jannis@basement_agi·27 Şub

@ns123abc Mhhhhhm. I don't like Anthropics position regarding open source, but they have integrity

English

1.2K

NIK@ns123abc·27 Şub

🚨 Anthropic CEO Tells Pentagon “NO.” >pentagon: “use claude for ALL lawful purposes” >dario: no >pentagon: do as we say or you’re blacklisted >dario: “these threats do not change our position” Anthropic CEO final message to Department of War: >no fully autonomous weapons without humans >no mass domestic surveillance for Americans Pentagon official calls Dario a “liar with a God-complex” who “wants to personally control the US Military” and is “ok putting our nation’s safety at risk.” >xAI, Google & OpenAI all agreed to the Pentagon’s terms Anthropic: “Regardless, we cannot in good conscience accede.”

English

258

820

11.1K

668.7K

Jannis@basement_agi·27 Şub

@icanvardar We are in stealth phase

English

Can Vardar@icanvardar·27 Şub

we might actually be in "delusion" or "new paradigm"

Chad Hurley@Chad_Hurley

Hope everyone enjoys their last year of meaningful work!

English

4.2K

Jannis@basement_agi·26 Şub

@hamptonism Mhhhhhhm

484

ₕₐₘₚₜₒₙ@hamptonism·26 Şub

You could literally just get a PhD in Meteorological Applications in Quantitative Finance, and work at 幻方量化 (DeepSeek Quant Fund) and buy this penthouse in Hong Kong, then dm that influencer with 3.5M followers on 抖音 (Chinese TikTok) who keeps liking all your pics, first dates at a cha chaan teng milk tea, then promenade at Victoria Harbour, Courtship at Rosary Church youth group or Sunday mass, start your own quant fund with their families funding, - but you will not.

Casey B. Head@CaseyBHead

You could get a job at the Littleton Coin Company and buy this house just off Main Street. Then meet a nice girl at St. Rose of Lima and take her on a date to Schilling Beer Company. Get married and send your kids to Above The Notch Community School. But you will not.

English

155

3.8K

1.3M

Keşfet

@InterstellarUAP @PeterMcCormack @TheAhmadOsman @rohanpaul_ai @art_zucker @rawsalerts @MarktMatts @greg16676935420