Jannis

2.5K posts

Jannis

Jannis

@basement_agi

Deep Learner

Katılım Aralık 2013
1.6K Takip Edilen521 Takipçiler
Jannis
Jannis@basement_agi·
@InterstellarUAP @PeterMcCormack Already wrong. They are waves AND particles at the same time. There are some operators which are compatible and some which are not. And impuls and position is not.
English
0
0
0
8
Interstellar
Interstellar@InterstellarUAP·
🚨 Simulation Theory: The Double Slit Experiment proves particles act like waves until observed then they snap into particles. What if our reality only "renders" when we're looking, just like a video game optimizing resources? Check out this episode from The Why Files breaking it down, tying it to Simulation Theory. Are we in a sim? This could be the key to unlocking the true nature of existence! The Why Files video did a great job on explaining the Double Slit Experiment & Simulation Theory What do YOU think—real or rendered? Drop your thoughts below!
English
1.5K
4.5K
27.2K
38.1M
Rohan Paul
Rohan Paul@rohanpaul_ai·
Tinder is launching AI tools to fix dating app burnout by using computer vision to scan your camera roll and LLMs to improve safety. The engine Chemistry scans your camera roll to understand your personality through photo patterns. This tool uses computer vision to build a profile without you having to type out interests. Learning Mode tracks real-time activity to adjust profile suggestions while you are active. Internal tests on 14mn users showed these adjustments increased engagement for new users. --- ibtimes .co.uk/tinder-ai-features-dating-app-fatigue-1785489
Rohan Paul tweet media
English
10
2
32
6.8K
Jannis
Jannis@basement_agi·
@art_zucker Coool! But still not a fan of GRPO. It feels like finetuning on rollouts instead of real reinforcement learning.
English
0
0
0
112
Arthur Zucker
Arthur Zucker@art_zucker·
If you don't realize what that means: for easy dev / eval but mostly GRPO this is kinda game changer! No weight synchro. No accuracy drop. You just use the exact same codepath for training and generating. Once the model is trained you put to prod in vllm / sglang, same code..
Rémi Ouazan@remi_or_

The inference stack just got simpler. PagedAttention, the kernel that made vLLM fast, now ships natively in 🤗 Transformers CB. Result: 84% of vLLM throughput on a single GPU. Near SOTA with no extra runtime. The gap is closing 📈

English
9
10
149
18K
R A W S A L E R T S
R A W S A L E R T S@rawsalerts·
🚨#BREAKING: Tinder has announced a new AI feature that would scan users’ camera rolls to help find better matches. The company says the tool would use artificial intelligence to analyze photos and better understand users’ interests and preferences, saying, We are using AI to better understand what you’re into.
English
471
226
2.4K
832.5K
Ahmad
Ahmad@TheAhmadOsman·
BREAKING Elon Musk endorsed my Top 26 Essential Papers for Mastering LLMs and Transformers Implement those and you’ve captured ~90% of the alpha behind modern LLMs. Everything else is garnish. This list bridges the Transformer foundations with the reasoning, MoE, and agentic shift Recommended Reading Order 1. Attention Is All You Need (Vaswani et al., 2017) > The original Transformer paper. Covers self-attention, > multi-head attention, and the encoder-decoder structure > (even though most modern LLMs are decoder-only.) 2. The Illustrated Transformer (Jay Alammar, 2018) > Great intuition builder for understanding > attention and tensor flow before diving into implementations 3. BERT: Pre-training of Deep Bidirectional Transformers (Devlin et al., 2018) > Encoder-side fundamentals, masked language modeling, > and representation learning that still shape modern architectures 4. Language Models are Few-Shot Learners (GPT-3) (Brown et al., 2020) > Established in-context learning as a real > capability and shifted how prompting is understood 5. Scaling Laws for Neural Language Models (Kaplan et al., 2020) > First clean empirical scaling framework for parameters, data, and compute > Read alongside Chinchilla to understand why most models were undertrained 6. Training Compute-Optimal Large Language Models (Chinchilla) (Hoffmann et al., 2022) > Demonstrated that token count matters more than > parameter count for a fixed compute budget 7. LLaMA: Open and Efficient Foundation Language Models (Touvron et al., 2023) > The paper that triggered the open-weight era > Introduced architectural defaults like RMSNorm, SwiGLU > and RoPE as standard practice 8. RoFormer: Rotary Position Embedding (Su et al., 2021) > Positional encoding that became the modern default for long-context LLMs 9. FlashAttention (Dao et al., 2022) > Memory-efficient attention that enabled long context windows > and high-throughput inference by optimizing GPU memory access. 10. Retrieval-Augmented Generation (RAG) (Lewis et al., 2020) > Combines parametric models with external knowledge sources > Foundational for grounded and enterprise systems 11. Training Language Models to Follow Instructions with Human Feedback (InstructGPT) (Ouyang et al., 2022) > The modern post-training and alignment blueprint > that instruction-tuned models follow 12. Direct Preference Optimization (DPO) (Rafailov et al., 2023) > A simpler and more stable alternative to PPO-based RLHF > Preference alignment via the loss function 13. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Wei et al., 2022) > Demonstrated that reasoning can be elicited through prompting > alone and laid the groundwork for later reasoning-focused training 14. ReAct: Reasoning and Acting (Yao et al., 2022 / ICLR 2023) > The foundation of agentic systems > Combines reasoning traces with tool use and environment interaction 15. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (Guo et al., 2025) > The R1 paper. Proved that large-scale reinforcement learning without > supervised data can induce self-verification and structured reasoning behavior 16. Qwen3 Technical Report (Yang et al., 2025) > A modern architecture lightweight overview > Introduced unified MoE with Thinking Mode and Non-Thinking > Mode to dynamically trade off cost and reasoning depth 17. Outrageously Large Neural Networks: Sparsely-Gated Mixture of Experts (Shazeer et al., 2017) > The modern MoE ignition point > Conditional computation at scale 18. Switch Transformers (Fedus et al., 2021) > Simplified MoE routing using single-expert activation > Key to stabilizing trillion-parameter training 19. Mixtral of Experts (Mistral AI, 2024) > Open-weight MoE that proved sparse models can match dense quality > while running at small-model inference cost 20. Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints (Komatsuzaki et al., 2022 / ICLR 2023) > Practical technique for converting dense checkpoints into MoE models > Critical for compute reuse and iterative scaling 21. The Platonic Representation Hypothesis (Huh et al., 2024) > Evidence that scaled models converge toward shared > internal representations across modalities 22. Textbooks Are All You Need (Gunasekar et al., 2023) > Demonstrated that high-quality synthetic data allows > small models to outperform much larger ones 23. Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet (Templeton et al., 2024) > The biggest leap in mechanistic interpretability > Decomposes neural networks into millions of interpretable features 24. PaLM: Scaling Language Modeling with Pathways (Chowdhery et al., 2022) > A masterclass in large-scale training > orchestration across thousands of accelerators 25. GLaM: Generalist Language Model (Du et al., 2022) > Validated MoE scaling economics with massive > total parameters but small active parameter counts 26. The Smol Training Playbook (Hugging Face, 2025) > Practical end-to-end handbook for efficiently training language models Bonus Material > T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (Raffel et al., 2019) > Toolformer (Schick et al., 2023) > GShard (Lepikhin et al., 2020) > Adaptive Mixtures of Local Experts (Jacobs et al., 1991) > Hierarchical Mixtures of Experts (Jordan and Jacobs, 1994) If you deeply understand these fundamentals; Transformer core, scaling laws, FlashAttention, instruction tuning, R1-style reasoning, and MoE upcycling, you already understand LLMs better than most Time to lock-in, good luck!
Ahmad tweet media
English
43
114
900
58.2K
Jannis
Jannis@basement_agi·
@0xzak I just saw total VRAM is 64 GB so it's the 16 GB variant. Then you have to choose a 6 or 7 billion model so it will be 12 to 14 in fp16 and you have something left for kv cache. You will get memory bandwidth/active params tokens per second.
English
0
0
0
34
Jannis
Jannis@basement_agi·
@0xzak Hm, this is so slow. Volta doesn't support int8 with tensor cores. So you have to run it in fp16. But then qwen 3.5 30 billion is 60gb, which doesn't fit in one GPU. So you have to split and use nvlink which is possible super slow (with this nvlink generation). Choose a 10b model
English
1
0
0
43
zak.eth
zak.eth@0xzak·
Ok, what up y'all. I'm back in action. Took some time off. About a week and a half. First real break in a while. Spent the entire thing building my own AI inference cluster because apparently that's what I do for fun now. Picked up a pair of used NVIDIA DGX Stations. Each one has 4x Tesla V100 GPUs, 64GB VRAM, NVLink, 256GB RAM. Originally $69,000 each. Got both used for a less than $10k total. Connected the first one to my main workstation (Threadripper 3970X, RTX 3090) over a direct 10GbE ethernet link with jumbo frames. 8 V100s, one RTX 3090, 160GB of combined GPU memory across three machines sitting on my desk. Two of them are former enterprise servers that a Fortune 500 company probably depreciated off their books years ago. The model lineup running right now: - Qwen3 30B (MoE, 3B active params): 21 tokens/sec - GPT-OSS 20B (MoE, 3.6B active): 13 tokens/sec - Qwen2.5 Coder 32B: 3 tok/s, best open-source coding model at this size - DeepSeek R1 32B: 3 tok/s, reasoning model that benchmarks above o1-mini - Llama 3.3 70B: 1.5 tok/s, strongest overall open model MoE architectures are the move for this hardware. Only 3B parameters active per forward pass means you get 21 tok/s out of a 30B model. Dense 32B models crawl at 3 tok/s on V100s. The 70B is usable for batch work but you're not having a conversation at 1.5 tok/s. Biggest lesson I learned the hard way is that vLLM (the standard high-performance inference engine) straight up does not work on V100 GPUs. AWQ quantization requires compute capability 7.5+. V100 is 7.0. GPTQ kernel is documented as "buggy" and hung indefinitely. Marlin needs Ampere or newer. Three quantization backends, three failures. Spent days debugging before pivoting entirely to Ollama with GGUF models. llama.cpp just works on everything. NVIDIA also dropped V100 driver support starting at version 550. Locked to the legacy R535 branch forever. Enterprise hardware depreciates like a brick and the software ecosystem moves on without it. For comparison, a Mac Mini M4 Pro with 48GB unified memory ($2,000) runs 30B models at 12-18 tok/s and MoE models at up to 83 tok/s via MLX. Faster than a single DGX for single-model inference and uses 30 watts instead of 1,200. The Mac wins on efficiency, but it can't run 70B models, which is what I need for the research I'm doing. A 70B Q4 needs 42GB, leaving almost nothing for context in 48GB. One DGX handles it with room to spare. Two of them with 128GB of combined VRAM opens up models the Mac can't even attempt. Cost math: - Claude Max: $100-200/mo - ChatGPT Pro: $200/mo - Heavy API usage (Opus): Can be thousands/mo - Two DGX Stations electricity at full load: $180-300/mo for unlimited local inference across both Unlimited requests, complete data privacy, zero dependence on third-party APIs. The hardware pays for itself within a year and then it's just electricity forever. Local models do not replace Claude or GPT-4 (yet) for complex multi-file agentic coding. That gap is still massive and I'll be real about it. I use local for code completion, quick questions, reasoning tasks, RAG pipelines, and anything I don't want touching a third-party server. Cloud for the heavy agentic loops. Optimization stack, for anyone building something similar: - CPU governor locked to performance mode - Transparent hugepages enabled, swap disabled, vm.swappiness at 1 - NCCL for NVLink GPU communication - Models pinned in VRAM for 24 hours (no cold-start penalty) - 10GbE jumbo frames (MTU 9000) between machines - UFW firewall locked to only accept requests from my workstation Every knob I could find, turned. Next step is clustering the two DGX Stations together. 128GB VRAM across 8 GPUs. 10GbE is too slow for tensor parallelism across machines (you need InfiniBand) but pipeline parallelism and model routing work fine over ethernet. Run different models on different boxes, route requests to whichever one has capacity. This is me getting back to the grind. I'm diving back into agentic research, specifically the intersection of decentralized AI and Ethereum. If individuals can assemble competitive inference clusters from depreciated enterprise hardware, the economics of AI access change permanently. You don't need a data center. You need a couple used servers, some ethernet cable, and about a week of fighting NVIDIA drivers. The models are open, the hardware is cheap, and the software exists. Sovereign compute is the play. Decentralized AI starts in your garage. We just have to build it. Will be sharing some cool new stuff here in the coming days, so stay tuned!
zak.eth tweet media
English
20
1
58
4K
Jannis
Jannis@basement_agi·
@livinoffwater 🤣🤣 Also OCT is worth visiting (Penny Black Jazz Bar)
English
0
0
0
78
Natalie
Natalie@livinoffwater·
Flying to Shenzhen tonight. Documenting my journey through Huaqiangbei and the streets of Nanshan Who should I meet out there?
Natalie tweet media
English
19
1
86
21.8K
Jannis
Jannis@basement_agi·
My toxic trade is that I think that the brain isn't some impossibly powerful supercomputer, it's actually doing absurdly little compute relative to its width. A GPU is ~100 million times more efficient.
English
0
0
0
54
Jannis
Jannis@basement_agi·
@dunik_7 Everyone wants to become a quant. Nobody wants to become a hardcore low level cpp engineer
English
0
0
1
139
dunik
dunik@dunik_7·
$1.4m average comp at Jane Street. here’s why you won’t get it everyone talks about "becoming a quant" like it’s some 30-day challenge reality: the math alone is an 18-month grind through 5 levels, and each one gates the next / level 1. you simulate 10,000 coin flips just to learn that probability is not intuition it’s conditional math / level 2. you realize your first 10 strategies are noise. bonferroni correction exists because your brain wants to see patterns where there are none / level 3. a 500×500 covariance matrix, and the first 5 eigenvectors explain 70% of everything. the rest is garbage / level 4. gradient descent from scratch. no importing sklearn. you write the optimizer yourself / level 5. you derive black-scholes through the delta-hedging argument and realize that drift μ disappears completely the option doesn’t care about your conviction. risk-neutral pricing permanently breaks your worldview then polymarket shows up, and the cost function behind prediction markets is literally softmax - the same function sitting behind every neural net classifier the article gives you every formula, every textbook, every library but the line that hit hardest was this: “ai can write code but being able to derive why ito’s lemma has that extra term is what separates quants who create edge from quants who borrow it and borrowed edge has an expiration date” the tools are free. quantlib is free. pytorch is free mathematical literacy isn’t
dunik tweet media
gemchanger@gemchange_ltd

x.com/i/article/2028…

English
25
36
517
63.3K
Jannis
Jannis@basement_agi·
@tsungxu (9) looks like lorentz transformation
English
0
0
1
285
Tsung Xu
Tsung Xu@tsungxu·
If you know what this is, dm me I will hire you
Tsung Xu tweet media
English
387
108
1.8K
278.1K
Jannis retweetledi
Davis Blalock
Davis Blalock@davisblalock·
🚀 Today we’re releasing FlashOptim: better implementations of Adam, SGD, etc, that compute the same updates but save tons of memory. You can use it right now via `pip install flashoptim`. 🚀 arxiv.org/abs/2602.23349 A bunch of cool ideas make this possible: [1/n]
Davis Blalock tweet media
English
30
226
1.5K
203.6K
blue
blue@bluewmist·
If you were given $1,000,000 in cash right now, what is the very first thing you would buy or do?
English
75
5
76
26.1K
Jannis
Jannis@basement_agi·
@prajdabre You can use one of the 4 models as your base model. And the other 3 models as "teacher" and distill into your base model. Like in the paper "Pre-training under infinite compute"
English
0
0
0
380
Raj Dabre
Raj Dabre@prajdabre·
Basic ML interview question: You learned about model merging and decided to test it out. You trained 4 models on 4 tasks with the same architecture and vocabulary. You then averaged the parameters of the 4 models. You expected slightly poorer performance but the model just gave terrible performance. What went wrong?
English
19
5
188
44.1K
Jannis
Jannis@basement_agi·
@ns123abc Mhhhhhm. I don't like Anthropics position regarding open source, but they have integrity
English
0
0
3
1.2K
NIK
NIK@ns123abc·
🚨 Anthropic CEO Tells Pentagon “NO.” >pentagon: “use claude for ALL lawful purposes” >dario: no >pentagon: do as we say or you’re blacklisted >dario: “these threats do not change our position” Anthropic CEO final message to Department of War: >no fully autonomous weapons without humans >no mass domestic surveillance for Americans Pentagon official calls Dario a “liar with a God-complex” who “wants to personally control the US Military” and is “ok putting our nation’s safety at risk.” >xAI, Google & OpenAI all agreed to the Pentagon’s terms Anthropic: “Regardless, we cannot in good conscience accede.”
NIK tweet mediaNIK tweet mediaNIK tweet media
English
258
820
11.1K
668.7K
ₕₐₘₚₜₒₙ
ₕₐₘₚₜₒₙ@hamptonism·
You could literally just get a PhD in Meteorological Applications in Quantitative Finance, and work at 幻方量化 (DeepSeek Quant Fund) and buy this penthouse in Hong Kong, then dm that influencer with 3.5M followers on 抖音 (Chinese TikTok) who keeps liking all your pics, first dates at a cha chaan teng milk tea, then promenade at Victoria Harbour, Courtship at Rosary Church youth group or Sunday mass, start your own quant fund with their families funding, - but you will not.
ₕₐₘₚₜₒₙ tweet media
Casey B. Head@CaseyBHead

You could get a job at the Littleton Coin Company and buy this house just off Main Street. Then meet a nice girl at St. Rose of Lima and take her on a date to Schilling Beer Company. Get married and send your kids to Above The Notch Community School. But you will not.

English
49
155
3.8K
1.3M