varchasvi

2.4K posts

varchasvi banner
varchasvi

varchasvi

@varchasvee_

Building weird DL experiments from scratch + shipping MVPs ML x Full-stack

Katılım Kasım 2024
304 Takip Edilen337 Takipçiler
Sabitlenmiş Tweet
varchasvi
varchasvi@varchasvee_·
@karpathy said we should delete tokenizers. So I did, and it worked! I read @karpathy 's post on OCR where he talks about models that read text straight from pixels instead of using a tokenizer (and how having a tokenizer is a problem), and I wanted to build one such model from scratch right away. Took time this saturday and built this small model which works good(everything is done from scratch). The dataset: --> Built a synthetic dataset of 2k simple math questions like “4+7=”, “3+5=”.(small yes, but this is just a simple poc so it works) --> Rendered each into 128x128 grayscale with a bold font so characters are clear. --> No OCR, no preprocessing, no tokenized steps. Just raw pixels in, answer out. The architecture: --> Tiny vision encoder feeds into a mini gpt-style decoder. --> Encoder: splits image into patches, positional embeddings(2d), transformer blocks. --> Decoder: autoregressively generates answer characters. --> MLP hidden dim = 2x embedding dim. --> GELU activations, dropout ~0.1/0.2 (can be removed for large enough dataset), tied embeddings. --> Trained from scratch. No pretrained anything. Model learns numbers directly from pixel shapes. Problems I faced: Initial runs = pure chaos, the model just hallucinated endless nonsense like “bbtbbbbbrrr…”. Bug 1: gpt2 tokenizer uses same id for PAD and EOS → model didn’t know when to stop. Fix: added a real PAD token, masked loss correctly. Bug 2: Font too thin → ViT couldn’t identify digits. Fix: bumped font weight + size. Perfectly readable at 128x128(48-60 works good). Bug 3: My validation compared raw token IDs instead of decoded answers. Fix: decode before compare (should’ve been obvious but yeah). After fixes, the model just clicked around ~800 steps. The results: Learning rate (cosine decay): Starts at ~5e-4, smooth ramp → fall to almost zero by 5k steps. Very stable, no weird plateaus. Training loss: Drops from ~11 down to ~1 ish rapidly in the first few hundred steps, then slides down to <1 with tiny jitter. No collapse, no exploding gradients. Validation accuracy: Starts around ~10% (random), hits ~50% by ~200 steps, ~70%+ around ~1k steps, and settles around ~80% with small wobble. Model is reliably reading digits from raw pixels and performing addition with no tokenizer at all. What can be improved: --> More fonts, distortions, handwriting style noise (to generalize beyond “printed”). --> Try multi step arithmetic (carry operations + multi digit inputs). --> Better optimizer param grouping (don’t decay layernorm/biases). --> Larger batch via grad accumulation to stabilize the early phase even more. --> If scaling to bigger models or more diverse data, add a short learning rate warmup before cosine decay + more hidden layers. Final thoughts: Tbh, this small side project taught me more about deep learning than all the time i spent studying it. The PAD token bug almost made me quit. Really thought the model was fundamentally broken when it was just my masking logic being wrong (thanks to kimi for pointing out the most obvious amateur slip). Doing this made me actually see what the model sees. No text, no vocab, no symbolic shortcuts. Just pixel shapes → patterns → meaning. The model has to learn what digits look like, how they combine into equations, and how addition works, all from raw visual input. Imo, that’s the power of going tokenizer free. The input space becomes continuous and expressive. Text, handwriting, emojis, weird fonts, equations, mixed languages, it’s all just pixels. The model doesn’t care about “language,” it learns directly from appearance. But yeah, removing tokenizers also means training is heavier (more compute, more data, more patience, more money). Pixel inputs are high dimensional(and very dynamic), so you do pay in gpu time(if you’re gpu poor like me, you'll feel the pinch immediately). Still, once you watch the model do math straight from pixels(with no tokenizer anywhere in the pipeline), it feels like magic! Link to colab(complete code): colab.research.google.com/drive/1PT9dNaT…
GIF
GIF
varchasvi tweet mediavarchasvi tweet media
English
12
26
335
79.2K
varchasvi
varchasvi@varchasvee_·
@0xastro98 Did the same over the weekend and now onto coding a neural network to map the waddington gap! Applied ML is way more interesting!
English
0
0
0
9
varchasvi
varchasvi@varchasvee_·
Spent the weekend diving into applied ML/DL and came across the concept kf waddington gap in bio. It’s basically the gap b/w a stem cell and a specialised cell. The idea is to model this transition using NNs. I’m trying out a combo of CNV + SSM + flow matching, to see if it can capture how cells move through this process. Let’s see how it goes!
English
0
0
2
77
varchasvi
varchasvi@varchasvee_·
@Abhindas1 Is that real? The rotation speed is too low and the mass it's trying to lift is too high!
English
1
0
0
7
varchasvi
varchasvi@varchasvee_·
@FIR31415 Think they should first start by revamping the syllabus. I mean robotics is far away, most of the college's still teach COBOL, c and etc(not that it's wrong, but they haven't changed in a long long time).
English
0
0
0
7
varchasvi
varchasvi@varchasvee_·
@madsf88 It's hardest for small accounts tbh. Even with good original content, small accounts rarely get traction so yeah! Definitely not a smooth ride
English
0
0
1
81
Mads
Mads@madsf88·
tech twitter is the most slept on job board on the internet every time ppl ask me how i got a job in nyc i tell them to delete linkedin & start tweeting
English
55
4
275
10.9K
varchasvi
varchasvi@varchasvee_·
@ThedatagGuy It's like a never ending ride. You study one, you end up wanting to study 10 more!
English
0
0
0
8
Data guy
Data guy@ThedatagGuy·
@varchasvee_ 💯💯 😭 It takes a lot of time to deep dive into a topic But once we understand deeply it drives more curiosity
English
1
0
1
52
varchasvi
varchasvi@varchasvee_·
I thought getting into ML/DL and diving into, diffusion, transformers and etc would satisfy my curiosity. Instead, it’s keeping me up at night. There’s always another idea, another model to train, another rabbit hole like robotics to explore. How do people even get bored?
English
1
0
1
80
varchasvi
varchasvi@varchasvee_·
@tanaylohia I don’t have a background in biology, but I’m working on a really interesting problem right now. Biology will benefit a lot from more engineers getting involved (especially the ML/DL lot).
English
0
0
1
53
varchasvi
varchasvi@varchasvee_·
@code_kartik ay, teach us small accounts the art of becoming big accounts please! Growing tired of posting original content and receiving no traction(not complaining though, just a thought)
English
0
0
1
26
Kartik
Kartik@code_kartik·
there are so many things i want to discuss so many takes which i want to post but this account is so big that if i post something useless good people who have followed me will be dissappointed in me and i don't want that to happen. so expect less shitposting from this account
English
15
0
83
1.7K
varchasvi
varchasvi@varchasvee_·
@w2sgarnav and i've 1000000000 more. Getting super difficult to decide what to work on. All my AI chats are literally me checking if my ideas are workable.
English
0
0
1
53
arnav sonavane
arnav sonavane@w2sgarnav·
i have like 150+ ideas in ml, need to make an academia group ig
arnav sonavane@w2sgarnav

aiming for 5 research topics for the upcoming few months, if yall want to join in pls do so, GPU shortage wont be there (hopefully) (worked on these problem statements a bit previously, and have ran a few experiments on each) find them below: ps 1 : Process Reward Models Beyond Outcome Supervision Without the need for human-labeled trajectories, we provide a completely automated approach for training Process Reward Models (PRMs) that either meet or surpass the quality of gold step-level annotations. We create dense Monte-Carlo Tree Search (MCTS) rollouts with depth d ≥ 32 and branching factor b = 8, starting from a base policy π_θ trained via SFT on chain-of-thought data. Each intermediate step is scored using an ensemble of outcome verifiers (ORMs) bootstrapped from self-consistency and LLM-as-judge signals under temperature T = 0.7. A process-DPO variation with step-wise Bradley-Terry losses weighted by MCTS visit counts and calibrated via Platt scaling on a short held-out verification set is introduced to reduce verifier noise. By simultaneously optimising the PRM and policy under a single RLVR goal that alternates between process-level preference optimisation and outcome-level PPO updates, with adaptive mixing ratio λ_t planned via cosine annealing, our method closes the annotation gap. Our auto-annotated PRM delivers +14.7% pass@1 over outcome-only RM baselines at 7B scale and transfers to code and scientific reasoning domains with 3% deterioration following LoRA adaptation on 2k domain-specific trajectories, according to extensive ablation on GSM8K, MATH, and HumanEval. We present the multi-domain PRM benchmark, the distilled verifier weights, and the whole MCTS annotation program, offering the first production-ready recipe for frontier-scale process supervision. ps 2 : Computer-Use Agents and GUI Grounding In addition to introducing a large-scale synthetic data engine that uses Playwright + Android Emulator instrumentation to generate 500k grounded interaction traces across web, mobile, and desktop environments, we formalise GUI grounding failures through a tripartite decomposition: perception (pixel-to-semantic mapping), planning (high-level action sequence), and execution (low-level mouse/keyboard trajectories). Pixel-level segmentation masks, accessibility tree annotations, and oracle action sequences obtained via deterministic UI state diffing are linked with each trace. Using a hybrid loss that combines contrastive screen embedding alignment (using InfoNCE on cropped UI elements), autoregressive action token prediction, and auxiliary bounding-box regression heads that function at 4× downsampled resolution to maintain fine-grained OCR and icon semantics, we train a multimodal VLA policy on top of a Qwen2-VL-7B backbone. A domain-adversarial training objective that aligns screen embeddings across platforms while maintaining task-specific action distributions is combined with test-time adaptation using a lightweight 256M adapter that conditions on platform-specific accessibility trees to achieve cross-platform zero-shot transfer. Our model decreases end-to-end grounding error from 48% (Claude-3.5 baseline) to 19% on the recently released GUI-Grounding-Bench (which includes 12k actual jobs from WebArena, AndroidWorld, and OSWorld), with the biggest improvements in perception-heavy mobile UIs. We provide the cross-platform VLA checkpoint, the failure atlas taxonomy, and the complete synthetic trace generator, creating the first reproducible benchmark and recipe for reliable computer-use agents. ps 3 : Agent Memory Architectures Beyond RAG We present TypedAgentMemory, a modular memory substrate controlled by a differentiable memory controller trained end-to-end with the agent policy that explicitly distinguishes episodic semantic (dense vector summaries with SAE-derived concept tags), procedural, and working (short-term KV cache compression) memories. A 128-dim uncertainty head that thresholds epistemic uncertainty from an ensemble of forward passes gates memory writes. The controller uses a hierarchical policy over four memory operations: write, consolidate (graph-based merging with GNN message passing), forget (learned eviction via eligibility traces and recency + relevance scores), and retrieve (hybrid dense + symbolic query routing). Explicit memory consolidation every 50 steps is used to evaluate long-horizon tasks on τ-bench, WebArena, and GAIA. This results in a 2.3× decrease in context length and a 31% improvement in success rate over flat vector-store RAG baselines. Per-memory-type differential privacy approaches, such as homomorphic encryption for procedural skill graphs, concept-level k-anonymity on semantic features, and ε = 0.5 noise injection on episodic writing, are used to ensure privacy. Ablations show that typed memory facilitates effective cross-task transfer through procedural memory reuse and prevents catastrophic forgetting on 200-step agent trajectories. We provide the first rational substitute for monolithic RAG for production-grade autonomous agents by making the whole TypedAgentMemory library (based on LangGraph + FAISS + Neo4j), the long-horizon evaluation harness, and pretrained memory controllers for Llama-3.1-8B and Qwen2.5-72B open-source. ps 4: SAE Universality Across Model Families By training 128k-feature JumpReLU SAEs (expansion factor 64, k = 32) on residual streams of Llama-3.1-8B, Qwen2.5-72B, Gemma-2-27B, Mistral-Large-2, and DeepSeek-V3 with the same hyperparameters and reconstruction aims, we perform the first extensive cross-family SAE universality investigation. A bipartite matching that quantifies pairwise overlap at both neuron-level (cosine similarity > 0.85) and concept-level (via automated interpretation pipelines using 512 probe prompts per feature) is obtained by performing feature matching via optimal transport with Sinkhorn algorithm on normalised decoder weight matrices. By grouping similar features from different families into 4.2k platonic ideas and annotating each concept with activation data, downstream steering efficacy, and causal mediation scores calculated via route patching, we further build a universal feature library. Steering vectors created from the universal library outperform within-family SAEs on out-of-distribution tasks and enhance zero-shot generalisation on MMLU-Pro, GPQA, and LiveCodeBench by an average of 9.4% when transferred between families, according to downstream transfer studies. We make available the whole SAE training software, the universal concept library with 4.2k interpreted features, the cross-family matching dataset (which includes optimum transport plans), and a plug-and-play steering toolkit that works with Hugging Face Transformers and vLLM. In order to facilitate transfer learning, model merging, and safety interventions within the existing frontier model ecosystem, this study offers the first rigorous atlas and infrastructure for mechanistic universality. ps 5 : Synthetic Data Generation Without Mode Collapse We provide an iterated synthetic data pipeline that explicitly characterises the collapse threshold ρ*(q) as a function of generator quality q (as determined by the activation entropy of the SAE feature and the entropy of the output distribution H_π). Using temperature-annealed sampling (T=1.0 → 0.7) supplemented with SAE-guided rejection sampling, we create synthetic corpora at different mixing ratios ρ ∈ {0, 0.1,…, 1.0} starting from a 7B base policy π_θ trained on 200B tokens of FineWeb-Edu. At each generation, we train a 128k-feature JumpReLU SAE (expansion factor 64, k=32) on the residual stream of the current model and filter synthetic samples whose top-activating features show activation entropy below a calibrated threshold τ derived from the real-data reference distribution. Our experiments provide the first empirical collapse-threshold map ρ*(q) at 1.3B–7B scale, demonstrating that SAE-guided diversity sampling extends the safe mixing ratio by 2.3× compared to persona-conditioned or temperature-only baselines, while generator entropy H_π ≥ 4.2 nats delays the onset of measurable perplexity degradation on a held-out real validation set until generation 7 under accumulation (versus generation 3 under pure replacement). A closed-form constraint on variance contraction rate under synthetic mixing is derived theoretically, connecting the number of safe iterations before tail probability mass falls below 10^{-3} to the spectral gap of the generator's transition kernel.

English
8
0
74
5.2K
varchasvi
varchasvi@varchasvee_·
@Anoyroyc @Deforge_io @autobattle_fun I’ve stopped watching x analytics altogether and I’m just focusing on posting my work now. Thinking about traction all the time just ends in disappointment.
English
0
0
1
12
Anoy
Anoy@Anoyroyc·
This week's Changelog > X Impressions - 🔼94% > LinkedIn Growth - 🔽 52% > @Deforge_io visits - 🔽 2% Working on > @autobattle_fun > @Deforge_io Reading > Bhagavad Gita (350% Completed) > Srimad Bhagavatam (71% Completed)
Anoy tweet media
English
3
6
17
114
varchasvi
varchasvi@varchasvee_·
@teortaxesTex What’s surprising is that deepseek has no ads, no built in ecosystem, and no existing user base to lean on, yet it’s still at the top. It managed to do what facebook, insta, and threads couldn’t pull off together(pushing meta AI ie).
English
1
0
4
442
varchasvi
varchasvi@varchasvee_·
@TheAhmadOsman The sad part about open source is that as models get better and more people start using them, the incentive to stay open begins to drop. The better the model and the bigger the user base, the stronger the incentive to go closed.
English
0
0
0
16
Ahmad
Ahmad@TheAhmadOsman·
Please ignore these accounts that are trying to tell you that opensource is lagging behind / dying and things are dire lol They’re not being faithful, some of them might even be trying to secure a job at Anthropic / OpenAI or whatever
Ahmad tweet media
English
40
13
269
24.7K
varchasvi
varchasvi@varchasvee_·
@tunguz If everything does get automated(resulting in job loss and etc), it would be a rare kind of disruption, driven by progress rather than harm. Bizzare but true
English
0
0
0
41
Bojan Tunguz
Bojan Tunguz@tunguz·
Accurate, except for a two caveats: 1. Jobs and tasks are two different things. The former will persist even long after the latter is completely overhauled. 2. Humans have, at best, a very limited capacity to adapt over such short (and getting shorter) time scales.
Riley Goodside@goodside

AI will take some jobs, but it will create countless new jobs too—exciting jobs we can’t even imagine yet. A year later those will also be done by AI, but there will be new jobs—exciting jobs we can’t even imagine yet. Six months later those too will be done by AI, but

English
9
2
36
5.4K
varchasvi
varchasvi@varchasvee_·
@Em_Nomadic The real breakthrough in robotics will come when we have a solid, reliable open-source training dataset. Hardware(basic robot kits) is already cheap and accessible, but good training data isn’t. Hopefully we get there soon (nvidia is already making strong progress on this).
English
1
0
1
31
Emerson S
Emerson S@Em_Nomadic·
You can just build them now. Open source robotics now feels like when people first realized you could just build a website. Suddenly everyone could participate and nobody had any idea what was about to get built. We’re at the beginning of that same thing but for physical AI.
Emerson S tweet mediaEmerson S tweet media
English
8
3
34
1.5K
varchasvi
varchasvi@varchasvee_·
@TheAhmadOsman I understand all the internals of how LLMs work and I still consider them magic! The fascination never stops really!
English
0
0
1
59
Ahmad
Ahmad@TheAhmadOsman·
Who cares, this thing is magic and we get to play with it
Ahmad tweet media
English
35
51
750
21.3K
varchasvi
varchasvi@varchasvee_·
@ponnappa Most discussions about AI consciousness miss a key point. We don’t have a clear, shared understanding/definition of what consciousness actually is. What draws people in is how convincingly these systems can imitate it rather than whether they truly possess it.
English
1
0
0
37
varchasvi
varchasvi@varchasvee_·
@amritwt deepseek imo is way better tbh. Atleast with deepseek we don't have to deal with all the constant nerfing. True is isn't as capable, but someone who knows what they're doing can do good with deepseek!
English
0
0
1
580
amrit
amrit@amritwt·
I think they nerfed Opus 4.7 again
English
70
5
327
29.7K
varchasvi
varchasvi@varchasvee_·
@original_ngv Hopefully we get there soon. Once the AI hype settles and people realize you can’t prompt your way out of every problem.
English
0
0
1
50
varchasvi
varchasvi@varchasvee_·
@Prityush Every single person I know who doesn't even know ABC of computer now lists themselves as AI engineers. Funny how they think they can vibe code their way into doing everything in the universe!
English
0
0
1
16
Prityush bansal
Prityush bansal@Prityush·
A lot of people do not want to build AI They want the social status of understanding it
English
4
1
10
225