varchasvi

2.4K posts

varchasvi

@varchasvee_

Building weird DL experiments from scratch + shipping MVPs ML x Full-stack

Katılım Kasım 2024

304 Takip Edilen337 Takipçiler

Sabitlenmiş Tweet

varchasvi@varchasvee_·7 Kas

@karpathy said we should delete tokenizers. So I did, and it worked! I read @karpathy 's post on OCR where he talks about models that read text straight from pixels instead of using a tokenizer (and how having a tokenizer is a problem), and I wanted to build one such model from scratch right away. Took time this saturday and built this small model which works good(everything is done from scratch). The dataset: --> Built a synthetic dataset of 2k simple math questions like “4+7=”, “3+5=”.(small yes, but this is just a simple poc so it works) --> Rendered each into 128x128 grayscale with a bold font so characters are clear. --> No OCR, no preprocessing, no tokenized steps. Just raw pixels in, answer out. The architecture: --> Tiny vision encoder feeds into a mini gpt-style decoder. --> Encoder: splits image into patches, positional embeddings(2d), transformer blocks. --> Decoder: autoregressively generates answer characters. --> MLP hidden dim = 2x embedding dim. --> GELU activations, dropout ~0.1/0.2 (can be removed for large enough dataset), tied embeddings. --> Trained from scratch. No pretrained anything. Model learns numbers directly from pixel shapes. Problems I faced: Initial runs = pure chaos, the model just hallucinated endless nonsense like “bbtbbbbbrrr…”. Bug 1: gpt2 tokenizer uses same id for PAD and EOS → model didn’t know when to stop. Fix: added a real PAD token, masked loss correctly. Bug 2: Font too thin → ViT couldn’t identify digits. Fix: bumped font weight + size. Perfectly readable at 128x128(48-60 works good). Bug 3: My validation compared raw token IDs instead of decoded answers. Fix: decode before compare (should’ve been obvious but yeah). After fixes, the model just clicked around ~800 steps. The results: Learning rate (cosine decay): Starts at ~5e-4, smooth ramp → fall to almost zero by 5k steps. Very stable, no weird plateaus. Training loss: Drops from ~11 down to ~1 ish rapidly in the first few hundred steps, then slides down to <1 with tiny jitter. No collapse, no exploding gradients. Validation accuracy: Starts around ~10% (random), hits ~50% by ~200 steps, ~70%+ around ~1k steps, and settles around ~80% with small wobble. Model is reliably reading digits from raw pixels and performing addition with no tokenizer at all. What can be improved: --> More fonts, distortions, handwriting style noise (to generalize beyond “printed”). --> Try multi step arithmetic (carry operations + multi digit inputs). --> Better optimizer param grouping (don’t decay layernorm/biases). --> Larger batch via grad accumulation to stabilize the early phase even more. --> If scaling to bigger models or more diverse data, add a short learning rate warmup before cosine decay + more hidden layers. Final thoughts: Tbh, this small side project taught me more about deep learning than all the time i spent studying it. The PAD token bug almost made me quit. Really thought the model was fundamentally broken when it was just my masking logic being wrong (thanks to kimi for pointing out the most obvious amateur slip). Doing this made me actually see what the model sees. No text, no vocab, no symbolic shortcuts. Just pixel shapes → patterns → meaning. The model has to learn what digits look like, how they combine into equations, and how addition works, all from raw visual input. Imo, that’s the power of going tokenizer free. The input space becomes continuous and expressive. Text, handwriting, emojis, weird fonts, equations, mixed languages, it’s all just pixels. The model doesn’t care about “language,” it learns directly from appearance. But yeah, removing tokenizers also means training is heavier (more compute, more data, more patience, more money). Pixel inputs are high dimensional(and very dynamic), so you do pay in gpu time(if you’re gpu poor like me, you'll feel the pinch immediately). Still, once you watch the model do math straight from pixels(with no tokenizer anywhere in the pipeline), it feels like magic! Link to colab(complete code): colab.research.google.com/drive/1PT9dNaT…

GIF

English

335

79.2K

varchasvi@varchasvee_·58m

@0xastro98 Did the same over the weekend and now onto coding a neural network to map the waddington gap! Applied ML is way more interesting!

English

Amrit Shenava@0xastro98·10h

I am a coder however I am starting to fall in love with Bio. And Ravi’s work is so inspiring that I am spending a lot of time reading about Biology.

Ravi Sharma@ravishar313

Biology will soon be an engineering subdomain

English

226

varchasvi@varchasvee_·1h

Spent the weekend diving into applied ML/DL and came across the concept kf waddington gap in bio. It’s basically the gap b/w a stem cell and a specialised cell. The idea is to model this transition using NNs. I’m trying out a combo of CNV + SSM + flow matching, to see if it can capture how cells move through this process. Let’s see how it goes!

English

varchasvi@varchasvee_·1h

@Abhindas1 Is that real? The rotation speed is too low and the mass it's trying to lift is too high!

English

Abhinav Das@Abhindas1·6h

Vibe coded hardware founder.

Nick Khami@skeptrune

trying to code with open source models

Cherryland, CA 🇺🇸 English

1.4K

varchasvi@varchasvee_·1h

@FIR31415 Think they should first start by revamping the syllabus. I mean robotics is far away, most of the college's still teach COBOL, c and etc(not that it's wrong, but they haven't changed in a long long time).

English

Rashree@FIR31415·9h

every engineering college in India should quadruple the funds disbursement for the robotics, hardware labs/clubs! we will see magic!!!

Rahul Raj@x_rahulraj

IIT Madras is creating great hardware startups one after another; other engineering colleges should catch up

English

varchasvi@varchasvee_·11h

@madsf88 It's hardest for small accounts tbh. Even with good original content, small accounts rarely get traction so yeah! Definitely not a smooth ride

English

Mads@madsf88·13h

tech twitter is the most slept on job board on the internet every time ppl ask me how i got a job in nyc i tell them to delete linkedin & start tweeting

English

275

10.9K

varchasvi@varchasvee_·11h

@ThedatagGuy It's like a never ending ride. You study one, you end up wanting to study 10 more!

English

Data guy@ThedatagGuy·12h

@varchasvee_ 💯💯 😭 It takes a lot of time to deep dive into a topic But once we understand deeply it drives more curiosity

English

varchasvi@varchasvee_·12h

I thought getting into ML/DL and diving into, diffusion, transformers and etc would satisfy my curiosity. Instead, it’s keeping me up at night. There’s always another idea, another model to train, another rabbit hole like robotics to explore. How do people even get bored?

English

varchasvi@varchasvee_·11h

@tanaylohia I don’t have a background in biology, but I’m working on a really interesting problem right now. Biology will benefit a lot from more engineers getting involved (especially the ML/DL lot).

English

Tanay Lohia@tanaylohia·13h

We need more engineers in Biology :))

Ravi Sharma@ravishar313

Biology will soon be an engineering subdomain

English

6.1K

varchasvi@varchasvee_·12h

@code_kartik ay, teach us small accounts the art of becoming big accounts please! Growing tired of posting original content and receiving no traction(not complaining though, just a thought)

English

Kartik@code_kartik·15h

there are so many things i want to discuss so many takes which i want to post but this account is so big that if i post something useless good people who have followed me will be dissappointed in me and i don't want that to happen. so expect less shitposting from this account

English

1.7K

varchasvi@varchasvee_·12h

@w2sgarnav and i've 1000000000 more. Getting super difficult to decide what to work on. All my AI chats are literally me checking if my ideas are workable.

English

arnav sonavane@w2sgarnav·14h

i have like 150+ ideas in ml, need to make an academia group ig

arnav sonavane@w2sgarnav

aiming for 5 research topics for the upcoming few months, if yall want to join in pls do so, GPU shortage wont be there (hopefully) (worked on these problem statements a bit previously, and have ran a few experiments on each) find them below: ps 1 : Process Reward Models Beyond Outcome Supervision Without the need for human-labeled trajectories, we provide a completely automated approach for training Process Reward Models (PRMs) that either meet or surpass the quality of gold step-level annotations. We create dense Monte-Carlo Tree Search (MCTS) rollouts with depth d ≥ 32 and branching factor b = 8, starting from a base policy π_θ trained via SFT on chain-of-thought data. Each intermediate step is scored using an ensemble of outcome verifiers (ORMs) bootstrapped from self-consistency and LLM-as-judge signals under temperature T = 0.7. A process-DPO variation with step-wise Bradley-Terry losses weighted by MCTS visit counts and calibrated via Platt scaling on a short held-out verification set is introduced to reduce verifier noise. By simultaneously optimising the PRM and policy under a single RLVR goal that alternates between process-level preference optimisation and outcome-level PPO updates, with adaptive mixing ratio λ_t planned via cosine annealing, our method closes the annotation gap. Our auto-annotated PRM delivers +14.7% pass @1 over outcome-only RM baselines at 7B scale and transfers to code and scientific reasoning domains with 3% deterioration following LoRA adaptation on 2k domain-specific trajectories, according to extensive ablation on GSM8K, MATH, and HumanEval. We present the multi-domain PRM benchmark, the distilled verifier weights, and the whole MCTS annotation program, offering the first production-ready recipe for frontier-scale process supervision. ps 2 : Computer-Use Agents and GUI Grounding In addition to introducing a large-scale synthetic data engine that uses Playwright + Android Emulator instrumentation to generate 500k grounded interaction traces across web, mobile, and desktop environments, we formalise GUI grounding failures through a tripartite decomposition: perception (pixel-to-semantic mapping), planning (high-level action sequence), and execution (low-level mouse/keyboard trajectories). Pixel-level segmentation masks, accessibility tree annotations, and oracle action sequences obtained via deterministic UI state diffing are linked with each trace. Using a hybrid loss that combines contrastive screen embedding alignment (using InfoNCE on cropped UI elements), autoregressive action token prediction, and auxiliary bounding-box regression heads that function at 4× downsampled resolution to maintain fine-grained OCR and icon semantics, we train a multimodal VLA policy on top of a Qwen2-VL-7B backbone. A domain-adversarial training objective that aligns screen embeddings across platforms while maintaining task-specific action distributions is combined with test-time adaptation using a lightweight 256M adapter that conditions on platform-specific accessibility trees to achieve cross-platform zero-shot transfer. Our model decreases end-to-end grounding error from 48% (Claude-3.5 baseline) to 19% on the recently released GUI-Grounding-Bench (which includes 12k actual jobs from WebArena, AndroidWorld, and OSWorld), with the biggest improvements in perception-heavy mobile UIs. We provide the cross-platform VLA checkpoint, the failure atlas taxonomy, and the complete synthetic trace generator, creating the first reproducible benchmark and recipe for reliable computer-use agents. ps 3 : Agent Memory Architectures Beyond RAG We present TypedAgentMemory, a modular memory substrate controlled by a differentiable memory controller trained end-to-end with the agent policy that explicitly distinguishes episodic semantic (dense vector summaries with SAE-derived concept tags), procedural, and working (short-term KV cache compression) memories. A 128-dim uncertainty head that thresholds epistemic uncertainty from an ensemble of forward passes gates memory writes. The controller uses a hierarchical policy over four memory operations: write, consolidate (graph-based merging with GNN message passing), forget (learned eviction via eligibility traces and recency + relevance scores), and retrieve (hybrid dense + symbolic query routing). Explicit memory consolidation every 50 steps is used to evaluate long-horizon tasks on τ-bench, WebArena, and GAIA. This results in a 2.3× decrease in context length and a 31% improvement in success rate over flat vector-store RAG baselines. Per-memory-type differential privacy approaches, such as homomorphic encryption for procedural skill graphs, concept-level k-anonymity on semantic features, and ε = 0.5 noise injection on episodic writing, are used to ensure privacy. Ablations show that typed memory facilitates effective cross-task transfer through procedural memory reuse and prevents catastrophic forgetting on 200-step agent trajectories. We provide the first rational substitute for monolithic RAG for production-grade autonomous agents by making the whole TypedAgentMemory library (based on LangGraph + FAISS + Neo4j), the long-horizon evaluation harness, and pretrained memory controllers for Llama-3.1-8B and Qwen2.5-72B open-source. ps 4: SAE Universality Across Model Families By training 128k-feature JumpReLU SAEs (expansion factor 64, k = 32) on residual streams of Llama-3.1-8B, Qwen2.5-72B, Gemma-2-27B, Mistral-Large-2, and DeepSeek-V3 with the same hyperparameters and reconstruction aims, we perform the first extensive cross-family SAE universality investigation. A bipartite matching that quantifies pairwise overlap at both neuron-level (cosine similarity > 0.85) and concept-level (via automated interpretation pipelines using 512 probe prompts per feature) is obtained by performing feature matching via optimal transport with Sinkhorn algorithm on normalised decoder weight matrices. By grouping similar features from different families into 4.2k platonic ideas and annotating each concept with activation data, downstream steering efficacy, and causal mediation scores calculated via route patching, we further build a universal feature library. Steering vectors created from the universal library outperform within-family SAEs on out-of-distribution tasks and enhance zero-shot generalisation on MMLU-Pro, GPQA, and LiveCodeBench by an average of 9.4% when transferred between families, according to downstream transfer studies. We make available the whole SAE training software, the universal concept library with 4.2k interpreted features, the cross-family matching dataset (which includes optimum transport plans), and a plug-and-play steering toolkit that works with Hugging Face Transformers and vLLM. In order to facilitate transfer learning, model merging, and safety interventions within the existing frontier model ecosystem, this study offers the first rigorous atlas and infrastructure for mechanistic universality. ps 5 : Synthetic Data Generation Without Mode Collapse We provide an iterated synthetic data pipeline that explicitly characterises the collapse threshold ρ*(q) as a function of generator quality q (as determined by the activation entropy of the SAE feature and the entropy of the output distribution H_π). Using temperature-annealed sampling (T=1.0 → 0.7) supplemented with SAE-guided rejection sampling, we create synthetic corpora at different mixing ratios ρ ∈ {0, 0.1,…, 1.0} starting from a 7B base policy π_θ trained on 200B tokens of FineWeb-Edu. At each generation, we train a 128k-feature JumpReLU SAE (expansion factor 64, k=32) on the residual stream of the current model and filter synthetic samples whose top-activating features show activation entropy below a calibrated threshold τ derived from the real-data reference distribution. Our experiments provide the first empirical collapse-threshold map ρ*(q) at 1.3B–7B scale, demonstrating that SAE-guided diversity sampling extends the safe mixing ratio by 2.3× compared to persona-conditioned or temperature-only baselines, while generator entropy H_π ≥ 4.2 nats delays the onset of measurable perplexity degradation on a held-out real validation set until generation 7 under accumulation (versus generation 3 under pure replacement). A closed-form constraint on variance contraction rate under synthetic mixing is derived theoretically, connecting the number of safe iterations before tail probability mass falls below 10^{-3} to the spectral gap of the generator's transition kernel.

English

5.2K

varchasvi@varchasvee_·12h

@Anoyroyc @Deforge_io @autobattle_fun I’ve stopped watching x analytics altogether and I’m just focusing on posting my work now. Thinking about traction all the time just ends in disappointment.

English

Anoy@Anoyroyc·13h

This week's Changelog > X Impressions - 🔼94% > LinkedIn Growth - 🔽 52% > @Deforge_io visits - 🔽 2% Working on > @autobattle_fun > @Deforge_io Reading > Bhagavad Gita (350% Completed) > Srimad Bhagavatam (71% Completed)

English

114

varchasvi@varchasvee_·16h

@teortaxesTex What’s surprising is that deepseek has no ads, no built in ecosystem, and no existing user base to lean on, yet it’s still at the top. It managed to do what facebook, insta, and threads couldn’t pull off together(pushing meta AI ie).

English

442

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·16h

Grok.

Similarweb@Similarweb

ChatGPT, Gemini, and Perplexity held steady in April. Everyone behind them either surged or dropped.

English

105

9.9K

varchasvi@varchasvee_·16h

@TheAhmadOsman The sad part about open source is that as models get better and more people start using them, the incentive to stay open begins to drop. The better the model and the bigger the user base, the stronger the incentive to go closed.

English

Ahmad@TheAhmadOsman·1d

Please ignore these accounts that are trying to tell you that opensource is lagging behind / dying and things are dire lol They’re not being faithful, some of them might even be trying to secure a job at Anthropic / OpenAI or whatever

English

269

24.7K

varchasvi@varchasvee_·16h

@tunguz If everything does get automated(resulting in job loss and etc), it would be a rare kind of disruption, driven by progress rather than harm. Bizzare but true

English

Bojan Tunguz@tunguz·16h

Accurate, except for a two caveats: 1. Jobs and tasks are two different things. The former will persist even long after the latter is completely overhauled. 2. Humans have, at best, a very limited capacity to adapt over such short (and getting shorter) time scales.

Riley Goodside@goodside

AI will take some jobs, but it will create countless new jobs too—exciting jobs we can’t even imagine yet. A year later those will also be done by AI, but there will be new jobs—exciting jobs we can’t even imagine yet. Six months later those too will be done by AI, but

English

5.4K

varchasvi@varchasvee_·17h

@Em_Nomadic The real breakthrough in robotics will come when we have a solid, reliable open-source training dataset. Hardware(basic robot kits) is already cheap and accessible, but good training data isn’t. Hopefully we get there soon (nvidia is already making strong progress on this).

English

Emerson S@Em_Nomadic·1d

You can just build them now. Open source robotics now feels like when people first realized you could just build a website. Suddenly everyone could participate and nobody had any idea what was about to get built. We’re at the beginning of that same thing but for physical AI.

English

1.5K

varchasvi@varchasvee_·17h

@TheAhmadOsman I understand all the internals of how LLMs work and I still consider them magic! The fascination never stops really!

English

Ahmad@TheAhmadOsman·1d

Who cares, this thing is magic and we get to play with it

English

750

21.3K

varchasvi@varchasvee_·17h

@ponnappa Most discussions about AI consciousness miss a key point. We don’t have a clear, shared understanding/definition of what consciousness actually is. What draws people in is how convincingly these systems can imitate it rather than whether they truly possess it.

English

Sidu Ponnappa@ponnappa·21h

ZXX

1.2K

varchasvi@varchasvee_·17h

@amritwt deepseek imo is way better tbh. Atleast with deepseek we don't have to deal with all the constant nerfing. True is isn't as capable, but someone who knows what they're doing can do good with deepseek!

English

580

amrit@amritwt·19h

I think they nerfed Opus 4.7 again

English

327

29.7K

varchasvi@varchasvee_·17h

@original_ngv Hopefully we get there soon. Once the AI hype settles and people realize you can’t prompt your way out of every problem.

English

enji vi@original_ngv·18h

It is but people aren't really ready to have this conversation. As a matter of fact very soon doing CSE and some AI will become a prerequisite skill for almost every job.

Anushka Singh@nush_1320

According to Indian parents pursuing B.TECh in CSE( Ai/ML) is the solution to every career problem.

English

985

varchasvi@varchasvee_·17h

@Prityush Every single person I know who doesn't even know ABC of computer now lists themselves as AI engineers. Funny how they think they can vibe code their way into doing everything in the universe!

English

Prityush bansal@Prityush·1d

A lot of people do not want to build AI They want the social status of understanding it

English

225

Keşfet

@0xastro98 @Abhindas1 @FIR31415 @madsf88 @ThedatagGuy @tanaylohia @code_kartik @w2sgarnav