shaan☄️

861 posts

shaan☄️ banner
shaan☄️

shaan☄️

@MehtaDontStop

consistency + iteration

New York, NY Katılım Kasım 2014
324 Takip Edilen241 Takipçiler
Sabitlenmiş Tweet
shaan☄️
shaan☄️@MehtaDontStop·
I built an RL environment for Flappy Bird and trained an agent that beats the game with 100% accuracy
English
3
2
6
383
shaan☄️
shaan☄️@MehtaDontStop·
Day 118 diving into ML. Read the ReAct paper for agent building and had a long LLM session trying to understand how RL plays into the LLM landscape. I'm at the point where applying what I've learned through a live product / problem is going to be what gives me the next learning leap. A project isn't going to suffice - needs to be something with economic stakes. The forced constraints will lead me to run into issues which I haven't premeditated, and I'll see the bigger picture. At this point, the I understand the fundamentals very well and it almost feels suboptimal / small picture / wasting my time to learn more algorithm optimizations. I feel like I have the world's sharpest sword and I'm in search of a worthy beast to slay with it. My brain thinks in systems and paradigms and I want to work on engineering problems at that scale.
shaan☄️ tweet media
shaan☄️@MehtaDontStop

Day 117 diving into ML. Read the OG DQN paper today and exploring RL applications in various industries. Cool ones: Nuclear fusion: nature.com/articles/s4158… Ride matching at Lyft: arxiv.org/pdf/2310.13810 Cooling data centers: arxiv.org/pdf/2211.07357

English
1
0
4
232
shaan☄️
shaan☄️@MehtaDontStop·
Day 116 diving into ML. Read the NetHack Learning Environment (NLE) paper. Key takeaway: to get generalizing agents, create envs which are procedural + stochastic worlds and build memory + structured encoders into the policy from the start. also, use unseen seeds during eval (seeds reserved for eval) to expand on structured encoders from above, don't flatten the entire obs array, encode each modality into the right format like map data into a CNN, text based data into a language encoder, etc. feeding these into your policy allows it to learn richer representations like a human would, rather then sending a single giant vector
shaan☄️ tweet media
shaan☄️@MehtaDontStop

Day 115 diving into ML. Read DeepMind’s 2018 Quake III CTF paper. Notes: - Population based training: they trained a league of agents at once instead of one at time - Self-play with match making: generate an Elo score and match make between similar caliber agents so curriculum auto scales - Reward shaping for sparse wins: instead of only win/lose at the end, they learn a dense reward from in-game events (flags, pickups, progress) to make credit assignment tractable - Generalization via variation: procedurally generated maps + changing teammates/opponents forces transferable skills (this theme holds from XLand) Starting to see a similar recipe between these multi agent papers. Varied worlds + auto curricula baked into training via self play or skill based match making + learnable reward shaping

English
1
0
2
539
shaan☄️
shaan☄️@MehtaDontStop·
Day 115 diving into ML. Read DeepMind’s 2018 Quake III CTF paper. Notes: - Population based training: they trained a league of agents at once instead of one at time - Self-play with match making: generate an Elo score and match make between similar caliber agents so curriculum auto scales - Reward shaping for sparse wins: instead of only win/lose at the end, they learn a dense reward from in-game events (flags, pickups, progress) to make credit assignment tractable - Generalization via variation: procedurally generated maps + changing teammates/opponents forces transferable skills (this theme holds from XLand) Starting to see a similar recipe between these multi agent papers. Varied worlds + auto curricula baked into training via self play or skill based match making + learnable reward shaping
shaan☄️ tweet media
shaan☄️@MehtaDontStop

Day 114 diving into ML. Read OpenAI’s Hide-and-Seek paper (“Emergent Tool Use from Multi-Agent Autocurricula”). Notes: - Key takeaway: to create emergent complexity, hand authoring tasks is not the move, instead set up incentives + rich physics playground, and let agents generate curriculum for you - Pure competition created an autocurriculum: each side keeps inventing harder problems for the other, so difficulty self-scales - They didn't specify a reward for tool use , but tool use still emerged bc it was the shortest path to victory - Team rewards produced coordination + division of labor (agents specialize: builder, blocker, distractor) - Intrinsically motivated baselines (ie reward for exploration are not as effective as auto curriculum via competition (agent is forced to learn / explore new strategies to beat opponent) Lots of good, practical ideas in this paper regarding environment design as well. Worth the read.

English
0
0
1
281
shaan☄️
shaan☄️@MehtaDontStop·
Day 114 diving into ML. Read OpenAI’s Hide-and-Seek paper (“Emergent Tool Use from Multi-Agent Autocurricula”). Notes: - Key takeaway: to create emergent complexity, hand authoring tasks is not the move, instead set up incentives + rich physics playground, and let agents generate curriculum for you - Pure competition created an autocurriculum: each side keeps inventing harder problems for the other, so difficulty self-scales - They didn't specify a reward for tool use , but tool use still emerged bc it was the shortest path to victory - Team rewards produced coordination + division of labor (agents specialize: builder, blocker, distractor) - Intrinsically motivated baselines (ie reward for exploration are not as effective as auto curriculum via competition (agent is forced to learn / explore new strategies to beat opponent) Lots of good, practical ideas in this paper regarding environment design as well. Worth the read.
shaan☄️ tweet media
shaan☄️@MehtaDontStop

Day 113 diving into ML. Read DeepMind’s XLand paper (“Open-Ended Learning Leads to Generally Capable Agents”). Key notes: - “General capability” came from an open-ended training loop that keeps making new challenges - The curriculum was auto-adjusting via constant feedback so that difficulty was always just right / training didn't stall - Multi-agent dynamics (coop + comp) act like a robustness regularizer - XLand is a procedurally-generated 3D task universe (not a fixed task list), so skills have to transfer across endless variations - Though, the policy adapts across variations within Xland, it doesn't necessarily transfer to all games (obvious) but worth noting - Cheap fine tuning for new games on this generalizing policy - World generator was the key here vs having a static environment, they could create endless scenarios

English
0
0
2
233
shaan☄️
shaan☄️@MehtaDontStop·
Day 113 diving into ML. Read DeepMind’s XLand paper (“Open-Ended Learning Leads to Generally Capable Agents”). Key notes: - “General capability” came from an open-ended training loop that keeps making new challenges - The curriculum was auto-adjusting via constant feedback so that difficulty was always just right / training didn't stall - Multi-agent dynamics (coop + comp) act like a robustness regularizer - XLand is a procedurally-generated 3D task universe (not a fixed task list), so skills have to transfer across endless variations - Though, the policy adapts across variations within Xland, it doesn't necessarily transfer to all games (obvious) but worth noting - Cheap fine tuning for new games on this generalizing policy - World generator was the key here vs having a static environment, they could create endless scenarios
shaan☄️ tweet media
shaan☄️@MehtaDontStop

Day 112 diving into ML. Read the 2018 OpenAI "Learning Dexterous In-Hand Manipulation" paper. Key notes: - They trained a dexterous hand control policy entire in sim using PPO, then deployed it to the real robot (no irl demos) - Domain randomization across their sims was the real unlock - Randomization helped bridge gap between real world and sim, as there are many more variances IRL than could be modeled. Randomization helped capture some of that noise / variability - LSTM policy matters because randomizations persist across an episode - Main takeaway: for sim-to-real, creating a perfect simulator might be the wrong goal, instead train on distribution of worlds and memory enabled policies so robustness emerges by design

English
2
0
3
312
shaan☄️
shaan☄️@MehtaDontStop·
Day 112 diving into ML. Read the 2018 OpenAI "Learning Dexterous In-Hand Manipulation" paper. Key notes: - They trained a dexterous hand control policy entire in sim using PPO, then deployed it to the real robot (no irl demos) - Domain randomization across their sims was the real unlock - Randomization helped bridge gap between real world and sim, as there are many more variances IRL than could be modeled. Randomization helped capture some of that noise / variability - LSTM policy matters because randomizations persist across an episode - Main takeaway: for sim-to-real, creating a perfect simulator might be the wrong goal, instead train on distribution of worlds and memory enabled policies so robustness emerges by design
shaan☄️ tweet media
shaan☄️@MehtaDontStop

Day 111 diving into ML. Read the OG AlphaGo paper. Key notes: - They used a combo of SL + RL + MCTS - Policy net was good for approximating best moves, but value net helped with long-horizon position evaluation (less reliance on rollouts) - MCTS caches edge stats (N, Q) and uses policy priors to focus search - Essentially NNs give learned approximations, but search allows deeper strategic evaluation and decision making over longer time horizons (ie this move is less obvious right now but its tactically advantageous later) - Main takeaway: learn heuristics, then wrap them in search so we get consequence-aware decisions (not just “best-looking” moves)

English
0
0
3
271
shaan☄️
shaan☄️@MehtaDontStop·
Day 111 diving into ML. Read the OG AlphaGo paper. Key notes: - They used a combo of SL + RL + MCTS - Policy net was good for approximating best moves, but value net helped with long-horizon position evaluation (less reliance on rollouts) - MCTS caches edge stats (N, Q) and uses policy priors to focus search - Essentially NNs give learned approximations, but search allows deeper strategic evaluation and decision making over longer time horizons (ie this move is less obvious right now but its tactically advantageous later) - Main takeaway: learn heuristics, then wrap them in search so we get consequence-aware decisions (not just “best-looking” moves)
shaan☄️ tweet media
shaan☄️@MehtaDontStop

Day 110 diving into ML. Exploring CUDA. Hunting for the next summit to climb.

English
0
1
1
228
shaan☄️
shaan☄️@MehtaDontStop·
@iAnonymous3000 @near_ai @ilblackdragon Thank you, I have been looking for secure alternatives. Couldn't trust OpenClaw with internet access while there was a 30% chance it'll leak my API keys on the first web search
English
0
0
0
518
Sooraj
Sooraj@iAnonymous3000·
IronClaw (@near_ai, Rust) is the most architecturally serious alternative. Built by @ilblackdragon as a direct response to OpenClaw's security failures. Tools and channels run in isolated WASM containers with capability-based permissions. Credentials live in an encrypted vault and are domain-scoped. That directly blocks the exact exfil vector, where a ClawHub skill was silently curling credentials to an attacker-controlled server. Auth is handled entirely outside the LLM flow. All arbitrary code runs inside Docker containers. Network calls are intercepted and checked for data leakage and prompt injection. Gateway defaults to 127.0.0.1 instead of OpenClaw's catastrophic 0.0.0.0 binding. And it leverages NEAR AI infra for confidential and anonymized inference. Apache-2.0 licensed, no OpenAI ties.
Igor Babuschkin@ibab

What’s the best open alternative to OpenClaw right now? Doesn’t make sense to put all your data into it if it’s owned by OpenAI.

English
41
77
987
106.5K
shaan☄️
shaan☄️@MehtaDontStop·
Day 109 diving into machine learning. Deciding where to go from here. Some options: 1. Design another RL env that's more complex than Flappy 2. Maybe do a paper / algo implementation. PPO? 3. Take crash course on CUDA / GPUs ? 4. Something else Spent time exploring potential envs to build and read these two today: Puffing Up PPO: x.com/jsuarez/status… NeuralMMO: x.com/jsuarez/status…
shaan☄️ tweet media
shaan☄️@MehtaDontStop

Day 108 diving into machine learning. Realized a massive oversight in Flappy RL and fixed it (didn't have an LSTM 😂). Completed my Flappy blog post. Read four papers / blogs and had long LLM session about PPO / GAE. Here's what I read: PPO paper - arxiv.org/abs/1707.06347 PPO implementation blog post - iclr-blog-track.github.io/2022/03/25/ppo… GAE (skimmed) - arxiv.org/pdf/1506.02438 PufferLib Sweep Algo Blog Post - x.com/jsuarez/status… Still wrapping my head around some of the finer math details, but getting a solid intuition on how the algorithms work. PPO is deceptively simple.

English
0
0
2
234
shaan☄️
shaan☄️@MehtaDontStop·
Day 108 diving into machine learning. Realized a massive oversight in Flappy RL and fixed it (didn't have an LSTM 😂). Completed my Flappy blog post. Read four papers / blogs and had long LLM session about PPO / GAE. Here's what I read: PPO paper - arxiv.org/abs/1707.06347 PPO implementation blog post - iclr-blog-track.github.io/2022/03/25/ppo… GAE (skimmed) - arxiv.org/pdf/1506.02438 PufferLib Sweep Algo Blog Post - x.com/jsuarez/status… Still wrapping my head around some of the finer math details, but getting a solid intuition on how the algorithms work. PPO is deceptively simple.
shaan☄️ tweet media
shaan☄️@MehtaDontStop

Day 107 diving into machine learning. Consolidation day. Writing about Flappy RL. Read a couple of blog posts / papers.

English
0
0
3
287