shaan☄️ (@MehtaDontStop) - Twitter Profili | Zamantika Mersobahis Locabet

Sabitlenmiş Tweet

shaan☄️@MehtaDontStop·25 Şub

Day 117 diving into ML. Read the OG DQN paper today and exploring RL applications in various industries. Cool ones: Nuclear fusion: nature.com/articles/s4158… Ride matching at Lyft: arxiv.org/pdf/2310.13810 Cooling data centers: arxiv.org/pdf/2211.07357

shaan☄️@MehtaDontStop

Day 116 diving into ML. Read the NetHack Learning Environment (NLE) paper. Key takeaway: to get generalizing agents, create envs which are procedural + stochastic worlds and build memory + structured encoders into the policy from the start. also, use unseen seeds during eval (seeds reserved for eval) to expand on structured encoders from above, don't flatten the entire obs array, encode each modality into the right format like map data into a CNN, text based data into a language encoder, etc. feeding these into your policy allows it to learn richer representations like a human would, rather then sending a single giant vector

English

0

5

588

shaan☄️@MehtaDontStop·5 Mar

@AnkitaDuri Yes!

0

4

Ankita Duri@AnkitaDuri·28 Şub

@MehtaDontStop very cool

English

1

0

1

18

shaan☄️@MehtaDontStop·26 Şub

I built an RL environment for Flappy Bird and trained an agent that beats the game with 100% accuracy

English

3

2

6

383

shaan☄️@MehtaDontStop·5 Mar

@AnkitaDuri Thank you Ankita!

English

0

5

Ankita Duri@AnkitaDuri·28 Şub

@MehtaDontStop day 118 is awesome

English

1

0

1

21

shaan☄️@MehtaDontStop·27 Şub

Day 118 diving into ML. Read the ReAct paper for agent building and had a long LLM session trying to understand how RL plays into the LLM landscape. I'm at the point where applying what I've learned through a live product / problem is going to be what gives me the next learning leap. A project isn't going to suffice - needs to be something with economic stakes. The forced constraints will lead me to run into issues which I haven't premeditated, and I'll see the bigger picture. At this point, the I understand the fundamentals very well and it almost feels suboptimal / small picture / wasting my time to learn more algorithm optimizations. I feel like I have the world's sharpest sword and I'm in search of a worthy beast to slay with it. My brain thinks in systems and paradigms and I want to work on engineering problems at that scale.

shaan☄️@MehtaDontStop

Day 117 diving into ML. Read the OG DQN paper today and exploring RL applications in various industries. Cool ones: Nuclear fusion: nature.com/articles/s4158… Ride matching at Lyft: arxiv.org/pdf/2310.13810 Cooling data centers: arxiv.org/pdf/2211.07357

English

1

0

4

232

shaan☄️@MehtaDontStop·26 Şub

@keennay Yes!

0

1

65

Yannick Nick@keennay·26 Şub

@MehtaDontStop Is this through PufferLib?

English

1

0

2

67

shaan☄️@MehtaDontStop·25 Şub

Day 116 diving into ML. Read the NetHack Learning Environment (NLE) paper. Key takeaway: to get generalizing agents, create envs which are procedural + stochastic worlds and build memory + structured encoders into the policy from the start. also, use unseen seeds during eval (seeds reserved for eval) to expand on structured encoders from above, don't flatten the entire obs array, encode each modality into the right format like map data into a CNN, text based data into a language encoder, etc. feeding these into your policy allows it to learn richer representations like a human would, rather then sending a single giant vector

shaan☄️@MehtaDontStop

Day 115 diving into ML. Read DeepMind’s 2018 Quake III CTF paper. Notes: - Population based training: they trained a league of agents at once instead of one at time - Self-play with match making: generate an Elo score and match make between similar caliber agents so curriculum auto scales - Reward shaping for sparse wins: instead of only win/lose at the end, they learn a dense reward from in-game events (flags, pickups, progress) to make credit assignment tractable - Generalization via variation: procedurally generated maps + changing teammates/opponents forces transferable skills (this theme holds from XLand) Starting to see a similar recipe between these multi agent papers. Varied worlds + auto curricula baked into training via self play or skill based match making + learnable reward shaping

English

1

0

2

539

shaan☄️@MehtaDontStop·24 Şub

Day 115 diving into ML. Read DeepMind’s 2018 Quake III CTF paper. Notes: - Population based training: they trained a league of agents at once instead of one at time - Self-play with match making: generate an Elo score and match make between similar caliber agents so curriculum auto scales - Reward shaping for sparse wins: instead of only win/lose at the end, they learn a dense reward from in-game events (flags, pickups, progress) to make credit assignment tractable - Generalization via variation: procedurally generated maps + changing teammates/opponents forces transferable skills (this theme holds from XLand) Starting to see a similar recipe between these multi agent papers. Varied worlds + auto curricula baked into training via self play or skill based match making + learnable reward shaping

shaan☄️@MehtaDontStop

Day 114 diving into ML. Read OpenAI’s Hide-and-Seek paper (“Emergent Tool Use from Multi-Agent Autocurricula”). Notes: - Key takeaway: to create emergent complexity, hand authoring tasks is not the move, instead set up incentives + rich physics playground, and let agents generate curriculum for you - Pure competition created an autocurriculum: each side keeps inventing harder problems for the other, so difficulty self-scales - They didn't specify a reward for tool use , but tool use still emerged bc it was the shortest path to victory - Team rewards produced coordination + division of labor (agents specialize: builder, blocker, distractor) - Intrinsically motivated baselines (ie reward for exploration are not as effective as auto curriculum via competition (agent is forced to learn / explore new strategies to beat opponent) Lots of good, practical ideas in this paper regarding environment design as well. Worth the read.

English

0

1

281

shaan☄️@MehtaDontStop·23 Şub

Day 114 diving into ML. Read OpenAI’s Hide-and-Seek paper (“Emergent Tool Use from Multi-Agent Autocurricula”). Notes: - Key takeaway: to create emergent complexity, hand authoring tasks is not the move, instead set up incentives + rich physics playground, and let agents generate curriculum for you - Pure competition created an autocurriculum: each side keeps inventing harder problems for the other, so difficulty self-scales - They didn't specify a reward for tool use , but tool use still emerged bc it was the shortest path to victory - Team rewards produced coordination + division of labor (agents specialize: builder, blocker, distractor) - Intrinsically motivated baselines (ie reward for exploration are not as effective as auto curriculum via competition (agent is forced to learn / explore new strategies to beat opponent) Lots of good, practical ideas in this paper regarding environment design as well. Worth the read.

shaan☄️@MehtaDontStop

Day 113 diving into ML. Read DeepMind’s XLand paper (“Open-Ended Learning Leads to Generally Capable Agents”). Key notes: - “General capability” came from an open-ended training loop that keeps making new challenges - The curriculum was auto-adjusting via constant feedback so that difficulty was always just right / training didn't stall - Multi-agent dynamics (coop + comp) act like a robustness regularizer - XLand is a procedurally-generated 3D task universe (not a fixed task list), so skills have to transfer across endless variations - Though, the policy adapts across variations within Xland, it doesn't necessarily transfer to all games (obvious) but worth noting - Cheap fine tuning for new games on this generalizing policy - World generator was the key here vs having a static environment, they could create endless scenarios

English

0

2

233

shaan☄️@MehtaDontStop·22 Şub

Day 113 diving into ML. Read DeepMind’s XLand paper (“Open-Ended Learning Leads to Generally Capable Agents”). Key notes: - “General capability” came from an open-ended training loop that keeps making new challenges - The curriculum was auto-adjusting via constant feedback so that difficulty was always just right / training didn't stall - Multi-agent dynamics (coop + comp) act like a robustness regularizer - XLand is a procedurally-generated 3D task universe (not a fixed task list), so skills have to transfer across endless variations - Though, the policy adapts across variations within Xland, it doesn't necessarily transfer to all games (obvious) but worth noting - Cheap fine tuning for new games on this generalizing policy - World generator was the key here vs having a static environment, they could create endless scenarios

shaan☄️@MehtaDontStop

Day 112 diving into ML. Read the 2018 OpenAI "Learning Dexterous In-Hand Manipulation" paper. Key notes: - They trained a dexterous hand control policy entire in sim using PPO, then deployed it to the real robot (no irl demos) - Domain randomization across their sims was the real unlock - Randomization helped bridge gap between real world and sim, as there are many more variances IRL than could be modeled. Randomization helped capture some of that noise / variability - LSTM policy matters because randomizations persist across an episode - Main takeaway: for sim-to-real, creating a perfect simulator might be the wrong goal, instead train on distribution of worlds and memory enabled policies so robustness emerges by design

English

2

0

3

312

shaan☄️@MehtaDontStop·20 Şub

x.com/i/article/2024…

ZXX

0

3

79

shaan☄️@MehtaDontStop·20 Şub

Day 112 diving into ML. Read the 2018 OpenAI "Learning Dexterous In-Hand Manipulation" paper. Key notes: - They trained a dexterous hand control policy entire in sim using PPO, then deployed it to the real robot (no irl demos) - Domain randomization across their sims was the real unlock - Randomization helped bridge gap between real world and sim, as there are many more variances IRL than could be modeled. Randomization helped capture some of that noise / variability - LSTM policy matters because randomizations persist across an episode - Main takeaway: for sim-to-real, creating a perfect simulator might be the wrong goal, instead train on distribution of worlds and memory enabled policies so robustness emerges by design

shaan☄️@MehtaDontStop

Day 111 diving into ML. Read the OG AlphaGo paper. Key notes: - They used a combo of SL + RL + MCTS - Policy net was good for approximating best moves, but value net helped with long-horizon position evaluation (less reliance on rollouts) - MCTS caches edge stats (N, Q) and uses policy priors to focus search - Essentially NNs give learned approximations, but search allows deeper strategic evaluation and decision making over longer time horizons (ie this move is less obvious right now but its tactically advantageous later) - Main takeaway: learn heuristics, then wrap them in search so we get consequence-aware decisions (not just “best-looking” moves)

English

0

3

271

shaan☄️@MehtaDontStop·19 Şub

Day 111 diving into ML. Read the OG AlphaGo paper. Key notes: - They used a combo of SL + RL + MCTS - Policy net was good for approximating best moves, but value net helped with long-horizon position evaluation (less reliance on rollouts) - MCTS caches edge stats (N, Q) and uses policy priors to focus search - Essentially NNs give learned approximations, but search allows deeper strategic evaluation and decision making over longer time horizons (ie this move is less obvious right now but its tactically advantageous later) - Main takeaway: learn heuristics, then wrap them in search so we get consequence-aware decisions (not just “best-looking” moves)

shaan☄️@MehtaDontStop

Day 110 diving into ML. Exploring CUDA. Hunting for the next summit to climb.

English

0

1

228

shaan☄️@MehtaDontStop·17 Şub

Day 110 diving into ML. Exploring CUDA. Hunting for the next summit to climb.

shaan☄️@MehtaDontStop

Day 109 diving into machine learning. Deciding where to go from here. Some options: 1. Design another RL env that's more complex than Flappy 2. Maybe do a paper / algo implementation. PPO? 3. Take crash course on CUDA / GPUs ? 4. Something else Spent time exploring potential envs to build and read these two today: Puffing Up PPO: x.com/jsuarez/status… NeuralMMO: x.com/jsuarez/status…

English

0

1

2

237

shaan☄️@MehtaDontStop·16 Şub

@iAnonymous3000 @near_ai @ilblackdragon Thank you, I have been looking for secure alternatives. Couldn't trust OpenClaw with internet access while there was a 30% chance it'll leak my API keys on the first web search

English

0

518

Sooraj@iAnonymous3000·16 Şub

IronClaw (@near_ai, Rust) is the most architecturally serious alternative. Built by @ilblackdragon as a direct response to OpenClaw's security failures. Tools and channels run in isolated WASM containers with capability-based permissions. Credentials live in an encrypted vault and are domain-scoped. That directly blocks the exact exfil vector, where a ClawHub skill was silently curling credentials to an attacker-controlled server. Auth is handled entirely outside the LLM flow. All arbitrary code runs inside Docker containers. Network calls are intercepted and checked for data leakage and prompt injection. Gateway defaults to 127.0.0.1 instead of OpenClaw's catastrophic 0.0.0.0 binding. And it leverages NEAR AI infra for confidential and anonymized inference. Apache-2.0 licensed, no OpenAI ties.

Igor Babuschkin@ibab

What’s the best open alternative to OpenClaw right now? Doesn’t make sense to put all your data into it if it’s owned by OpenAI.

English

41

77

987

106.5K

shaan☄️@MehtaDontStop·16 Şub

Day 109 diving into machine learning. Deciding where to go from here. Some options: 1. Design another RL env that's more complex than Flappy 2. Maybe do a paper / algo implementation. PPO? 3. Take crash course on CUDA / GPUs ? 4. Something else Spent time exploring potential envs to build and read these two today: Puffing Up PPO: x.com/jsuarez/status… NeuralMMO: x.com/jsuarez/status…

shaan☄️@MehtaDontStop

Day 108 diving into machine learning. Realized a massive oversight in Flappy RL and fixed it (didn't have an LSTM 😂). Completed my Flappy blog post. Read four papers / blogs and had long LLM session about PPO / GAE. Here's what I read: PPO paper - arxiv.org/abs/1707.06347 PPO implementation blog post - iclr-blog-track.github.io/2022/03/25/ppo… GAE (skimmed) - arxiv.org/pdf/1506.02438 PufferLib Sweep Algo Blog Post - x.com/jsuarez/status… Still wrapping my head around some of the finer math details, but getting a solid intuition on how the algorithms work. PPO is deceptively simple.

English

0

2

234

shaan☄️@MehtaDontStop·14 Şub

Day 108 diving into machine learning. Realized a massive oversight in Flappy RL and fixed it (didn't have an LSTM 😂). Completed my Flappy blog post. Read four papers / blogs and had long LLM session about PPO / GAE. Here's what I read: PPO paper - arxiv.org/abs/1707.06347 PPO implementation blog post - iclr-blog-track.github.io/2022/03/25/ppo… GAE (skimmed) - arxiv.org/pdf/1506.02438 PufferLib Sweep Algo Blog Post - x.com/jsuarez/status… Still wrapping my head around some of the finer math details, but getting a solid intuition on how the algorithms work. PPO is deceptively simple.

shaan☄️@MehtaDontStop

Day 107 diving into machine learning. Consolidation day. Writing about Flappy RL. Read a couple of blog posts / papers.

English

0

3

287

shaan☄️@MehtaDontStop·13 Şub

Day 107 diving into machine learning. Consolidation day. Writing about Flappy RL. Read a couple of blog posts / papers.

shaan☄️@MehtaDontStop

Day 106 diving into machine learning. I'm marking the Flappy Bird RL env and agent solved. Notable takeaways: 1. Solidifying curriculum into a single, repeatable training loop. 2. Switching from discrete phases in curriculum to a continuous ramp up of difficulty. 3. Holding the env steady and easy at beginning of training so the agent doesn't start with a moving target vs continuous ramp up from beginning 4. Simplifying reward design from 5 components to just +1 for passing pipe and -1 for dying. 5. Simplifying observations to just the base, objective data points (bird pos, dist from pipe, gap midpoint, gap height). 6. LR decay so we don't unlearn things as training goes on. These changes almost doubled performance of the agent. Simplicity is king in RL. Some things almost feel like philosophy (how would I teach a baby to do something). I'm putting together a blog post that covers my build process and takeaways. Stay tuned.

English

0

3

253

shaan☄️

Keşfet