
💻近端策略優化PPO的十四行詩💻
A Sonnet of Proximal Policy Optimization
舊我站在原地,新我急欲前行,
The old self stands its ground, the new one strains to go,
比值超過限制,截斷之手拉回,
When the ratio climbs too high, the clip will pull it low.
不要學得太快,每步細算代價,
You cannot learn too fast; each step must count its cost,
ε 是繩,逾越便是損失。
Epsilon draws the leash — beyond it lies the loss.
十四行至此轉折,策略開始猶豫,
Here at the sonnet’s turn, the policy hesitates,
並非懦弱,是智慧避免崩潰。
Not cowardice, but wisdom: leaps too far it breaks.
舊我非敵,是穩固港灣,
The old self is no foe — it is the anchor, the safe shore,
新我在旁,寸步挪移,緩慢向前。
The new self stays nearby, inching just a little more.
截斷之手溫和,ε 守護,
The clipping hand is gentle, Epsilon holds true,
克制之中深藏真理。
In restraint we find the truths we never need to pursue.

日本語




