
A novel Transformer-variant architecture. I used Gemini 3.1 to come up with some weird ideas on multiobjective loss functions and a few other things then tried it out. Surprisingly, it trains 8 and 14 million parameter models on TinyStories V2 quite well with smooth loss curves over 40 epochs.
English





















