Jan Tempus

@Jan55028368

Doing some research on tokenisation.

Katılım Eylül 2021

7 Takip Edilen66 Takipçiler

Jan Tempus@Jan55028368·17h

Interestingly, Craig W. Schmidt has a second paper using LPs for tokenisation hitting arXiv today as well! Check it out: arxiv.org/abs/2605.22705

English

3.1K

Jan Tempus@Jan55028368·17h

In our new paper, we reinterpret tokenisation as a problem in high-dimensional geometry (100M dims to be precise!), which we can solve efficiently to get a globally near-optimal tokeniser! Our method consistently improves language models over BPE. See 🧵for details.

English

280

20.9K

Jan Tempus@Jan55028368·17h

As a bonus, by solving a relaxed linear program, our method allows you to upperbound how far from optimal any tokeniser is. w/ @tpimentelms Paper: arxiv.org/pdf/2605.22821 Code: github.com/JanTempus/toke…

English

1.2K

Keşfet

@tpimentelms @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates @NASA @nikifrancismediavine