Ido Amos

13 posts

Ido Amos

@AmosaurusRex

MSc student at Tel-Aviv University working on ML/DL

Katılım Ocak 2022

263 Takip Edilen97 Takipçiler

Ido Amos@AmosaurusRex·24 Şub

@JentseHuang Thanks @JentseHuang ! Sounds very interesting, we mostly used Thinkig States to represent reasoning in our work but treating them as an internal memory indeed sounds very natural. I’ll have a look on your experiments

English

J Huang@JentseHuang·24 Şub

@AmosaurusRex Very cool work Ido! I am doing similar things: arxiv.org/abs/2505.10571 In this project we design three simple & effective experiments to show that current LLMs lack such internal memory & thinking. I think it’s worth trying Thinking States on our experiments.

English

Ido Amos@AmosaurusRex·17 Şub

Can LLMs reason internally while processing their inputs, similar to how humans can think ahead as we process information? Our latest work introduces Thinking States, a novel architectural adaptation that transforms reasoning into a internal recurrent process. By training models to maintain a dynamic thinking state, we achieve significant inference speedups over Chain-of-Thought while substantially outperforming existing latent reasoning methods. Paper: arxiv.org/abs/2602.08332

English

131

12.1K

Ido Amos@AmosaurusRex·17 Şub

Thinking States outperforms existing latent reasoning methods on multiple benchmarks and matches Chain-of-Thought performance on multi-hop QA, while leading to faster inference times. Furthermore, Thinking states exhibit superior length generalization in state-tracking tasks, successfully extrapolating to sequences significantly longer than those seen during training. This work was done during an internship at Google Research with an incredible team of collaborators: @clu_avi @megamor2 @amirgloberson @jonherzig @LiorShani286867 @ISzpektor Read the full paper and explore our findings here: arxiv.org/abs/2602.08332

English

383

Ido Amos@AmosaurusRex·17 Şub

A major challenge in latent reasoning is finding effective supervision for the reasoning process. Since thinking states are represented in natural language, we can leverage existing Chain-of-Thought data for supervision. Furthermore, as this supervision is available in advance, we use it to teacher-force the thinking states themselves. This circumvents the need for costly recurrent optimization via backpropagation through time (BPTT), enabling fully parallel training and maintaining nearly constant training costs regardless of reasoning depth.

English

447

Ido Amos@AmosaurusRex·17 Eki

@lovodkin93 Good luck on your exciting new journey!!

English

Aviv Slobodkin @NeurIPS@lovodkin93·16 Eki

I’m excited to share that I’ve started a full-time position as a Research Scientist at Google! 🚀 I’ve also moved to the Bay Area 🌉, so if you are around please text me and we can meet for coffee! To new beginnings!

English

291

24.8K

Ido Amos@AmosaurusRex·9 Eyl

@gon_buzaglo @Princeton Good luck 👑 enjoy it!

English

Gon Buzaglo@gon_buzaglo·8 Eyl

Excited to begin my PhD in Computer Science @Princeton !

English

105

3.4K

155.3K

Ido Amos@AmosaurusRex·8 May

Honestly cannot believe that our work got the BEST PAPER award @iclr_conf !!! This was an amazing experience with my collaborators @JonathanBerant @ankgup2 , looking forward to share with everyone at the conference. Reach out if you want to chat!

Ido Amos@AmosaurusRex

Excited to share my work with @JonathanBerant @ankgup2! We show pretraining on task data alone suffices to bridge the gap between state space models and transformers on Long Range Arena, leading to a significantly better estimate of model capabilities. arxiv.org/abs/2310.02980 🧵

English

4.5K

Ido Amos@AmosaurusRex·10 Şub

@ibomohsin A really interesting point of view on LLMs and language in general! Can you expend on what you think fractal dimension means for language?

English

341

Ibrahim Alabdulmohsin | إبراهيم العبدالمحسن@ibomohsin·6 Şub

How is next-token prediction capable of such intelligent behavior? I’m very excited to share our work, where we study the fractal structure of language. TLDR: thinking of next-token prediction in language as “word statistics” is a big oversimplification! arxiv.org/abs/2402.01825

Ibrahim Alabdulmohsin | إبراهيم العبدالمحسن tweet media

English

107

531

125.1K

Ido Amos@AmosaurusRex·5 Ara

[4/4] Investigating the effects of data scale, we find self-pretraining is most effective in low-data regimes, underscoring its importance for evaluation across all dataset sizes. We further show that self pretraining is effective across model sizes and when compute is limited.

English

511

Ido Amos@AmosaurusRex·5 Ara

[3/4] The marked effect of self-pretraining on long-sequence tasks leads us to rethink the necessity of complex designs, with Diagonal Linear RNNs (DLR) as a specific example. Our findings indicate that, when pretrained, simple architectures can be as effective as complex designs

English

637

Ido Amos@AmosaurusRex·5 Ara

English

9.3K

Keşfet

@JentseHuang @clu_avi @megamor2 @amirgloberson @jonherzig @LiorShani286867 @ISzpektor @lovodkin93