Edo Dotan (@DotanEdo) - Twitter Profili | Zamantika Mersobahis Locabet

Edo Dotan@DotanEdo·3 Oca

@razoralign Also described in: "Multiple sequence alignment as a sequence-to-sequence learning problem" (ICLR, 2023) openreview.net/forum?id=8efJY…

English

0

1

13

antisense.@razoralign·9 Oca

BetaAlign: a deep learning approach for multiple sequence alignment academic.oup.com/bioinformatics…

English

2

36

152

7.8K

Edo Dotan@DotanEdo·9 Ara

@PastelBio Github link is now available!

English

0

7

Pastel BioScience@PastelBio·9 Ara

Github link doesn't appear to wotk :-( and there exists another called Prot2text which is confusing ... Protein2Text: Providing Rich Descriptions for Protein Sequences biorxiv.org/content/10.110… --- #proteomics #prot-preprint

English

1

0

2

277

Edo Dotan retweetledi

Yonatan Belinkov@boknilev·8 Ara

New work led by Edo Dotan, with Iris Lyubman, Eran Bacharach and Tal Pupko @TelAvivUni @TechnionLive

Biology+AI Daily@BiologyAIDaily

Protein2Text: Providing Rich Descriptions for Protein Sequences 1. Introducing Protein2Text, a novel model that bridges protein sequences and natural language, generating rich textual descriptions of protein properties, functions, and roles. 2. The system leverages BetaDescribe, a model derived from LLAMA2, trained on over 120 billion biological and English tokens, enabling seamless integration of biological insights into generative language capabilities. 3. Protein2Text excels at describing proteins with low sequence similarity to training data, outperforming traditional methods like BlastP, especially when homologous sequences are unavailable. 4. The model comprises a generator for creating descriptions, validators for property prediction, and a judge to assess accuracy, offering robust multi-perspective outputs. 5. Key innovations include its ability to identify functionally important regions via in-silico mutagenesis, revealing biological meaningful domains without experimental mutagenesis. 6. Compared to public large language models like GPT4, BetaDescribe demonstrates superior performance in protein-specific contexts, with higher accuracy and contextual relevance. 7. This tool advances functional protein annotation, with implications for medicine, agriculture, and protein engineering, and suggests potential for reverse application in protein design. @boknilev 💻Code: github.com/technion-cs-nl… 📜Paper: biorxiv.org/content/10.110… #ProteinBiology #GenerativeAI #Bioinformatics #ProteinFunction #NLP

English

0

2

8

817

Edo Dotan retweetledi

Biology+AI Daily@BiologyAIDaily·8 Ara

Protein2Text: Providing Rich Descriptions for Protein Sequences 1. Introducing Protein2Text, a novel model that bridges protein sequences and natural language, generating rich textual descriptions of protein properties, functions, and roles. 2. The system leverages BetaDescribe, a model derived from LLAMA2, trained on over 120 billion biological and English tokens, enabling seamless integration of biological insights into generative language capabilities. 3. Protein2Text excels at describing proteins with low sequence similarity to training data, outperforming traditional methods like BlastP, especially when homologous sequences are unavailable. 4. The model comprises a generator for creating descriptions, validators for property prediction, and a judge to assess accuracy, offering robust multi-perspective outputs. 5. Key innovations include its ability to identify functionally important regions via in-silico mutagenesis, revealing biological meaningful domains without experimental mutagenesis. 6. Compared to public large language models like GPT4, BetaDescribe demonstrates superior performance in protein-specific contexts, with higher accuracy and contextual relevance. 7. This tool advances functional protein annotation, with implications for medicine, agriculture, and protein engineering, and suggests potential for reverse application in protein design. @boknilev 💻Code: github.com/technion-cs-nl… 📜Paper: biorxiv.org/content/10.110… #ProteinBiology #GenerativeAI #Bioinformatics #ProteinFunction #NLP

English

0

6

41

3.6K

Edo Dotan retweetledi

bioRxiv Bioinfo@biorxiv_bioinfo·8 Ara

Protein2Text: Providing Rich Descriptions for Protein Sequences biorxiv.org/cgi/content/sh… #biorxiv_bioinfo

English

0

7

28

2.8K

Edo Dotan@DotanEdo·21 Tem

@sivil_taram We report a significant performance boost on biological tasks when increasing the size of the tokenizer, as detailed in "Effect of Tokenization on Transformers for Biological Sequences" (Bioinformatics, 2024). academic.oup.com/bioinformatics…

English

0

230

Qian Liu@sivil_taram·19 Tem

🔥 Llama3 quadruples Llama2's vocab size from 32K to 128K, but our research says that's just the beginning! 🚀 New paper: "Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies" 🌟Key findings: 1⃣ The optimal vocab sizes of LLMs depend on the computational budget, and larger models deserve larger vocabularies. 2⃣️3B model: 32K → 43K vocab size = ARC-Challenge score 29.1 → 32.0 (same training FLOPs)! 3⃣️Llama3-70B should have 212K as the vocabulary size. 4⃣️Future Llama3-400B? We predict it needs 487K vocabulary size! 📖Paper: huggingface.co/papers/2407.13… 🖥️Code: github.com/sail-sg/scalin… Details in thread 🧵

English

7

77

329

72K

Edo Dotan@DotanEdo·19 May

Special thanks to Gal Jaschek for their invaluable contributions, as well as to my supervisors Prof. Tal Pupko and Dr. @boknilev for their guidance and support throughout the project. Code: github.com/technion-cs-nl… Paper: academic.oup.com/bioinformatics…

English

0

2

65

Edo Dotan@DotanEdo·19 May

TL;DR: our findings suggest that integrating a #tokenizer into the training process of a #deep-g#learning model for #biological #sequences can significantly enhance performance.

English

1

0

1

60

Edo Dotan@DotanEdo·19 May

I'm pleased to announce that our paper, "Effect of #tokenization on transformers for biological sequences" has been accepted for publication in #Bioinformatics.

English

1

4

614

Edo Dotan retweetledi

antisense.@razoralign·13 Nis

Effect of tokenization on transformers for biological sequences academic.oup.com/bioinformatics…

English

1

17

67

6K

Edo Dotan retweetledi

bioRxiv Bioinfo@biorxiv_bioinfo·28 Mar

BetaAlign: a deep learning approach for multiple sequence alignment biorxiv.org/cgi/content/sh… #biorxiv_bioinfo

English

0

13

38

3.5K

Edo Dotan@DotanEdo·28 Mar

BetaAlign: a deep learning approach for multiple sequence alignment biorxiv.org/content/10.110…

English

0

63

Edo Dotan retweetledi

Hadas Orgad@OrgadHadas·5 Eyl

In exactly one month - we'll be presenting our model editing method -- TIME -- in #ICCV2023! Text-to-image diffusion models encode a lot of assumptions about the world, which is what allows them to generate beautiful images even with simple prompts. BUT >>> 🧵

English

1

6

44

5.4K

Edo Dotan retweetledi

Shmunis School of Biomedicine and Cancer Research@ShmunisR·9 Şub

1/7 Can you imagine translating genomic data like you would with a foreign language? Our latest research paper was accepted to ICLR utilizing seq2seq (translation) methods for a bioinformatics task! 🧵 👇 read more below #Bioinformatics #SequenceAlignment #NLProc #ICLR2023

Shmunis School of Biomedicine and Cancer Research tweet media

English

3

6

21

4.3K

Edo Dotan retweetledi

Shmunis School of Biomedicine and Cancer Research@ShmunisR·29 May

📢 Exciting News! 🧬🌐 We are thrilled to announce the launch of GenomeFLTR, our webserver that simplifies the process of filtering reads! Say goodbye to the complexities of identifying contaminants and embrace the power of GenomeFLTR. 🧵1/4