Edo Dotan

18 posts

Edo Dotan

Edo Dotan

@DotanEdo

Katılım Ağustos 2023
16 Takip Edilen15 Takipçiler
Edo Dotan retweetledi
Yonatan Belinkov
Yonatan Belinkov@boknilev·
New work led by Edo Dotan, with Iris Lyubman, Eran Bacharach and Tal Pupko @TelAvivUni @TechnionLive
Biology+AI Daily@BiologyAIDaily

Protein2Text: Providing Rich Descriptions for Protein Sequences 1. Introducing Protein2Text, a novel model that bridges protein sequences and natural language, generating rich textual descriptions of protein properties, functions, and roles. 2. The system leverages BetaDescribe, a model derived from LLAMA2, trained on over 120 billion biological and English tokens, enabling seamless integration of biological insights into generative language capabilities. 3. Protein2Text excels at describing proteins with low sequence similarity to training data, outperforming traditional methods like BlastP, especially when homologous sequences are unavailable. 4. The model comprises a generator for creating descriptions, validators for property prediction, and a judge to assess accuracy, offering robust multi-perspective outputs. 5. Key innovations include its ability to identify functionally important regions via in-silico mutagenesis, revealing biological meaningful domains without experimental mutagenesis. 6. Compared to public large language models like GPT4, BetaDescribe demonstrates superior performance in protein-specific contexts, with higher accuracy and contextual relevance. 7. This tool advances functional protein annotation, with implications for medicine, agriculture, and protein engineering, and suggests potential for reverse application in protein design. @boknilev 💻Code: github.com/technion-cs-nl… 📜Paper: biorxiv.org/content/10.110… #ProteinBiology #GenerativeAI #Bioinformatics #ProteinFunction #NLP

English
0
2
8
817
Edo Dotan retweetledi
Biology+AI Daily
Biology+AI Daily@BiologyAIDaily·
Protein2Text: Providing Rich Descriptions for Protein Sequences 1. Introducing Protein2Text, a novel model that bridges protein sequences and natural language, generating rich textual descriptions of protein properties, functions, and roles. 2. The system leverages BetaDescribe, a model derived from LLAMA2, trained on over 120 billion biological and English tokens, enabling seamless integration of biological insights into generative language capabilities. 3. Protein2Text excels at describing proteins with low sequence similarity to training data, outperforming traditional methods like BlastP, especially when homologous sequences are unavailable. 4. The model comprises a generator for creating descriptions, validators for property prediction, and a judge to assess accuracy, offering robust multi-perspective outputs. 5. Key innovations include its ability to identify functionally important regions via in-silico mutagenesis, revealing biological meaningful domains without experimental mutagenesis. 6. Compared to public large language models like GPT4, BetaDescribe demonstrates superior performance in protein-specific contexts, with higher accuracy and contextual relevance. 7. This tool advances functional protein annotation, with implications for medicine, agriculture, and protein engineering, and suggests potential for reverse application in protein design. @boknilev 💻Code: github.com/technion-cs-nl… 📜Paper: biorxiv.org/content/10.110… #ProteinBiology #GenerativeAI #Bioinformatics #ProteinFunction #NLP
Biology+AI Daily tweet media
English
0
6
41
3.6K
Edo Dotan
Edo Dotan@DotanEdo·
@sivil_taram We report a significant performance boost on biological tasks when increasing the size of the tokenizer, as detailed in "Effect of Tokenization on Transformers for Biological Sequences" (Bioinformatics, 2024). academic.oup.com/bioinformatics…
English
0
0
0
230
Qian Liu
Qian Liu@sivil_taram·
🔥 Llama3 quadruples Llama2's vocab size from 32K to 128K, but our research says that's just the beginning! 🚀 New paper: "Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies" 🌟Key findings: 1⃣ The optimal vocab sizes of LLMs depend on the computational budget, and larger models deserve larger vocabularies. 2⃣️3B model: 32K → 43K vocab size = ARC-Challenge score 29.1 → 32.0 (same training FLOPs)! 3⃣️Llama3-70B should have 212K as the vocabulary size. 4⃣️Future Llama3-400B? We predict it needs 487K vocabulary size! 📖Paper: huggingface.co/papers/2407.13… 🖥️Code: github.com/sail-sg/scalin… Details in thread 🧵
Qian Liu tweet media
English
7
77
329
72K
Edo Dotan
Edo Dotan@DotanEdo·
I'm pleased to announce that our paper, "Effect of #tokenization on transformers for biological sequences" has been accepted for publication in #Bioinformatics.
Edo Dotan tweet media
English
1
1
4
614
Edo Dotan retweetledi
Hadas Orgad
Hadas Orgad@OrgadHadas·
In exactly one month - we'll be presenting our model editing method -- TIME -- in #ICCV2023! Text-to-image diffusion models encode a lot of assumptions about the world, which is what allows them to generate beautiful images even with simple prompts. BUT >>> 🧵
English
1
6
44
5.4K
Edo Dotan retweetledi
Shmunis School of Biomedicine and Cancer Research
📢 Exciting News! 🧬🌐 We are thrilled to announce the launch of GenomeFLTR, our webserver that simplifies the process of filtering reads! Say goodbye to the complexities of identifying contaminants and embrace the power of GenomeFLTR. 🧵1/4
English
1
4
8
1.5K