Ilias GeoSo retweetledi

Fundamental limitations of genomic language models for realistic sequence generation
1. A new study evaluates the capabilities of genomic language models (gLMs) in generating realistic genomic sequences, revealing significant limitations in capturing essential genomic features. The research highlights the need for specialized architectures to better model biological constraints.
2. The study focuses on Evo 2, a state-of-the-art gLM with 40 billion parameters, and tests its performance on diverse prokaryotic, eukaryotic, and viral genomes. Results show that while synthetic sequences capture local statistics, they fail to preserve long-range genomic organization and other key biological features.
3. Synthetic genomes generated by Evo 2 consistently fail to replicate natural k-mer spectra, showing systematic distortions in frequency chaos game representations (FCGRs). This indicates a lack of species-specific higher-order k-mer organization in the generated sequences.
4. The research also finds that synthetic genomes exhibit significant deviations in nullomer content, with eukaryotic genomes showing depletion of nullomers, while viral and prokaryotic genomes show enrichment. This highlights Evo 2's inability to capture domain-specific evolutionary constraints.
5. Non-B DNA motifs, which are crucial for genomic processes, are systematically distorted in synthetic genomes. Eukaryotic sequences show depletion of these motifs, while viral genomes show enrichment, indicating a failure to replicate the density and distribution of non-B DNA structures.
6. Transcription factor binding sites (TFBS) are found to be systematically enriched in synthetic human sequences, with a loss of native clustering and hotspot organization. This suggests that gLMs like Evo 2 reshape the regulatory motif landscape in a way that diverges from natural genomic patterns.
📜Paper: biorxiv.org/content/10.648…
#Genomics #LanguageModels #ComputationalBiology #SyntheticGenomes #Bioinformatics

English

















