
F Cancer Why has AI had so little impact on Cancer? New essay, link below.
Ross
40 posts

@rssrwn
PhD student at AstraZeneca and Chalmers University working on generative models for molecules. Computer Science undergrad at Imperial College.

F Cancer Why has AI had so little impact on Cancer? New essay, link below.









LAGOM: A Transformer-Based Chemical Language Model for Drug Metabolite Prediction 1.LAGOM is a Transformer-based model designed to predict drug metabolites directly from SMILES, offering an end-to-end alternative to traditional rule-based and two-step ML approaches. 2.Unlike conventional tools that rely heavily on manually curated transformation rules and intermediate site-of-metabolism (SoM) predictions, LAGOM leverages the Chemformer architecture to perform direct sequence-to-sequence translation from drugs to metabolites. 3.The model is trained using a curriculum-style transfer learning pipeline: general chemical pretraining (Virtual Analogs), followed by metabolite-specific pretraining (MetaTrans), and fine-tuning on a rigorously curated dataset (LAGOM dataset) composed of DrugBank and MetXBioDB entries. 4.A key innovation is LAGOM’s single-model approach, which outperforms both the rule-based GLORYx and SyGMa tools and the previous Transformer-based MetaTrans model on the standard GLORYx benchmark dataset. 5.LAGOM incorporates SMILES randomisation during fine-tuning, which significantly boosts performance. Other augmentation strategies (e.g., parent-grandchild reactions, property annotations) had limited or even negative effects. 6.The curated LAGOM dataset includes over 4000 parent-metabolite pairs with strict filtering based on atom types, molecular weight, and Tanimoto similarity to ensure quality and eliminate overlaps with test sets. 7.LAGOM achieves superior precision (0.18 vs 0.11) and F1 score (0.25 vs 0.17) over MetaTrans, while maintaining comparable recall. This reflects better balance between identifying correct metabolites and avoiding false positives. 8.An ensemble approach using multiple LAGOM models trained on different data splits further improves recall, though precision tends to drop due to increased prediction diversity. 9.Evaluation metrics go beyond simple accuracy, focusing on recall, precision, and F1 score using top-k predictions per drug. All models maintain over 95% validity of generated SMILES strings. 10.This work also contributes a reproducible data curation and training pipeline, enabling future research in metabolite prediction using chemical language models. 11.Despite advances, the authors acknowledge that low-data regimes, high chemical diversity, and the one-to-many nature of metabolism remain challenges for model generalisation and evaluation. 12.Future directions include expanding the dataset with richer metabolic transformations and exploring better model selection strategies that reflect external benchmark performance. 💻Code: github.com/tsofiac/LAGOM 📜Paper: doi.org/10.26434/chemr… #DrugDiscovery #Chemoinformatics #AI4Science #MetabolitePrediction #TransformerModel #ComputationalPharmacology #SMILES #DeepLearning








