Ross

40 posts

Ross

@rssrwn

PhD student at AstraZeneca and Chalmers University working on generative models for molecules. Computer Science undergrad at Imperial College.

Gothenburg, Sweden Katılım Haziran 2024

579 Takip Edilen97 Takipçiler

Ross retweetledi

Simon Barnett@SimonDBarnett·17 Mar

I can feel @GaryMarcus’s frustration. AI’s utility—specifically in drug development—was over-promised and under-delivered upon. But I still can’t help but get annoyed at the concept of an AI drug. They don’t exist and I look forward to when we stop using this phrase. AI is a utility. It’s a raw resource. It’s intelligence in a box (er, data center). Electricity is the same way. It gets pushed into devices that transform that utility into another unit of work. AI gets pushed into different nodes of the drug discovery paradigm and also gets transformed into units of work. That’s why the phrase is based entirely on a false premise. If we were talking about building bridges. We wouldn’t have this issue. Is electricity used to build bridges? Sure! There are all sorts of electrically-powered tools used to build bridges. Small ones. Big ones. But is it an electrically-built bridge? Do 51% of the rivets need to get installed with electricity and not gasoline as the raw fuel? What about drug development? Do 51% of the atoms require data-driven models versus human heuristics and/or physics? I don’t know! The phrase AI drug is an allergic reaction to the irrational over-hyping of an idea and the ensuing misallocation of investor capital. It is rarely used in a scientific or technical sense. Nowadays, it’s mostly used by some who tacitly root on the missteps or failures of so-called AI drugs — as a bit of ‘I told you so’. And I get it! It’s well placed and somewhat deserved, but doesn’t move the needle on scientific discourse.

Gary Marcus@GaryMarcus

F Cancer Why has AI had so little impact on Cancer? New essay, link below.

English

3.3K

Ross retweetledi

Simon Olsson@smnlssn·14 Şub

PhD positions in my lab (AI for Science), but with special preference for people who complement our current activities or are enthusiastic about contributing to our on going work. Looking for technically strong, independent and proactive candidates. chalmers.se/en/about-chalm…

English

185

13.4K

Ross retweetledi

Simon Olsson@smnlssn·15 Oca

I'm excited to open the new year by sharing a new perspective paper I give a informal outline of MD and how it can interact with Generative AI. Then, how far the field has come since the seminal contributions, such as Boltzmann Generators, and what is still missing

English

273

22.9K

Ross retweetledi

Simon Olsson@smnlssn·10 Eki

New preprint out! We present "Transferable Generative Models Bridge Femtosecond to Nanosecond Time-Step Molecular Dynamics,"

English

136

17.9K

Ross retweetledi

Simon Olsson@smnlssn·22 Eyl

Thrilled to announce that the first paper from Flemmings PhD was accepted as Poster for NeurIPS2025! In this paper we adapt the HollowNet idea from @RickyTQChen to equivariant MP networks, to get cheap sample likelihoods for CNF-based Boltzmann Generators. Preprint/code soon.

English

1.3K

Ross retweetledi

Yiming Qin@qinym710·15 Tem

🚀 Presenting #DeFoG: our discrete flow‑matching framework for graph generation! Catch our #ICML2025 oral presentation today (3:30 – 3:45 PM, in West Exhibition Hall C) and drop by the poster right after (4:30 –7:00). Come chat graphs & generative models! @manuelmlmadeira

English

1.3K

Ross retweetledi

Biology+AI Daily@BiologyAIDaily·8 Tem

PROTAC-Splitter: A Machine Learning Framework for Automated Identification of PROTAC Substructures 1. A new machine learning framework, PROTAC-Splitter, has been developed to automate the challenging task of identifying and annotating substructures within Proteolysis-targeting chimeras (PROTACs). This addresses the current reliance on manual curation and predefined matching, which often limits scalability and accuracy. 2. To overcome the significant scarcity of annotated PROTAC data, the researchers generated and openly released a large-scale synthetic dataset comprising approximately 1.3 million PROTAC structures with detailed ligand annotations. This valuable resource is intended to foster further research and method development in the field. 3. The framework introduces two complementary machine learning models: a Transformer-based sequence-to-sequence model and a graph-based XGBoost model. While the Transformer achieved high exact-match accuracy on public data, the XGBoost model guaranteed 100% chemical validity and perfect reassembly, crucial for practical applications. 4. A key innovation is the development of a wrapper function, Transformer-Δ, for the Transformer model. This function effectively corrects partial prediction errors, significantly improving reassembly accuracy on both public and novel internal datasets, mitigating the 'hallucination' issue sometimes observed with generative models. 5. The study proposes a hybrid strategy that leverages the strengths of both models. By accepting Transformer predictions that pass validity checks and defaulting to the XGBoost model otherwise, PROTAC-Splitter can reliably annotate PROTACs across diverse chemical spaces, ensuring chemically sound and reassembling splits for all molecules. 6. PROTAC-Splitter represents a robust and scalable tool designed to facilitate automated PROTAC analysis and streamline the design of targeted protein degraders in drug discovery. Its open-source availability makes it a valuable asset for the computational chemistry community. 💻Code: github.com/ribesstefano/P… 📜Paper: doi.org/10.26434/chemr… #ComputationalBiology #MachineLearning #PROTAC #DrugDiscovery #Cheminformatics

English

716

Ross retweetledi

Biology+AI Daily@BiologyAIDaily·4 Tem

Synthesizable by Design: A Retrosynthesis-Guided Framework for Molecular Analog Generation １．SynTwins is a new framework for generating synthetically accessible molecular analogs, bridging the gap between AI molecule design and lab synthesis—a persistent bottleneck in drug and materials discovery. ２．Unlike many ML-based models, SynTwins does not rely on neural networks. Instead, it emulates chemists’ intuition through a 3-step process: retrosynthesis, building block search, and virtual synthesis, using only reaction templates and building block libraries. ３．The core insight: if a molecule is hard to synthesize, generate a structurally similar analog from commercially available building blocks and well-established reactions. This mirrors how expert chemists think. ４．SynTwins builds a retrosynthesis tree of the target molecule using retro-reaction templates, searches for k-nearest building blocks (preserving functional groups), then reconstructs analogs using forward reactions. ５．In benchmarks against ChemProjector and SynFormer across diverse datasets (ChEMBL, USPTO, FDA drugs), SynTwins consistently produces analogs with higher structural similarity and better reconstruction performance. ６．On real-world molecules (USPTO, FDA drugs), SynTwins achieves up to 17% exact reconstructions—dramatically outperforming SynFormer (0–4%). Even when exact matches fail, the analogs are chemically close and synthetically feasible. ７．Importantly, SynTwins generates analogs with lower BR-SAScores (i.e., easier to synthesize) than the original targets. This shows that SynTwins reduces synthetic complexity while retaining structural motifs. ８．The algorithm adapts to easy vs hard-to-synthesize molecules: when a target has a low SAScore, SynTwins can find analogs with >0.85 Tanimoto similarity. For hard targets, similarity drops, but key pharmacophores are preserved. ９．The team also demonstrates how to embed SynTwins into optimization pipelines (REINVENT, GraphGA) to enforce synthesizability during multi-property optimization—trading a small drop in performance for guaranteed synthesis routes. １０．Compared to retro-template learning models that require GPU-intensive training, SynTwins is computationally efficient and interpretable, making it highly practical for labs with limited compute but access to reaction libraries. １１．SynTwins offers a modular, chemistry-aware, ML-free path to design analogs with real-world synthetic plans—positioning itself as a useful component in synthesis-aware drug/material pipelines. 📜Paper: arxiv.org/abs/2507.02752 #MolecularDesign　#Cheminformatics　#Retrosynthesis　#DrugDiscovery　#AI4Science　#ComputationalChemistry

English

1.1K

Ross@rssrwn·4 Tem

Very cool application of pre-trained chemical language models!

Biology+AI Daily@BiologyAIDaily

LAGOM: A Transformer-Based Chemical Language Model for Drug Metabolite Prediction １．LAGOM is a Transformer-based model designed to predict drug metabolites directly from SMILES, offering an end-to-end alternative to traditional rule-based and two-step ML approaches. ２．Unlike conventional tools that rely heavily on manually curated transformation rules and intermediate site-of-metabolism (SoM) predictions, LAGOM leverages the Chemformer architecture to perform direct sequence-to-sequence translation from drugs to metabolites. ３．The model is trained using a curriculum-style transfer learning pipeline: general chemical pretraining (Virtual Analogs), followed by metabolite-specific pretraining (MetaTrans), and fine-tuning on a rigorously curated dataset (LAGOM dataset) composed of DrugBank and MetXBioDB entries. ４．A key innovation is LAGOM’s single-model approach, which outperforms both the rule-based GLORYx and SyGMa tools and the previous Transformer-based MetaTrans model on the standard GLORYx benchmark dataset. ５．LAGOM incorporates SMILES randomisation during fine-tuning, which significantly boosts performance. Other augmentation strategies (e.g., parent-grandchild reactions, property annotations) had limited or even negative effects. ６．The curated LAGOM dataset includes over 4000 parent-metabolite pairs with strict filtering based on atom types, molecular weight, and Tanimoto similarity to ensure quality and eliminate overlaps with test sets. ７．LAGOM achieves superior precision (0.18 vs 0.11) and F1 score (0.25 vs 0.17) over MetaTrans, while maintaining comparable recall. This reflects better balance between identifying correct metabolites and avoiding false positives. ８．An ensemble approach using multiple LAGOM models trained on different data splits further improves recall, though precision tends to drop due to increased prediction diversity. ９．Evaluation metrics go beyond simple accuracy, focusing on recall, precision, and F1 score using top-k predictions per drug. All models maintain over 95% validity of generated SMILES strings. １０．This work also contributes a reproducible data curation and training pipeline, enabling future research in metabolite prediction using chemical language models. １１．Despite advances, the authors acknowledge that low-data regimes, high chemical diversity, and the one-to-many nature of metabolism remain challenges for model generalisation and evaluation. １２．Future directions include expanding the dataset with richer metabolic transformations and exploring better model selection strategies that reflect external benchmark performance. 💻Code: github.com/tsofiac/LAGOM 📜Paper: doi.org/10.26434/chemr… #DrugDiscovery #Chemoinformatics #AI4Science #MetabolitePrediction #TransformerModel #ComputationalPharmacology #SMILES #DeepLearning

English

Ross@rssrwn·2 Tem

@smnlssn Congrats Simon! Well deserved 🎉

English

Simon Olsson@smnlssn·2 Tem

Honored and humbled to receive the inaugural ICTP-IBM Brahmagupta AI Prize. Thanks to my group and my long-term collaborators for helping make this happen. ictp.it/news/2025/6/fi…

English

953

Ross retweetledi

Biology+AI Daily@BiologyAIDaily·3 Haz

Look mom, no experimental data! Learning to score protein-ligand interactions from simulations １．This work introduces Ligand Force Matching (LFM), a novel per-target scoring method for protein-ligand binding that learns from molecular dynamics (MD) simulations—no experimental binding data required. ２．Unlike general-purpose ML models or slow physics-based free energy estimators, LFM trains a neural network on MD-derived force data to learn an implicit potential of mean force (PMF) landscape for each target. ３．To generate training data, the authors simulate random small molecules in the protein's binding site, collecting forces via MD while freezing ligand positions and applying soft-core alchemical restraints. ４．The model learns from this force data using a force-matching loss, minimizing the difference between predicted and MD-observed forces. It is trained to predict binding PMF differences between bound and solvent poses. ５．This method provides high pose sensitivity, outperforming traditional models like Vina and GNINA in early enrichment—especially when using accurate poses from co-crystal structures. ６．When evaluated on six benchmark proteins, LFM achieved state-of-the-art enrichment factors with crystal poses, and remained competitive even with docked poses from tools like DIFFDOCK and Uni-Dock. ７．The LFM models showed greater differentiation between actives and decoys when using more accurate docking methods, demonstrating that better input poses synergize with LFM’s pose-sensitive scoring. ８．For practical virtual screening, LFM inference is efficient (~2.5 s per molecule on an L40 GPU), making it suitable for large-scale screening pipelines. ９．Though initial prospective screening with BRD4 did not yield new binders, LFM successfully re-identified known actives, showcasing its potential in rediscovery and retrospective benchmarking. １０．LFM is designed to generalize better to novel targets, especially those underrepresented in training datasets for traditional models—e.g., intrinsically disordered proteins or RNA structures. １１．The model architecture includes ligand and coordinate embeddings fed into an equivariant transformer. Additional loss terms encourage accurate force and torque predictions over average poses. １２．Authors suggest that LFM’s high pose sensitivity may make it more robust for novel or challenging protein targets, compared to topology-dependent or ligand-only biased methods. １３．Limitations include the need for MD simulations (~100–500 μs per target), but the cost remains significantly lower than experimental screening campaigns. １４．Future directions include testing on out-of-distribution targets, incorporating advanced MD techniques (e.g., constant pH, enhanced sampling), and exploring faster or uncertainty-aware architectures. １５．LFM could also support pose prediction (docking), not just scoring, by minimizing the learned free energy surface—opening new directions for flexible docking. １６．This physics-informed learning approach bridges high-accuracy MD and scalable ML, offering a new paradigm for target-specific virtual screening in drug discovery. 💻Code: github.com/molecularmodel… 📜Paper: arxiv.org/abs/2506.00593 #DrugDiscovery #ComputationalBiology #MolecularDynamics #VirtualScreening #MachineLearning #ProteinLigandInteraction

English

108

Ross@rssrwn·1 Haz

@chaitjo Agreed it’s not done often, I just think it’s interesting to think that transformers can actually be adapted fairly easily to be SO(3) equivariant. Could also lead to fairer comparisons since many equi vs non-equi comparisons look at very different archs

English

138

Chaitanya K. Joshi@chaitjo·1 Haz

@rssrwn Most examples I am aware of use Transformers with the 3D coordinates as initial features for each token/node, so I think they will break the symmetry through that.

English

402

Chaitanya K. Joshi@chaitjo·1 Haz

After a long hiatus, I've started blogging again! My first post was a difficult one to write, because I don't want to keep repeating what's already in papers. I tried to give some nuanced and (hopefully) fresh takes on equivariance and geometry in molecular modelling.

English

272

63.2K

Ross retweetledi

Luke Yun@luke_yun1·1 Haz

Pfizer and AstraZeneca’s FLOWR model generates ligands 70× faster with improved structural accuracy by conditioning on protein pocket shapes. The shift from diffusion to flow-based models like FLOWR shows promising gains in speed and precision for drug discovery.

English

176

14.8K

Ross retweetledi

Simon Olsson@smnlssn·27 May

We are looking for someone to join the group as a postdoc to help us with scaling implicit transfer operators. If you are interested in this, please reach out to me through email. Include CV, with publications and brief motivational statement. RTs appreciated!

English

5.1K

Ross@rssrwn·27 May

@chaitjo Agreed, very interesting findings! Particularly that quality molecules and confs can be decoded in one step

English

Chaitanya K. Joshi@chaitjo·26 May

@rssrwn These results are with 500 steps during sampling for best performance. Fewer steps also leads to very close performance to what’s reported in the table.

English

183

Chaitanya K. Joshi@chaitjo·22 May

Sharing a surprising result on AI molecule generation with Transformers: All-atom DiT with minimal molecular inductive bias is on par with SOTA equivariant diffusion (SemlaFlow, a very strong and optimized model) Scaling pure Transformers can alleviate explicit bond prediction!

English

156

35.6K

Ross retweetledi

Christopher Kolloff@chrisdkolloff·26 May

New preprint alert 🚨 How can you guide diffusion and flow-based generative models when data is scarce but you have domain knowledge? We introduce Minimum Excess Work, a physics-inspired method for efficiently integrating sparse constraints. Thread below 👇arxiv.org/abs/2505.13375

English

155

11K

Ross retweetledi

Jason Wei@_jasonwei·19 May

Discriminator-generator gap seems to be the most important idea in AI for scientific innovation. With compute + clever search, anything that we can measure will be optimized. First up will be environments that can be verified quickly, with continuous reward, and at scale. Measurement is all you need (if you can't tell, i'm still rattled by AlphaEvolve)

English

616

96.2K

Ross@rssrwn·2 May

And since I haven’t posted these before: Paper -> arxiv.org/abs/2406.07266 Code -> github.com/rssrwn/semla-f…

English

Ross@rssrwn·2 May

I’m in Thailand for AISTATS and will be presenting our work on SemlaFlow — designing efficient equivariant models for 3D molecular generation (poster session 2, no 132). Come and say hi if you’re around!

English

716

Keşfet

@GaryMarcus @RickyTQChen @manuelmlmadeira @smnlssn @chaitjo @elonmusk @BarackObama @taylorswift13