Alexey Orlov

1K posts

Alexey Orlov banner
Alexey Orlov

Alexey Orlov

@AlexeyAOrlov

Charting low-dimensional manifolds of chemistry with #AI. PhD in #MedChem | Asst. Prof. in #Cheminformatics @ UniStra |Opinions = my own

Chemical multiverse Katılım Nisan 2018
1.3K Takip Edilen408 Takipçiler
Alexey Orlov retweetledi
Jorge Bravo Abad
Jorge Bravo Abad@bravo_abad·
Knowledge graphs as the backbone of digital twins for chemical processes Building a digital twin of a chemical reactor sounds simple in principle: connect a virtual model to the plant, feed it data, let it predict. In practice, every unit operation needs its own bespoke model, and the equations, parameters and process descriptions live scattered across papers, software and lab notebooks. Scaling this to hundreds of processes is the kind of problem where ontologies and graphs shine. Shuyuan Zhang and coauthors propose a knowledge graph that organizes process model building blocks (variables, laws, formulas, phenomena, context) into two ontologies, OntoModel and OntoProcess. Formulas are stored in MathML and parse automatically into code for SciPy, Pyomo or Julia. Autonomous agents handle assembly, calibration, SPARQL rule inference, database queries, AI property prediction, and chemistry queries via an LLM. Two workflows emerge. A bottom-up agent assembles models when phenomena are explicit, tested on an annular microreactor where Villermaux–Dushman calibration reveals tunable mixing times down to 0.1 ms. A top-down agent screens candidates when phenomena are ambiguous, applied to a ribbed Taylor–Couette reactor where the best dispersion law shifts with rotation speed and solvent. It then drives multi-objective optimization of a flow amidation, finding Pareto-optimal trade-offs between space-time yield and E-factor, and beating Bayesian optimization on a benchmark. What I find compelling is the philosophy. Rather than training one black-box model per process, the authors treat models as structured, reusable knowledge objects, with LLMs and AI predictors as supporting agents. A clean answer to a familiar frustration: predictive science gets stuck not on math, but on the lack of shared semantics across teams and tools. For groups in pharma, specialty chemicals or battery electrolytes, this points to digital twins that actually scale. Process knowledge becomes queryable infrastructure rather than tribal memory, and new reactors can be onboarded by adding instances to the graph rather than rebuilding from scratch. Paper: Zhang et al., Nature Chemical Engineering (2026) — CC BY 4.0 | doi.org/10.1038/s44286…
Jorge Bravo Abad tweet media
English
0
6
24
983
Alexey Orlov retweetledi
Michael Bronstein
Michael Bronstein@mmbronstein·
Multiple postdoc positions in geometric ML and generative modeling are available at Oxford in collaboration with Aithyra and Imperial @bose_joey lnkd.in/eCMYpxzT
English
1
25
83
7.8K
Alexey Orlov retweetledi
Nicholas Runcie
Nicholas Runcie@NicholasRuncie·
Excited to share our preprint: Molecular Representations for Large Language Models. We show that LLMs struggle with existing chemical formats, and that our new MolJSON representation substantially improves performance.
Nicholas Runcie tweet media
English
6
6
61
3.2K
Alexey Orlov retweetledi
Biology+AI Daily
Biology+AI Daily@BiologyAIDaily·
A 37-million-particle dataset from over 250 experiments to accelerate data-driven cryo-EM analysis 1. The paper introduces cryoPANDA (cryo-EM Particles ANnotated DAtaset): 37,623,123 curated experimental particle images from 252 cryo-EM experiments, designed to remove the main bottleneck for particle-level foundation models in cryo-EM: lack of large, diverse, richly annotated real data. 2. Scale and diversity are key: cryoPANDA spans 16 function-based protein classes and broad molecular-weight ranges (mean ~600 kDa; min 21 kDa; max 200,000 kDa), aiming to support models that generalize across targets and imaging conditions rather than being retrained per experiment. 3. Rich per-particle annotations go far beyond picking coordinates, covering acquisition parameters (e.g., voltage, dose, Cs), CTF estimates (defocus U/V, astigmatism angle), 2D classification statistics (class, alignment resolution, ESS, ECA), and 3D reconstruction metadata (Euler angles, translations, alignment error), plus links to EMDB maps and (when available) PDB models. 4. Dataset construction is not a simple scrape: the authors examined 495 EMPIAR entries, used sequence similarity (>30%) to cluster entries and reduce redundancy, then selected up to four representatives per cluster with manual curation for data quality and documentation, yielding 252 final experiments (mostly EMPIAR + 5 in-house). 5. A standardized cryoSPARC v4.6 processing pipeline is used to curate particles and attempt reconstructions: CTF estimation (when starting from micrographs), picking (blob picker or author coordinates), multiple rounds of 2D classification/selection with recovery of mistakenly rejected classes, duplicate removal using estimated particle diameter, and typical ab initio + refinement steps for 3D maps. 6. Reconstruction quality is validated against published EMDB maps (for cases with reported reconstructions): among 214 experiments with cryoPANDA reconstructions, 75 (35%) achieve better reported resolution than the published map and 139 (65%) are worse; differences are often explained by cryoPANDA using smaller particle subsets, with results becoming broadly comparable when particle fractions match. 7. A major contribution is demonstrating foundation-model readiness: the authors train a DINOv2 ViT-L/16 model from scratch on ~32M particles (215 experiments) and test generalization on 37 held-out experiments (~5M particles), using an experiment-level split to avoid leakage across near-identical acquisition settings or targets. 8. Without task-specific fine-tuning, the pretrained model yields micrograph-level representations that separate particle regions from background via sliding-window feature extraction and PCA-to-RGB visualization, despite the model being trained only on cropped particle images (not full micrographs). 9. The paper also shows a fully unsupervised particle-picking pipeline built on frozen DINOv2 features, evaluated on held-out EMPIAR-10017 with Henderson’s manual annotations: 91.5% recall, 45.5% precision (F1 60.8%). After downstream cryoSPARC cleanup, the picked particles support a 3D reconstruction at 4.38 Å, close to the published 4.20 Å and the cryoPANDA pipeline’s 4.29 Å for the same dataset. 10. Using cryoPANDA’s metadata, linear probes on frozen DINOv2 features can predict multiple particle properties (symmetry, pixel size, molecular weight, max diameter, EMDB resolution, defocus). Cross-experiment performance drops vs in-distribution, and the authors quantify that part of this gap comes from acquisition-parameter entanglement; regressing out acquisition parameters improves OOD accuracy across tasks, illustrating how the dataset enables mechanistic analysis of generalization failures. 💻Code: github.com/azamanos/cryoP… 📜Paper: biorxiv.org/content/10.648… #cryoEM #StructuralBiology #DeepLearning #FoundationModels #SelfSupervisedLearning #Datasets #Bioinformatics #ComputationalBiology #EMPIAR #EMDB
Biology+AI Daily tweet media
English
0
3
23
2K
Alexey Orlov retweetledi
Alex Rives
Alex Rives@alexrives·
Scaling laws are powering AI. It’s time to scale biology. Today we’re launching the Virtual Biology Initiative to generate the data to unlock scaling laws in biology and build accurate predictive models of the cell. Digital representations of proteins are already expanding our understanding of life at the molecular level, and accelerating the design of molecules and medicines. Accurate digital representations of the cell could reveal the mechanisms that are responsible for disease, and show how to reverse them. The protein data bank, and worldwide repositories of protein sequence biodiversity were created through decades of work by the scientific community. The advances in artificial intelligence for proteins would not have been possible without them. The cell is orders of magnitude more complex, and we will need to create the data in just a few years rather than decades. This will require a coordinated global effort. We're partnering with Broad, Wellcome Sanger, Arc, Allen, Human Cell Atlas, Human Protein Atlas, NVIDIA, and Renaissance Philanthropy. Biohub is contributing to this effort as both a funder and a builder. We are developing microscopy to observe millions of cells in living organisms, and cryo-ET to resolve the cell in atomic detail. We're building instruments that expand the range of modalities and parameters that can be simultaneously measured. We’re developing molecular, cellular, and tissue engineering to create models of disease and design interventions. The data we generate will be available to the worldwide scientific community. We’re also committing $100M over the next five years to support work beyond Biohub. We invite other scientific teams and funders to join. Link: biohub.org/news/virtual-b…
English
37
137
733
124.9K
Alexey Orlov retweetledi
Anthropic
Anthropic@AnthropicAI·
We're launching the Anthropic STEM Fellows Program. AI will accelerate progress in science and engineering. We're looking for experts across these fields to work alongside our research teams on specific projects over a few months. Learn more and apply: job-boards.greenhouse.io/anthropic/jobs…
English
235
630
6.2K
967.1K
Alexey Orlov retweetledi
Luca Naef
Luca Naef@NaefLuca·
📜 New paper with @mmbronstein: most data needed for AI4Science breakthroughs doesn't exist yet. And it won't - unless we fundamentally rethink data generation. Scaling up isn't enough. We need to stop generating data for humans and start generating for black-box models. We need black-box data 🤖 - 🧵pubs.rsc.org/en/content/art…
Luca Naef tweet media
English
3
44
153
41K
Alexey Orlov retweetledi
Markus J. Buehler
Markus J. Buehler@ProfBuehlerMIT·
Unreasonable Labs exists to build a different kind of machine for discovery: an AI designed to help close the gap between the known and the unknown. Our goal is not simply to generate plausible language about the world, but to reason about the world itself: to compress complexity into transferable principles, recompose them across domains, test them against physics rather than token statistics, and evolve with every hypothesis it examines. We design for the human as co-reasoner, contributing tacit knowledge, judgment, and cross-domain intuition that remain the deepest sources of leverage in discovery. Unreasonable Labs is built on a simple conviction: reasoning grounded in first principles can take us beyond the limits of statistical prediction. Conventional AI optimizes for plausibility within distributions it has already seen - it doesn't reason about what lies outside them. If we want to invent a novel composite resin, discover a new chemical compound, or design a bio-inspired material, probabilistic fluency isn't enough. Our dynamic world model continuously evolves by integrating data with physics engines and experiments, updating its structured understanding of physical reality with every new hypothesis it tests. In one of our example use cases, a materials engineer searching for structures that are simultaneously impact-resistant, flexible, and lightweight discovers unexpected inspiration in butterflies. This is the sort of cross-domain, unreasonable connection that standard AI misses, and that our platform is designed to surface, validate, and ground in physics. This approach extends the range of discoverable designs by making hidden cross-domain principles visible and testable, and brings AI to the physical world. We are building a transparent, visual workspace where scientists can see the AI's logical steps, verify its connections, and steer the process using their own expertise - choosing the degree of autonomy appropriate to the task, from fully autonomous to deeply collaborative. We wrote more about why we are building Unreasonable Labs, and where our name comes from (link below). The hint is in the word itself: it often takes unreasonable people to change the world. “The reasonable man adapts himself to the world: the unreasonable one persists in trying to adapt the world to himself. Therefore all progress depends upon the unreasonable man.” (George Bernard Shaw) @unreasonable_ai @caoyuan33, Andrew Lew, Haiqian Yang, Jennifer Kang, Matt Insler, @ProfBuehlerMIT
English
7
20
95
7.1K
Alexey Orlov retweetledi
Jorge Bravo Abad
Jorge Bravo Abad@bravo_abad·
Self-driving labs you can actually afford: Bayesian optimization meets $5,000 hardware Self-driving laboratories (SDLs) promise to revolutionize chemical discovery by closing the loop between experiment design, execution, and analysis. But there's a catch: most existing platforms cost upwards of $100,000—often much more once you add inline analytics like NMR or HPLC. That price tag has walled off autonomous experimentation to a handful of well-funded groups, amplifying the Matthew effect in chemical research. Simone Pilon and coauthors tackle this with RoboChem-Flex, a modular SDL built largely from 3D-printed parts, Arduino microcontrollers, and aluminum profiles, with a human-in-the-loop entry configuration that brings the total cost to around $5,000. But the more interesting story is the ML stack on top. At its core sits "RoBrains," a Bayesian optimization engine built on BoTorch supporting a broad toolkit: single- and multi-objective acquisition functions (UCB, qEHVI, qLogNEHVI, qLogNParEGO), GP and random forest surrogates, transfer learning via multi-task GPs, hybrid batching, and heteroskedastic noise modeling for low-SNR analytics. The team validates the platform across six case studies that each stress-test a different ML capability: an adaptive UCB that flips between exploitation and exploration when yields plateau (photocatalytic trifluoromethylation, 70% in 2 min); hypervolume optimization for selectivity trade-offs (deoxygenative C–H alkylation); noise-aware optimization compensating for a homemade Raman setup (H/D exchange, 64% D incorporation vs. 0–38% in prior reports); transfer learning across two Buchwald–Hartwig couplings using UMAP-projected DFT ligand descriptors, where the second substrate converged in just 4 extra experiments after learning from the first; and a three-objective enantioselective [2+2] cycloaddition (>99% ee, 80% yield). A recurring theme: featurization and acquisition function choice matter as much as the surrogate model, and modular frameworks let chemists match the algorithm to the problem rather than the other way around. This lowers two barriers at once: the capital cost of automation and the data-efficiency cost of exploring large condition spaces. Smaller teams can now run transfer-learning-driven catalyst screens or impurity-aware multi-objective optimizations without dedicated HTE, and the open-source code and hardware files mean methodology developed in academic labs can transfer directly into process development. Paper: Pilon et al., Nature Synthesis (2026) — CC BY 4.0 | nature.com/articles/s4416…
Jorge Bravo Abad tweet media
English
0
14
22
1.7K
Alexey Orlov retweetledi
Jorge Bravo Abad
Jorge Bravo Abad@bravo_abad·
Incrementally trained language models design drug molecules more potent than their training data Chemical language models (CLMs) are now standard for de novo molecular design: pretrain on millions of SMILES, fine-tune on ligands for a target of interest, sample candidates. But one task has stubbornly resisted them—structural optimization, meaning taking a known active scaffold and squeezing more potency from it without leaning on external oracles like docking or QSAR predictors. Tim Hörmann and coauthors close this gap with a deceptively simple idea: mimic how a medicinal chemist actually learns. Instead of dumping all known analogs into one fine-tuning pass, they split the structure-activity relationship (SAR) series into subsets of increasing potency and fine-tune an LSTM-based CLM incrementally—one potency tier at a time. The model walks up the activity ladder, refining what it learned at each step. Retrospectively, across 27 PPARγ agonist SAR series, the incremental strategy consistently beats one-shot fine-tuning at rediscovering held-out high-potency molecules. The perplexity-potency correlation also becomes positive and sharper, meaning the model internally "knows" which samples should be most active—without any external scorer. Then comes the prospective test. Applied to a benzimidazole PPARγ agonist scaffold, 9 synthesized top-ranked designs all outperform the best known training molecule. Five benzoic acid derivatives hit EC50 values of 0.6–3.1 nM, 12 to 62× more potent than the reference. A second campaign on RORγ inverse agonists produces a 30 nM sulfonamide 20× more potent than its closest training neighbor—and the model correctly picked a rare but potency-driving N-tert-butyl motif present in just one training molecule, capturing long-range SAR dependencies. Two points stand out from an ML angle: the added potency dimension in the training schedule creates a smoother gradient landscape that LSTMs exploit well, and perplexity alone—a model-intrinsic quantity—becomes a reliable ranking signal, removing the external-oracle bottleneck that has constrained generative drug design. For industrial drug discovery pipelines, this is a practical shift: hit-to-lead optimization can be driven directly from in-house SAR tables without building a bespoke scoring model for every target, which matters especially where docking or QSAR predictors are unreliable. It brings generative chemistry meaningfully closer to the daily workflow of medicinal chemistry teams. Paper: Hörmann et al., Nature Communications (2026) — CC BY 4.0 | doi.org/10.1038/s41467…
Jorge Bravo Abad tweet media
English
8
24
121
6.8K
Alexey Orlov retweetledi
Scholarships Corner
Scholarships Corner@scholar_corner·
Humboldt Research Fellowship 2027 in Germany | Fully Funded Applications are now open for the Humboldt Research Fellowship 2027. This is a great opportunity for international researchers to conduct research in Germany with full financial support and access to top institutions. Benefits include: Monthly stipend (€3,000–€3,600), travel allowance, health insurance support, free German language course, family allowances, and strong academic support. 📍 Duration: 6–24 months 🌍 Open to applicants from all countries This fellowship allows you to work with leading researchers and gain international experience in your field. 🔗 Visit: scholarshipscorner.website/humboldt-resea… Credit: Alexander von Humboldt Foundation 📌 Disclaimer: This post is for informational purposes only. Scholarships Corner does not own or manage this opportunity. Please verify details on the official website before applying. #ScholarshipsCorner #fellowship #ResearchFellowship #studyingermany #scholarship
Scholarships Corner tweet media
English
9
378
1.4K
429.4K
Alexey Orlov retweetledi
Markus J. Buehler
Markus J. Buehler@ProfBuehlerMIT·
The next frontier in protein design will not be defined by structure alone, but by the capacity to engineer motion as a first-class principle of function. This is because dynamics is where the real biology lives. Foundational work by Karplus, Levitt & Warshel made clear that chemistry cannot be understood without motion, mechanism, and scale. Gō, Brooks & others showed that proteins possess characteristic collective motions - low-frequency normal modes that capture how whole molecules bend, breathe, and fluctuate. Frauenfelder then sharpened the picture further: proteins are not static objects occupying a single minimum, but dynamic ensembles traversing rugged energy landscapes. And yet the modern AI revolution in protein science has been, above all, a revolution in structure. In our new paper in Matter, @_Bo_Ni and I ask a different question: not what structure will this sequence adopt? but what sequence will realize a prescribed pattern of motion? VibeGen inverts the conventional design paradigm. Rather than treating dynamics as a consequence to be analyzed after the fact, it makes dynamics the design objective from the outset. Using a language diffusion model with two cooperating agents - a designer that proposes sequences and a predictor that critiques them against the target motion profile - the system converges on de novo proteins with tailored vibrational behavior. One of the most intriguing results is a form of functional degeneracy - distinct sequences and distinct folds can satisfy the same target dynamical specification. For a given functional pattern of motion, evolution may have sampled only a small region of the physically realizable design space. The space of viable molecular mechanics may be far larger than the repertoire biology happened to discover. We have made "vibe" into a cultural metaphor - something intuitive, affective, subjective. But at the molecular scale, vibe is not metaphor: It is physics. For a protein, the vibe is the pattern of motion itself; the fluctuations, resonances, and collective displacements that determine what the molecule can do.
English
33
146
810
88.8K
Alexey Orlov retweetledi
EMBL
EMBL@embl·
Biology is built on patterns too complex for us to easily interpret. AI can help bridge the gap, but it’s a two-way exchange: biology inspires AI, and AI delivers back. Here are 7 key takeaways from a recent EMBO | EMBL symposium on AI in life sciences: embl.org/news/science-t…
EMBL tweet media
English
0
29
69
6.8K
Alexey Orlov retweetledi
alphaXiv
alphaXiv@askalphaxiv·
The best way to learn frontier research is to replicate it yourself. And now, you can also win prizes for that! We are excited to announce our partnership with @marimo_io for a competition to bring research to life. All you have to do is pick a paper, build a marimo notebook that brings the core idea to life, and experiment with the research topic. Prizes: Mac Mini + $500! 👀 Deadline: April 26, 11:59 PM PST Individual and team submissions are all welcome Full details found below 👇
English
7
97
914
58.2K
Alexey Orlov retweetledi
Demis Hassabis
Demis Hassabis@demishassabis·
Excited to launch Gemma 4: the best open models in the world for their respective sizes. Available in 4 sizes that can be fine-tuned for your specific task: 31B dense for great raw performance, 26B MoE for low latency, and effective 2B & 4B for edge device use - happy building!
Demis Hassabis tweet media
English
327
883
8K
986.8K
Alexey Orlov retweetledi
Georgia Channing
Georgia Channing@cgeorgiaw·
WE ARE LIVE. The 🧬 OpenADMET PXR Induction Challenge 🧬 is officially open and taking submissions RIGHT NOW. Predict which drug candidates will cause dangerous drug-drug interactions before they reach a patient. More in 🧵 (+ new data just for you)
English
2
12
46
4.4K