Carlos Acevedo-Rocha

3.6K posts

Carlos Acevedo-Rocha

@CaGuAcRo1

Group leader of the Computational Protein Engineering at DTU Biosustain, Denmark. All views my own.

Denmark Katılım Temmuz 2013

755 Takip Edilen418 Takipçiler

Carlos Acevedo-Rocha retweetledi

Jorge Bravo Abad@bravo_abad·21h

Compressing the collective knowledge of ESM into a single protein language model Predicting whether a genetic variant is harmful or benign is one of the most consequential tasks in computational biology. A mutation in BRCA1 or PCSK9 can mean the difference between a healthy carrier and a serious disease. Most top-performing variant effect prediction (VEP) methods get their edge by combining protein language models (PLMs) with 3D structure, multiple sequence alignments (MSAs), or population genetics — extra information that is expensive, incomplete, or potentially circular in clinical settings. Tuan Dinh and coauthors ask a sharp question: are sequence-only PLMs fundamentally limited, or just under-exploited? The key insight is that different ESM models — despite nearly identical architectures — have complementary blind spots. ESM2 reliably detects KRAB domains; ESM1b detects BRICHOS domains; neither catches what the other misses. Rather than averaging predictions (which dilutes rare signals), the authors select the minimum log-likelihood ratio across all models — the prediction most confident that a residue is mutationally sensitive. This signal then drives co-distillation of the entire ESM family into improved single models (VESM), through iterative rounds where models alternately teach and learn from each other. The results are remarkable. VESM-3B, trained exclusively on unaligned sequences, matches or surpasses SaProt, PoET, TranceptEVE, and even AlphaMissense — a closed-source model trained on 3D structure, MSAs, and population allele frequencies. Critically, VESM maintains consistent performance across all allele frequencies, outperforming AlphaMissense precisely on the rare variants where clinical interpretation matters most. VESM scores also correlate quantitatively with continuous phenotypes in UK Biobank data, extending VEP from binary pathogenicity to quantitative trait prediction. This is directly actionable: a sequence-only model at state-of-the-art accuracy removes hard dependencies on structural data or population databases, enabling scalable proteome-wide variant scoring — including for targets with no known structure, no deep alignment, and no prior clinical annotation. Paper: Dinh et al., Nature Methods (2026) — CC BY 4.0 | nature.com/articles/s4159…

English

1.4K

Carlos Acevedo-Rocha retweetledi

Jorge Bravo Abad@bravo_abad·20h

Generating protein geometry with diffusion, no homology needed Predicting a protein's 3D structure from its sequence is one of biology's hardest problems. AlphaFold2 and RoseTTAFold solved it impressively — but with a catch: they rely heavily on multiple sequence alignments and structural templates mined from large databases. For proteins with few or no known relatives — orphan proteins, rapidly evolving viral proteins — that information simply doesn't exist, and prediction accuracy collapses. Protein language models like ESMFold sidestep database searches, but carry their own cost: transformer architectures that scale as O(n³) with sequence length, making them memory-hungry and slow for long sequences. Xudong Wang and coauthors propose TDFold, which takes a different path entirely. Rather than searching for homologous structures, it generates inter-residue geometries — distance and orientation matrices encoding how residues relate in 3D space — directly from sequence, using a fine-tuned stable diffusion model. The amino acid sequence becomes the text prompt; the geometry matrices become the "image" to generate. Two LoRA branches are attached to the SD model's text encoder and UNet: one aligns sequence embeddings with geometric image features, the other learns the distribution of inter-residue distances and orientations. These generated 2D templates then feed a lightweight graph network that predicts full 3D atomic coordinates. On orphan protein benchmarks, TDFold outperforms ESMFold (TM-score gains of 0.04–0.07) and even AlphaFold2 running with full MSA and templates. It handles virus-related proteins with scarce homology — including SARS-CoV-2 accessory proteins — with markedly higher accuracy than competing methods. And it does this in ~10 seconds per 500-residue protein versus ~100 s for ESMFold and ~1,000 s for AlphaFold2, requiring only 7 GB of GPU memory and trainable on a single consumer GPU. This is a practical shift: rapid, low-cost structure prediction for poorly characterized targets — viral antigens, orphan disease proteins, novel enzyme scaffolds — no longer requires expensive compute or homology-rich databases. High-throughput structural screening becomes accessible even to resource-limited teams. Paper: Wang et al., Nature Machine Intelligence (2026) — Journal copyright | nature.com/articles/s4225…

English

3.5K

Carlos Acevedo-Rocha retweetledi

Biology+AI Daily@BiologyAIDaily·1d

Compressing the collective knowledge of ESM into a single protein language model @naturemethods 1. The paper argues that “sequence-only” protein language models (PLMs) are not intrinsically capped for variant-effect prediction (VEP); instead, their evolutionary signals are fragmented across model families and can be recovered by making models learn from each other. 2. Key observation: closely related ESM models have complementary blind spots. For example, ESM2 models systematically miss KRAB-domain conservation signals, while ESM1b/ESM1v can miss BRICHOS-domain signals; yet at least one model in the family captures each domain’s mutational sensitivity. 3. They introduce a simple but effective ensemble rule: for each missense mutation, take the minimum log-likelihood ratio (LLR) across models (ESMIN), i.e., “maximum confidence” scoring. This can amplify subtle evolutionary constraints that averaging would dilute. 4. A theoretical analysis explains when min-LLR beats averaging: if pathogenic-variant LLRs are more dispersed across models than benign-variant LLRs (variance asymmetry). The ESM family empirically shows this property, making maximum-confidence aggregation advantageous. 5. ESMIN is evaluated using 11 sequence-only ESM models (ESM1b, five ESM1v, five ESM2; excluding ESM2-15B). It outperforms averaging-based ensembles and improves ProteinGym DMS correlations, with gains occurring in ~50% of assays (versus ~20% for typical ensembles). 6. Main methodological contribution: “maximum-confidence co-distillation.” For each protein, all models score all mutations; the element-wise minimum LLR matrix becomes a teacher signal, and each model is trained (variant-level MSE) to match these confident targets—without MSAs, structures, or population genetics features. 7. Co-distillation substantially improves every participating model, including small ones: ESM2-8M improves on ClinVar AUC from ~0.65 to ~0.88. Several co-distilled single models (e.g., ESM2-3B, ESM1b, ESM2-650M) can even surpass the ESMIN teacher signal (“student surpasses teacher”). 8. Robustness/ablation: improvements persist when training data are heavily reduced and de-homologized. With only ~1% of human proteins (~200 sequences; <30% identity to benchmark proteins), ESM2-35M reaches ~97% (ClinVar) and ~94% (DMS) of its peak co-distilled performance. 9. Iterative procedure: after round 1 (min-LLR co-distillation), additional rounds switch to average-aggregation co-distillation. As models improve, class-conditional variances become more symmetric, making averaging slightly better; after 3 rounds, a single 3B model matches the ensemble—named VESM-3B. 10. Practical compression: VESM-3B is distilled into smaller models (650M, 150M, 35M) that retain most performance (reported as >98% on Balanced ClinVar and >93% on ProteinGym DMS relative to VESM-3B), enabling high-throughput VEP under limited compute. 11. Clinical benchmark (ProteinGym ClinVar, 2,227 genes): sequence-only VESM models outperform other sequence-only PLMs (including ESM-C) and compete with or surpass methods using MSA/structure/population priors. VESM-3B shows balanced ROC behavior across specificity and sensitivity regimes. 12. AlphaMissense comparison: VESM-3B performance is stable across allele-frequency strata, while AlphaMissense shows strong dependence on MAF (consistent with circularity risks when population frequency informs clinical labels). After excluding variants overlapping AlphaMissense training (gnomAD v2 MAF > 1e-5), all VESM sizes outperform AlphaMissense on AUC and multiple calibrated metrics. 13. Modular use of structure: rather than retraining a joint model, they fine-tune the sequence component of ESM3 using VESM-style sequence-based loss to create VESM3, and combine VESM3 with VESM-3B into a structure-aware ensemble (VESM++). This improves performance on structure-dependent DMS assays (binding/stability/expression) while maintaining strong fitness/activity performance. 14. Cross-domain generalization: despite co-distillation being trained on human proteins, gains transfer strongly to nonhuman DMS assays, with disproportionately large improvements reported for viral proteins—even though ESM3’s released training data excluded viral sequences. 15. Beyond binary pathogenicity: using UK Biobank/Genebass summary statistics for 332 gene–phenotype pairs (blood biochemistry biomarkers), variant-level VESM scores correlate with single-variant effect sizes (β). VESM++ and VESM-3B yield the strongest gene–trait association signals across tested models. 16. Notably, VESM-3B recovers the correct pLoF direction of effect in 98.8% of significant gene–phenotype pairs and identifies many associations not detected by missense burden tests, suggesting utility for quantitative trait interpretation from summary statistics. 📜Paper: doi.org/10.1038/s41592… #ProteinLanguageModels #VariantEffectPrediction #ComputationalBiology #HumanGenetics #ESM #ClinVar #ProteinGym #DeepMutationalScanning #UKBiobank #MachineLearning

English

123

20.2K

Carlos Acevedo-Rocha retweetledi

Nainsi Dwivedi@NainsiDwiv50980·2d

Stop telling Claude: "build this" Stop telling Claude: "write code" Stop telling Claude: "fix this bug" You're using a staff-level AI like a junior intern. Claude performs best when you give: • role • constraints • architecture expectations • output format • real-world context Here are 10 production-grade Claude prompts you can copy-paste:

English

517

4.1K

478.2K

Carlos Acevedo-Rocha retweetledi

Jorge Bravo Abad@bravo_abad·1d

Tensor abstraction meets biomolecular electrostatics Every time a drug binds to a protein or an enzyme catalyzes a reaction, electrostatic forces are at play. The Poisson-Boltzmann (PB) equation is the workhorse for computing these forces in ionic environments—essential for drug discovery, binding affinity prediction, and molecular simulation. Decades of dedicated PB solver development have produced robust, well-validated tools like AMBER PBSA, Delphi, and APBS. The challenge is not that these solvers lack accuracy—it's that their independently developed codebases make it difficult to port them systematically to modern GPUs, benchmark them under unified conditions, or adapt them to heterogeneous HPC architectures without significant engineering effort. Yongxian Wu and coauthors introduce AmberTorchPB, a unified PB solver framework built on LibTorch—PyTorch's C++ backend. The core idea: instead of separate implementations per hardware platform and numerical precision, use tensor abstraction to write once and deploy everywhere. The framework integrates five iterative solvers (CG, BiCG, GMRES, SOR, RB-SOR), three preconditioners (block Jacobi, incomplete Cholesky, AMG), and a matrix-free stencil layout that avoids materializing the full sparse matrix. Tested on 570 proteins and 353 nucleic acids from the AMBER PBSA benchmark, AmberTorchPB matches reference energies with R² = 1.00 across all solvers, achieves more than 2× speedup on both CPU and GPU relative to AMBER PBSA, and cuts GPU memory usage by over 30% for large systems. Even bfloat16 precision converges for the CG solver—opening avenues for memory-efficient large-scale runs. This directly lowers the barrier to high-fidelity electrostatics at scale. Binding affinity estimation and lead optimization workflows that currently stall on large macromolecular assemblies can now run across heterogeneous HPC hardware without platform-specific reimplementation—reducing both computational cost and engineering overhead in molecular simulation pipelines. Paper: Wu et al., J. Chem. Theory Comput. (2026) — CC BY-NC-ND 4.0 | pubs.acs.org/doi/full/10.10…

English

1.1K

Carlos Acevedo-Rocha retweetledi

Vasilis Ntranos@vntranos·1d

Excited to share that our latest work building on ESM is now published in @NatureMethods: A single, sequence-only protein language model achieves state-of-the-art variant effect prediction, surpassing hybrid approaches that use MSA, 3D structure, or population genetics data. nature.com/articles/s4159…

English

240

13.3K

Carlos Acevedo-Rocha retweetledi

Biology+AI Daily@BiologyAIDaily·1d

Small-molecule binding and sensing with a designed protein family @NatureComms 🚀 New paper from David Baker!🚀 1 Researchers have developed a powerful new computational strategy that combines deep learning with physics-based methods to design a family of proteins capable of binding diverse small molecules with high affinity and atomic-level precision. 2 The core innovation lies in the diversification of the NTF2-like fold using structure generation algorithms to create over 10,000 unique scaffolds with varied internal pocket geometries tailored for ligand docking. 3 This approach successfully produced functional binders for six chemically distinct targets, including hormones like cortisol and drugs such as the anticoagulant apixaban, with binding affinities reaching the nanomolar range. 4 The design process utilized LigandMPNN, a deep learning model trained specifically on protein-ligand complexes, which proved highly effective at generating accurate protein-ligand interactions that were later confirmed by crystal structures. 5 Beyond simple binding, the study demonstrates a modular platform for sensor development by creating a cortisol-induced heterodimerization system that functions as a bioluminescent biosensor at physiologically relevant concentrations. 6 High-resolution structural analysis revealed that the designed proteins match their computational models with sub-angstrom accuracy, validating the reliability of these new deep learning-integrated design pipelines. 7 This work establishes a versatile foundation for designing custom sensors and binders for a wide array of applications across environmental monitoring, diagnostics, and therapeutics. 📜Paper: nature.com/articles/s4146… #ProteinDesign #DeepLearning #Biosensors #ComputationalBiology #Biochemistry #NatureCommunications

English

123

6.3K

Carlos Acevedo-Rocha retweetledi

Pranam Chatterjee@pranamanam·2d

Now, THIS is something I am VERY excited about. 🤩 The farther out we can predict clinically, the better we can guide strong molecular generators to design therapeutically-ready molecules! 💊 Super proud of @kalyanmpalepu (one of my first students!) for building this!! ☺️ My vision has always been a drug development paradigm where a model (like Warpseed) could guide and/or tilt a multi-objective discrete generator (i.e. discrete diffusion/flow matching) to enforce clinical success when generating peptides or small molecules, alongside other ADMET and developability properties. 🧪 Best of luck to the team and excited to see where this goes! 🤗

Rohil Badkundri@rohilbadkundri

We used AI to predict the failure of a Phase 3 trial before the results were announced. Today, we're publishing 10 more predictions for the future. Thread 🧵

English

116

50.2K

Carlos Acevedo-Rocha retweetledi

Jorge Bravo Abad@bravo_abad·2d

Evolving enzymes virtually: 7× activity gain guided by deep learning and generative AI Enzymes are not merely binders. Their catalytic efficiency depends on subtle intramolecular geometries that shift with each amino acid substitution—and predicting those shifts has so far defeated even the most powerful protein language models. CYP2C9, the cytochrome P450 enzyme that metabolizes roughly a quarter of all prescribed drugs, is a particularly demanding case: a single variant can halve warfarin clearance or accelerate it to toxic levels. The clinical stakes are high, and the variant space is vast. Chang Li and coauthors address this with VERnet, a 2D convolutional neural network built around a key architectural choice: rather than predicting generic pathogenicity, it learns to predict specific enzyme activity. The model takes AlphaFold2-predicted structures, converts them into amino acid networks (AANs)—weighted graphs encoding interatomic contacts, hydrogen bonds, and overlap interactions—and trains on deep mutation scanning data from 6,142 CYP2C9 variants. Self-distillation removes ambiguous training examples; EasyEnsemble handles severe class imbalance. VERnet reaches 93.5% accuracy on 276 held-out variants, with ROC-AUC of 0.971, outperforming AlphaMissense and ESM-1b. The more striking result comes from pairing VERnet with a variational autoencoder: by imputing a complete activity landscape—including mutation cold spots absent from natural variation—the pipeline identifies six sites invisible to prior structural studies. Virtual saturation mutagenesis at those sites yields N218A, confirmed in vitro with metabolic activity seven times higher than wild-type, unprecedented among all previously characterized CYP2C9 alleles. A MaxEnt statistical framework further validates that these computationally designed sequences remain within evolutionarily plausible space. This represents a concrete shift: function-targeted, structure-aware models can map enzyme activity landscapes computationally before wet-lab synthesis begins, enabling rational prioritization of gain- or loss-of-function variants. The same framework (train on activity data, not disease labels) transfers directly to any enzyme where deep mutational scanning data exist. Paper: Li et al., ACS Catalysis (2026) — CC BY-NC-ND 4.0 | pubs.acs.org/doi/10.1021/ac…

English

2.2K

Carlos Acevedo-Rocha retweetledi

Biology+AI Daily@BiologyAIDaily·2d

Structure-informed direct coupling analysis improves protein mutational landscape predictions 1 Direct Coupling Analysis (DCA) often brings only small gains over independent-site models for mutation-effect prediction, likely because fully connected Potts models are noisy and heavily parameterized relative to typical MSA depth. This work flips the usual DCA workflow: instead of using coevolution to predict contacts, it uses known 3D contacts to constrain the coevolution model. 2 The paper introduces StructureDCA: a sparse Potts/DCA model where pairwise couplings Jij are kept only for residue pairs that are in spatial contact in a provided structure (contact map defined by a distance cutoff). Parameters are optimized directly in this restricted space (not “fit full DCA then zero-out”), aligning inference with the intended sparse model. 3 A second variant, StructureDCA[RSA], reweights contributions using per-residue relative solvent accessibility (RSA): buried residues get higher weight, and pairwise weights are averaged from the two residues’ burial. This aims to better reflect stability physics (core interactions matter more than surface ones). 4 On MegaScale stability data, performance peaks at intermediate sparsity: short-range physical contacts add signal, while adding many long-range/non-contact couplings gradually hurts (interpreted as noise). Reported average Spearman correlation improves from ~0.48 (independent-site or full DCA) to >0.54 with StructureDCA, and up to ~0.60 with StructureDCA[RSA]. 5 The contact-based sparsity criterion beats other sparsification strategies at matched coupling counts: keeping strongest couplings by Frobenius norm, keeping only sequence-local couplings, or random subsampling all help somewhat, but remain clearly below distance/contact-guided selection. The main message: “which couplings” matter more than “how many”. 6 Sparsity also changes scaling: the number of parameters grows roughly linearly with protein length (vs quadratically for fully connected DCA). The authors report orders-of-magnitude speedups in inference and reduced memory, enabling much larger-scale analyses while keeping residue-level interpretability. 7 Across ProteinGym, StructureDCA and StructureDCA[RSA] rank among top unsupervised methods, trailing only a small set of very large protein language models (pLMs) that incorporate additional information and hundreds of millions/billions of parameters. RSA helps most on stability assays, modestly on expression/activity, and can slightly reduce performance on fitness—consistent with fitness involving solvent-exposed functional sites. 8 For stability prediction, StructureDCA[RSA] matches or outperforms several supervised structure-based ΔΔG predictors on MegaScale, and shows a stronger lead on HumanDomains. The paper notes an interesting benchmark contrast: supervised ΔΔG predictors do well on MegaScale but not on HumanDomains, suggesting HumanDomains’ assay may mix stability with fitness-like effects. 9 The method is particularly strong on epistasis and higher-order mutants: on ProteinGym datasets with 5+ simultaneous mutations, StructureDCA[RSA] is reported as top among compared methods. A detailed case study on two homologous metallo-β-lactamases (NDM1 vs VIM2) shows improved reproduction of background-dependent landscapes and mutational tolerance differences, with large computational savings versus Boltzmann-machine DCA. 10 Structural context matters for PPIs: for a toxin–antitoxin system (ParD–ParE), using a concatenated inter-protein MSA plus the complex structure dramatically boosts performance, and using only inter-chain couplings helps further. For SARS-CoV-2 RBD–ACE2 binding, using the bound complex structure (vs monomeric) improves correlations and yields best-in-benchmark performance for that entry when combined with RSA. 💻Code: github.com/3BioCompBio/St… 📜Paper: biorxiv.org/content/10.648… #computationalbiology #bioinformatics #proteins #DCA #coevolution #epistasis #mutationalscanning #proteinengineering #proteindesign #structuralbiology

English

4.3K

Carlos Acevedo-Rocha retweetledi

Biology+AI Daily@BiologyAIDaily·2d

AlphaFold Database expands to proteome-scale quaternary structures 1 AFDB is extended beyond monomers by adding 1,754,242 high-confidence predicted protein complexes (primarily homodimers), enabling proteome-scale access to quaternary structure models with standardized formats, metrics, and metadata for search/visualization/bulk use. 2 The study runs a large-scale prediction campaign covering ~31M complexes from 4,777 proteomes: 23,441,822 homodimers derived from UniProt proteomes, plus 7,620,644 heterodimer candidates derived from STRING “physical interaction” links for 16 model organisms and 30 WHO-prioritized global health proteomes. 3 A key methodological contribution is confidence calibration for complex models using interface-focused signals. The authors evaluate multiple metrics and settle on a practical high-confidence rule: ipSAEmin ≥ 0.6, pLDDTavg ≥ 70, and backbone clashes ≤ 10, benchmarked against post-2021 PDB homodimers (positives) and PDB monomers (negatives). 4 Among interface metrics tested (ipTM, ipSAEmin, LISmin, pDockQ2), ipSAEmin provides the clearest separation between true homodimers and monomers and shows a stable F1 plateau up to the chosen cutoff (precision ~0.859, recall ~0.655, F1 ~0.744), motivating its use as the primary AFDB-facing filter. 5 The resulting high-confidence set retains ~7% of predicted homodimers (~1.8M out of ~23M), and AFDB further labels entries by ipSAEmin into “very high-confidence” (≥0.8), “confident” (0.7–<0.8), and “low-confidence” (0.6–<0.7) to help non-expert users interpret interfaces. 6 Compared to experimentally determined multimers in the PDB, the high-confidence predicted complexes increase structural coverage by 1–3 orders of magnitude for most organisms, with especially large gaps being bridged in Metazoa and Viridiplantae (exceptions are heavily studied species like human, E. coli, and yeast). 7 Dataset consistency is assessed at scale: after clustering sequences at 98% identity/95% coverage and aligning complexes within clusters, 95.9% of aligned complexes show complex TM-score > 0.8; additionally, chain A vs chain B within homodimers yields 98.81% with TM-score > 0.8, supporting internal coherence of the predictions. 8 Prediction yield varies across taxonomy: archaea and bacteria show >3× higher high-confidence homodimer rates than eukaryotes, consistent with shorter/more compact prokaryotic proteins and a higher prevalence of homo-oligomeric assemblies, while eukaryotic proteins are often longer, multi-domain, and more disordered and may preferentially form heteromers. 9 For heterodimers, applying the same thresholds produces 56,956 “tentatively high-confidence” models from 7,620,644 STRING-derived candidates. High-confidence likelihood correlates with STRING score, but also with homodimer-like properties (higher inter-chain sequence identity and smaller chain-length differences), motivating further calibration tailored to heteromer biology. 10 Structural clustering of 1,811,201 predicted complexes (high-confidence homodimers + tentative heterodimers) compresses the space ~8-fold into 224,862 clusters; complex topology is highly recurrent (top 1% of non-singleton representatives cover ~25% of entries; top 20% cover ~82%), and ~9% of non-singleton clusters span multiple superkingdoms, suggesting conserved complex building blocks across deep evolution. 11 Case studies highlight “emergent” structures only visible in oligomeric context: a domain-swapped fold in Dictyostelium Q55DI5 becomes high-confidence only as a dimer; a fungal membrane protein (Atg33) gains a more coherent assembly and clearer membrane boundaries as a dimer; multimer prediction can also refine inter-domain architecture even when monomer predictions are already confident, and can partially rescue low-confidence monomers (e.g., an HTH regulator consistent with dimeric function). 12 Engineering/scaling details matter for feasibility at this size: MMseqs2-GPU MSAs (best hit per taxon as an orthology-like filter), AlphaFold-Multimer inference via accelerated ColabFold/OpenFold (TensorRT + cuEquivariance), batching strategies to reduce recompilations, and a post-processing pipeline producing AFDB-compliant mmCIF/BCIF plus computed interface metrics and clash scores. 📜Paper: biorxiv.org/content/10.648… #AlphaFold #ProteinComplexes #StructuralBioinformatics #Interactome #ComputationalBiology #Bioinformatics #ProteinStructure #AIforScience #OpenScience

English

3.6K

Carlos Acevedo-Rocha retweetledi

Peter Ottsjö@peterottsjo·2d

A lot happened in the AI × bio world these past few days. Here's what you might have missed. ☑️ An AI agent from an ex-AlphaFold 2 developer is said to be “the world's first autonomous agent for drug design, lab-validated end to end”. Oh, and it engaged in some reward hacking shenanigans. ☑️ Eli Lilly just signed a massive $2.75 billion deal for AI-designed drugs. ☑️ Meta's new brain model could enable “in-silico neuroscience”. ☑️ Anthropic seems to be building something called Operon for Claude. Let’s take a closer look.🧵

English

Carlos Acevedo-Rocha retweetledi

Nav Toor@heynavtoor·5d

🚨BREAKING: Every book you have ever read. Every novel that has ever been published. It is sitting inside ChatGPT right now. Word for word. Up to 90% of it. And OpenAI told a judge that was impossible. Researchers at Stony Brook University and Columbia Law School just proved it. They fine tuned GPT-4o, Gemini 2.5 Pro, and DeepSeek V3.1 on a simple task: expand a plot summary into full text. A normal use case. The kind of thing a writing assistant is built for. No hacking. No jailbreaking. No tricks. The models started reciting copyrighted books from memory. Not paraphrasing. Not summarizing. Entire pages reproduced verbatim. Single unbroken spans exceeding 460 words. Up to 85 to 90% of entire copyrighted novels. Word for word. Then it got worse. The researchers fine tuned the models on the works of only one author. Haruki Murakami. Just his novels. Nothing else. It unlocked verbatim recall of books from over 30 completely unrelated authors. One author's books opened the vault to everyone else's. The memorization was already inside the model the whole time. The fine tuning just removed the lock. Your book might be in there right now. You would never know it unless someone looked. Every safety measure the companies rely on failed. RLHF failed. System prompts failed. Output filters failed. The exact protections these companies cite in courtroom defenses did not stop a single page from being extracted. Then the researchers compared the three models. GPT-4o. Gemini. DeepSeek. Three different companies. Three different countries. They all memorized the same books in the same regions. The correlation was 0.90 or higher. That means they all trained on the same stolen data. The paper names the sources directly: LibGen and Books3. Over 190,000 copyrighted books obtained from pirated websites. Right now, authors and publishers have dozens of active lawsuits against OpenAI, Anthropic, Google, and Meta. These companies have argued in court that their models learn patterns. Not copies. That no book is stored inside the weights. This paper says that is a lie. The books are still inside. And researchers just pulled them out.

English

252

2.8K

7.1K

419.8K

Carlos Acevedo-Rocha retweetledi

Oscar Arias@OACerebro·6d

Lo leí. Y suena bonito. Limpio. Aséptico. Como un laboratorio sin olor a vinagre. Una máquina que tiene ideas, escribe código, corre experimentos, hace figuras, redacta el paper… y encima se revisa a sí misma. Un pequeño dios de silicio jugando a ser científico mientras nosotros seguimos peleándonos con reviewers borrachos y cafés fríos. Dicen que es “el ciclo completo de la ciencia”. La fantasía húmeda de cualquier comité editorial. Pero la ciencia —la de verdad— nunca fue limpia. La ciencia es un tipo a las 3 a.m. dudando de su propia hipótesis. Es un error estúpido en una línea de código que te arruina seis meses. Es una obsesión que no te deja coger, dormir ni vivir. Esto… esto es otra cosa. Una fábrica. Ideas baratas. Papers baratos. Tal vez verdad barata. Porque claro, puede generar hipótesis, revisar literatura y escupir manuscritos más rápido que cualquier humano agotado. Pero no sangra por ellas. No hay silencio incómodo en una discusión. No hay ego. No hay miedo a estar equivocado. Y sin eso… no sé si hay ciencia o solo producción. Lo irónico es que durante años soñamos con quitarle a la ciencia lo más humano: el error, el sesgo, la lentitud. Y ahora que lo logramos, empezamos a sospechar que ahí estaba precisamente lo valioso. Tal vez este “AI Scientist” no viene a reemplazarnos. Viene a dejarnos en evidencia. A mostrar que gran parte de lo que llamábamos investigación… ya era automatizable. Lo que queda —lo realmente peligroso—no cabe en un paper. Y esa parte, por ahora, sigue siendo nuestra. nature.com/articles/s4158…

Español

162

583

32.1K

Carlos Acevedo-Rocha retweetledi

Guri Singh@heygurisingh·25 Mar

🚨BREAKING: Meta just built an AI that rewrites its own learning algorithm. Not just getting better at tasks. Getting better at getting better. It's called "Hyperagents" and the results are terrifying. Here's what happened: They merged the task-solving AI and the self-improvement AI into one single editable program. The AI can now rewrite its own improvement procedure. Not metaphorically. Literally editing the code that controls how it evolves. They tested it across 4 domains: coding, paper review, robotics, and Olympiad-level math grading. → In robotics: performance jumped from 0.060 to 0.372. The AI discovered that jumping was a better strategy than standing -- something no human programmed it to try. → In paper review: accuracy went from 0.0 to 0.710. It built multi-stage evaluation pipelines with checklists and decision rules on its own. → The wildest part: they transferred the "ability to improve" from robotics to math grading. Human-designed improvement agents scored 0.0 in the new domain. The Hyperagent scored 0.630. But here's what should keep you up at night: Without anyone telling it to, the AI spontaneously developed: - Persistent memory to store insights across generations - Performance tracking to identify which changes actually worked - Compute-aware planning to prioritize big changes early and small refinements late It built its own R&D infrastructure from scratch. The researchers call it "metacognitive self-modification." The rest of us should call it what it is: Recursive self-improvement is no longer theoretical. Meta just open-sourced it on GitHub.

English

352

30.9K

Carlos Acevedo-Rocha retweetledi

Nature Biotechnology@NatureBiotech·23 Mar

Voices of biotech leaders Nature Biotechnology asks a selection of leaders from across biotech to look at the future of the sector and make some predictions for the coming years go.nature.com/4uR7pVa

English

145

20.4K

Carlos Acevedo-Rocha retweetledi

Niko McCarty.@NikoMcCarty·24 Mar

Announcing the winners for the "Fast Biology Bounties." I ended up giving away ~$15,000 for 20 projects after reading 430 submissions from 335 individuals. Many winners were "highly generative," meaning they sent me 3-5 excellent ideas and were glad to have them shared freely and openly. There were some major failure modes, too. Some ideas surfaced repeatedly, but I didn't do a good job of connecting "like-minded" people. I'll fix this next time. Also, I managed everything manually using my personal email. This was tedious, and I'm working on building a platform that will automate a lot of this. I'd like to send feedback and scores for every submission in future contests. Many more details in my blog post, which breaks down all the numbers, what I learned, and highlights some of the winners. Some people who I gave money to: - Sebastian Cocioba for a laser-based PCR thermocycler, in which infrared heating replaces aluminum blocks. - Bryan Duoto for writing and publishing a colony-to-sequence cloning workflow that uses magnetic beads and Nanopore sequencers. Scientists can verify clones in 1–3 hours instead of waiting overnight. - Jeff Nivala for an idea to synthesize proteins directly from DNA, without relying on any RNA intermediates. - Sierra Bedwell for a clever automation system that uses off-the-shelf parts to screen thousands of environmental DNA samples in parallel. - Xavier Bower for "IceCreamClone," an interactive cloning strategy ranker that looks at a scientist’s available “parts,” or sequences, and then determines whether they ought to use Gibson, Golden Gate, restriction digest, or another strategy to assemble them together. The software also catches likely cloning errors and estimates the cost and time required for each option. - Andres Arango for multiple ideas, including using antifreeze to accelerate DNA ligation by 2-3 orders of magnitude, and an idea for computationally designed protein cradles for expressing membrane proteins in E. coli.

English

283

43.8K

Carlos Acevedo-Rocha retweetledi

Biology+AI Daily@BiologyAIDaily·21 Mar

BioReason-Pro: Advancing Protein Function Prediction with Multimodal Biological Reasoning @arcinstitute 1. BioReason-Pro introduces the first multimodal reasoning large language model specifically designed for protein function prediction, combining protein embeddings with biological context to generate interpretable reasoning traces rather than just classification labels. 2. The system integrates ESM3 protein embeddings, a GO graph encoder, and biological context including organism, domains, protein-protein interactions, and GO-GPT predictions to perform step-by-step biological reasoning from sequence to function. 3. GO-GPT, a key component, is the first autoregressive transformer for Gene Ontology prediction that captures hierarchical and cross-aspect dependencies between GO terms, achieving state-of-the-art Fwmax of 0.65-0.70 across inference strategies. 4. The model was trained on over 130,000 synthetic reasoning traces generated by GPT-5 and further optimized through reinforcement learning with Group Sequence Policy Optimization, achieving 73.6% Fmax on GO term prediction. 5. Human protein experts preferred BioReason-Pro annotations over ground truth UniProt annotations in 79% of evaluated cases, with an LLM judge score of 8/10 for functional summaries, substantially outperforming previous methods. 6. Remarkably, BioReason-Pro de novo predicted experimentally confirmed binding partners with per-residue attention localizing to exact contact residues resolved in cryo-EM structures, demonstrating genuine structural reasoning capabilities. 7. The model successfully performed structural reasoning that overrode misleading superfamily-level domain annotations, such as correctly identifying CFAP61 as a non-enzymatic scaffold despite its Rossmann-like fold that typically indicates catalytic activity. 8. For eEFSec, BioReason-Pro identified SECIS-binding protein 2 as the obligate functional partner from sequence alone, with attention concentrated on the RIFT domain surface that matches the experimentally resolved SECIS RNA binding interface in PDB 7ZJW. 9. The system maintains strong performance even for proteins with very low sequence similarity to training data, with performance degrading much more slowly than BLAST as sequence identity decreases, indicating learned generalizable reasoning rather than simple homology transfer. 10. All model weights, code, and curated datasets are released publicly, alongside precomputed predictions for over 240,000 proteins including the Human Protein Atlas, enabling broad adoption for functional annotation of uncharacterized proteins. 💻Code: bioreason.net/code 📜Paper: biorxiv.org/content/10.648… #BioReasonPro #ProteinFunction #ComputationalBiology #Bioinformatics #MachineLearning #LLM #GeneOntology #ProteinStructure #FunctionalAnnotation #AIforScience

English

5.4K

Carlos Acevedo-Rocha retweetledi

Arc Institute@arcinstitute·20 Mar

Over 250 million protein sequences are known, but fewer than 0.1% have confirmed functions. Today, @genophoria, @BoWang87 & team introduce BioReason-Pro, a multimodal reasoning model that predicts protein function and explains its reasoning like an expert would.

English

125

527

61.1K

Carlos Acevedo-Rocha retweetledi

Jorge Bravo Abad@bravo_abad·20 Mar

Variational synthesis: co-designing generative models and DNA chemistry to manufacture 10¹⁷ sequences for $1,000 Generative models for protein design are remarkably powerful. Trained on hundreds of millions of biological sequences, they learn evolutionary constraints and sample diverse, realistic candidates at will. The bottleneck is not generation—it is synthesis. Conventional oligosynthesis caps practical libraries at ~10⁵ individual sequences, leaving the overwhelming majority of model-designed candidates physically inaccessible. Weinstein and coauthors address this by rethinking where the model ends and the laboratory begins. Their key idea is manufacturing-aware architecture: each parameter θ of the generative model maps directly to an experimentally controlled parameter in a DNA synthesis protocol—reagent concentrations, reaction timing, nucleotide mixture ratios. Model and synthesis protocol are co-designed from the start. The result is variational synthesis, a framework where the inherent stochasticity of chemical reactions replaces the random number generator of conventional in silico sampling. Every DNA molecule produced in a synthesis reaction is an independent draw from the model distribution pθ*(x). At picomole-scale yields, that translates to ~10¹⁶ independent samples per run. The authors train and validate variational synthesis models on three targets: 325 million human antibody CDRH3 sequences, HLA-A*02:01-binding T cell epitopes, and Taq DNA polymerase completions generated by ProGen2. Across all three, in vitro sample quality—assessed via nonparametric two-sample tests (BEAR and MMD)—is comparable to state-of-the-art protein language models sampled in silico only. The antibody library of 9×10¹⁶ sequences cost ~$1,000; conventional synthesis of the same library would cost ~$10¹⁵. For drug discovery, vaccine development, and enzyme engineering pipelines, this shifts the scale of experimental screening by twelve orders of magnitude—turning generative models from design tools into physical reagent generators, and making petascale functional screening a concrete near-term possibility. Paper: Weinstein et al., Nature Biotechnology (2026) | nature.com/articles/s4158…

English

4.1K

Keşfet

@naturemethods @NatureMethods @NatureComms @kalyanmpalepu @arcinstitute @genophoria @BoWang87 @elonmusk