sayan ghosal

24 posts

sayan ghosal

@SayanGhosal94

Research Scientist (AI/ML) at Chan Zuckerberg Initiative | ML for Genomics | ML Scientist | PhD@Johns Hopkins University

Katılım Haziran 2021

155 Takip Edilen92 Takipçiler

sayan ghosal retweetledi

JHU Computer Science@JHUCompSci·26 Kas

Learn how @mike_schatz & other Hopkins scientists corrected 1000s of sequencing errors in the human genome, discovered over 100 new genes that can create proteins, & helped develop the advanced tools and programs that make sequencing faster and more accurate. #ResearchSavesLives

Johns Hopkins University@JohnsHopkins

For more than half a century, Johns Hopkins has been a leader in converting federal support into tangible benefits for the American people. In the latest issue of Johns Hopkins Magazine, explore how Hopkins researchers, scientists, professors, and even students have used or are using federal support in practical, fruitful, meaningful, and often world-changing ways. hub.jhu.edu/magazine/2025/…

English

625

sayan ghosal retweetledi

Paolo Casale@fpcasale·24 Kas

🚀 In @genomeresearch! BayesRVAT integrates pathogenicity predictions, e.g. from AI sequence models, to model gene disruption-response in rare variant association studies. Collab with @caina89 & @Tkaraletsos. @HelmholtzMunich @PioneerCampus @czbiohub 🔗doi.org/10.1101/gr.280…

English

4.4K

sayan ghosal retweetledi

Stephen Turner 🦋 @stephenturner.us@strnr·17 Kas

Complex de novo structural variants are an underestimated cause of rare disorders nature.com/articles/s4146…

Stephen Turner 🦋 @stephenturner.us tweet media

English

4.9K

sayan ghosal retweetledi

Surag Nair@suragnair·10 Kas

Excited to share Nona: a unifying multimodal masking framework for functional genomics. Models for DNA have evolved along separate paths: sequence-to-function (AlphaGenome), language models (Evo2), and generative models (DDSM). Can these be unified under a single paradigm? 1/15

English

229

33.4K

sayan ghosal@SayanGhosal94·8 Kas

@rrastogi02 @anshulkundaje @Avsecz Our guess is somatic mutations. Most cell lines have a significant number of somatic mutations, which we embed in the DNA before passing them to any of the models.

English

110

Ruchir Rastogi@rrastogi02·8 Kas

@SayanGhosal94 @anshulkundaje @Avsecz Do you have a sense for why you see much better prediction on chr19 in ENCODE cell lines compare to (fine-tuned) baseline models? It's a pretty sizable gap, especially compared to performance differences on held-out individuals.

English

149

Anshul Kundaje@anshulkundaje·3 Kas

Looks impressive on a quick read. I like the design principles of the architecture - hierarchical from chromatin to expression rather than the classical multi-task everything + variant-aware. Within & cross-gene performance looks good + useful evals Congrats @Tkaraletsos+team

bioRxiv Genomics@biorxiv_genomic

VariantFormer: A hierarchical transformer integrating DNA sequences with genetic variations and regulatory landscapes for ... biorxiv.org/content/10.110… #biorxiv_genomic

English

21.5K

sayan ghosal@SayanGhosal94·8 Kas

@anshulkundaje @Avsecz Also, we realized that each model has its own way of splitting the data to train. So, finding a true hold out chr is complicated. Instead we focused on cross donor attributes prediction like gene expression, and disease risk.

English

101

sayan ghosal@SayanGhosal94·8 Kas

@anshulkundaje @Avsecz All the baselines models are either trained or fine-tuned using the same strategy.

English

538

sayan ghosal@SayanGhosal94·8 Kas

@Avsecz @anshulkundaje We also followed similar strategies to finetune Enformer, Borzoi with MLP heads and added them as baselines. On genotype models we used RF for ge prediction which can also be extended to elasticnet (similar to PrediXcan).

English

Žiga Avsec@Avsecz·8 Kas

@SayanGhosal94 @anshulkundaje Such as biorxiv.org/content/10.110… biorxiv.org/content/10.110…

English

456

sayan ghosal@SayanGhosal94·8 Kas

@Avsecz @anshulkundaje Appreciate the suggestions! In this work we primarily focused on performance generalization across held-out/unseen individuals.

English

650

Žiga Avsec@Avsecz·8 Kas

@SayanGhosal94 @anshulkundaje Variants that alter a single vs multiple tracks has nothing to do with the result you are seeing. It's whether the model has seen individual level data or not. As i said, to test actual generalisation you need to test on variants on held out chromosome in this case.

English

1.2K

sayan ghosal@SayanGhosal94·8 Kas

@Avsecz @anshulkundaje In comparison Alphagenome works better for variants that alter multiple tracks since it's trained on the ref genome with multiple genome tracks as outpu

English

637

sayan ghosal@SayanGhosal94·8 Kas

@Avsecz @anshulkundaje Thanks @Avsecz for explaining the underlying assumption. Now i understand that this result is an outcome of the different training paradigms. VF is trained on individual data hence it captures LD and LD associated SNPs, aligning with the distribution of the eqtl slope.

English

615

sayan ghosal@SayanGhosal94·6 Kas

@anshulkundaje Finally, I think it's such a interesting result that we will try to reach out to the Alphagenome team to get to the root of it.

English

646

sayan ghosal@SayanGhosal94·6 Kas

@anshulkundaje In Supplementary Table 1 we have provided all the variants, their summary stats, and the model specific scores. Additionally, we have provided a notebook to reproduce the results: github.com/czi-ai/variant…

English

652

sayan ghosal@SayanGhosal94·6 Kas

What’s new • Variant-aware encoders for het/homozygous SNPs and indels • Hierarchical design combining regulatory landscapes with transcriptional regions in a conditional framework • Tissue-specific conditioning to capture tissue effects

English

345

sayan ghosal@SayanGhosal94·6 Kas

Trained on 21,000 paired WGS + bulk RNA-seq samples, VariantFormer delivers SOTA performance in (a) gene-expression prediction across donors and ancestries, (b) disease stratification, (c) capturing effect of somatic mutation and (d) eQTL prediction for low-frequency variants.

English

565

sayan ghosal@SayanGhosal94·6 Kas

Today we’re releasing VariantFormer — a 1.2B-parameter DNA language model built to encode and interpret human genetic variation. Blog: biohub.org/blog/variantfo… Model: virtualcellmodels.cziscience.com/model/variantf… Preprint: biorxiv.org/content/10.110…

English

102

11.3K

sayan ghosal retweetledi

Theofanis Karaletsos@Tkaraletsos·6 Kas

1. VariantFormer: from human genomes to populations VariantFormer strives to elucidate how human cells differ as a consequence of individual variation based on personalized genomes. What is VariantFormer? VariantFormer is the first genomic foundation model that bridges sequence-based modeling across genomes with population-based modeling across personalized genomes. It models cross-tissue bulk gene expression from whole genomes, trained on approximately 2,300 human donors. Work led by @SayanGhosal94 and the VariantFormer team Preprint: biorxiv.org/content/10.110…

English

518

sayan ghosal retweetledi

Veera Rajagopal @doctorveera·13 Haz

A striking example of human genetics-driven molecular discovery. A GWAS of Yersinia pestis infection in 1,000 B cell lines identified a missense variant (rs2282284) in FCRL3—a B cell membrane protein—strongly associated with protection against infection. It turns out that FCRL3 is the receptor that Y. pestis uses to enter human cells (reminiscent of the CCR5 discovery from GWAS of HIV infection). Functional studies showed that overexpressing FCRL3 increases invasion, while knocking it out reduces it. Notably, the protein clusters at sites of bacterial contact on B cells, creating a niche where the bacterium can multiply and evade immune clearance. When the authors looked up this variant in the Japan Biobank, they found it was also protective against chronic hepatitis C infection, suggesting a shared mechanism in B cell targeting. Together, the work highlights how molecular phenotypes can power meaningful genetic discoveries, even with modest sample sizes. A reminder that sometimes the most revealing GWAS don’t come from tens of thousands of people—but from a clever “pandemic in a plate.” Paper: Keneer et al. Human genetic variation reveals FCRL3 is a lymphocyte receptor for Yersinia pestis. Cell Genom 2025

English

128

12.1K

Keşfet

@mike_schatz @genomeresearch @caina89 @Tkaraletsos @HelmholtzMunich @PioneerCampus @czbiohub @rrastogi02