sayan ghosal

24 posts

sayan ghosal

sayan ghosal

@SayanGhosal94

Research Scientist (AI/ML) at Chan Zuckerberg Initiative | ML for Genomics | ML Scientist | PhD@Johns Hopkins University

Katılım Haziran 2021
155 Takip Edilen92 Takipçiler
sayan ghosal retweetledi
sayan ghosal retweetledi
Surag Nair
Surag Nair@suragnair·
Excited to share Nona: a unifying multimodal masking framework for functional genomics. Models for DNA have evolved along separate paths: sequence-to-function (AlphaGenome), language models (Evo2), and generative models (DDSM). Can these be unified under a single paradigm? 1/15
Surag Nair tweet media
English
5
51
229
33.4K
sayan ghosal
sayan ghosal@SayanGhosal94·
@rrastogi02 @anshulkundaje @Avsecz Our guess is somatic mutations. Most cell lines have a significant number of somatic mutations, which we embed in the DNA before passing them to any of the models.
English
0
0
1
110
Ruchir Rastogi
Ruchir Rastogi@rrastogi02·
@SayanGhosal94 @anshulkundaje @Avsecz Do you have a sense for why you see much better prediction on chr19 in ENCODE cell lines compare to (fine-tuned) baseline models? It's a pretty sizable gap, especially compared to performance differences on held-out individuals.
English
2
0
0
149
Anshul Kundaje
Anshul Kundaje@anshulkundaje·
Looks impressive on a quick read. I like the design principles of the architecture - hierarchical from chromatin to expression rather than the classical multi-task everything + variant-aware. Within & cross-gene performance looks good + useful evals Congrats @Tkaraletsos+team
bioRxiv Genomics@biorxiv_genomic

VariantFormer: A hierarchical transformer integrating DNA sequences with genetic variations and regulatory landscapes for ... biorxiv.org/content/10.110… #biorxiv_genomic

English
2
8
86
21.5K
sayan ghosal
sayan ghosal@SayanGhosal94·
@anshulkundaje @Avsecz Also, we realized that each model has its own way of splitting the data to train. So, finding a true hold out chr is complicated. Instead we focused on cross donor attributes prediction like gene expression, and disease risk.
English
0
0
0
101
sayan ghosal
sayan ghosal@SayanGhosal94·
@Avsecz @anshulkundaje We also followed similar strategies to finetune Enformer, Borzoi with MLP heads and added them as baselines. On genotype models we used RF for ge prediction which can also be extended to elasticnet (similar to PrediXcan).
English
0
0
0
87
sayan ghosal
sayan ghosal@SayanGhosal94·
@Avsecz @anshulkundaje Appreciate the suggestions! In this work we primarily focused on performance generalization across held-out/unseen individuals.
English
1
0
0
650
Žiga Avsec
Žiga Avsec@Avsecz·
@SayanGhosal94 @anshulkundaje Variants that alter a single vs multiple tracks has nothing to do with the result you are seeing. It's whether the model has seen individual level data or not. As i said, to test actual generalisation you need to test on variants on held out chromosome in this case.
English
1
1
6
1.2K
sayan ghosal
sayan ghosal@SayanGhosal94·
@Avsecz @anshulkundaje In comparison Alphagenome works better for variants that alter multiple tracks since it's trained on the ref genome with multiple genome tracks as outpu
English
1
0
0
637
sayan ghosal
sayan ghosal@SayanGhosal94·
@Avsecz @anshulkundaje Thanks @Avsecz for explaining the underlying assumption. Now i understand that this result is an outcome of the different training paradigms. VF is trained on individual data hence it captures LD and LD associated SNPs, aligning with the distribution of the eqtl slope.
English
1
0
0
615
sayan ghosal
sayan ghosal@SayanGhosal94·
@anshulkundaje Finally, I think it's such a interesting result that we will try to reach out to the Alphagenome team to get to the root of it.
English
1
0
1
646
sayan ghosal
sayan ghosal@SayanGhosal94·
@anshulkundaje In Supplementary Table 1 we have provided all the variants, their summary stats, and the model specific scores. Additionally, we have provided a notebook to reproduce the results: github.com/czi-ai/variant…
English
1
0
1
652
sayan ghosal
sayan ghosal@SayanGhosal94·
What’s new • Variant-aware encoders for het/homozygous SNPs and indels • Hierarchical design combining regulatory landscapes with transcriptional regions in a conditional framework • Tissue-specific conditioning to capture tissue effects
English
0
0
1
345
sayan ghosal
sayan ghosal@SayanGhosal94·
Trained on 21,000 paired WGS + bulk RNA-seq samples, VariantFormer delivers SOTA performance in (a) gene-expression prediction across donors and ancestries, (b) disease stratification, (c) capturing effect of somatic mutation and (d) eQTL prediction for low-frequency variants.
English
1
0
1
565
sayan ghosal retweetledi
Theofanis Karaletsos
Theofanis Karaletsos@Tkaraletsos·
1. VariantFormer: from human genomes to populations VariantFormer strives to elucidate how human cells differ as a consequence of individual variation based on personalized genomes. What is VariantFormer? VariantFormer is the first genomic foundation model that bridges sequence-based modeling across genomes with population-based modeling across personalized genomes. It models cross-tissue bulk gene expression from whole genomes, trained on approximately 2,300 human donors. Work led by @SayanGhosal94 and the VariantFormer team Preprint: biorxiv.org/content/10.110…
English
1
1
3
518
sayan ghosal retweetledi
Veera Rajagopal 
Veera Rajagopal @doctorveera·
A striking example of human genetics-driven molecular discovery. A GWAS of Yersinia pestis infection in 1,000 B cell lines identified a missense variant (rs2282284) in FCRL3—a B cell membrane protein—strongly associated with protection against infection. It turns out that FCRL3 is the receptor that Y. pestis uses to enter human cells (reminiscent of the CCR5 discovery from GWAS of HIV infection). Functional studies showed that overexpressing FCRL3 increases invasion, while knocking it out reduces it. Notably, the protein clusters at sites of bacterial contact on B cells, creating a niche where the bacterium can multiply and evade immune clearance. When the authors looked up this variant in the Japan Biobank, they found it was also protective against chronic hepatitis C infection, suggesting a shared mechanism in B cell targeting. Together, the work highlights how molecular phenotypes can power meaningful genetic discoveries, even with modest sample sizes. A reminder that sometimes the most revealing GWAS don’t come from tens of thousands of people—but from a clever “pandemic in a plate.” Paper: Keneer et al. Human genetic variation reveals FCRL3 is a lymphocyte receptor for Yersinia pestis. Cell Genom 2025
Veera Rajagopal  tweet media
English
2
20
128
12.1K