Judith Bernett

@judith_bernett

Bioinformatics PhD student at TUM, find me on bluesky: https://t.co/tJTwov7PMr

Munich, Germany Katılım Kasım 2015

103 Takip Edilen91 Takipçiler

Judith Bernett@judith_bernett·12 Ağu

Very proud to present our latest work: 7 guiding questions to avoid data leakage in biological machine learning applications ✨🔍 We hope that reflecting on these questions helps researchers to identify issues or shortcuts leading to overly optimistic performance estimates. 📈🧑‍🔬

Nature Methods@naturemethods

A Perspective from @itisalist @judith_bernett @RomanJoeres @ok55991 @FloHasee @dg_grimm @bit_tumcs & @dbblumenthal discusses the issue of data leakage in machine learning models and presents 7 questions to identify and avoid problems as a result. nature.com/articles/s4159…

English

499

Judith Bernett@judith_bernett·22 Mar

@lipido @itisalist @dbblumenthal @hlfernandez Yes, exactly, our gold standard dataset is balanced

English

Daniel Glez-Peña 🇪🇺@lipido·22 Mar

@judith_bernett @itisalist @dbblumenthal @hlfernandez Thanks! 65% Acc on balanced pos/neg dataset, right?

English

Judith Bernett@judith_bernett·7 Mar

So happy to announce that my paper with @itisalist and @dbblumenthal "Cracking the black box of deep sequence-based protein–protein interaction prediction" is finally published at Briefings in Bioinformatics doi.org/10.1093/bib/bb… ! So what is it about? 1/13 🧵

English

1.9K

Judith Bernett@judith_bernett·22 Mar

@lipido @itisalist @dbblumenthal @hlfernandez Apparently, using ESM-2 embeddings helps a lot already!

English

Judith Bernett@judith_bernett·22 Mar

@lipido @itisalist @dbblumenthal @hlfernandez Thank you, that's great to hear! In our tests, Topsy-Turvy was the method with the highest performance on our gold standard dataset. Since then, some models have been published that beat its performance, e.g., 10.1101/2023.11.09.566187 or TUnA (10.1101/2024.02.19.581072, 65% Acc)

English

Judith Bernett@judith_bernett·7 Mar

What is the takeaway? 📈High acc. can be reached with simple methods for known proteins -> Know your prediction task and try baselines first! 🔮Current seq.-based methods aren't made for predicting the "dark interactome" ✅We made a leakage-free dataset for future development

English

140

Judith Bernett@judith_bernett·7 Mar

12/13 🧵Because this strategy rendered most datasets too small for proper DL, we designed a larger gold standard training (163,192)/val (59,260)/test (52,048) dataset using the same partitioning strategy. The best method achieved 56% accuracy on it. doi.org/10.6084/m9.fig…

English

143

Judith Bernett@judith_bernett·23 Oca

Happy and excited to finally share this project! We show conclusively that high accuracies of deep learning-based PPI prediction models are exclusively due to data leakage via sequence similarities and node degree information. biorxiv.org/content/10.110… @itisalist @dbblumenthal

English

8.1K

Judith Bernett retweetledi

David B. Blumenthal@dbblumenthal·6 Oca

Network-based disease module mining tools often yield non-robust modules and are prone to random bias. To address this problem, we’ve designed a new method using enumeration of diverse Steiner trees: track.smtpsendmail.com/9032119/c?p=Jw… @janbaumbach @KacprowskiTim @judith_bernett @itisalist

English

Keşfet

@lipido @dbblumenthal @hlfernandez @janbaumbach @elonmusk @BarackObama @taylorswift13 @cristiano