Lagoffre retweetledi

30.9% of genetics papers data are kind of trash because of Excel’s aggressive auto-formatting.
Until 2023, there was no global option to disable data conversion. For example, the human SEPT family (1-14) of genes is directly related to cell division and cancer research.
I’ll give you one guess as to what that auto-formats to. Yup…turns into a date.
Oh, it get’s worse though. Many labs use what are known as RIKEN identifiers. It’s a 10 digit alphanumeric code, kind of like a barcode that identifies a gene sequence. Here’s one:
2310009E13
Uh oh. There’s an “E” in there. Guess what that turns into? A floating point!
Excel has a hard limit of 15 significant digits for floats. So, not only did your RIKEN identifier get formatted wrong, but it’s also rounded off to an unrecoverable state.
12.5% of the RIKEN database (Row E) is a disaster.
If you know anything about Bioinformatics, you should be losing your mind. Remember, a huge amount of scientific research is meta-analysis. Good luck cross-referencing patterns when ~31% of the data has errors!
So basically there’s a giant data hole from 2004-2023, much of which has been standardized into national / official databases, and there’s no good way to fix it.


English





























