The paper (arxiv.org/abs/2007.07399) is an interrogation of how datasets in ML are made and their influence. It motivates having genealogical methods for datasets to trace their history and ensure that users are aware of the biases they introduce into downstream applications.
"... the concerns with datasets go much far beyond the statistical properties of who is represented, and that's what we're really trying to do with this paper. The examination of ImageNet both from the categorical and the distributional sides is what sparked our research ..."
"The first Q is trying to understand how dataset developers motivate the decisions that go into the dataset creation. The idea was to read [the dataset artifacts] as texts and understand the values, motivations, and assumptions based on what is said and unsaid within the texts."
"Some interesting patterns which are not not too surprising but are a little disheartening is basically zero papers talking about IRB approval. The only papers that discuss IRB approval processes are review papers. I think only one paper discussed ethical considerations."
"The vast majority of dataset publications don't foreground the dataset as a core contribution. So even though datasets are really fundamental to machine learning, we don't value the construction of datasets like we value algorithmic and modeling contributions."
"There's a history of making these data sets. Well, what are the things that people bring to the table when they do that? If we can understand that, then we can see where the deficiencies are that could lead to things going forward that are just better approaches."
"I would love somebody to take away from this paper that datasets are situated. It's not just the perspectives of the creators but also the socio-technical processes like search engines and the time+place particulars that filter through in the act of creation."