Machine learning is everywhere in science and technology: powering facial recognition, picking your recommendations on Netflix, and controlling self-driving cars. But how reliable are machine learning techniques really? A statistician says that the answer is “not very,” arguing that questions of accuracy and reproducability of machine learning have not been fully addressed.
Dr Genevera Allen, associate professor of statistics, computer science, and electrical and computer engineering Rice University in Houston, Texas has discussed this topic at a press briefing and at a scientific conference, the 2019 Annual Meeting of the American Association for the Advancement of Science (AAAS). She warned that researchers in the field of machine learning have spent so much time developing predictive models that they have not devoted enough attention to checking the accuracy of their models, and that the field must develop systems which can assess the accuracy of their own findings.
“The question is, ‘Can we really trust the discoveries that are currently being made using machine-learning techniques applied to large data sets?'” Allen said in a statement. “The answer in many situations is probably, ‘Not without checking,’ but work is underway on next-generation machine-learning systems that will assess the uncertainty and reproducibility of their predictions.”
As an example, recently machine learning has been used to study patients with cancer. To study the disease, scientists use machine learning to identify genetically similar individuals so that drug therapies can then be targeted to these specific genomes. But when comparing across different studies, the clusters identified by machine learning are completely different from each other.
The problem is that machine learning techniques do not have a way to say “I don’t know” or “It’s not clear.” The techniques will generally always produce an answer — in the example of the cancer patients, they will always identify a group in some way — but this answer may not be as certain or accurate as it is believed to be. The techniques are able to find a pattern that exists in the data set, even if only dimly, but the pattern may not hold in the real world.
“There is general recognition of a reproducibility crisis in science right now,” Allen told BBC News. “I would venture to argue that a huge part of that does come from the use of machine learning techniques in science.”