By Omnia Gohar
SAN FRANCISCO—“Big data,” so central to modern physics and astronomy, offers the potential to revolutionize medicine as well. Vast archives of biomedical information could already include the data needed to diagnose and treat many diseases, said computational health scientist Atul Butte of the University of California, San Francisco, on 26 October at the World Conference of Science Journalists 2017.
From diagnosing breast cancer to detecting preeclampsia and beyond, big data “is a new soil just lying there waiting for seeds,” Butte told journalists in a session titled “Open Data and Global Drug Discovery.” Researchers just need to examine data available from large numbers of medical trials, analyze it, and let computerized algorithms sort out underlying patterns, he said.
In his talk, Butte pointed out the numerous—and underutilized—opportunities of today’s massive volumes of biomedical data.
Big data against cancer
For example, Butte highlighted the work of Brittany Wenger. In 2012, this 17-year-old developed “Cloud4Cancer,” a computer program for breast cancer diagnosis, using data from 7.6 million trials to detect patterns and identify malignant tumors. The program is available as a web service to doctors and hospitals all over the world.
“If a high school kid can do it, every scientist now needs to learn what to do with data.”
Wenger’s technology proved to be 99.1% sensitive to malignancy—nearly 5% more accurate than available commercial breast cancer diagnostics. “If a high school kid can do it, every scientist now needs to learn what to do with data,” said Butte.
Carmenta Bioscience, a biotechnology company cofounded by Butte in Palo Alto, California, has also harnessed big data to develop a cutting-edge preeclampsia diagnostic tool. Preeclampsia is a blood pressure disorder associated with pregnancy. The condition is a primary cause of maternal mortality worldwide. Typically doctors have diagnosed preeclampsia using a non-specific test that looks for protein in urine. But such protein can indicate a wide range of diseases, such as diabetes, sickle cell anemia and chronic kidney disease.
In 2013, researchers analyzed data sets from seven previous studies and developed a test that identifies nine specific protein markers in a blood sample to diagnose preeclampsia. The team used this finding to launch Carmenta.
Marrying data with hypotheses
Butte argued that we need more researchers working on projects that start from a database, rather than from a hypothesis. But some scientists have questioned whether data-driven research is science.
Hypothesis-driven research is meant to advance the understanding of how things work. Data-driven research can uncover underlying patterns and trends quickly, which is useful for developing technologies. However, it doesn’t uncover much about the lurking mechanisms of a disease.
When an audience member asked about this shortcoming of big data, Butte agreed that in parallel with scouring data archives, we still must continue understanding the underlying science.
Analyzing big data is also cheap, Butte said: “You can buy an entire preclinical trial experiment off the internet with your credit card.”
The enthusiasm about big data is understandable: It is available, accessible, and usable. Yet it’s challenging to judge the quality of this great amount of information. One main concern is sampling error: Will any inherent sampling issues become magnified with ever-larger data sets? Butte explained that scientists can still control the quality of data by choosing certified labs.
Many researchers today only use publicly available data to validate results from their own laboratories. But when used only in that way, Butte said, “data is just as valueless as kitten videos.”
Omnia Gohar is a fresh graduate from the Applied Chemistry and Industrial Microbiology program at Faculty of Science, Alexandria University in Egypt. She is now a freelance science writer. She enjoys covering emerging science in the Middle East. Email her at firstname.lastname@example.org