What’s the Big Deal with Big Data?

Clinical trials, the largest of which may enroll a few thousand patients with hematology or oncology diagnoses, represent the gold standard of clinical research. But what if clinical decisions could be made, or research questions answered, using data from tens of thousands or even a million patients?

Initiatives are springing up across the country to examine the power and promise of big data – massive amounts of information that can be analyzed to provide an overview of trends or patterns – to revolutionize health care and transform how patients are diagnosed, treated, and even involved in their own care.

For instance, in 2012, the National Institutes of Health (NIH) established the Big Data to Knowledge (BD2K) initiative, an effort to promote research and development of tools and approaches that would accelerate the use of big data in biomedical research.1 This spring, IBM launched IBM Watson Health and the Watson Health Cloud platform, a new unit of the IBM Watson cognitive computing system that will analyze and extract large volumes of health data from structured and unstructured medical systems.2

While the promise of big data is immense, the movement is not without technical, legislative, and privacy concerns. We spoke with several experts who seemed to agree that, before big data becomes a regular part of clinical practice, questions about these concerns need to be answered.

What Exactly is Big Data?

The term “big data” is popping up more often in discussions of the future of health care, but what does it really mean?

At its core, big data refers to data collected from multiple internal and external sources that are too large and complex for traditional analytics to handle. In recent years, there have been a number of technological innovations to store, process, generate, and interpret these large datasets, such as analytical models and algorithms.

How big are these data? In 2011, IBM estimated that the global size of data in health care was 161 billion gigabytes. In 2012, 2.3 trillion gigabytes of data were created each day.3
Anne-Marie Meyer, PhD, assistant professor of epidemiology at the Gillings School of Global Public Health at the University of North Carolina, prefers to use “the four Vs” when describing big data:

  • Volume: the scale of data
  • Variety: the many sources and types of data, both structured and unstructured
  • Veracity: accuracy and certainty of data
  • Velocity: the pace at which data flows in from sources like electronic medical records and monitoring devices3

While this definition encompasses the size, complexity, and technology components, it also speaks to the challenges and opportunities that exist within big data in any field.

“Variety and veracity of data are the two biggest issues,” Dr. Meyer said, offering the retail industry as an example. “Companies can analyze big data to derive algorithms to predict shoppers’ behaviors, but it’s not always accurate. When these algorithms are wrong in health care, though, the stakes are a lot higher.”

Two developments have coincided to make big data in health care a viable reality: the advent of electronic medical records (EMRs) and an increased computing capacity. With these advances, the health-care industry is now able to collect health information electronically, at a high volume, at a rapid pace, and at the exact moment of health-care delivery. The ultimate goal is to use the data from large patient populations to predict future outcomes and treatment responses for other patients.

“Our challenge is to figure out how to produce good data, and then to bring these data together and use them with integrity,” she added.

For big data to serve as a useful tool in the clinical setting, it will need to be incorporated into the EMR system, rather than forcing the physician to log into and learn to use multiple software systems.

The Big Idea Behind Big Data

EMRs have been growing in prevalence over the last decade. According to the Centers for Disease Control and Prevention, 78.4 percent of office-based physicians were using some type of EMR in 2013.4 As these systems proliferate, so do the opportunities to collect more patient data.

Big data, whether clinical or genomic, has the potential to transform patient care by acting as a support tool for physicians – helping them to make real-time decisions about how to best treat the patient sitting in front of them.

In addition, big data can help tailor medical therapy by selecting a smaller set of similar patients from a large pool of data and comparing their responses to a particular treatment.
“For the clinical doctors, especially in hematology and medical oncology, I think it is incredibly challenging to predict whether or not someone is going to respond to a therapeutic agent,” Dr. Meyer said, adding that she views big data as “an immense opportunity” to collect large volumes of data, create predictive algorithms, and, ultimately, improve patient care.
For instance, if a patient had a specific genomic abnormality or mutated gene, clinicians may be able to use big data resources to identify others with a similar mutation and determine if certain drug regimens outperformed others in those individuals.

John Sweetenham, MD, executive medical director for the Huntsman Cancer Institute at the University of Utah, noted one area where big data could be especially valuable: relapse settings. The current process to discover which drugs work best in a relapsed patient is essentially a trial-and-error process; big data could give clinicians more real-world data when making treatment decisions.

“The fact that we are going to be able to do that in a data-driven way means that we’re going to see higher response rates, longer survival, and fewer unnecessary treatments down the line,” he said.

In an analysis published in Health Affairs, David W. Bates, MD, MSc, and colleagues identified six areas where they believed big data would offer the greatest advantage: high-cost patients, readmissions, triage, decompensation, adverse events, and treatment optimization for diseases that affect more than one organ system.5

“Both predicting outcomes of patients – such as who will be a high-cost patient, will be readmitted, or will suffer an adverse event – and tailoring the management of patients should result in substantial savings for the health-care system,” Dr. Bates wrote in the paper.

Clinicians could also potentially use big data as a rationale for payment or prior authorization decisions from payers – in particular, for drugs that are not approved for use in a particular indication but still have shown evidence that the drug could be useful.

Isaac S. Kohane, MD, PhD, chair of the Department of Biomedical Informatics at Harvard Medical School, added that big data can be used to learn more about specific diseases. Dr. Kohane, for instance, has used entire health records from multiple institutions to study children with autism. “Access to the complete health record allows us to have a glance at the clinical landscape of all the other diseases that they have – not just the neurobehavioral diseases, but all of the other clinical diseases that they have, and across tens of thousands of patients.”

Dr. Kohane and colleagues discovered that patients with autism also had a high incidence of inflammatory bowel disease – a finding that may not have been apparent without a wider lens to study the population.

“You could have 1,000 kids with autism in your practice, which is a huge number, but you would likely only see about 10 of these kids with inflammatory bowel disease in the lifetime of your practice,” Dr. Kohane explained. “But, if you looked at tens of thousands of patients, it would become very clear that the association is statistically significant.”

Dr. Kohane said his research team was also able to identify three distinct subgroups of autism. “These three subgroups are quite distinct; they all have autism, yes, but they have three distinct clinical manifestations all strongly suggesting three different biologies.”

Dr. Kohane and others familiar with the potential for big data are optimistic that they have only begun to scratch the surface when it comes to using these resources to make new discoveries about disease and how it affects patients with those diseases.

Big Data Analysis Versus Clinical Trials

Randomized clinical trials are requisite for the rigorous study of new drugs and ultimately for approval of drugs by regulatory agencies. Big data may complement such studies by offering a more robust view of real-world patients.

Performing secondary analyses of big data can answer questions about the effectiveness of therapies that trials cannot, Dr. Meyer said, mainly because these records include data from individuals who may not have met the inclusion criteria for a trial.

“Trials are asking questions about efficacy,” she noted. “Comparative-effectiveness or secondary analyses of big data are asking questions of effectiveness at the level of the population. It’s a different question. Big data is asking, ‘How does this therapy interact in the real world?’”

Dr. Sweetenham said big data tools will allow clinicians to be more “granular” and “patient-specific” in their approach to directing treatments. “Within any large randomized clinical trial, large effects in a small proportion of the patients in the trial can be obscured. What might be potentially very beneficial treatments in a small subset of patients, would get lost in the overall results of the study,” he explained.

Big data analysis also can provide immediate feedback about how to treat patients, avoiding the years-long delays of getting practical results from clinical trials, and solve small-sample-size concerns that could possibly plague a trial.

EMRs are also one of the best sources of patient-reported outcomes, which are likely to become a more important focus as the move is made to value-based health care. “As these big data tools emerge, we are going to get much more information around well-being, health-related behaviors, and functional capabilities of patients once they have had a particular treatment. It’s going to be a different level of data from the standard quality-of-life assessments in clinical trials,” Dr. Sweetenham said.

The Bigger the Better?

While those who spoke with ASH Clinical News see the incredible value big data could provide, they also acknowledge that the big data movement comes with some big obstacles.

First, there’s the obvious problem: the size of these data.

D. Neil Hayes, MD, MPH, associate professor of medicine and director of clinical research for the cancer center at University of North Carolina in Durham, North Carolina, who has focused much of his work on tumor sequencing said that, currently, tumor sequencing includes data anywhere from a couple dozen genes to a couple of hundred genes – creating a large amount of data. “There are very few places for researchers to place genetic data at this point,” he said. “The size and scale of genetic data can be a barrier to even publishing papers.”

Dr. Kohane added that while genomic data are large, he believes clinical data dwarf the available genomic data. Clinical data encompass magnetic resonance imaging scans, pathology reports, and a patient’s medical history, all of which are necessary to fully understand genetic variance and what it means. “The clinical code is far more complex and requires literally hundreds of gigabytes per person to store,” he said. “To store all of the details of an MRI image itself from one patient is already tens if not hundreds of megabytes, and that’s just one [radiographic] study.”

Aside from the mere size of the data, data interoperability also remains elusive. The industry has yet to find a way to integrate multiple data systems – such as the EMR, claims data, and research data – or get various electronic medical record systems to “speak” to one another.

“It’s really hard to obtain these data,” Dr. Meyers said, partly due to the difficulty of operating the informatics systems that collect the data, but also “from a privacy and governance standpoint.”

Current rules on depositing sequencing data into a public database are difficult to work with. “They are very hard to be compliant with, and they are pretty labor-intensive in terms of getting our data into the research record so that we can even share our publication in a compliant way.”

Protecting patient privacy and upholding HIPAA (Health Insurance Portability and Accountability Act) laws to secure patient data are other challenges facing the industry. There are currently 75 different requirements that fall under the HIPAA umbrella, and every health-care organization is responsible for the protection of their patients’ data.

Privacy, however, is nothing new, and is already being dealt with in EMRs. To store patient these massive amounts of patient information, organizations are migrating from physical storage devices to the cloud – raising even more questions about security and privacy.

Cloud storage systems assign providers, research staff, and even patients with different levels of permissions, allowing them to securely access EMRs and retrieve crucial patient health information while maintaining HIPAA compliance. However, this system does require extra training and vigilance on the part of clinicians to protect patient privacy.

In some cases, transitioning to off-site storage may be more secure than traditional storage methods, according to Dr. Kohane. For instance, larger systems designed specifically to handle big datasets may offer more security than smaller systems, such as universities and individual hospitals, for storing research and patient data. “Compared to current practice, I think we can be far more secure and far more fastidious in the care of patients’ data than traditional research,” he said.

There is also the issue of exactly who will be using these data and for what reasons. While insurance companies and payers can use big data to prevent fraud or to identify which preventive measures would best benefit patients (and reduce costs later on), there is also the risk that big data will replace clinicians’ experience when deciding which treatments or procedures are necessary for patients – and which treatments or procedures insurance companies will pay for.

What Is Currently Being Done?

Despite its challenges, across the country new initiatives are being created to advance the role of big data systems in health care and improve the collection and integration of such data sets.

The amount of data available has grown in recent years thanks to The Health Data Initiative,6 an effort by the U.S. Health and Human Services Department (HHS) designed to make health data more openly available. Launched in 2010, The Health Data Initiative changed the default setting of HHS data from “closed” to “open,” opening the more than 1,000 datasets created since the program began. Private-sector entrepreneurs and investigators are now able to harness these newly released HHS data to create health-related services, applications, and products.

These datasets are housed at HealthData.gov, a central website that was designed to be easily accessible and searchable.

At NIH, the Big Data to Knowledge (BD2K) Initiative was established in 2012 to facilitate biomedical research. According to Philip E. Bourne, PhD, associate director for Data Science, the BD2K initiative is a $110-million-a-year program designed to support the research and development of tools that will “accelerate the integration of big data and data science into biomedical research.”1

So far, 12 data centers of excellence across the country have been created, all working on various aspects of enhancing the efficiency of health-care big data: whether it is complete genotype-to-phenotype information, social networking data, or mobility data.

“Basically we are working on this notion of fair principles,” Dr. Bourne said. “We want things to be findable, accessible, interoperable, and reusable, and we are doing what we can to foster that within the program.”

The NIH is also funding BioCADDIE (Biomedical and Healthcare Data Discovery Index Ecosystem), a data discovery project that would help index data that are stored in other places.7 “About 88 percent of the data that are referenced in papers are not in a recognized resource – they are in someone’s lab and over time sort of atrophy away. We want to be able to find those datasets,” he noted.

IBM has thrown its hat into the health-care big data ring, as well. Earlier this year, IBM’s Watson Health division announced that it will collaborate with Epic (an electronic medical records software system) and the Mayo Clinic to apply the cognitive computing capabilities of the Watson supercomputer to patient records. The goal is to use big data to develop patient treatment protocols, personalize the management of chronic conditions, and provide relevant medical evidence to doctors and nurses.8

In one example of bringing precision cancer treatments to more patients, the Watson supercomputer, with its database of clinical trials and research papers on particular cancers and potential therapies, will use a “big data” approach to genomic analysis, sifting through thousands of mutations to try to identify which mutations in a patient’s DNA fingerprint are drivers of the malignancy using a scoring system to distinguish these mutations from others. The computer system will then match the actionable targets to approved and experimental drugs.

To answer the question of where all those health data can be stored, the Watson Health Cloud platform was created to store the large quantity of personal health data created each day.

The Cloud enables information to be shared and combined with research, clinical, and social health data in an anonymous way.

What’s the Future for Big Data?

In the years ahead, Dr. Kohane says he believes the boundaries will blur between what has traditionally been considered research data and clinical data. That doesn’t mean, however, that the role of the randomized clinical trial is endangered.

“Big data is already being used as a tool, but I don’t believe that secondary analysis of big data – population-level comparative-effectiveness data – will ever completely replace trials,” Dr. Meyer said. “By definition, we are not controlling the experiment. We are not controlling the therapy.”

Instead, the experts who spoke with ASH Clinical News believe big data will give researchers and clinicians a deeper wealth of information to draw from as they make new discoveries and make better-informed clinical decisions.

Dr. Kohane also believes that patients will play a greater role in who is able to see their personal health data. “I think it’s going to become very clear in short order that not only do patients have a right to their data – which everybody has understood for decades, in principle – but, practically speaking, they are going to now have direct or proxy control of who can see their data. In the end, as we all eventually become patients, we will have the most to gain and the most to lose from who sees our data.”—By Jill Sederstrom 


  1. National Institutes of Health. “About Big Data to Knowledge (BD2K).” Accessed September 10, 2015 from https://datascience.nih.gov/bd2k/about.
  2. IBM. “Watson to gain ability to ‘see’ with planned $1B acquisition of merge healthcare.” Accessed September 11, 2015 from www.03.ibm.com/press/us/en/pressrelease/47435.wss.
  3. IBM. “The four V’s of big data.” Accessed September 10, 2015 from www.ibmbigdatahub.com/infographic/four-vs-big-data.
  4. Centers for Disease Control and Prevention. “Electronic medical records/electronic health records.” Accessed September 13, 2015 from www.cdc.gov/nchs/fastats/electronic-medical-records.htm.
  5. Bates DW, Saria S, Ohno-Machado L, et al. Big data in health care: using analytics to identify and manage high-risk and high-cost patients. Health Aff (Millwood). 2014;33:1123-31.
  6. HealthData.gov. “Unleashing the power of data and innovation to improve health.” Accessed September 13, 2015 from www.healthdata.gov/content/about.
  7. National Institutes of Health. “Resource indexing.” Accessed September 21, 2015 from https://datascience.nih.gov/bd2k/funded-programs/resource-indexing.
  8. IBM. “IBM Watson Health, Epic and Mayo Clinic to unlock new insights from electronic health records.” Accessed September 13, 2015 from www.03.ibm.com/press/us/en/pressrelease/46768.wss.