Goethe University Frankfurt
(C) Big Data Laboratory. Design By Tea Sets


We continue working on the research questions originally defined by our Associated Faculty Nicholas Marko, MD and his team, during our joint work with the Geisinger Health Collider Project.

We will work on blending clinical data with additional, novel types and sources of data to improve the quality of patient-centric healthcare analytics. We have selected three challenges, each of which traditionally has been approached using a core set of clinical data in conjunction with a standard, conservative analytic strategy. Arguably, each of these analyses have fallen short of their maximal potential because they have not adequately address traditionally non-clinical (molecular, socioeconomic, behavioral, etc.) dimensions of these complex problems. Our goal is to improve the quality of these analytics through a combination of blending traditional and nontraditional data sources and an analytic strategy that incorporates novel approaches.

For each challenge we will have a core set of clinical data derived from the EMR. We will look to find additional, nonclinical data to supplement our core data set, to resolve fundamental and technical difficulties of data blending, and to demonstrate the effectiveness of multi-disciplinary data use and analysis. This may result in “obtaining a better answer” or “answering a better question. The final product will be an academic report accompanied by a body of transformed/blended data.


Data integration, or “blending,” is defined as combining data from multiple sources resulting in a unified view of the transformed data. Blended data sets can be largely analogous (same patient features, different clinics), or come from very different domains (EMRs, financial transactions, extracts from social media, academic records, criminal records, voting turnouts, etc). The sets will not overlap over individuals, but they should be related to the same general population or phenomena.

The technical process of data integration can be laborious, but it is relatively straightforward in its implementation. It may involve conversion between formats, filtering, recovery of missing portions of data, elements of mathematical modelling to formalize mappings between the data sets. However, the questions of what data to look for, and how to best use it, depend on the applied problem and cannot be formalized. Resolving these issues requires a good understanding of the involved disciplines and a creative effort to construct a novel solution. Establishing feedback is also important: in multi-source data studies, one may have to restate the problem and re-acquire data as the analysis evolves.

Modern data science is driven by looking for evidence in new ways and from new sources. But does using a larger volume and variety of data result in an improved understanding of the underlying questions? This question remains to be answered in detail, and establishing better practices of data blending and then studying the downstream consequences will be beneficial for medical informatics and data science in general. The hunt for missing data is still an art rather than a science. Many professionals don’t know what is available just outside of their scope of interest, and many academic data science programs have limited access to proprietary and sensitive “real-world” data. Instead, they rely on the instructors’ and students’ ability to hunt for public data sources.

This leads to very uneven performance of educational programs: in fact, many young career professionals are completely unprepared for dealing with data of industrial size and complexity.

Three Challenges:

1. It’s not all in your head: multi-disciplinary data analysis of common psychological conditions

Psychological mood disorders are often described as both social and medical phenomena. Recent studies in suicide prevention make connections between mood disorders and patterns in residential power use, logs of phone calls, and purchasing history, while older studies identify certain demographic groups as being more at risk. The problem of screening for and predicting the risk of mood disorders in the general population could have a major impact on population health. Can a combination of EMRs (containing clinical data) and other data sources improve upon current strategies for predicting the individualized risk for developing a mood disorder?

2.Integrated data analysis for early warning of heart/lung failure

The advantages of early diagnostics of serious medical conditions are obvious. This is particularly important for common conditions such as congestive heart failure (CHF) and chronic obstructive pulmonary disease (COPD), as they are among the most common causes of death in the US. They may result from multiple causes and exist together with other complications: in the absence of an early warning model, it is difficult to prioritize testing and allocate resources in the diagnostic process. Both conditions affect millions of patients, and are associated with socioeconomic factors: occupational hazards, lifestyle choices, environmental factors. Using statistical inference on EMRs, we can estimate the chances of a CHF/COPD diagnosis pre-emptively, before it is confirmed by a medical specialist. Can we make our prediction better by using additional information?

3. Indirect data collection to support anti-obesity efforts in healthcare and society

Obesity is a primary population health concern in the US. The contributing factors (diet, inactive lifestyle, role of food in social interactions) are fairly well-described, but their role in creating a successful prevention strategy is not fully understood. Analytic efforts establishing links between isolated social and medical factors and obesity are often inconclusive or ineffective. New hope lies in integrated analysis of complete medical and social histories. Using EMRs, we can see patterns that obese patients have in common, infer risk of obesity from other medical conditions, and also find new ways to characterize patients that could be successfully treated. Can we use even more data in analysis by including non-clinical events? What non-traditional and indirect information about patient’s background and multi-faceted behavior can be collected to contribute to anti-obesity studies?


Contact Person: Prof. Roberto V. Zicari

Frankfurt Big Data Lab: http://www.bigdata.uni-frankfurt.de