We continue working on the research questions originally defined by our Associated Faculty Nicholas Marko, MD and his team, during our joint work with the Geisinger Health Collider Project.

We will work on blending clinical data with additional, novel types and sources of data to improve the quality of patient-centric healthcare analytics. We have selected three challenges, each of which traditionally has been approached using a core set of clinical data in conjunction with a standard, conservative analytic strategy. Arguably, each of these analyses have fallen short of their maximal potential because they have not adequately address traditionally non-clinical (molecular, socioeconomic, behavioral, etc.) dimensions of these complex problems. Our goal is to improve the quality of these analytics through a combination of blending traditional and nontraditional data sources and an analytic strategy that incorporates novel approaches.

For each challenge we will have a core set of clinical data derived from the EMR. We will look to find additional, nonclinical data to supplement our core data set, to resolve fundamental and technical difficulties of data blending, and to demonstrate the effectiveness of multi-disciplinary data use and analysis. This may result in “obtaining a better answer” or “answering a better question. The final product will be an academic report accompanied by a body of transformed/blended data.


Data integration, or “blending,” is defined as combining data from multiple sources resulting in a unified view of the transformed data. Blended data sets can be largely analogous (same patient features, different clinics), or come from very different domains (EMRs, financial transactions, extracts from social media, academic records, criminal records, voting turnouts, etc). The sets will not overlap over individuals, but they should be related to the same general population or phenomena.

The technical process of data integration can be laborious, but it is relatively straightforward in its implementation. It may involve conversion between formats, filtering, recovery of missing portions of data, elements of mathematical modelling to formalize mappings between the data sets. However, the questions of what data to look for, and how to best use it, depend on the applied problem and cannot be formalized. Resolving these issues requires a good understanding of the involved disciplines and a creative effort to construct a novel solution. Establishing feedback is also important: in multi-source data studies, one may have to restate the problem and re-acquire data as the analysis evolves.

Modern data science is driven by looking for evidence in new ways and from new sources. But does using a larger volume and variety of data result in an improved understanding of the underlying questions? This question remains to be answered in detail, and establishing better practices of data blending and then studying the downstream consequences will be beneficial for medical informatics and data science in general. The hunt for missing data is still an art rather than a science. Many professionals don’t know what is available just outside of their scope of interest, and many academic data science programs have limited access to proprietary and sensitive “real-world” data. Instead, they rely on the instructors’ and students’ ability to hunt for public data sources.

This leads to very uneven performance of educational programs: in fact, many young career professionals are completely unprepared for dealing with data of industrial size and complexity.

Three Challenges:

1. It’s not all in your head: multi-disciplinary data analysis of common psychological conditions

Psychological mood disorders are often described as both social and medical phenomena. Recent studies in suicide prevention make connections between mood disorders and patterns in residential power use, logs of phone calls, and purchasing history, while older studies identify certain demographic groups as being more at risk. The problem of screening for and predicting the risk of mood disorders in the general population could have a major impact on population health. Can a combination of EMRs (containing clinical data) and other data sources improve upon current strategies for predicting the individualized risk for developing a mood disorder?

2.Integrated data analysis for early warning of heart/lung failure

The advantages of early diagnostics of serious medical conditions are obvious. This is particularly important for common conditions such as congestive heart failure (CHF) and chronic obstructive pulmonary disease (COPD), as they are among the most common causes of death in the US. They may result from multiple causes and exist together with other complications: in the absence of an early warning model, it is difficult to prioritize testing and allocate resources in the diagnostic process. Both conditions affect millions of patients, and are associated with socioeconomic factors: occupational hazards, lifestyle choices, environmental factors. Using statistical inference on EMRs, we can estimate the chances of a CHF/COPD diagnosis pre-emptively, before it is confirmed by a medical specialist. Can we make our prediction better by using additional information?

3. Indirect data collection to support anti-obesity efforts in healthcare and society

Obesity is a primary population health concern in the US. The contributing factors (diet, inactive lifestyle, role of food in social interactions) are fairly well-described, but their role in creating a successful prevention strategy is not fully understood. Analytic efforts establishing links between isolated social and medical factors and obesity are often inconclusive or ineffective. New hope lies in integrated analysis of complete medical and social histories. Using EMRs, we can see patterns that obese patients have in common, infer risk of obesity from other medical conditions, and also find new ways to characterize patients that could be successfully treated. Can we use even more data in analysis by including non-clinical events? What non-traditional and indirect information about patient’s background and multi-faceted behavior can be collected to contribute to anti-obesity studies?


Concerns and Strategies

Three main concerns can be raised with this new research approach :


What about protection of personal data? 

The project will take this aspect in serious consideration following the various rules defined in the appropriate territory.

For example, in Europe, we will comply with the new EU data protection rules, proposed in January 2012, officially published on May 2016 in the EU Official Journal. This Directive entered  into force on 5 May 2016. EU Member States will have to transpose it into their national law by 6 May 2018.

We will also follow the ethical principles defined by the “Data for Humanity” initiative, started by Professor Roberto V. Zicari (Goethe University Frankfurt) and Professor Andrej Zwitter (University of Groningen) , signed by over 1,000+ signatories around the world, to help encouraging people and institutions to use data on sound principles that serve humanity. More info:


– What if data blending is not producing any significant results?

As with any new research direction, there is no absolute guarantee that we will be able to derived useful insights with data blending. The key to mitigate this risk is to involve domain  experts in the medical area of Mood disorders in our work from the very start. For that, we have formed an international Advisory Board composed of domain experts in the medical field and senior experts in data science applied to healthcare.


– What if the results obtained are misused?

Insights resulted  by analysing combined patients data with additional data sources could be potentially used (or even misused) in a way that is not conformant to ethical principles.  To mitigate this risk, we have augmented our Advisory Board by adding members who are experts in the social/political/ethical implication of using data. They will provide an effective monitoring feedback mechanism through the entire duration of the project.


Advisory Board

Publications and Reports



Contact Person: Prof. Roberto V. Zicari

Frankfurt Big Data Lab: