Geisinger Health System teams up with Sutardja Center for Entrepreneurship & Technology (SCET) at UC Berkeley to offer Collider Project.
(Project Scientific Mentor: Roberto V. Zicari)
20 PhD and Master students from UC Berkeley, Stanford University and Goethe University Frankfurt committed participation to the project in teams of two each. 

Teams haven been asked to address of one of three problems related to multidisciplinary data analysis of obesity, heart/lung failure and mood disorder.

Berkeley team wins Geisinger Health Collider: link to the news



WHO WE ARE: Geisinger Health System is known as an early adopter of modern paradigms of healthcare and medical informatics. Data Science team at Geisinger combines the best practices of machine learning to support decision-making in healthcare. In 2014 we completed over 20 research projects using Electronic Medical Records (EMRs) to predict personalized treatment outcomes, to better allocate healthcare resources, and to obtain early warnings in scenarios of crisis. The team is focused on research that lies at the interface of clinical medicine and applied mathematics & computer science but is also interested in all forms of multidisciplinary studies centered on effective utilization of data in healthcare. Academic outreach is a part of our mission: we maintain collaborations with university researchers of all levels. For students and early career professionals, we believe in providing opportunities for immersive training experiences using real-world, unrefined data to address practical, patient-centric problems.

PROPOSAL SUMMARY: Through the Collider Project, we invite teams of college students to participate in a competition of projects focused on blending clinical data with additional, novel types and sources of data to improve the quality of patient-centric healthcare analytics. Each team will have 2 participants. A team will select one of several proposed projects aimed at improving some dimension of the quality of healthcare delivery (e.g. improving the well-being, safety and informed freedom of choice in the interactions between society, individual and a healthcare provider). We have selected three challenges from which participants can choose, each of which traditionally has been approached using a core set of clinical data in conjunction with a standard, conservative analytic strategy. Arguably, each of these analyses have fallen short of their maximal potential because they have not adequately address traditionally non-clinical (molecular, socioeconomic, behavioral, etc.) dimensions of these complex problems. Our goal is to encourage the participants in this project to improve the quality of these analytics through a combination of blending traditional and nontraditional data sources and an analytic strategy that incorporates novel approaches.

For each challenge we will provide a core set of clinical data derived from the EMR. We challenge the teams to find additional, nonclinical data to supplement our core data set, to resolve fundamental and technical difficulties of data blending, and to demonstrate the effectiveness of multi-disciplinary data use and analysis. This may result in “obtaining a better answer” or “answering a better question,” depending on each team’s individual take on the question being asked. Teams will have multiple opportunities to interact with Geisinger data scientists and will receive access to appropriate data sets and software tools. The final product will be an academic report accompanied by a body of transformed/blended data.

We expect that for the basic level (BS degrees, independent studies) this will be mostly an effort in data acquisition, cleanup, and blending followed by more traditional analytics. For intermediate students (early graduate studies) and advanced students (MS degree or higher) we expect increasingly more complex strategies for data integration and analysis that produce study results at a higher level of scientific rigor. The reports will be evaluated based on creativity, scientific rigor, and on how well the achieved results match the stated objectives. Different teams can address the same question and cite each other’s work if they wish: we will evaluate each effort separately. The goal of this exercise is not to determine who is the best coder or the best mathematician. While some coding will be required, some guidance and technical assistance will also be provided by our team. Ultimately our goal is to allow students to apply their individual skills, backgrounds, interests, and talents to the task of data blending, hypothesis generation, and data analysis. The analyses will be intended to answer specific medical questions, and so the results may have immediate tangible impact on this field.

REWARD: The winning team will be offered 3-month summer internships at Geisinger Data Science, where they will work on a theme of their choosing and will be closely supported by the department faculty.

ELIGIBILITY: The competition is open to students of all fields, and is not limited to mathematics and computer science majors. Arguably, every modern profession has elements of data analysis, and we expect the participants to use insights from their fields to look for additional data and suggest new analysis strategies. Seniority/academic standing is also not a limiting factor for participation.

BACKGROUND AND IMPACT: Data integration, or “blending,” is defined as combining data from multiple sources resulting in a unified view of the transformed data. Blended data sets can be largely analogous (same patient features, different clinics), or come from very different domains (EMRs, financial transactions, extracts from social media, academic records, criminal records, voting turnouts, etc). The sets will not overlap over individuals, but they should be related to the same general population or phenomena.

The technical process of data integration can be laborious, but it is relatively straightforward in its implementation. It may involve conversion between formats, filtering, recovery of missing portions of data, elements of mathematical modelling to formalize mappings between the data sets. However, the questions of what data to look for, and how to best use it, depend on the applied problem and cannot be formalized. Resolving these issues requires a good understanding of the involved disciplines and a creative effort to construct a novel solution. Establishing feedback is also important: in multi-source data studies, one may have to restate the problem and re-acquire data as the analysis evolves.

Modern data science is driven by looking for evidence in new ways and from new sources. But does using a larger volume and variety of data result in an improved understanding of the underlying questions? This question remains to be answered in detail, and establishing better practices of data blending and then studying the downstream consequences will be beneficial for medical

informatics and data science in general. The hunt for missing data is still an art rather than a science. Many professionals don’t know what is available just outside of their scope of interest, and many academic data science programs have limited access to proprietary and sensitive “real-world” data. Instead, they rely on the instructors’ and students’ ability to hunt for public data sources. This leads to very uneven performance of educational programs: in fact, many young career professionals are completely unprepared for dealing with data of industrial size and complexity. The proposed healthcare-academia immersive competition will contribute to the quality of modern data science education and generate a pool of talent in data acquisition and integration for multidisciplinary research.

TIMELINE: The project will have two phases: Phase 1 will take place during the Fall 2015 academic semester and Phase 2 will take place during the Spring 2016 term. The proposed timeline and details of these stages are:

Phase 1:
A) Overview: Teams will have a choice of one of three problems. These will be introduced and the project details will be

reviewed in an introductory lecture. Teams will then work independently to identify novel data sources, to refine the hypothesis, and to formulate a tentative strategy for data blending and subsequent analysis. At this stage, a premium is placed on creativity, although the practical achievability of the project must also be considered. A data dictionary for the clinical data set will be provided. Teams are reminded that the main body of work must be based on data that is tangible and publicly available.

B) Deliverable: a 3-5 page summary of proposed work that contains the following information: a. Required

i. clearly stated objectives, including a well-formulated hypothesis or research question ii. a list of the additional data sets that will be combined with the clinical data

iii. an explanation of relevance of the additional data iv. a general description of data integration process

v. a general description of the process for analyzing the integrated data set b. Encouraged

C) Timeline:

i. references to similar multidisciplinary efforts
ii. justification of used metrics of distance and quality (i.e. “how will you know that your outcome is ‘better’?)

iii. an in-depth description of data sets [this part does not count for the page limit].

a. Kickoff Lecture: October 15, 2015

b. Duration: 4-6 weeks

c. Due Date: Approximately Nov 13

Between Phases: Geisinger Data Science team will review the proposals, compile data resources, and build some preliminary code bases as needed to assist students at getting started on phase 2.

Phase 2:
A) Overview: This phase is modeled after several successful Geisinger-university collaborations. Students will spend most of the time actually doing the work that they proposed and outlined in Phase 1. They will gather their data sets and integrate them with the base clinical data set provided by the data science team. They will have some standard code bases provided by the data science team from which to build their integration and analysis strategies. They may refine their objectives and strategies of analysis and them using easy-to-acquire, easy-to-use methods and software. Throughout these 8 weeks they can communicate with the data scientists from Geisinger and/or with whatever other data scientists, mathematicians, computer scientists, and domain experts they like. The Geisinger team will provide some guidance and support, including: assistance in acquisition and interpretation of public data sets, access to additional code-bases, and, most importantly, access to core data clinical data sets relevant for the project. Additional clinical data pulls may be possible, and these must be requested early to allow us time to process.

Phase 2 will last approximately 8 weeks. If there is interest, the data science team can be available for 1-2 days on site at Berkely around the middle of this timeline to provide hands-on assistance with the technical components as needed. The challenge will conclude after 8 total weeks with another opportunity for in-person, last-minute assistance immediately followed by closing of the challenge and final submission.

B) Final deliverable: A 5-15 page document accompanied by additional data used in analysis. It must contain: i. A discussion of final objectives, including rationale for any revisions

C) Timeline:

ii. An explanation of the relevance of the data added by the group iii. A discussion of the methods of data blending
iv. A discussion of the methods of data analysis

v. Numerical or narrative evidence for the improvement in quality or understanding provided by using additional multidisciplinary data.

a. Kickoff lecture & live assistance workshop: Feb 12
b. Mid-point live assistance workshop: March 11

c. Final assistance session: April 7

 d. Final due date: April 8

e. Winner notified / feedback to groups: April 15


Nicholas Marko, MD: Project lead, lecturer, responding to inquiries, workshop coordinator

Oleg Roderick, PhD Mathematics lead, respond to inquiries, workshop participant

David Sanchez, MS Programming lead, respond to inquiries, workshop participant

Joe Klobusicky, PhD Mathematician, respond to inquiries, workshop participant

Arun Aryasomayajula Programmer, respond to inquiries, workshop participant


1. Integrated data analysis for early warning of heart/lung failure

The advantages of early diagnostics of serious medical conditions are obvious. This is particularly important for common conditions such as congestive heart failure (CHF) and chronic obstructive pulmonary disease (COPD), as they are among the most common causes of death in the US. They may result from multiple causes and exist together with other complications: in the absence of an early warning model, it is difficult to prioritize testing and allocate resources in the diagnostic process. Both conditions affect millions of patients, and are associated with socioeconomic factors: occupational hazards, lifestyle choices, environmental factors. Using statistical inference on EMRs, we can estimate the chances of a CHF/COPD diagnosis pre-emptively, before it is confirmed by a medical specialist. Can we make our prediction better by using additional information?

2. Indirect data collection to support anti-obesity efforts in healthcare and society

Obesity is a primary population health concern in the US. The contributing factors (diet, inactive lifestyle, role of food in social interactions) are fairly well-described, but their role in creating a successful prevention strategy is not fully understood. Analytic efforts establishing links between isolated social and medical factors and obesity are often inconclusive or ineffective. New hope lies in integrated analysis of complete medical and social histories. Using EMRs, we can see patterns that obese patients have in common, infer risk of obesity from other medical conditions, and also find new ways to characterize patients that could be successfully treated. Can we use even more data in analysis by including non-clinical events? What non-traditional and indirect information about patient’s background and multi-faceted behavior can be collected to contribute to anti-obesity studies?

3. It’s not all in your head: multi-disciplinary data analysis of common psychological conditions

Psychological mood disorders are often described as both social and medical phenomena. Recent studies in suicide prevention make connections between mood disorders and patterns in residential power use, logs of phone calls, and purchasing history, while older studies identify certain demographic groups as being more at risk. The problem of screening for and predicting the risk of mood disorders in the general population could have a major impact on population health. Can a combination of EMRs (containing clinical data) and other data sources improve upon current strategies for predicting the individualized risk for developing a mood disorder?


Project Kick-off Date:  October  15, 2015

Geisinger Health System and UC Berkeley Collide

Collider Project allows students to collaborate with an industry partner on data blending.

Published in Clinical Informatics News: