Lecture: Introduction to Data Science

(Summer Semester 2017)


Lecturers: Dr. Jochen L. Leidner and Kim Hee

Course start: Monday 29. May. 2017

Time and Location:

Lecture week 1: May 29-31, 2017, 13:00-16:00; Hörsaaltrakt Bockenheim – H IV
Exercise week 1: June 6-7, 2019, 13:00-16:00; Hörsaaltrakt Bockenheim – H IV
Lecture week 2: June 12-14, 2017, 13:00-16:00; Hörsaaltrakt Bockenheim – H IV
Exercise week 2: June 19-20, 2017, 13:00-16:00; Hörsaaltrakt Bockenheim – H IV
Exam: June 30, 14:00-15:30, Hörsaaltrakt Bockenheim – H III
Nachklausur: September 15, 08:30-10:00, Raum 501 Robert-Mayer-Str. 10

Languages: The language of the lecture is English

Credit Points: Students can receive 5 CPs point. Link in QIS/LFS

Assessment: by written exam.

Eligibility: Master Students in Computer Science, Bio informatics and Business informatics (Wirtschaftsinformatik, Vertiefungsbereich Informatik) are encouraged to attend

Prerequisites: programming skills, knowledge of Python, algorithms and data structures

Course Description: The goal of this compact course is to give participants a first gentle introduction and solid conceptual grounding in what has been called ‚data science‘, i.e. experimental work that is data-driven and empirical. The focus is on methodology, defining an experimental protocol, devising hypotheses, thinking about measuring success, but also on more practical approaches like basic machine learning methods (both supervised and unsupervised) and natural language processing approaches (like part-of-speech tagging, named entity recognition/classification/resolution, and parsing) and the introduction to popular tools. The course also demonstrates some practical applications of the techniques shown, and deepens the students‘ skills via practical exercises.

The lecture is delivered over 4 weeks of calendar time and consists of 2 three-day blocks of 3 hours of lectures followed by 2 days of 2.5 hours of exercises/tutorials each). It targets Master’s level students. By the end of the course, participants will be able to analyze data-sets, and to create their own predictive classifieds and visualizations.

Course Schedule (preliminary)


Date Topic Materials
29.05.2017 – 13:00-16:00 structured and unstructured
profiling data sets
Notes will be available after a lecture day
30.05.2017 – 13:00-16:00 hypothesis testing
descriptive v. predictive analytics
machine learning I: clustering
 Notes will be available after a lecture day
31.05.2017 – 13:00-16:00 machine learning II: classification
machine learning III: regression
Web crawling & mining
 Notes will be available after a lecture day
06.06.2017 – 13:00-16:00 Exercise 1. getting started
Exercise 2. profiling data
Exercise 3. pre-processing data
Exercise 4. clustering data
Exercise 5. visualizing data
 Notes will be available after a lecture day
07.06.2017 – 13:00-16:00 Exercise 1. classifying data
Exercise 2. annotating textual data
Exercise 3. rule-based extraction
Exercise 4. market basket analysis
 Notes will be available after a lecture day
12.06.2017 – 13:00-16:00 experimental protocol
evaluation measures
data science tools
 Notes will be available after a lecture day
13.06.2017 – 13:00-16:00 inter-rater agreement
data science economics: value creation
 Notes will be available after a lecture day
14.06.2017 – 13:00-16:00 visualization & presentation
planning your data science project
data science & ethics.
 Notes will be available after a lecture day
19.06.2017 – 13:00-16:00 Exercise 1. Project.
Form groups of max. 3 team members. Think of a group name and register your team on kaggle.com. Your team’s challenge is to predict house prices. Build a predictive model and evaluate it using 10-fold cross-validation.Exercise 2. Documentation.
Document your work in a report using the template documentation/idsreport.tex (compile this into a PDF report using the make or pdflatex idsreport commands).
 Notes will be available after a lecture day
20.06.2017 – 13:00-16:00 Presentation. Present the results, strategy and lessons learned of your team projects.
(Hint: Start by devising an outline for a 20 minutes presentation, and carefully budget your presentation time, already at authoring time of your slides.)
 Notes will be available after a lecture day


– MOOC in Courera: Introduction to Data Science in Python by University of Michigan

– MOOC in Data Science Teaching Initiative

– A good article for beginners: http://dl.acm.org/citation.cfm?id=2347755