Big Data Analytics – Course (Summer Semester 2014)

Lecturers

Dr. Nikolaos Korfiatis, Todor Ivanov, Sead Izberovic

Announcements

05.05.2014 – The class will meet from this Wednesday 07.05 onwards on Room 308 Robert-Mayer-Str. 6-8 (3rd floor, entrance from Robert-Mayer Str. 6 only)

Target Group:

Students willing to learn how to make insights from vast amounts of data, built innovative tools and integrate various data sources to make useful insights.

Prerequisites

Although a lot of introductory material will be provided, students need to have basic knowledge of stochastics as well as database technology

Goals:

The objectives of this course are:

To present the basic techniques for extracting information from large datasets such as the web, social-network graphs, and large document repositories.
To introduce to the students with the theoretical and practical tools and techniques for data mining of massive datasets through practical applications in predictive analytics.
To help the students familiarize with the modern data science toolkits and platforms and the “big data” ecosystems.

Course Material:

Course Book

Rajaraman, Anand, and Jeffrey David Ullman. Mining of massive datasets. Cambridge University Press, 2012.

Articles

Lee, K. C., Orten, B., Dasdan, A., & Li, W. (2012). Estimating conversion rate in display advertising from past performance data. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 768-776). ACM.
Tagami, Y., Ono, S., Yamamoto, K., Tsukamoto, K., & Tajima, A. (2013). CTR prediction for contextual advertising: learning-to-rank approach. In Proceedings of the Seventh International Workshop on Data Mining for Online Advertising (p. 4). ACM.

Links You are encouraged to study the resources on ODBMS.org

Particulars

The class meets every Wednesday at 16.00 in room 308 ( Robert-Mayer-Str. 6-8 -3rd floor, entrance from Robert-Mayer Str. 6 only!) . Note that the schedule content might be subject to changes.

Exam

A written Exam will take place on 16.07.2014, 16.00-17.00 at Room 308. You must register with the Prüfungsamt (can be done online through QIS System)

Course Schedule

Date	Title	Material
23.04.2014	Introductory concepts: Hadoop. data mining, statistical techniques, predictive analytics, text mining	SoSe 2014 – Lecture 1 General Introduction to Hadoop Link: Jure Leskovec’s Slides (Stanford): Chapters 1 and 2) Video: Professor Zicari’s talk on Big Data Challenges
30.04.2014	Map-Reduce and Distributed file systems Distributed file systems: introduction to Hadoop , compute and data nodes, large-scale file system organization. Map-Reduce: Mappers, Reducers and Combiners	Map Reduce Example in Python (From Wakari) SoSe 2014 – Lecture 2
07.05.2014	= Practical Hands-on Lab with Hadoop = Your own laptop is required in able to attend. You need to form a group of 3 before this hands on lab.	Cloudera QuickStart VM Cloudera CDH Documentation Lab Preparation Material (Contains Links to videos and resources)
14.05.2014	Similarity Mining Applications of nearest neighbor search, k-item sets	SoSe 2014 – Lecture 3 Ipython Notebook with the examples Link: Rajaraman’s slides (Stanford) – Clustering
21.05.2014	Similarity Mining Practical hands on lab with Hadoop and Apache MahoutApplication to document mining, application to financial news using the Reuters NIST corpus	SoSe 2014 – Lecture 4 Link: Mahout in Action – Working with the Reuters data Link: Apache Mahout (Apache Foundation) Note: Apache Mahout is allready included in the Lab Virtual Machine
28.05.2014	Frequent Itemsets The Market-basket model and frequent item sets, association rule mining, the market basket problem on big data sets, memory based and limited pass algorithms)	SoSe 2014 – Lecture 5 Link: Ipython Notebook
04.06.2014		Apache Hive: Language Manual SoSe 2014 – Lecture 6 – Hive
11.06.2014	*Link analysis and Pagerank – 1* Search engine essentials, pagerank computation,, topic specific models for Pagerank	Link: Jure Leskovec’s Slides-Pagerank 1 iPython Notebook – Power Iteration
18.06.2014	*Link analysis and Pagerank – 2* Link spam and trust rank, Hubs and Authorities (HITS)	Link: Jure Leskovec’s Slides-Pagerank 2 SoSe 2014 – Lecture 7
25.06.2014	Hands on Session: Gephi
02.07.2014	Application workshop 1: Recommender SystemsThe recommendation problem: utility matrix and the long tail, content based recommendations, collaborative filtering and dimensionality reduction, Applications on the Netflix challenge corpus.
09.07.2014	Application workshop 2: Attribution in AdvertisingThe problem of channel and referral attribution, The TURN model, direct and in-direct placement of ads, CPC computation and biding models, matching algorithms for displaying ads, Adwords and competitive ratio for balance, Application with the Amazon Ad- attribution corpus
16.07.2014	===== Exam ============

Additional Material (in German) with exercises and Slides (Reading is Mandatory!)

Big Data Engineering Lecture – Lars George (EMEA Chief Architect @ Cloudera)

Books on Hadoop