Carnegie Mellon University

Algorithms for Massive Data Science

Course Number: 47842

Data sets have grown to an astonishing size and companies like Yahoo, Google and Twitter are processing up to a petabyte of data every day.  There is an opportunity to discover more and better information by processing large data sets. While this opportunity exists, current computational methods frequently fail to scale to data sets of enormous size.  Due to this, new methods are being developed to handle the ever increasing size of data sets. 

This course will focus on a subset of recent algorithmic developments for handling large data.  The course will introduce algorithms for massively distributed models of computation.  These models are designed to capture frameworks such as Spark, Hadoop and MapReduce used for deploying algorithms in a data center.  The course will cover streaming algorithms, an alternative approach where the methods do not store the entire data set in memory. 

This will be primarily a theoretical course focusing on algorithm design. Applications discussed will be motivated by real world challenges.

Degree: PhD
Concentration: Operations Research
Academic Year: 2020-2021
Semester(s): Mini 4
Required/Elective: Elective
Units: 6


Lecture: 100min/wk and Recitation: 50min/wk