Carnegie Mellon University

Foundations of Data Science

Course Number: 47843

Modern data in many Machine Learning applications, Recommendation Systems, Neural nets, Ranking, as well as other areas consists of vectors with many components. The abstraction of data to high dimensional vectors is not just a book-keeping device, but is fundamental to our understanding of the structure, properties and uses of the data. There are three types of lenses with which properties of a large collection of high dimensional vectors can be studied and understood – Geometric, Linear Algebraic and Statistical. The purpose of the course is to introduce each of these three. While applications will motivate our study, the focus is on mathematical properties and not specific applications. We will start with the geometric view. We will look at standard geometric measures like volume and surface area and see that they have very interesting properties in high dimensions, rather different from our understanding from two or three dimensions. An important tool here will be the multi-variate Gaussian distribution, which is a central modeling tool; we will introduce, analyze and use it. The Linear Algebraic view will be discussed mainly by the study (from first principles) of Principal Component Analysis which is a very versatile and widely used tool for data analysis. The algebraic properties of the singular vectors and values, the greedy algorithm for finding them and the basic power method will be discussed. The third main point we will see is the use of random sampling. Modern data, besides being multi-dimensional, also can be massive. This presents computational challenges. Random sampling is a standard technique to draw a sub-sample of the data and run algorithms only on the sample. We will study the uses of sampling for massive matrix algorithms, where one samples rows/columns using certain probabilities to get a smaller matrix and computes with it. The question is what probability distribution is both computationally efficient as well as provably faithful to the data. We will see an answer to this question.

Degree: PhD
Concentration: Operations Research
Academic Year: 2023-2024
Semester(s): Mini 2
Required/Elective: Elective
Units: 6

Format

Lecture: 100min/wk and Recitation: 50min/wk

Textbook(s):

Foundations of Data Science: Avrim Blum, John Hopcroft, Ravindran Kannan