Carnegie Mellon University

38614 - Large-Scale Computing in Data Science

This course introduces students to the techniques necessary for manipulating and analyzing big data as encountered in modern scientific computing. This Python based course will introduce modern software engineering techniques and tools, and data science frameworks with an emphasis on large-scale problems and computing platforms. It is hands-on and will use the Spark framework for the mining of large and complex scientific datasets. Students will progress to scalable data analytics and eventually basic machine learning on various scalable platforms such as supercomputers and clouds. This will culminate with an introduction to the TensorFlow framework for deep learning. Lower-level concepts such as performance optimization and concurrent programming techniques will be introduced along the way. Exercises will be motivated by relevant scientific community datasets. This course is required for students enrolled in the MS program in Data Analytics for Science. 

The main topics will include:

  • Introduction to modern software engineering techniques
  • Intro to Big Data w/ Spark (Databases and formats: JSON, HDF5, XML, graph)
  • Intro to Spark data analytics (Clustering)
  • Intro to Dimensionality Reduction with Spark
  • Spark Machine Learning (Recommender system)
  • Cloud Computing (including VMs and containers). AWS, Azure, NCCP
  • HPC Platforms (including GPU's)
  • Manual concurrency with Python MPI
  • Optimization and performance with Python (Cython, profilers, debugging)
  • Introduction to Python alternatives in the sciences (C, C++, Fortran, Java, Julia)