Carnegie Mellon University

38611 - Introduction to Large-Scale Computing in Science

This course builds upon the skills acquired in the proceeding mini-semester, Modern Programming for Data Scientists, and introduces students to the techniques necessary for manipulating and analyzing “big data” encountered in modern scientific computing, and using the large computational platforms involved in those processes. The course will use the Python-friendly Spark framework to massively amplify capabilities introduced in the previous mini. Starting with mining of large and complex scientific datasets, the students will progress to scalable data analytics and eventually very basic machine learning on various large scalable platforms such as supercomputers and clouds. Lower-level concepts such as performance optimization and concurrent programming techniques will also be introduced along the way. Exercises will be motivated by relevant scientific community datasets. This course is required for students enrolled in the MS program in Data Analytics for Science.

The main topics will include:

  • Intro to Big Data w/ Spark (Databases and formats: JSON, HDF5, XML, graph)
  • Intro to Spark data analytics (Clustering)
  • Intro to Dimensionality Reduction with Spark
  • Spark Machine Learning (Recommender system)
  • Cloud Computing (including VMs and containers). AWS, Azure, NCCP
  • HPC Platforms (including GPU's)
  • Manual concurrency with Python MPI
  • Optimization and performance with Python (Cython, profilers, debugging)
  • Introduction to Python alternatives in the sciences (C, C++, Fortran, Java, Julia)