Carnegie Mellon University

38615 - Computational Modeling, Statistical Analysis and Machine Learning in Science

The purpose of this course is to provide a practical introduction to the core concepts and tools of machine learning in a manner easily understood and intuitive to STEM students. The course begins by covering fundamental concepts in ML, data science, and modern statistics such as the bias-variance tradeoff, overfitting, regularization, and generalization, before moving on to more advanced topics in both supervised and unsupervised learning.

Students will choose a large dataset from a selection of biology, chemistry, math, or physics datasets hosted by PSC and use this dataset throughout the MS program. The topics of the course are taught with students analyzing the chosen dataset. An intensive knowledge of Python or another computing language is not a pre-prerequisite since students will be given at first simple scripts that they work with and then
expand upon. This course is required for students enrolled in the MS program in Data Analytics for Science.

Potential topics include:

  • Efficient data structures (arrays, stacks, queues, lists, trees, heaps, graphs)
  • Data storage, sorting and searching (binary search trees, hash tables), efficient query
  • Techniques for handling high-dimensional data (instances with many attributes), including variable selection and dimension reduction, ensemble methods (bagging and boosting)
  • Large-scale search algorithms, intro to databases
  • Model accuracy, prediction accuracy
  • Model selection, dimension reduction, and other high-dimensional considerations
  • Linear and nonlinear models
  • Classification, SVM, kernel methods
  • Decision trees and RF
  • Probabilistic methods