Carnegie Mellon University

38610 - Modern Programming for Data Scientists

A hands-on introductory course to the fundamentals of Python programming in data science for students with minimal or no programming experience. Students will learn while working on scientific problems and leveraging scientific datasets. The data science Python ecosystem includes easy-to-use packages for working with data and is the foundation for most deep learning frameworks, which will be used in subsequent courses. Students will develop skills in object-oriented programming in Python3; usage of packages for efficiently working with scientific data; customizing their environment; Anaconda; developing electronic notebooks for reusing and sharing code; reading data specific to the sciences (Biology, Chemistry, Math, or Physics); improving the efficiency of Python code; and visualizing data. At the end of the courses, students will have the skills to design and deploy a python-based data science solution for a small scientific challenge. This course is required for students enrolled in the MS program in Data Analytics for Science.

The main topics will include:
  • Brief introduction to Python3 and the development ecosystem
  • Introduction to object-oriented and procedural programming models and basic software architecture principles
  • Professional programming techniques for modern software development: version control and team development (Git and GitHub), coding standards, unit and regression testing (PyTest) and continuous integration (TravisCI)
  • Introduction to R and RStudio
  • Developing reusable, sharable, and interactive electronic notebooks with Jupyter
  • Python environment management: Virtualenv and Anaconda
  • Fundamentals of data structures and their implementation in Python
  • Python packages for science and data science: NumPy, SciPy, Pandas, StatsModels
  • Data processing techniques for small, medium and large datasets
  • Manual and programmatic metadata standards
  • Data Analytics with Python: Optimization, Linear and Non-Linear Regression, Mathematical Modeling, Monte Carlo Sampling, Distributions, and Clustering