Carnegie Mellon University

Core Ph.D. Requirements

Deviation from these requirements will only be allowed in exceptional cases, and with the approval of the Director of Graduate Studies. With a few exceptions, students in joint degree programs are also required to satisfy all of these requirements.

The Path to the Ph.D.

The following template shows the typical schedule a student follows to, in under two years, complete the coursework requirements and commence thesis research.

Year 1

Fall

  • 36-699: Immigration to Statistics
  • 36-700: Probability and Mathematical Statistics OR 36-705: Intermediate Statistics
  • 36-707: Regression Analysis
  • 36-750: Statistical Computing
Spring

  • 36-709: Advanced Statistics I
  • 36-757: Advanced Data Analysis I
  • 36-708: Statistical Machine Learning or Methods Minis
  • Begin Work on ADA Project

Year 2

Fall

  • 36-795: Advanced Data Analysis II (Interdisciplinary Applied Research)
  • Complete oral and written presentations of ADA Project
  • Electives, e.g. 36-710: Advanced Statistical Theory / Probability Theory
Spring

  • Finalize Thesis Advisor and Topic
  • Begin Elective Coursework*

Year 3

Prepare and deliver your thesis proposal.

Year 4 and beyond

Dedicated to dissertation research.

*Note: After the first three semesters, students often take electives on advanced statistics, machine learning, or domain-specific topics. Each semester also features half-semester courses ("minis") that cover exciting topics in the field.

Program Requirements and Course Descriptions

Year 1 - Fall

Students are introduced to the faculty and their interests, the field of statistics, and the facilities at Carnegie Mellon.

Each faculty member gives at least one elementary lecture on some topic of his or her choice. In the past, topics have included: the field of statistics and its history, large-scale sample surveys, survival analysis, subjective probability, time series, robustness, multivariate analysis, psychiatric statistics, experimental design, consulting, decision-making, probability models, statistics and the law, and comparative inference.

Students are also given information about the libraries at Carnegie Mellon and current bibliographic tools. In addition, students are instructed in the use of the Departmental and University computational facilities and available statistical program packages.

 

This course covers the basics of statistics. We will first provide a quick introduction to probability theory, and then cover fundamental topics in mathematical statistics such as point estimation, hypothesis testing, asymptotic theory, and Bayesian inference. If time permits, we will also cover more advanced and useful topics including nonparametric inference, regression and classification. Prerequisites: one- and two-variable calculus and matrix algebra.

This course covers the fundamentals of theoretical statistics. Topics include: probability inequalities, point and interval estimation, minimax theory, hypothesis testing, data reduction, convergence concepts, Bayesian inference, nonparametric statistics, bootstrap resampling, VC dimension, prediction and model selection.

This course covers:

  • The basic principles of causality.
  • Foundations of linear regression, including theory, computation, diagnostics, and generalized linear models.
  • Extensions to nonparametric regression, including splines, kernel regression, and generalized additive models.
  • Discussion of tools to compare statistical models, including hypothesis tests, cross-validation, and bootstrapping.
  • Topics in nonparametric regression and machine learning as time permits, such as regression trees, boosting, and random forests.
  • Emphasis on writing data analysis reports that answer substantive scientific methods with appropriate statistical tools.

Students will be equipped with the tools needed to explore a substantive scientific question with data, translate scientific questions into statistical questions, compare different modeling approaches rigorously, and write their results in a clear manner.

A detailed introduction to elements of computing relating to statistical modeling, targeted to PhD students and masters students in Statistics & Data Science.

Topics include important data structures and algorithms; numerical methods; databases; parallelism and concurrency; and coding practices, program design, and testing.

Multiple programming languages will be supported (e.g., C, R, Python, etc.). Those with no previous programming experience are welcome but will be required to learn the basics of at least one language via self-study.

Year 1 - Spring

This is a core Ph.D. course in theoretical statistics. The class will cover a selection of modern topics in mathematical statistics, focussing on high-dimensional parametric models and non-parametric models. The main goal of the course is to provide the students with adequate theoretical background and mathematical tools to read and understand the current statistical literature on high-dimensional models.

Topics will include: concentration inequalities, covariance estimation, principal component analysis, penalized linear regression, maximal inequalities for empirical processes, Rademacher and Gaussian complexities, non-parametric regression and minimax theory. This will be the first part of a two semester sequence.

Advanced Data Analysis (ADA) is a Ph.D. level seminar on advanced methods in statistics, including computationally intensive smoothing, classification, variable selection and simulation techniques.

During 36-757, you work with the seminar instructor to identify an ADA project for yourself. The ADA project is an extended project in applied statistics, done in collaboration with an investigator from outside the Department, under the guidance of a faculty committee, culminating in a publishable paper that is presented orally and in writing in 36-758.

This course focuses on statistical methods for machine learning, a decades-old topic in statistics that now has a life of its own, intersecting with many other fields. While the core focus of this course is methodology (algorithms), the course will have some amount of formalization and rigor (theory/derivation/proof), and some amount of interacting with data (simulated and real). However, the primary way in which this course complements related courses in other departments is the joint ABCDE focus on (A) Algorithm design principles, (B) Bias-variance thinking, (C) Computational considerations (D) Data analysis (E) Explainability and interpretability.

The ADA project is an extended project in applied statistics, done in collaboration with an investigator from outside the Department, under the guidance of a faculty committee, culminating in a publishable paper that is presented orally and in writing.

Year 2 - Fall

Advanced Data Analysis (ADA) is a Ph.D. level seminar on advanced methods in statistics, including computationally intensive smoothing, classification, variable selection and simulation techniques. During 36-757, you work with the seminar instructor to identify an ADA project for yourself.

The ADA project is an extended project in applied statistics, done in collaboration with an investigator from outside the Department, under the guidance of a faculty committee, culminating in a publishable paper that is presented orally and in writing in 36-758.

The project ends with a 15 page internal report describing the work together with a 25 minutes presentation to the Department, with additional time allotted for questions and answers. The completion of the ADA report into a publishable paper is at the discretion of the ADA advisors and the student, and may extend up to one or two additional semesters depending on the scope and nature of the research.

Ph.D. students can customize their studies by choosing from a variety of elective courses. These options allow for deeper exploration of specific research areas or the development of skills that align with their academic and career goals.

Year 2 - Spring

The student determines the research topic they will pursue for their Ph.D. thesis and which faculty will serve as their advisor or co-advisor(s) to carry out this work.

 

Ph.D. students can customize their studies by choosing from a variety of elective courses. These options allow for deeper exploration of specific research areas or the development of skills that align with their academic and career goals.

Elective Coursework

Note: Electives can change year-to-year

The class will cover a selection of modern topics in mathematical statistics, focussing on high-dimensional parametric models and non-parametric models. The main goal of the course is to provide the students with adequate theoretical background and mathematical tools to read and understand the current statistical literature on high-dimensional models. Topics will include: concentration inequalities, covariance estimation, principal component analysis, penalized linear regression, maximal inequalities for empirical processes, Rademacher and Gaussian complexities, non-parametric regression and minimax theory.

This course will include (a) fundamental statistical issues that arise frequently, and (b) timely response concerning issues raised from student research and lab meetings. The first, (a), will involve particular topics, with assigned reading. Prior to class, students will be required to post (on a course discussion board) a comment or question, and these will furnish the basis for class discussion. For (b), students will also have the opportunity to ask questions about anything that has come up in their reading and research.

Examples of topics in part (a) are the following: Ten simple rules for effective statistical practice. What we mean by "random."  The often-overlooked yet crucial statistical assumption of independence. The reason the normal distribution is so important (it's not about data). The enduring lessons of statistical theory. How bias can invalidate conclusions, while sometimes being helpful. The most common difficulty with reported p-values, and why many concerns about p-values are largely misplaced. The fundamental distinction between effects of causes and causes of effects. The ways that correlation is more subtle than is usually appreciated. Why regression is the most important method in statistics. The situations in which regression can be treacherous. What Bayesian methods can and can not achieve. The goals, successes, and perils of machine learning.

Designed experiments are crucial to draw causal conclusions with minimum expense and maximum precision. This course introduces the basic principles and theory of experimental design, including randomized designs, blocking, analysis of covariance, factorial designs, and power analysis, with an emphasis on recent techniques often applied to the online experiments frequently used by tech companies. We will emphasize the importance of critical thinking about the goals and context of an experiment to choose the best design, and practice these skills through a course project. Coursework will primarily use R for the analysis of experimental data.

This course explores the role of stochastic models in scientific research, particularly in understanding neural systems. Through case studies, students will examine how statistical models, such as Poisson processes and Bayesian frameworks, are used to describe neural behavior in sensory processing, learning, and motor control.

The course emphasizes the importance of mathematical theories in making sense of complex neural functions, aiming to equip students with the tools to mathematically formalize and investigate neural computations.

The mini courses vary each year, but below are a list of topics that have been covered: Network Theory, Central Limit Theorem, Bayesian Statistics, Concentration of Measure, Statistical Analysis, Statistical Methodology

Students are also encouraged to take courses in machine learning:

Machine learning is an interdisciplinary field that draws on aspects of computer science, statistics, probability, optimization, information theory, and more. Increasingly its applications include fields that span all of those represented at Carnegie Mellon.

This course is designed to give students a deep understanding of how and why these methods work and how they can be applied to new problems. This course covers the core concepts, theory, algorithms, and applications of machine learning. We cover supervised learning topics such as classification (Naive Bayes, Logistic regression, Support Vector Machines, neural networks, decision trees, boosting) and regression (linear, nonlinear), unsupervised learning (MLE, MAP, clustering, PCA, dimensionality reduction), graphical models, reasoning under uncertainty, and ML theory.

 

This course will give students a thorough grounding in the algorithms, mathematics, theories, and insights needed to do in-depth research and applications in machine learning. The topics of this course will in part parallel those covered in the general PhD-level machine learning course (10-701), but with a greater emphasis on depth in theory. Students entering the class are expected to have a pre-existing strong working knowledge of linear algebra, probability, statistics, and algorithms.

 

This course is for students who have already taken introductory courses in machine learning and statistics, and who are interested in deeper theoretical foundations of machine learning, as well as advanced methods and frameworks used in modern machine learning. The course goals are to:

  1. Understand statistical and computational considerations in machine learning methods.
  2. Develop the skill of devising computationally efficient and yet statistically rigorous algorithms for solving machine learning problems.
  3. Understand the science of modern statistical analysis.
  4. Develop the skill of quantifying the statistical performance of any new machine learning method.