Statistics & Data Science Core Ph.D. Requirements

Deviation from these requirements will only be allowed in exceptional cases, and with the approval of the Director of Graduate Studies. With a few exceptions, students in joint degree programs are also required to satisfy all of these requirements.

Learn how to apply

The Path to the Ph.D.

The following template shows the typical schedule a student follows to, in under two years, complete the coursework requirements and commence thesis research.

Year 1
Fall 36-699: Immigration to Statistics 36-700: Probability and Mathematical Statistics OR 36-705: Intermediate Statistics 36-707: Regression Analysis 36-750: Statistical Computing	Spring 36-709: Advanced Statistics I 36-757: Advanced Data Analysis I 36-708: Statistical Machine Learning or Methods Minis Begin Work on ADA Project
Year 2
Fall 36-795: Advanced Data Analysis II (Interdisciplinary Applied Research) Complete oral and written presentations of ADA Project Electives, e.g. 36-710: Advanced Statistical Theory / Probability Theory	Spring Finalize Thesis Advisor and Topic Begin Elective Coursework*
Year 3
Prepare and deliver your thesis proposal.
Year 4 and beyond
Dedicated to dissertation research.

*Note: After the first three semesters, students often take electives on advanced statistics, machine learning, or domain-specific topics. Each semester also features half-semester courses ("minis") that cover exciting topics in the field.

Program Requirements and Course Descriptions

Year 1 - Fall

36-699: Statistical Immigration

Students are introduced to the faculty and their interests, the field of statistics, and the facilities at Carnegie Mellon.

Each faculty member gives at least one elementary lecture on some topic of his or her choice. In the past, topics have included: the field of statistics and its history, large-scale sample surveys, survival analysis, subjective probability, time series, robustness, multivariate analysis, psychiatric statistics, experimental design, consulting, decision-making, probability models, statistics and the law, and comparative inference.

Students are also given information about the libraries at Carnegie Mellon and current bibliographic tools. In addition, students are instructed in the use of the Departmental and University computational facilities and available statistical program packages.

36-700: Probability and Mathematical Statistics

This course covers the basics of statistics. We will first provide a quick introduction to probability theory, and then cover fundamental topics in mathematical statistics such as point estimation, hypothesis testing, asymptotic theory, and Bayesian inference. If time permits, we will also cover more advanced and useful topics including nonparametric inference, regression and classification. Prerequisites: one- and two-variable calculus and matrix algebra.

36-705: Intermediate Statistics

This course covers the fundamentals of theoretical statistics. Topics include: probability inequalities, point and interval estimation, minimax theory, hypothesis testing, data reduction, convergence concepts, Bayesian inference, nonparametric statistics, bootstrap resampling, VC dimension, prediction and model selection.

36-707: Regression Analysis

This course covers:

The basic principles of causality.
Foundations of linear regression, including theory, computation, diagnostics, and generalized linear models.
Extensions to nonparametric regression, including splines, kernel regression, and generalized additive models.
Discussion of tools to compare statistical models, including hypothesis tests, cross-validation, and bootstrapping.
Topics in nonparametric regression and machine learning as time permits, such as regression trees, boosting, and random forests.
Emphasis on writing data analysis reports that answer substantive scientific methods with appropriate statistical tools.

Students will be equipped with the tools needed to explore a substantive scientific question with data, translate scientific questions into statistical questions, compare different modeling approaches rigorously, and write their results in a clear manner.

36-750: Statistical Computing

A detailed introduction to elements of computing relating to statistical modeling, targeted to PhD students and masters students in Statistics & Data Science.

Topics include important data structures and algorithms; numerical methods; databases; parallelism and concurrency; and coding practices, program design, and testing.

Multiple programming languages will be supported (e.g., C, R, Python, etc.). Those with no previous programming experience are welcome but will be required to learn the basics of at least one language via self-study.

Year 1 - Spring

36-709: Advanced Statistics I

This is a core Ph.D. course in theoretical statistics. The class will cover a selection of modern topics in mathematical statistics, focussing on high-dimensional parametric models and non-parametric models. The main goal of the course is to provide the students with adequate theoretical background and mathematical tools to read and understand the current statistical literature on high-dimensional models.

Topics will include: concentration inequalities, covariance estimation, principal component analysis, penalized linear regression, maximal inequalities for empirical processes, Rademacher and Gaussian complexities, non-parametric regression and minimax theory. This will be the first part of a two semester sequence.

36-757: Advanced Data Analysis I

Advanced Data Analysis (ADA) is a Ph.D. level seminar on advanced methods in statistics, including computationally intensive smoothing, classification, variable selection and simulation techniques.

During 36-757, you work with the seminar instructor to identify an ADA project for yourself. The ADA project is an extended project in applied statistics, done in collaboration with an investigator from outside the Department, under the guidance of a faculty committee, culminating in a publishable paper that is presented orally and in writing in 36-758.

36-708: Statistical Machine Learning

This course focuses on statistical methods for machine learning, a decades-old topic in statistics that now has a life of its own, intersecting with many other fields. While the core focus of this course is methodology (algorithms), the course will have some amount of formalization and rigor (theory/derivation/proof), and some amount of interacting with data (simulated and real). However, the primary way in which this course complements related courses in other departments is the joint ABCDE focus on (A) Algorithm design principles, (B) Bias-variance thinking, (C) Computational considerations (D) Data analysis (E) Explainability and interpretability.

Begin Work on the ADA Project

The ADA project is an extended project in applied statistics, done in collaboration with an investigator from outside the Department, under the guidance of a faculty committee, culminating in a publishable paper that is presented orally and in writing.

Year 2 - Fall

36-758: Advanced Data Analysis II

Advanced Data Analysis (ADA) is a Ph.D. level seminar on advanced methods in statistics, including computationally intensive smoothing, classification, variable selection and simulation techniques. During 36-757, you work with the seminar instructor to identify an ADA project for yourself.

The ADA project is an extended project in applied statistics, done in collaboration with an investigator from outside the Department, under the guidance of a faculty committee, culminating in a publishable paper that is presented orally and in writing in 36-758.

Complete oral and written presentations of ADA Project

The project ends with a 15 page internal report describing the work together with a 25 minutes presentation to the Department, with additional time allotted for questions and answers. The completion of the ADA report into a publishable paper is at the discretion of the ADA advisors and the student, and may extend up to one or two additional semesters depending on the scope and nature of the research.

Elective Coursework

Ph.D. students can customize their studies by choosing from a variety of elective courses. These options allow for deeper exploration of specific research areas or the development of skills that align with their academic and career goals.

Year 2 - Spring

Finalize Thesis Advisor and Topic

The student determines the research topic they will pursue for their Ph.D. thesis and which faculty will serve as their advisor or co-advisor(s) to carry out this work.

Elective Coursework

Ph.D. students can customize their studies by choosing from a variety of elective courses. These options allow for deeper exploration of specific research areas or the development of skills that align with their academic and career goals.

Elective Coursework

Note: Electives can change year-to-year

36-710: Advanced Statistical Theory II

The class will cover a selection of modern topics in mathematical statistics, focussing on high-dimensional parametric models and non-parametric models. The main goal of the course is to provide the students with adequate theoretical background and mathematical tools to read and understand the current statistical literature on high-dimensional models. Topics will include: concentration inequalities, covariance estimation, principal component analysis, penalized linear regression, maximal inequalities for empirical processes, Rademacher and Gaussian complexities, non-parametric regression and minimax theory.

36-719: Topics in Experimental Design and Data Analysis for Neuroscience

This course will include (a) fundamental statistical issues that arise frequently, and (b) timely response concerning issues raised from student research and lab meetings. The first, (a), will involve particular topics, with assigned reading. Prior to class, students will be required to post (on a course discussion board) a comment or question, and these will furnish the basis for class discussion. For (b), students will also have the opportunity to ask questions about anything that has come up in their reading and research.

Examples of topics in part (a) are the following: Ten simple rules for effective statistical practice. What we mean by "random." The often-overlooked yet crucial statistical assumption of independence. The reason the normal distribution is so important (it's not about data). The enduring lessons of statistical theory. How bias can invalidate conclusions, while sometimes being helpful. The most common difficulty with reported p-values, and why many concerns about p-values are largely misplaced. The fundamental distinction between effects of causes and causes of effects. The ways that correlation is more subtle than is usually appreciated. Why regression is the most important method in statistics. The situations in which regression can be treacherous. What Bayesian methods can and can not achieve. The goals, successes, and perils of machine learning.

36-727 - Modern Experimental Design

Designed experiments are crucial to draw causal conclusions with minimum expense and maximum precision. This course introduces the basic principles and theory of experimental design, including randomized designs, blocking, analysis of covariance, factorial designs, and power analysis, with an emphasis on recent techniques often applied to the online experiments frequently used by tech companies. We will emphasize the importance of critical thinking about the goals and context of an experiment to choose the best design, and practice these skills through a course project. Coursework will primarily use R for the analysis of experimental data.

36-759 - Statistical Models of the Brain

This course explores the role of stochastic models in scientific research, particularly in understanding neural systems. Through case studies, students will examine how statistical models, such as Poisson processes and Bayesian frameworks, are used to describe neural behavior in sensory processing, learning, and motor control.

The course emphasizes the importance of mathematical theories in making sense of complex neural functions, aiming to equip students with the tools to mathematically formalize and investigate neural computations.

Ph.D. Minis

The mini courses vary each year, but below are a list of topics that have been covered: Network Theory, Central Limit Theorem, Bayesian Statistics, Concentration of Measure, Statistical Analysis, Statistical Methodology

Students are also encouraged to take courses in machine learning:

10-701: Introduction to Machine Learning

Machine learning is an interdisciplinary field that draws on aspects of computer science, statistics, probability, optimization, information theory, and more. Increasingly its applications include fields that span all of those represented at Carnegie Mellon.

This course is designed to give students a deep understanding of how and why these methods work and how they can be applied to new problems. This course covers the core concepts, theory, algorithms, and applications of machine learning. We cover supervised learning topics such as classification (Naive Bayes, Logistic regression, Support Vector Machines, neural networks, decision trees, boosting) and regression (linear, nonlinear), unsupervised learning (MLE, MAP, clustering, PCA, dimensionality reduction), graphical models, reasoning under uncertainty, and ML theory.

10-715: Advanced Introduction to Machine Learning

This course will give students a thorough grounding in the algorithms, mathematics, theories, and insights needed to do in-depth research and applications in machine learning. The topics of this course will in part parallel those covered in the general PhD-level machine learning course (10-701), but with a greater emphasis on depth in theory. Students entering the class are expected to have a pre-existing strong working knowledge of linear algebra, probability, statistics, and algorithms.

10-716: Advanced Machine Learning: Theory and Methods

This course is for students who have already taken introductory courses in machine learning and statistics, and who are interested in deeper theoretical foundations of machine learning, as well as advanced methods and frameworks used in modern machine learning. The course goals are to:

Understand statistical and computational considerations in machine learning methods.
Develop the skill of devising computationally efficient and yet statistically rigorous algorithms for solving machine learning problems.
Understand the science of modern statistical analysis.
Develop the skill of quantifying the statistical performance of any new machine learning method.

Dietrich College of Humanities and Social Sciences

Statistics & Data Science Core Ph.D. Requirements

Our Ph.D. Programs

The Path to the Ph.D.

Year 1

Year 2

Year 3

Year 4 and beyond

Program Requirements and Course Descriptions

Year 1 - Fall

36-699: Statistical Immigration

36-700: Probability and Mathematical Statistics

36-705: Intermediate Statistics

36-707: Regression Analysis

36-750: Statistical Computing

Year 1 - Spring

36-709: Advanced Statistics I

36-757: Advanced Data Analysis I

36-708: Statistical Machine Learning

Begin Work on the ADA Project

Year 2 - Fall

36-758: Advanced Data Analysis II

Complete oral and written presentations of ADA Project

Elective Coursework

Year 2 - Spring

Finalize Thesis Advisor and Topic

Elective Coursework

Elective Coursework

36-710: Advanced Statistical Theory II

36-719: Topics in Experimental Design and Data Analysis for Neuroscience

36-727 - Modern Experimental Design

36-759 - Statistical Models of the Brain

Ph.D. Minis

Students are also encouraged to take courses in machine learning:

10-701: Introduction to Machine Learning

10-715: Advanced Introduction to Machine Learning

10-716: Advanced Machine Learning: Theory and Methods