Carnegie Mellon University

Upcoming Events

More events will be posted soon.

Past Events

April 22, 2024

Anderson Zhang
Spectral Clustering: Methodology and Statistical Analysis

Anderson Zhang headshot

Abstract: Spectral clustering is one of the most popular algorithms to group high-dimensional data. It is easy to implement, computationally efficient, and has achieved tremendous success in many applications. The idea behind spectral clustering is dimensionality reduction. It first performs a spectral decomposition on the dataset and only keeps the leading few spectral components to reduce the data's dimension. It then applies some standard methods such as the k-means on the low-dimensional space to do clustering. In this talk, we demystify the success of spectral clustering by providing a sharp statistical analysis of its performance under mixture models. For isotropic Gaussian mixture models, we show spectral clustering is optimal. For sub-Gaussian mixture models, we derive exponential error rates for spectral clustering. To establish these results, we develop a new spectral perturbation analysis for singular subspaces.

Bio: Anderson Ye Zhang is an Assistant Professor in the Department of Statistics and Data Science at the Wharton School, with a secondary appointment in the Department of Computer and Information Science, at the University of Pennsylvania. Before joining Wharton, he was a William H. Kruskal Instructor in the Department of Statistics at the University of Chicago. He obtained his Ph.D. degree from the Department of Statistics and Data Science at Yale University. His research interests include network analysis, clustering, spectral analysis, and ranking from pairwise comparisons.

April 12, 2024

Alicia Carriquiry
Statistics and its Applications in Forensic Science and the Criminal Justice System

Alicia Carriquiry headshot

Abstract: Steve Fienberg was a pioneer in highlighting the role of statistical thinking in the civil and criminal justice systems and was an early critic of many forensic methods that are still in use in US courts. One of his last achievements was the creation of the Center for Statistics and Applications in Forensic Evidence (CSAFE), a federally funded NIST Center of Excellence, with the mission to build the statistical foundation – where possible – for what is known as forensic pattern comparison disciplines and digital forensics.

Forensic applications present unique challenges for statisticians. For example, much of the data that arise in forensics are non-standard, so even defining analytical variables may require out-of-the-box thinking. As a result, the usual statistical approaches may not enable addressing the questions of interest to jurors, legal professionals and forensic practitioners.

Today’s presentation introduces some of the statistical and algorithmic methods proposed by CSAFE researchers that have the potential to impact forensic practice in the US. Two examples are used for illustration: the analysis of questioned handwritten documents and of marks imparted by firearms on bullets or cartridge cases. In both examples, the question we address is one of source: do two or more items have the same source? In the first case, we apply “traditional” statistical modeling methods, while in the second case, we resort to algorithmic approaches. Much of the research carried out in CSAFE is collaborative and while mission-driven, also academically rigorous, which would have pleased Steve tremendously.

Bio: Alicia Carriquiry (NAM) is Professor of Statistics at Iowa State University. Between January of 2000 and July of 2004, she was Associate Provost at Iowa State. Her research interests are in Bayesian statistics and general methods. Her recent work focuses on nutrition and dietary assessment, as well as on problems in genomics, forensic sciences and traffic safety. Dr. Carriquiry is an elected member of the International Statistical Institute and a fellow of the American Statistical Association. She serves on the Executive Committee of the Institute of Mathematical Statistics and has been a member of the Board of Trustees of the National Institute of Statistical Sciences since 1997. She is also a past president of the International Society for Bayesian Analysis (ISBA) and a past member of the Board of the Plant Sciences Institute at Iowa State University.

April 5, 2024

Andrew Gelman
Bayesian Workflow:  Some Progress and Open Questions

Andrew Gelman headshot

Abstract: The workflow of applied Bayesian statistics includes not just inference but also model building, model checking, confidence-building using fake data, troubleshooting problems with computation, model understanding, and model comparison. We would like to toward codify these steps in the realistic scenario in which researchers are fitting many models for a given problem. We discuss various issues including prior distributions, data models, and computation, in the context of ideas such as the Fail Fast Principle and the Folk Theorem of Statistical Computing. We also consider some examples of Bayesian models that give bad answers and see if we can develop a workflow that catches such problems.

Bio: Andrew Gelman is a professor of statistics and political science at Columbia University. He has received the Outstanding Statistical Application award three times from the American Statistical Association, the award for best article published in the American Political Science Review, the Mitchell and DeGroot prizes from the International Society of Bayesian Analysis, and the Council of Presidents of Statistical Societies award. His books include Bayesian Data Analysis (with John Carlin, Hal Stern, David Dunson, Aki Vehtari, and Donald Rubin), Teaching Statistics: A Bag of Tricks (with Deborah Nolan), Data Analysis Using Regression and Multilevel/Hierarchical Models (with Jennifer Hill), Red State, Blue State, Rich State, Poor State: Why Americans Vote the Way They Do (with David Park, Boris Shor, and Jeronimo Cortina), A Quantitative Tour of the Social Sciences (co-edited with Jeronimo Cortina), and Regression and Other Stories (with Jennifer Hill and Aki Vehtari).

Andrew has done research on a wide range of topics, including: why it is rational to vote; why campaign polls are so variable when elections are so predictable; the effects of incumbency and redistricting; reversals of death sentences; police stops in New York City, the statistical challenges of estimating small effects; the probability that your vote will be decisive; seats and votes in Congress; social network structure; arsenic in Bangladesh; radon in your basement; toxicology; medical imaging; and methods in surveys, experimental design, statistical inference, computation, and graphics.

March 25, 2024

Glenn Shafer
Modernizing Cournot’s Principle

Glenn Shafer headshotAbstract: In everyday English, a forecast is something less than a prediction.  It is more like an estimate. When an economist forecasts 3.5% inflation in the United States next year, or my weather app forecasts 0.55 inches of rain, these are not exactly predictions. When the forecaster gives rain a 30% probability, this too is not a prediction. A prediction is more definite about what is predicted and about predicting it.

We might say that a probability is a prediction when it is very close to one.  But this formulation has a difficulty: there are too many high probabilities. There is a high probability against every ticket in a lottery, but we cannot predict that no ticket will win.

Game-theoretic statistics resolves this problem by showing how some high probabilities are simpler than others. The simpler ones qualify as predictions.

This story has roles for Cournot’s principle, Kolmogorov’s algorithmic complexity, and de Finetti’s previsione. See and my two books on the topic with Vladimir Vovk.

Bio: In the 1970s, Glenn launched the “Dempster-Shafer” theory. The Belief Functions and Applications Society, devoted to this theory, has been holding international conferences since 2010.

During the past 25 years, Glenn and Vladimir Vovk launched game-theoretic probability and statistics. Their two books on the topic appeared in 2001 and 2019.

Glenn has published more than 20 papers on the history of probability and statistics. His most recent book, The Splendors and Miseries of Martingales: Their History from the Casino to Mathematics, co-edited with Laurent Mazliak, was published by Birkhäuser in 2022.

Glenn served in the Peace Corps in Afghanistan. At the Rutgers Business School, he served as director of the doctoral program for ten years and as dean for four years.

March 18, 2024

Martin Wainwright
Challenges with Covariate Shift: From Prediction to Causal Inference

Martin Wainwright headshotAbstract: In many modern uses of predictive methods, there can be shifts between the distributional properties of training data compared to the test data.  Such mismatches can cause dramatic reductions in accuracy that remain mysterious. How to find practical procedures that mitigate such effects in an optimal way?  In this talk, we discuss the fundamental limits of problems with covariate shift, and simple procedures that achieve these fundamental limits.  Our talk covers both the challenges of covariate shift in non-parametric regression, and also in semi-parametric problems that arise from causal inference and off-policy evaluation.

Based on joint works with: Peter Bartlett, Peng Ding, Cong Ma, Wenlong Mou, Reese Pathak and Lin Xiao.

Bio: Martin Wainwright is the Cecil H. Green Professor in Electrical Engineering and Computer Science and Mathematics at MIT, and affiliated with the Laboratory for Information and Decision Systems and Statistics and Data Science Center.

His main claim to fame is that he was the graduate advisor of Nihar Shah, and postdoc advisor of Aaditya Ramdas, Pradeep Ravikumar (all esteemed faculty at CMU), and Sivaraman Balakrishnan. He has also received a number of awards and recognition including an Alfred P. Sloan Foundation Fellowship, best paper awards from the IEEE Signal Processing Society, the IEEE Communications Society, and the IEEE Information Theory and Communication Societies, the Medallion Lectureship and Award from the Institute of Mathematical Statistics, and the COPSS Presidents’ Award from the Joint Statistical Societies. He was a Section Lecturer with the International Congress of Mathematicians in 2014 and received the Blackwell Award from the Institute of Mathematical Statistics in 2017.

February 7, 2024

Shuangning Li
Inference and Decision-Making amid Social Interactions

Shuangning Li headshotAbstract: From social media trends to family dynamics, social interactions shape our daily lives. In this talk, I will present tools I have developed for statistical inference and decision-making in light of these social interactions.

  1. Inference: I will talk about estimation of causal effects in the presence of interference. In causal inference, the term “interference” refers to a situation where, due to interactions between units, the treatment assigned to one unit affects the observed outcomes of others. I will discuss large-sample asymptotics for treatment effect estimation under network interference where the interference graph is a random draw from a graphon. When targeting the direct effect, we show that popular estimators in our setting are considerably more accurate than existing results suggest. Meanwhile, when targeting the indirect effect, we propose a consistent estimator in a setting where no other consistent estimators are currently available.
  2. Decision-Making: Turning to reinforcement learning amid social interactions, I will focus on a problem inspired by a specific class of mobile health trials involving both target individuals and their care partners. These trials feature two types of interventions: those targeting individuals directly and those aimed at improving the relationship between the individual and their care partner. I will present an online reinforcement learning algorithm designed to personalize the delivery of these interventions. The algorithm's effectiveness is demonstrated through simulation studies conducted on a realistic test bed, which was constructed using data from a prior mobile health study. The proposed algorithm will be implemented in the ADAPTS HCT clinical trial, which seeks to improve medication adherence among adolescents undergoing allogeneic hematopoietic stem cell transplantation.

Bio: Shuangning Li is a postdoctoral fellow working with Professor Susan Murphy in the Department of Statistics at Harvard University. Prior to this, she earned her Ph.D. from the Department of Statistics at Stanford University, where she was advised by Professors Emmanuel Candès and Stefan Wager.

February 5, 2024

Brian Trippe
Probabilistic methods for designing functional protein structures

Brian Trippe headshotAbstract: The biochemical functions of proteins, such as catalyzing a chemical reaction or binding to a virus, are typically conferred by the geometry of only a handful of atoms. This arrangement of atoms, known as a motif, is structurally supported by the rest of the protein, referred to as a scaffold. A central task in protein design is to identify a diverse set of stabilizing scaffolds to support a motif known or theorized to confer function. This long-standing challenge is known as the motif-scaffolding problem.

In this talk, I describe a statistical approach I have developed to address themotif-scaffolding problem. My approach involves (1) estimating a distribution supported on realizable protein structures and (2) sampling scaffolds from this distribution conditioned on a motif. For step (1) Iadapt diffusion generative models to fit example protein structures from nature. For step (2) I develop sequential monte carlo algorithms to sample from the conditional distributions of these models. I finally describe how, with experimental and computational collaborators, I have generalized and scaled this approach to generate and experimentally validate hundreds of proteins with various functional specifications.

Bio: Brian Trippe is a postdoctoral fellow at Columbia University In the Department of Statistics, and a visiting researcher at the Institute for Protein Design at the University of Washington. He completed his Ph.D. in Computational and Systems Biology at the Massachusetts Institute of Technology where he worked on Bayesian methods for inference in hierarchical linear models. In his research, Brian develops statistical machine learning methods to address challenges in biotechnology and medicine, with a focus on generative modeling and inference algorithms for protein engineering.

January 31, 2024

Michael Celentano
Debiasing in the inconsistency regime

Michael Celentano headshotAbstract: In this talk, I will discuss semi-parametric estimation when nuisance parameters cannot be estimated consistently, focusing in particular on the estimation of average treatment effects, conditional correlations, and linear effects under high-dimensional GLM specifications. In this challenging regime, even standard doubly-robust estimators can be inconsistent. I describe novel approaches which enjoy consistency guarantees for low-dimensional target parameters even though standard approaches fail. For some target parameters, these guarantees can also be used for inference. Finally, I will provide my perspective on the broader implications of this work for designing methods which are less sensitive to biases from high-dimensional prediction models.

Bio: Michael Celentano is a Miller Fellow in the Statistics Department at the University of California, Berkeley, advised by Martin Wainwright and Ryan Tibshirani. He received his Ph.D. in Statistics from Stanford University in 2021, where he was advised by Andrea Montanari. Most of his work focuses on the high-dimensional asymptotics for regression, classification, and matrix estimation problems.

January 29, 2024

Ying Jin
Model-free selective inference: from calibrated uncertainty to trusted decisions

Ying Jin headshotAbstract: AI has shown great potential in accelerating decision-making and scientific discovery pipelines such as drug discovery, marketing, and healthcare. In many applications, predictions from black-box models are used to shortlist candidates whose unknown outcomes satisfy a desired property, e.g., drugs with high binding affinities to a disease target. To ensure the reliability of high-stakes decisions, uncertainty quantification tools such as conformal prediction have been increasingly adopted to understand the variability in black-box predictions. However, we find that the on-average guarantee of conformal prediction can be insufficient for its deployment in decision making which usually has a selective nature.

In this talk, I will introduce a model-free selective inference framework that allows us to select reliable decisions with the assistance of any black-box prediction model. Our framework identifies candidates whose unobserved outcomes exceed user-specified values while controlling the average proportion of falsely selected units (FDR), without any modeling assumptions. Leveraging a set of exchangeable training data, our method constructs conformal p-values that quantify the confidence in large outcomes; it then determines a data-dependent threshold for the p-values as a criterion for drawing confident decisions. In addition, I will discuss new ideas to further deal with covariate shifts between training and new samples. We show that in several drug discovery tasks, our methods narrow down the drug candidates to a manageable size of promising ones while controlling the proportion of falsely discovered. In a causal inference dataset, our methods identify students who benefit from an educational intervention, providing new insights for causal effects.

Bio: Ying Jin is a fifth-year Ph.D. student at Department of Statistics, Stanford University, advised by Emmanuel Candès and Dominik Rothenhäusler. Prior to this, she obtained B.S. in Mathematics from Tsinghua University. Her research focuses on devising modern statistical methodology that enables trusted inference and decisions with minimal assumptions, covering conformal inference, multiple testing, causal inference, distribution robustness, and data-driven decision-making.

January 24, 2024

Arkajyoti Saha
Inference for machine learning under dependence

Arkajyoti Saha headshotAbstract: Recent interest has centered on uncertainty quantification for machine learning models. For the most part, this work has assumed independence of the observations. However, many of the most important problems arising across scientific fields, from genomics to climate science, involve systems where dependence cannot be ignored. In this talk, I will investigate conference on machine learning models in the presence of dependence. 

In the first part of my talk, I will consider a common practice in the field of genomics in which researchers compute a correlation matrix between genes and threshold its elements in order to extract groups of independent genes. I will describe how to construct valid p-values associated with these discovered groups that properly account for the group selection process.  While thesis related to the literature on selective inference developed in the past decade, this work involves inference about the covariance matrix rather than the mean, and therefore requires an entirely new technical toolset. This same toolset can be applied to quantify the uncertainty associated with canonical correlation analysis after feature screening. 

In the second part of my talk, I will turn to an important problem in the field of oceanography as it relates to climate science. Oceanographers have recently applied random forests to estimate carbon export production, a key quantity of interest, at a given location in the ocean; they then wish to sum the estimates across the world’s oceans to obtain an estimate of global export production. While quantifying uncertainty associated with a single estimate is relatively straightforward, quantifying uncertainty of the summed estimates is not, due to their complex dependence structure. I will adapt the theory of V-statistics to this dependent data setting in order to establish a central limit theorem for the summed estimates, which can be used to quantify the uncertainty associated with global export production across the world’s oceans.

This is joint work with my postdoctoral supervisors, Daniela Witten (University of Washington) and Jacob Bien (University of Southern California).

Bio: Arkajyoti Saha is a postdoctoral fellow in the Department of Statistics, University of Washington. He received his Ph.D. in Biostatistics from the Johns Hopkins Bloomberg School of Public Health. His research lies at the intersection of machine learning, selective inference, and spatial statistics,with a focus on machine learning under dependence with applications in genomics and oceanography.

January 22, 2024

Satarupa Bhattacharjee
Geodesic Mixed Effects Models for Repeatedly Observed/Longitudinal Random Objects

Satarupa Bhattacharjee headshotAbstract: Mixed effect modeling for longitudinal data is challenging when the observed data are random objects, which are complex data taking values in a general metric space without either global linear or local linear (Riemannian) structure. In such settings, the classical additive error model and distributional assumptions are unattainable. Due to the rapid advancement of technology, longitudinal data containing complex random objects, such as covariance matrices, data on Riemannian manifolds, and probability distributions are becoming more common. Addressing this challenge, we develop a mixed-effects regression for data in geodesic spaces, where the underlying mean response trajectories are geodesics in the metric space and the deviations of the observations from the model are quantified by perturbation maps or transports. A key finding is that the geodesic trajectories assumption for the case of random objects is a natural extension of the linearity assumption in the standard Euclidean scenario to the case of general geodesic metric spaces. Geodesics can be recovered from noisy observations by exploiting a connection between the geodesic path and the path obtained by global Fréchet regression for random objects. The effect of baseline Euclidean covariates on the geodesic paths is modeled by another Fréchet regression step. We study the asymptotic convergence of the proposed estimates and provide illustrations through simulations and real-data applications.

Bio: I am a Postdoctoral Scholar in the Department of Statistics at Pennsylvania State University, working with Prof. Bing Li and Prof. Lingzhou Xue. I received my Ph.D. in Statistics at UC Davis advised by Prof. Hans-Georg Müller in September 2022.

My primary research centers around analyzing functional and non-Euclidean data situated in general metric spaces, which we refer to as random objects, with examples in brain imaging data, networks, distribution valued data, and high-dimensional genetics data.

January 17, 2024

Christopher Harshaw
Algorithm Design for Randomized Experiments

Chris Harshaw headshotAbstract: Randomized experiments are one of the most reliable causal inference methods and are used in a variety of disciplines from clinical medicine, public policy, economics, and corporate A/B testing. Experiments in these disciplines provide empirical evidence which drives some of the most important decisions in our society: What drugs are prescribed? Which social programs are implemented? What corporate strategies to use? Technological advances in measurements and intervention -- including high dimensional data, network data, and mobile devices -- offer exciting opportunities to design new experiments to investigate a broader set of causal questions. In these more complex settings, standard experimental designs (e.g. independent assignment of treatment) are far from optimal. Designing experiments which yield the most precise estimates of causal effects in these complex settings is not only a statistical problem, but also an algorithmic one.

In this talk, I will present my recent work on designing algorithms for randomized experiments. I will begin by presenting Clip-OGD, a new algorithmic experimental design for adaptive sequential experiments. We show that under the Clip-OGD design, the variance of an adaptive version of the Horvitz-Thompson estimator converges to the optimal non-adaptive variance, resolving a70-year-old problem posed by Robbins in 1952. Our results are facilitated by drawing connections to regret minimization in online convex optimization. Time permitting, I will describe a new unifying framework for investigating causal effects under interference, where treatment given to one subject can affect the outcomes of other subjects. Finally, I will conclude by highlighting open problems and reflecting on future work in these directions.

Bio: Christopher Harshaw is a FODSI postdoc at MIT and UC Berkeley. He received his Ph.D. from Yale University where he was advised by Dan Spielman and Amin Karbasi. His research lies at the interface of causal inference, machine learning, and algorithm design, with a particular focus on the design and analysis of randomized experiments. His work has appeared in the Journal of the American Statistical Association, Electronic Journal of Statistics, ICML, NeurIPS, and won Best Paper Award at the NeurIPS 2022 workshop, CML4Impact.

November 27, 2023

Yen-chi Chen
Pattern Graphs: a Graphical Approach to Nonmonotone Missing Data

Yen-chi Chen headshotAbstract: We introduce the concept of pattern graphs--directed acyclic graphs representing how response patterns are associated. A pattern graph represents an identifying restriction that is nonparametrically identified/saturated and is often a missing not at random restriction. We introduce a selection model and a pattern mixture model formulation using the pattern graphs and show that they are equivalent. A pattern graph leads to an inverse probability weighting estimator as well as an imputation-based estimator. We also study the semi-parametric efficiency theory and derive a multiply-robust estimator using pattern graphs.

Bio: Dr. Chen is an associate professor in the Department of Statistics and a data science fellow in the eScience Institute at the University of Washington. He also serves as a co-investigator and statistician at the National Alzheimer’s Coordinating Center. Dr. Chen has received several awards including NSF's CAREER award and ASA's Noether Young scholar award. 

November 13, 2023

Yanjun Han
Covariance alignment: from MLE to Gromov-Wasserstein

Yanjun Han headshotAbstract: Feature alignment methods are used in many scientific disciplines for data pooling, annotation, and comparison. As an instance of a permutation learning problem, feature alignment presents significant statistical and computational challenges. In this talk, I will introduce the covariance alignment model to study and compare various alignment methods and establish a minimax lower bound for covariance alignment that has a non-standard dimension scaling because of the presence of a nuisance parameter. This lower bound is in fact minimax optimal and is achieved by a natural quasi MLE. However, this estimator involves a search over all permutations which is computationally infeasible even when the problem has a moderate size. To overcome this limitation, I will show that the celebrated Gromov-Wasserstein algorithm from optimal transport which is more amenable to fast implementation even on large-scale problems is also minimax optimal. These results give the first statistical justification for the deployment of the Gromov-Wasserstein algorithm in practice. Finally, I will also discuss the connections to recent literature on statistical graph matching and orthogonal statistical learning. Based on a joint work with Philippe Rigollet and George Stepaniants.

Bio: Yanjun Han is an assistant professor of mathematics and data science at the Courant Institute of Mathematical Sciences and the Center for Data Science, New York University. He received his Ph.D. in Electrical Engineering from Stanford University in Aug 2021, under the supervision of Tsachy Weissman. After that, he spent one year as a postdoctoral scholar at the Simons Institute for the Theory of Computing, UC Berkeley, and another year as a Norbert Wiener postdoctoral associate in the Statistics and Data Science Center at MIT, mentored by Sasha Rakhlin and Philippe Rigollet. Honors on his past work include a best student paper finalist at ISIT 2016, a student paper award at ISITA 2016, and the Annals of Statistics Special Invited Session at JSM 2021. His research interests include high-dimensional and nonparametric statistics, bandits, and information theory.

November 10–11, 2023

SAC talkCarnegie Mellon Sports Analytics Conference

The Carnegie Mellon Sports Analytics Conference is an annual event dedicated to highlighting the latest sports research from the statistics and data science community.


November 6, 2023

Elizabeth Slate
A Bayesian model for joint longitudinal and survival outcomes in the presence of subpopulation heterogeneity

Elizabeth Slate headshotAbstract: Biomedical studies often monitor subjects using a longitudinal marker that may be informative about a time-to-event outcome of interest. An example is periodic monitoring of CD4 cell count as the longitudinal marker and time to death from AIDS. By modeling these two outcomes jointly, there is potential to improve the precision of inference for each. Brown and Ibrahim (2003, Biometrics) adopted the common approach of incorporating the mean longitudinal trajectory as a predictor for the event time hazard, and enhanced flexibility by using a Dirichlet process prior for the coefficients of the trajectory.  We generalize this model to accommodate non-Gaussian longitudinal outcomes and emphasize that the Dirichlet process enables discovery of subgroups of subjects with distinct behavior for the joint outcome. Our formulation, developed using the multivariate log-gamma distribution, offers greater flexibility in the longitudinal model and computational advantage for Markov chain Monte Carlo estimation. We illustrate with simulation and application. This is the joint work with Pengpeng Wang and Jonathan Bradley.

Bio: Elizabeth joined the Department of Statistics at Florida State University in 2011 as the Duncan McLean and Pearl Levine Fairweather Professor of Statistics. She is honored to hold this named professorship, especially after having the opportunity to meet Dr. David Fairweather (Ph.D. FSU Statistics, 1970), whose generosity made the position possible.

Elizabeth received her Ph.D. in Statistics from Carnegie Mellon University in Pittsburgh, PA and joined the faculty at Cornell University in the Department of Operations Research and Industrial Engineering (now Operations Research and Information Engineering) in 1992. She visited the Biometry Research Group of the National Cancer Institute 1999-2000 as Visiting Mathematical Statistician, which enhanced her interest in biostatistics. In 2000, she joined the Medical University of South Carolina (MUSC), where she collaborated broadly with clinical and basic science researchers. She directed the Biostatistics Core for the MUSC Center for Oral Health Research (2002-2011) and created and directed the NIH-supported predoctoral training program "Biostatistics for Basic Biomedical Research" (2005-2011).

Elizabeth's recent research is in longitudinal data analysis, Bayesian modeling and recurrent events, with applications in oral health research, disease biomarkers and other health research areas. She has many publications in statistics and medical journals and has received several grants from the National Science Foundation, National Institutes of Health, Department of Defense and other sources. Elizabeth is a Fellow of the American Statistical Association, a Fellow of the Institute of Mathematical Statistics and an Elected Member of the International Statistical Institute. FSU recognized her with the title of Distinguished Research Professor in 2019 and with the Graduate Mentor Award in 2022. She received the Paul Minton Award from the Southern Regional Council on Statistics in 2022.

October 23, 2023

Kristian Lum
Let's get practical: measuring (algorithmic) bias in the real world

Kristian Lum headshot

Abstract: While measures and mitigations for algorithmic bias have proliferated in the academic literature over the last several years, implementing them in practice is frequently not straightforward. This talk delves into the practical implementation of algorithmic bias metrics and mitigation techniques, exploring the challenges and solutions through three distinct case studies. The first case study explores the difficulty of measuring and communicating measures of algorithmic bias when the measures are taken over many different groups. In this example, we develop a low-dimensional summary statistic of algorithmic bias and discuss its statistical properties. In the second example, we build upon existing fair clustering techniques and apply them in the context of polling location assignment. Here, we confront the challenges arising from measuring distance between voters and polling locations and demonstrate how "fair" clustering based on poorly designed distance metrics could exacerbate disparities in voter turnout. Finally, in the third case study, unsatisfied by existing measures of bias for LLMs, I explore the extent of bias present based on my own usage patterns of the models.

Bio: Kristian Lum is an Associate Research Professor at the University of Chicago Data Science Institute. She has previously held positions at Twitter as a Sr. Staff ML Researcher and Research Lead for the Machine Learning Ethics, Accountability, and Transparency group; the University of Pennsylvania Department of Computer and Information Science; and at the Human Rights Data Analysis Group, where she led work on the use of algorithms in the criminal justice system. She is also a co-founder of the ACM Conference on Fairness, Accountability, and Transparency (FAccT). Her current research focuses on algorithmic bias/fairness and social impact data science. 

October 9, 2023

Ziv Goldfeld
Gromov-Wasserstein Alignment: Statistical and Computational Advancements via Duality

Ziv Goldfeld headshot

Abstract: The Gromov-Wasserstein (GW) distance quantifies dissimilarity between metric measure (mm) spaces and provides a natural correspondence between them. As such, it serves as a figure of merit for applications involving alignment of heterogeneous datasets, including object matching, single-cell genomics, and matching language models. While various heuristic methods for approximately evaluating the GW distance from data have been developed, formal guarantees for such approaches---both statistical and computational---remained elusive. This work closes these gaps for the quadratic GW distance between Euclidean mm spaces of different dimensions. At the core of our proofs is a novel dual representation of the GW problem as an infimum of certain optimal transportation problems. The dual form enables deriving, for the first time, sharp empirical convergence rates for the GW distance by providing matching upper and lower bounds. For computational tractability, we consider the entropically regularized GW distance. We derive bounds on the entropic approximation gap, establish sufficient conditions for convexity of the objective, and devise efficient algorithms with global convergence guarantees. These advancements facilitate principled estimation and inference methods for GW alignment problems, that are efficiently computable via the said algorithms.

Bio: Ziv Goldfeld is an Assistant Professor in the School of Electrical and Computer Engineering, and a graduate field member in the Center of Applied Mathematics, Statistics and Data Science, Computer Science, and Operations Research and Information Engineering, at Cornell University. Before joining Cornell, he was a postdoctoral research fellow in LIDS at MIT. Ziv graduated with a B.Sc., M.Sc., and Ph.D. (all summa cum laude) in Electrical and Computer Engineering from Ben Gurion University, Israel. Ziv’s research interests include optimal transport theory, statistical learning theory, information theory, and mathematical statistics. Honors include the NSF CAREER Award, the IBM University Award, the Rothschild Fellowship, and the Michael Tien ’72 Excellence in Teaching Award. 

October 2, 2023

Min-ge Xie
Repro Samples Method for Addressing Irregular Inference Problems and for Unraveling Machine Learning Blackboxes 

Min-ge Xie headshot

Abstract: Rapid data science developments and the desire to have interpretable AI require us to have innovative frameworks to tackle frequently seen, but highly non-trivial ``irregular inference problems,’’ e.g., those involving discrete or non-numerical parameters and those involving non-numerical data, etc. This talk presents an effective and wide-reaching framework, called repro samples method, to conduct statistical inference for the irregular problems and more.  We develop both theories to support our development and provide effective computing algorithms for problems in which explicit solutions are not available. The method is likelihood-free and is particularly effective for irregular inference problems. For commonly encountered irregular inference problems that involve discrete or nonnumerical parameters, we propose a three-step procedure to make inferences for all parameters and develop a unique matching scheme that turns the disadvantage of lacking theoretical tools to handle discrete/nonnumerical parameters into an advantage of improving computational efficiency. The effectiveness of the proposed method is illustrated through case studies by solving two highly nontrivial problems in statistics: a) how to quantify the uncertainty in the estimation of the unknown number of components and make inference for the associated parameters in a Gaussian mixture; b) how to quantify the uncertainty in model estimation and construct confidence sets for the unknown true model, the regression coefficients, or both true model and coefficients jointly in high dimensional regression models. The method also has extensions to complex machine learning models, e.g., (ensemble) tree models, neural networks, graphical models, etc. It provides a new toolset to develop interpretable AI and to help address the blackbox issues in complex machine learning models.

Bio: Min-ge Xie, PhD is a Distinguished Professor at Rutgers, The State University of New Jersey. Dr. Xie received his PhD in Statistics from University of Illinois at Urbana-Champaign and his BS in Mathematics from University of Science and Technology of China. He is the current Editor of The American Statistician and a co-founding Editor-in-Chief of The New England Journal of Statistics in Data Science. He is a fellow of ASA, IMS, and an elected member of ISI. His research interests include theoretical foundations of statistical inference and data science, fusion learning, finite and large sample theories, parametric and nonparametric methods. He is the Director of the Rutgers Office of Statistical Consulting and has a rich interdisciplinary research experiences in collaborating with computer scientists, engineers, biomedical researchers, and scientists in other fields.  

September 25, 2023

Walter Dempsey
Two challenges in time-varying causal effect moderation analysis in mobile health

Walter Dempsey headshot

Abstract: Twin revolutions in wearable technologies and smartphone-delivered digital health interventions have significantly expanded the accessibility and uptake of mobile health (mHealth) interventions in multiple domains of health sciences. Sequentially randomized experiments called micro-randomized trials (MRTs) have grown in popularity as a means to empirically evaluate the effectiveness of mHealth intervention components. MRTs have motivated a new class of causal estimands, termed “causal excursion effects”, that allow health scientists to answer important scientific questions about how intervention effectiveness may change over time or be moderated by individual characteristics, time-varying context, or past responses. In this talk, we present two new tools for causal effect moderation analysis. First, we consider a meta-learner perspective, where any supervised learning algorithm can be used to assist in the estimation of the causal excursion effect. We will present theoretical results and accompanying simulation experiments to demonstrate relative efficiency gains. Practical utility of the proposed methods is demonstrated by analyzing data from a multi-institution cohort of first year medical residents in the United States.  Second, we will consider effect moderation with tens or hundreds of potential moderators.  In this setting, it becomes necessary to use the observed data to select a simpler model for effect moderation and then make valid statistical inference. We propose a two-stage procedure to solve this problem that leverages recent advances in post-selective inference using randomization. We will discuss asymptotic validity of the conditional selective inference procedure and the importance of randomization. Simulation studies verify the asymptotic results. We end with an analysis of an MRT for promoting physical activity in cardiac rehabilitation to demonstrate the utility of the method.

Bio: Walter Dempsey is an Assistant Professor in the Department of Biostatistics, Assistant Professor of Data Science at the Michigan Institute of Data Science (MIDAS), and an Assistant Research Professor in the d3lab located in the Institute of Social Research at the University of Michigan. His research focuses on Statistical Methods for Digital and Mobile Health. Specifically, his current work involves three complementary research themes: (1) experimental design and data analytic methods to inform multi-stage decision making in health; (2) statistical modeling of complex longitudinal and survival data; and (3) statistical modeling of complex relational structures such as interaction networks.

September 18, 2023

Matias Cattaneo
Adaptive Decision Tree Method

Matias Cattaneo headshot

Abstract: The talk is based on two recent papers, "On the Pointwise Behavior of Recursive Partitioning and Its Implications for Heterogeneous Causal Effect Estimation", and "Convergence Rates of Oblique Regression Trees for Flexible Function Libraries".

Bio: Matias D. Cattaneo is a Professor of Operations Research and Financial Engineering (ORFE) at Princeton University, where he is also an Associated Faculty in the Department of Economics, the Center for Statistics and Machine Learning (CSML), and the Program in Latin American Studies (PLAS). His research spans econometrics, statistics, data science, and decision science, with applications to program evaluation and causal inference. Most of his work is interdisciplinary and motivated by quantitative problems in the social, behavioral, and biomedical sciences. As part of his main research agenda, he has developed novel nonparametric, semiparametric, high-dimensional, and machine learning estimation and inference procedures with demonstrably superior robustness to tuning parameter and other implementation choices.