November 27, 2023 | 4:00-5:00pm, Posner 153
Pattern Graphs: a Graphical Approach to Nonmonotone Missing Data
Abstract: We introduce the concept of pattern graphs--directed acyclic graphs representing how response patterns are associated. A pattern graph represents an identifying restriction that is nonparametrically identified/saturated and is often a missing not at random restriction. We introduce a selection model and a pattern mixture model formulation using the pattern graphs and show that they are equivalent. A pattern graph leads to an inverse probability weighting estimator as well as an imputation-based estimator. We also study the semi-parametric efficiency theory and derive a multiply-robust estimator using pattern graphs.
Bio: Dr. Chen is an associate professor in the Department of Statistics and a data science fellow in the eScience Institute at the University of Washington. He also serves as a co-investigator and statistician at the National Alzheimer’s Coordinating Center. Dr. Chen has received several awards including NSF's CAREER award and ASA's Noether Young scholar award.
Covariance alignment: from MLE to Gromov-Wasserstein
Abstract: Feature alignment methods are used in many scientific disciplines for data pooling, annotation, and comparison. As an instance of a permutation learning problem, feature alignment presents significant statistical and computational challenges. In this talk, I will introduce the covariance alignment model to study and compare various alignment methods and establish a minimax lower bound for covariance alignment that has a non-standard dimension scaling because of the presence of a nuisance parameter. This lower bound is in fact minimax optimal and is achieved by a natural quasi MLE. However, this estimator involves a search over all permutations which is computationally infeasible even when the problem has a moderate size. To overcome this limitation, I will show that the celebrated Gromov-Wasserstein algorithm from optimal transport which is more amenable to fast implementation even on large-scale problems is also minimax optimal. These results give the first statistical justification for the deployment of the Gromov-Wasserstein algorithm in practice. Finally, I will also discuss the connections to recent literature on statistical graph matching and orthogonal statistical learning. Based on a joint work with Philippe Rigollet and George Stepaniants.
Bio: Yanjun Han is an assistant professor of mathematics and data science at the Courant Institute of Mathematical Sciences and the Center for Data Science, New York University. He received his Ph.D. in Electrical Engineering from Stanford University in Aug 2021, under the supervision of Tsachy Weissman. After that, he spent one year as a postdoctoral scholar at the Simons Institute for the Theory of Computing, UC Berkeley, and another year as a Norbert Wiener postdoctoral associate in the Statistics and Data Science Center at MIT, mentored by Sasha Rakhlin and Philippe Rigollet. Honors on his past work include a best student paper finalist at ISIT 2016, a student paper award at ISITA 2016, and the Annals of Statistics Special Invited Session at JSM 2021. His research interests include high-dimensional and nonparametric statistics, bandits, and information theory.
The Carnegie Mellon Sports Analytics Conference is an annual event dedicated to highlighting the latest sports research from the statistics and data science community.
A Bayesian model for joint longitudinal and survival outcomes in the presence of subpopulation heterogeneity
Abstract: Biomedical studies often monitor subjects using a longitudinal marker that may be informative about a time-to-event outcome of interest. An example is periodic monitoring of CD4 cell count as the longitudinal marker and time to death from AIDS. By modeling these two outcomes jointly, there is potential to improve the precision of inference for each. Brown and Ibrahim (2003, Biometrics) adopted the common approach of incorporating the mean longitudinal trajectory as a predictor for the event time hazard, and enhanced flexibility by using a Dirichlet process prior for the coefficients of the trajectory. We generalize this model to accommodate non-Gaussian longitudinal outcomes and emphasize that the Dirichlet process enables discovery of subgroups of subjects with distinct behavior for the joint outcome. Our formulation, developed using the multivariate log-gamma distribution, offers greater flexibility in the longitudinal model and computational advantage for Markov chain Monte Carlo estimation. We illustrate with simulation and application. This is the joint work with Pengpeng Wang and Jonathan Bradley.
Bio: Elizabeth joined the Department of Statistics at Florida State University in 2011 as the Duncan McLean and Pearl Levine Fairweather Professor of Statistics. She is honored to hold this named professorship, especially after having the opportunity to meet Dr. David Fairweather (Ph.D. FSU Statistics, 1970), whose generosity made the position possible.
Elizabeth received her Ph.D. in Statistics from Carnegie Mellon University in Pittsburgh, PA and joined the faculty at Cornell University in the Department of Operations Research and Industrial Engineering (now Operations Research and Information Engineering) in 1992. She visited the Biometry Research Group of the National Cancer Institute 1999-2000 as Visiting Mathematical Statistician, which enhanced her interest in biostatistics. In 2000, she joined the Medical University of South Carolina (MUSC), where she collaborated broadly with clinical and basic science researchers. She directed the Biostatistics Core for the MUSC Center for Oral Health Research (2002-2011) and created and directed the NIH-supported predoctoral training program "Biostatistics for Basic Biomedical Research" (2005-2011).
Elizabeth's recent research is in longitudinal data analysis, Bayesian modeling and recurrent events, with applications in oral health research, disease biomarkers and other health research areas. She has many publications in statistics and medical journals and has received several grants from the National Science Foundation, National Institutes of Health, Department of Defense and other sources. Elizabeth is a Fellow of the American Statistical Association, a Fellow of the Institute of Mathematical Statistics and an Elected Member of the International Statistical Institute. FSU recognized her with the title of Distinguished Research Professor in 2019 and with the Graduate Mentor Award in 2022. She received the Paul Minton Award from the Southern Regional Council on Statistics in 2022.
Let's get practical: measuring (algorithmic) bias in the real world
Abstract: While measures and mitigations for algorithmic bias have proliferated in the academic literature over the last several years, implementing them in practice is frequently not straightforward. This talk delves into the practical implementation of algorithmic bias metrics and mitigation techniques, exploring the challenges and solutions through three distinct case studies. The first case study explores the difficulty of measuring and communicating measures of algorithmic bias when the measures are taken over many different groups. In this example, we develop a low-dimensional summary statistic of algorithmic bias and discuss its statistical properties. In the second example, we build upon existing fair clustering techniques and apply them in the context of polling location assignment. Here, we confront the challenges arising from measuring distance between voters and polling locations and demonstrate how "fair" clustering based on poorly designed distance metrics could exacerbate disparities in voter turnout. Finally, in the third case study, unsatisfied by existing measures of bias for LLMs, I explore the extent of bias present based on my own usage patterns of the models.
Bio: Kristian Lum is an Associate Research Professor at the University of Chicago Data Science Institute. She has previously held positions at Twitter as a Sr. Staff ML Researcher and Research Lead for the Machine Learning Ethics, Accountability, and Transparency group; the University of Pennsylvania Department of Computer and Information Science; and at the Human Rights Data Analysis Group, where she led work on the use of algorithms in the criminal justice system. She is also a co-founder of the ACM Conference on Fairness, Accountability, and Transparency (FAccT). Her current research focuses on algorithmic bias/fairness and social impact data science.
Gromov-Wasserstein Alignment: Statistical and Computational Advancements via Duality
Abstract: The Gromov-Wasserstein (GW) distance quantifies dissimilarity between metric measure (mm) spaces and provides a natural correspondence between them. As such, it serves as a figure of merit for applications involving alignment of heterogeneous datasets, including object matching, single-cell genomics, and matching language models. While various heuristic methods for approximately evaluating the GW distance from data have been developed, formal guarantees for such approaches---both statistical and computational---remained elusive. This work closes these gaps for the quadratic GW distance between Euclidean mm spaces of different dimensions. At the core of our proofs is a novel dual representation of the GW problem as an infimum of certain optimal transportation problems. The dual form enables deriving, for the first time, sharp empirical convergence rates for the GW distance by providing matching upper and lower bounds. For computational tractability, we consider the entropically regularized GW distance. We derive bounds on the entropic approximation gap, establish sufficient conditions for convexity of the objective, and devise efficient algorithms with global convergence guarantees. These advancements facilitate principled estimation and inference methods for GW alignment problems, that are efficiently computable via the said algorithms.
Bio: Ziv Goldfeld is an Assistant Professor in the School of Electrical and Computer Engineering, and a graduate field member in the Center of Applied Mathematics, Statistics and Data Science, Computer Science, and Operations Research and Information Engineering, at Cornell University. Before joining Cornell, he was a postdoctoral research fellow in LIDS at MIT. Ziv graduated with a B.Sc., M.Sc., and Ph.D. (all summa cum laude) in Electrical and Computer Engineering from Ben Gurion University, Israel. Ziv’s research interests include optimal transport theory, statistical learning theory, information theory, and mathematical statistics. Honors include the NSF CAREER Award, the IBM University Award, the Rothschild Fellowship, and the Michael Tien ’72 Excellence in Teaching Award.
Repro Samples Method for Addressing Irregular Inference Problems and for Unraveling Machine Learning Blackboxes
Abstract: Rapid data science developments and the desire to have interpretable AI require us to have innovative frameworks to tackle frequently seen, but highly non-trivial ``irregular inference problems,’’ e.g., those involving discrete or non-numerical parameters and those involving non-numerical data, etc. This talk presents an effective and wide-reaching framework, called repro samples method, to conduct statistical inference for the irregular problems and more. We develop both theories to support our development and provide effective computing algorithms for problems in which explicit solutions are not available. The method is likelihood-free and is particularly effective for irregular inference problems. For commonly encountered irregular inference problems that involve discrete or nonnumerical parameters, we propose a three-step procedure to make inferences for all parameters and develop a unique matching scheme that turns the disadvantage of lacking theoretical tools to handle discrete/nonnumerical parameters into an advantage of improving computational efficiency. The effectiveness of the proposed method is illustrated through case studies by solving two highly nontrivial problems in statistics: a) how to quantify the uncertainty in the estimation of the unknown number of components and make inference for the associated parameters in a Gaussian mixture; b) how to quantify the uncertainty in model estimation and construct confidence sets for the unknown true model, the regression coefficients, or both true model and coefficients jointly in high dimensional regression models. The method also has extensions to complex machine learning models, e.g., (ensemble) tree models, neural networks, graphical models, etc. It provides a new toolset to develop interpretable AI and to help address the blackbox issues in complex machine learning models.
Bio: Min-ge Xie, PhD is a Distinguished Professor at Rutgers, The State University of New Jersey. Dr. Xie received his PhD in Statistics from University of Illinois at Urbana-Champaign and his BS in Mathematics from University of Science and Technology of China. He is the current Editor of The American Statistician and a co-founding Editor-in-Chief of The New England Journal of Statistics in Data Science. He is a fellow of ASA, IMS, and an elected member of ISI. His research interests include theoretical foundations of statistical inference and data science, fusion learning, finite and large sample theories, parametric and nonparametric methods. He is the Director of the Rutgers Office of Statistical Consulting and has a rich interdisciplinary research experiences in collaborating with computer scientists, engineers, biomedical researchers, and scientists in other fields.
Two challenges in time-varying causal effect moderation analysis in mobile health
Abstract: Twin revolutions in wearable technologies and smartphone-delivered digital health interventions have significantly expanded the accessibility and uptake of mobile health (mHealth) interventions in multiple domains of health sciences. Sequentially randomized experiments called micro-randomized trials (MRTs) have grown in popularity as a means to empirically evaluate the effectiveness of mHealth intervention components. MRTs have motivated a new class of causal estimands, termed “causal excursion effects”, that allow health scientists to answer important scientific questions about how intervention effectiveness may change over time or be moderated by individual characteristics, time-varying context, or past responses. In this talk, we present two new tools for causal effect moderation analysis. First, we consider a meta-learner perspective, where any supervised learning algorithm can be used to assist in the estimation of the causal excursion effect. We will present theoretical results and accompanying simulation experiments to demonstrate relative efficiency gains. Practical utility of the proposed methods is demonstrated by analyzing data from a multi-institution cohort of first year medical residents in the United States. Second, we will consider effect moderation with tens or hundreds of potential moderators. In this setting, it becomes necessary to use the observed data to select a simpler model for effect moderation and then make valid statistical inference. We propose a two-stage procedure to solve this problem that leverages recent advances in post-selective inference using randomization. We will discuss asymptotic validity of the conditional selective inference procedure and the importance of randomization. Simulation studies verify the asymptotic results. We end with an analysis of an MRT for promoting physical activity in cardiac rehabilitation to demonstrate the utility of the method.
Bio: Walter Dempsey is an Assistant Professor in the Department of Biostatistics, Assistant Professor of Data Science at the Michigan Institute of Data Science (MIDAS), and an Assistant Research Professor in the d3lab located in the Institute of Social Research at the University of Michigan. His research focuses on Statistical Methods for Digital and Mobile Health. Specifically, his current work involves three complementary research themes: (1) experimental design and data analytic methods to inform multi-stage decision making in health; (2) statistical modeling of complex longitudinal and survival data; and (3) statistical modeling of complex relational structures such as interaction networks.
Adaptive Decision Tree Method
Abstract: The talk is based on two recent papers, "On the Pointwise Behavior of Recursive Partitioning and Its Implications for Heterogeneous Causal Effect Estimation", and "Convergence Rates of Oblique Regression Trees for Flexible Function Libraries".
Bio: Matias D. Cattaneo is a Professor of Operations Research and Financial Engineering (ORFE) at Princeton University, where he is also an Associated Faculty in the Department of Economics, the Center for Statistics and Machine Learning (CSML), and the Program in Latin American Studies (PLAS). His research spans econometrics, statistics, data science, and decision science, with applications to program evaluation and causal inference. Most of his work is interdisciplinary and motivated by quantitative problems in the social, behavioral, and biomedical sciences. As part of his main research agenda, he has developed novel nonparametric, semiparametric, high-dimensional, and machine learning estimation and inference procedures with demonstrably superior robustness to tuning parameter and other implementation choices.