Upcoming Events
December 6, 2024 | Wean 5409 | 1:15-2:15pm
David Shih
Rutgers University
Talk Title: Searching for the Unexpected from Colliders to Stars with Modern Machine Learning
This is a STAMPS hybrid event and will also be held via Zoom.
Abstract: Modern machine learning and generative AI are having an exciting impact on fundamental physics, allowing us to see deeper into the data and enabling new kinds of analyses that were not possible before. I will describe how we are using generative AI to develop powerful new model-agnostic methods for new physics searches at the Large Hadron Collider, and how these methods can also be applied to data from the Gaia Space Telescope to search for stellar streams. I will also describe how these same generative AI techniques can be used to perform a novel measurement of the local dark matter density using stars from Gaia as tracers of the Galactic potential.
Bio: David Shih is a Professor in the New High Energy Theory Center and the Department of Physics & Astronomy at Rutgers University. His current research focuses on developing new machine learning methods to tackle the major open questions in fundamental physics -- such as the nature of dark matter and new particles and forces beyond the Standard Model -- using big datasets from particle colliders and astronomy. His work has touched on many key topics at the intersection of ML and fundamental physics, including generative models, anomaly detection, AI fairness, feature selection, and interpretability. Shih is the recipient of an DOE Early Career Award, a Sloan Foundation Fellowship, the Macronix Prize, and the Humboldt Bessel Research Award.
Past Events
November 18, 2024
Paul Gustafson
University of British Columbia, Department of Statistics
Talk Title: Bayesian Inference when Parameter Identification is Lacking: A Narrative Arc across Applications, Methods, and Theory
Abstract: Partially identified models generally yield “in between” statistical behavior. As the sample size goes to infinity, the posterior distribution on the target parameter heads to a distribution narrower than the prior distribution but wider than a single point. Such models arise naturally in many areas, including the health sciences. They arise particularly when we own up to limitations in how data are acquired. I aim to highlight the narrative arc associated with partial identification. This runs from the applied (e.g., broaching the topic with subject-area scientists), to the methodological (e.g., implementing a Bayesian analysis without full identification), to the theoretical (e.g., characterizing what is going on as generally as possible). As per many areas of statistics, there is good scope to get involved across the whole arc, rather than just at one end or other.
There will be a reception following the talk.
November 11, 2024
Xin Tong
University of Southern California, Data Sciences & Operations
Talk Title: Monoculture and Social Welfare of the Algorithmic Personalization Market under Competition
Abstract: Algorithmic personalization markets, where providers utilize algorithms to predict user types and personalize products or services, have become increasingly prevalent in our daily lives. The adoption of more accurate algorithms holds the promise of improving social welfare through enhanced predictive accuracy. However, concerns have been raised about algorithmic monoculture, where all providers adopt the same algorithms. The prevalence of a single algorithm can hinder social welfare due to the resulting homogeneity of available products or services. In this work, we address the emergence of algorithmic monoculture from the perspective of providers' behavior under competition in the algorithmic personalization market. We propose that competition among providers could mitigate monoculture, thereby enhancing social welfare in the algorithmic personalization market. By examining the impact of competition on algorithmic diversity, our study contributes to a deeper understanding of the dynamics within algorithmic personalization markets and offers insights into strategies for promoting social welfare in these contexts.
Bio: Xin Tong is an associate professor in the department of Data Sciences and Operations at the University of Southern California. His current research focuses on learning with partial information and asymmetry, social and economic networks, and Al ethics. He is an associate editor for JASA and JBES.
October 28, 2024
Kaizheng Wang
Columbia University, Industrial Engineering and Operations Research
Talk Title: Adaptive Transfer Clustering
Abstract: We develop a transfer learning framework for clustering given a main dataset and an auxiliary one about the same subjects. The two datasets may reflect similar but different grouping structures. We propose an adaptive transfer clustering (ATC) algorithm that automatically leverages the commonality in the presence of unknown discrepancy, by optimizing an estimated bias-variance decomposition. It applies to a broad class of statistical models, including Gaussian mixture models, stochastic block models, and latent class models. A theoretical analysis proves the optimality of ATC under the Gaussian mixture model and explicitly quantifies the benefit of transfer. Extensive simulations and real data experiments confirm its effectiveness in various scenarios. The talk is based on joint work with Zhongyuan Lyu and Yuqi Gu.
Bio: Kaizheng Wang is an assistant professor of Industrial Engineering and Operations Research, and a member of the Data Science Institute at Columbia University. He works at the intersection of statistics, machine learning, and optimization. He obtained his Ph.D. from Princeton University in 2020 and B.S. from Peking University in 2015.
October 21, 2024
Gennady Samorodnitsky
Cornell University, Department of Statistics and Data Science
Talk Title: Kernel PCA for learning multivariate extremes
Abstract: We provide general insights into kernel PCA algorithm that can effectively identify clusters of preimages when the data consists of a discrete signal with added noise. We then apply kernel PCA for describing dependence structure of multivariate extremes. Kernel PCA has been motivated as a tool for denoising and clustering of the approximate preimages. The idea is that such structure should be captured by the first principal components in a suitable function space. We provide some simple insights that naturally lead to clustered primages when the underlying data comes from a discrete signal corrupted by noise. Specifically, we use the Davis-Kahan theory to give a perturbation bound on the performance of preimages that quantifies the impact of noise in clustering a discrete signal. We then propose kernel PCA as a method for analyzing the dependence structure of multivariate extremes and demonstrate that it can be a powerful tool for clustering and dimension reduction. In this case, kernel PCA is applied only to the extremal part of the sample, i.e., the angular part of random vectors for which the radius exceeds a large threshold. More specifically, we focus on the asymptotic dependence of multivariate extremes characterized by the angular or spectral measure in extreme value theory and provide a careful analysis in the case where the extremes are generated from a linear factor model.
Bio: Gennady Samorodnitsky is a professor of Operations Research and Information Engineering at Cornell University. His research interests range from machine learning and differential privacy to extreme value theory, phase transitions between short and long range dependence, topology of random objects and interplay between probability and ergodic theory. He is a fellow of the Institute of Mathematical Statistics.
October 11, 2024
Gwendolyn Eadie
University of Toronto
Talk Title: Studying the Universe with Astrostatistics
Abstract: Astrostatistics is a growing interdisciplinary field at the interface of astronomy and statistics. Astronomy is a field rich with publicly available data, but inference using these data must acknowledge selection effects, measurement uncertainty, censoring, and missingness. In the Astrostatistics Research Team (ART) at the University of Toronto --- a joint team between the David A. Dunlap Department of Astronomy & Astrophysics and the Department of Statistical Sciences --- we take an interdisciplinary approach to analysing astronomical data from a range of objects such as stars, old clusters, and galaxies. In this talk, I will cover three ART projects that employ Bayesian inference techniques to: (1) find stellar flares in time series data from stars using hidden Markov models, (2) investigate the relationship between old star cluster populations and their host galaxies using hurdle models, and (3) discover potential "dark" galaxies within an inhomogeneous Poisson Process framework.
Bio: Gwendolyn Eadie is an Assistant Professor of Astrostatistics at the University of Toronto, jointly appointed between the Department of Statistical Sciences and the David A. Dunlap Department of Astronomy & Astrophysics. She is the founder and co-leader of UofT's Astrostatistics Research Team, and works on a range of projects that use hierarchical Bayesian inference to study galaxies, globular star clusters, stars, and fast radio bursts. She is also the current Chair of the Astrostatistics Interest Group of the American Statistical Association and the Chair of the Working Group on Astroinformatics & Astrostatistics of the American Astronomical Society.
October 7, 2024
Xiu Yang
Lehigh University, Systems Engineering
Talk Title: Challenges and Opportunities in Quantum Computing for Data Science
Abstract: Exploring the potential opportunities offered by quantum computing (QC) to speed up the solution of challenging application problems has attracted significant attention in recent years. A key barrier in developing QC methods is the error induced by the noise in the hardware as well as the statistical error in the measurement. In this talk, we will first introduce implementations of statistical methods like Bayesian inference on modeling and mitigating the error in prototype quantum circuits. Next, an alternative approach for a specific optimization problem will be presented to illustrate how uncertainty quantification can be conducted by identifying a new design of an algorithm. Finally, we will discuss the potential of QC to accelerate the development of AI/machine learning in training and implementing models.
Bio: Xiu Yang joined Lehigh from Pacific Northwest National Laboratory (PNNL) where he was a scientist since 2016. His research has been centered around modern scientific computing including uncertainty quantification, multi-scale modeling, physics-informed machine learning, and data-driven scientific discovery. Xiu has been applying his methods on various research areas such as fluid dynamics, hydrology, biochemistry, soft material, energy storage, and power grid systems. Currently, he is focusing on uncertainty quantification in quantum computing algorithms and machine learning methods for scientific computing. He received a Faculty Early Career Development Program (CAREER) Award from NSF in 2022 and Outstanding Performance Award from PNNL in 2015 and 2016. Xiu also served on the DOE applied mathematics visioning committee in 2019.
September 30, 2024
Kary Myers
Los Alamos National Laboratory, Department of Statistics
Talk title: Community Detection (the life kind, not the network science kind)
In this talk, I’ll share some behind-the-scenes career highs and lows that led to my current (very fun) role at Los Alamos National Laboratory. I’ll also offer thoughts on creating career resilience by expanding the definition of “communities” and bringing an intentionality to how you engage with them.
Bio: Kary Myers is a fellow of the American Statistical Association and currently leads a group of ~40 scientists and R&D engineers in the Space Remote Sensing and Data Science Group at Los Alamos National Laboratory (LANL). With support from an AT&T Labs Fellowship, she earned her Ph.D. from Carnegie Mellon’s Statistics and Data Science Department and her MS from their Machine Learning Department before joining LANL in 2006. She spent 15 years as a scientist in the Statistical Sciences Group at Los Alamos, including a few years as their deputy group leader and as the Deputy Director for Data Science in LANL’s Information Science and Technology Institute. She also served as LANL’s Intelligence and Emerging Threats Program Manager for Data Science. She’s been involved with a range of data-intensive projects, from analyzing electromagnetic measurements, to aiding large scale computer simulations, to developing analyses for chemical spectra from the Mars Science Laboratory Curiosity Rover. She served as an associate editor for the Annals of Applied Statistics and the Journal of Quantitative Analysis in Sports, and she created and organizes CoDA, the Conference on Data Analysis, to showcase data-driven research from across the Department of Energy.
September 20, 2024
STAMPS Research Center Launch Event
Tepper Quad, Simmons Auditorium B
Coffee & Refreshments at 3:30 PM, reception following the event.
STAMPS (STAtistical Methods for the Physical Sciences) is one of few university research groups specializing in foundational statistics and Al research for physical science applications ranging from particle and astrophysics to climate and environmental sciences. Starting fall of 2024, STAMPS is now becoming a CMU Research Center!
September 16, 2024
Maggie Niu
Preferential Latent Space Models for Networks with Textual Edges
Abstract: Many real-world networks contain rich textual information in the edges, such as email networks where an edge between two nodes is an email exchange. Other examples include co-author networks and social media networks. The useful textual information carried in the edges is often discarded in most network analyses, resulting in an incomplete view of the relationships between nodes. In this work, we propose to represent the text document between each pair of nodes as a vector counting the appearances of keywords extracted from the corpus, and introduce a new and flexible preferential latent space network model that can offer direct insights on how contents of the textual exchanges modulate the relationships between nodes. We establish identifiability conditions for the proposed model and tackle model estimation with a computationally efficient projected gradient descent algorithm. We further derive the non-asymptotic error bound of the estimator from each step of the algorithm. The efficacy of our proposed method is demonstrated through simulations and an analysis of the Enron email network.
Bio: Xiaoyue Maggie Niu is an Associate Professor of Statistics and Director of the Statistical Consulting Center at Penn State. Her research focuses on the development of statistical models that solve real world problems, especially with applications in health and social sciences. The methodological approaches she takes include Bayesian methods, social network models, and latent variable models. Another big part of her work is in the statistical consulting center. She collaborates with a variety of researchers on campus and mentors graduate students to work with them. Solving practically important problems and interacting with people from diverse backgrounds are the most enjoyable part of her work. She received her Ph.D. in Statistics from University of Washington and her B.S. in Applied Math from Peking University.
April 22, 2024
Anderson Zhang
Spectral Clustering: Methodology and Statistical Analysis
Abstract: Spectral clustering is one of the most popular algorithms to group high-dimensional data. It is easy to implement, computationally efficient, and has achieved tremendous success in many applications. The idea behind spectral clustering is dimensionality reduction. It first performs a spectral decomposition on the dataset and only keeps the leading few spectral components to reduce the data's dimension. It then applies some standard methods such as the k-means on the low-dimensional space to do clustering. In this talk, we demystify the success of spectral clustering by providing a sharp statistical analysis of its performance under mixture models. For isotropic Gaussian mixture models, we show spectral clustering is optimal. For sub-Gaussian mixture models, we derive exponential error rates for spectral clustering. To establish these results, we develop a new spectral perturbation analysis for singular subspaces.
Bio: Anderson Ye Zhang is an Assistant Professor in the Department of Statistics and Data Science at the Wharton School, with a secondary appointment in the Department of Computer and Information Science, at the University of Pennsylvania. Before joining Wharton, he was a William H. Kruskal Instructor in the Department of Statistics at the University of Chicago. He obtained his Ph.D. degree from the Department of Statistics and Data Science at Yale University. His research interests include network analysis, clustering, spectral analysis, and ranking from pairwise comparisons.
April 12, 2024
Alicia Carriquiry
Statistics and its Applications in Forensic Science and the Criminal Justice System
Abstract: Steve Fienberg was a pioneer in highlighting the role of statistical thinking in the civil and criminal justice systems and was an early critic of many forensic methods that are still in use in US courts. One of his last achievements was the creation of the Center for Statistics and Applications in Forensic Evidence (CSAFE), a federally funded NIST Center of Excellence, with the mission to build the statistical foundation – where possible – for what is known as forensic pattern comparison disciplines and digital forensics.
Forensic applications present unique challenges for statisticians. For example, much of the data that arise in forensics are non-standard, so even defining analytical variables may require out-of-the-box thinking. As a result, the usual statistical approaches may not enable addressing the questions of interest to jurors, legal professionals and forensic practitioners.
Today’s presentation introduces some of the statistical and algorithmic methods proposed by CSAFE researchers that have the potential to impact forensic practice in the US. Two examples are used for illustration: the analysis of questioned handwritten documents and of marks imparted by firearms on bullets or cartridge cases. In both examples, the question we address is one of source: do two or more items have the same source? In the first case, we apply “traditional” statistical modeling methods, while in the second case, we resort to algorithmic approaches. Much of the research carried out in CSAFE is collaborative and while mission-driven, also academically rigorous, which would have pleased Steve tremendously.
Bio: Alicia Carriquiry (NAM) is Professor of Statistics at Iowa State University. Between January of 2000 and July of 2004, she was Associate Provost at Iowa State. Her research interests are in Bayesian statistics and general methods. Her recent work focuses on nutrition and dietary assessment, as well as on problems in genomics, forensic sciences and traffic safety. Dr. Carriquiry is an elected member of the International Statistical Institute and a fellow of the American Statistical Association. She serves on the Executive Committee of the Institute of Mathematical Statistics and has been a member of the Board of Trustees of the National Institute of Statistical Sciences since 1997. She is also a past president of the International Society for Bayesian Analysis (ISBA) and a past member of the Board of the Plant Sciences Institute at Iowa State University.
April 5, 2024
Andrew Gelman
Bayesian Workflow: Some Progress and Open Questions
Abstract: The workflow of applied Bayesian statistics includes not just inference but also model building, model checking, confidence-building using fake data, troubleshooting problems with computation, model understanding, and model comparison. We would like to toward codify these steps in the realistic scenario in which researchers are fitting many models for a given problem. We discuss various issues including prior distributions, data models, and computation, in the context of ideas such as the Fail Fast Principle and the Folk Theorem of Statistical Computing. We also consider some examples of Bayesian models that give bad answers and see if we can develop a workflow that catches such problems.
Bio: Andrew Gelman is a professor of statistics and political science at Columbia University. He has received the Outstanding Statistical Application award three times from the American Statistical Association, the award for best article published in the American Political Science Review, the Mitchell and DeGroot prizes from the International Society of Bayesian Analysis, and the Council of Presidents of Statistical Societies award. His books include Bayesian Data Analysis (with John Carlin, Hal Stern, David Dunson, Aki Vehtari, and Donald Rubin), Teaching Statistics: A Bag of Tricks (with Deborah Nolan), Data Analysis Using Regression and Multilevel/Hierarchical Models (with Jennifer Hill), Red State, Blue State, Rich State, Poor State: Why Americans Vote the Way They Do (with David Park, Boris Shor, and Jeronimo Cortina), A Quantitative Tour of the Social Sciences (co-edited with Jeronimo Cortina), and Regression and Other Stories (with Jennifer Hill and Aki Vehtari).
Andrew has done research on a wide range of topics, including: why it is rational to vote; why campaign polls are so variable when elections are so predictable; the effects of incumbency and redistricting; reversals of death sentences; police stops in New York City, the statistical challenges of estimating small effects; the probability that your vote will be decisive; seats and votes in Congress; social network structure; arsenic in Bangladesh; radon in your basement; toxicology; medical imaging; and methods in surveys, experimental design, statistical inference, computation, and graphics.
March 25, 2024
Glenn Shafer
Modernizing Cournot’s Principle
Abstract: In everyday English, a forecast is something less than a prediction. It is more like an estimate. When an economist forecasts 3.5% inflation in the United States next year, or my weather app forecasts 0.55 inches of rain, these are not exactly predictions. When the forecaster gives rain a 30% probability, this too is not a prediction. A prediction is more definite about what is predicted and about predicting it.
We might say that a probability is a prediction when it is very close to one. But this formulation has a difficulty: there are too many high probabilities. There is a high probability against every ticket in a lottery, but we cannot predict that no ticket will win.
Game-theoretic statistics resolves this problem by showing how some high probabilities are simpler than others. The simpler ones qualify as predictions.
This story has roles for Cournot’s principle, Kolmogorov’s algorithmic complexity, and de Finetti’s previsione. See www.probabilityandfinance.com and my two books on the topic with Vladimir Vovk.
Bio: In the 1970s, Glenn launched the “Dempster-Shafer” theory. The Belief Functions and Applications Society, devoted to this theory, has been holding international conferences since 2010.
During the past 25 years, Glenn and Vladimir Vovk launched game-theoretic probability and statistics. Their two books on the topic appeared in 2001 and 2019.
Glenn has published more than 20 papers on the history of probability and statistics. His most recent book, The Splendors and Miseries of Martingales: Their History from the Casino to Mathematics, co-edited with Laurent Mazliak, was published by Birkhäuser in 2022.
Glenn served in the Peace Corps in Afghanistan. At the Rutgers Business School, he served as director of the doctoral program for ten years and as dean for four years.
March 18, 2024
Martin Wainwright
Challenges with Covariate Shift: From Prediction to Causal Inference
Abstract: In many modern uses of predictive methods, there can be shifts between the distributional properties of training data compared to the test data. Such mismatches can cause dramatic reductions in accuracy that remain mysterious. How to find practical procedures that mitigate such effects in an optimal way? In this talk, we discuss the fundamental limits of problems with covariate shift, and simple procedures that achieve these fundamental limits. Our talk covers both the challenges of covariate shift in non-parametric regression, and also in semi-parametric problems that arise from causal inference and off-policy evaluation.
Based on joint works with: Peter Bartlett, Peng Ding, Cong Ma, Wenlong Mou, Reese Pathak and Lin Xiao.
Bio: Martin Wainwright is the Cecil H. Green Professor in Electrical Engineering and Computer Science and Mathematics at MIT, and affiliated with the Laboratory for Information and Decision Systems and Statistics and Data Science Center.
His main claim to fame is that he was the graduate advisor of Nihar Shah, and postdoc advisor of Aaditya Ramdas, Pradeep Ravikumar (all esteemed faculty at CMU), and Sivaraman Balakrishnan. He has also received a number of awards and recognition including an Alfred P. Sloan Foundation Fellowship, best paper awards from the IEEE Signal Processing Society, the IEEE Communications Society, and the IEEE Information Theory and Communication Societies, the Medallion Lectureship and Award from the Institute of Mathematical Statistics, and the COPSS Presidents’ Award from the Joint Statistical Societies. He was a Section Lecturer with the International Congress of Mathematicians in 2014 and received the Blackwell Award from the Institute of Mathematical Statistics in 2017.
February 7, 2024
Shuangning Li
Inference and Decision-Making amid Social Interactions
Abstract: From social media trends to family dynamics, social interactions shape our daily lives. In this talk, I will present tools I have developed for statistical inference and decision-making in light of these social interactions.
- Inference: I will talk about estimation of causal effects in the presence of interference. In causal inference, the term “interference” refers to a situation where, due to interactions between units, the treatment assigned to one unit affects the observed outcomes of others. I will discuss large-sample asymptotics for treatment effect estimation under network interference where the interference graph is a random draw from a graphon. When targeting the direct effect, we show that popular estimators in our setting are considerably more accurate than existing results suggest. Meanwhile, when targeting the indirect effect, we propose a consistent estimator in a setting where no other consistent estimators are currently available.
- Decision-Making: Turning to reinforcement learning amid social interactions, I will focus on a problem inspired by a specific class of mobile health trials involving both target individuals and their care partners. These trials feature two types of interventions: those targeting individuals directly and those aimed at improving the relationship between the individual and their care partner. I will present an online reinforcement learning algorithm designed to personalize the delivery of these interventions. The algorithm's effectiveness is demonstrated through simulation studies conducted on a realistic test bed, which was constructed using data from a prior mobile health study. The proposed algorithm will be implemented in the ADAPTS HCT clinical trial, which seeks to improve medication adherence among adolescents undergoing allogeneic hematopoietic stem cell transplantation.
Bio: Shuangning Li is a postdoctoral fellow working with Professor Susan Murphy in the Department of Statistics at Harvard University. Prior to this, she earned her Ph.D. from the Department of Statistics at Stanford University, where she was advised by Professors Emmanuel Candès and Stefan Wager.
February 5, 2024
Brian Trippe
Probabilistic methods for designing functional protein structures
Abstract: The biochemical functions of proteins, such as catalyzing a chemical reaction or binding to a virus, are typically conferred by the geometry of only a handful of atoms. This arrangement of atoms, known as a motif, is structurally supported by the rest of the protein, referred to as a scaffold. A central task in protein design is to identify a diverse set of stabilizing scaffolds to support a motif known or theorized to confer function. This long-standing challenge is known as the motif-scaffolding problem.
In this talk, I describe a statistical approach I have developed to address themotif-scaffolding problem. My approach involves (1) estimating a distribution supported on realizable protein structures and (2) sampling scaffolds from this distribution conditioned on a motif. For step (1) Iadapt diffusion generative models to fit example protein structures from nature. For step (2) I develop sequential monte carlo algorithms to sample from the conditional distributions of these models. I finally describe how, with experimental and computational collaborators, I have generalized and scaled this approach to generate and experimentally validate hundreds of proteins with various functional specifications.
Bio: Brian Trippe is a postdoctoral fellow at Columbia University In the Department of Statistics, and a visiting researcher at the Institute for Protein Design at the University of Washington. He completed his Ph.D. in Computational and Systems Biology at the Massachusetts Institute of Technology where he worked on Bayesian methods for inference in hierarchical linear models. In his research, Brian develops statistical machine learning methods to address challenges in biotechnology and medicine, with a focus on generative modeling and inference algorithms for protein engineering.
January 31, 2024
Michael Celentano
Debiasing in the inconsistency regime
Abstract: In this talk, I will discuss semi-parametric estimation when nuisance parameters cannot be estimated consistently, focusing in particular on the estimation of average treatment effects, conditional correlations, and linear effects under high-dimensional GLM specifications. In this challenging regime, even standard doubly-robust estimators can be inconsistent. I describe novel approaches which enjoy consistency guarantees for low-dimensional target parameters even though standard approaches fail. For some target parameters, these guarantees can also be used for inference. Finally, I will provide my perspective on the broader implications of this work for designing methods which are less sensitive to biases from high-dimensional prediction models.
Bio: Michael Celentano is a Miller Fellow in the Statistics Department at the University of California, Berkeley, advised by Martin Wainwright and Ryan Tibshirani. He received his Ph.D. in Statistics from Stanford University in 2021, where he was advised by Andrea Montanari. Most of his work focuses on the high-dimensional asymptotics for regression, classification, and matrix estimation problems.
January 29, 2024
Ying Jin
Model-free selective inference: from calibrated uncertainty to trusted decisions
Abstract: AI has shown great potential in accelerating decision-making and scientific discovery pipelines such as drug discovery, marketing, and healthcare. In many applications, predictions from black-box models are used to shortlist candidates whose unknown outcomes satisfy a desired property, e.g., drugs with high binding affinities to a disease target. To ensure the reliability of high-stakes decisions, uncertainty quantification tools such as conformal prediction have been increasingly adopted to understand the variability in black-box predictions. However, we find that the on-average guarantee of conformal prediction can be insufficient for its deployment in decision making which usually has a selective nature.
In this talk, I will introduce a model-free selective inference framework that allows us to select reliable decisions with the assistance of any black-box prediction model. Our framework identifies candidates whose unobserved outcomes exceed user-specified values while controlling the average proportion of falsely selected units (FDR), without any modeling assumptions. Leveraging a set of exchangeable training data, our method constructs conformal p-values that quantify the confidence in large outcomes; it then determines a data-dependent threshold for the p-values as a criterion for drawing confident decisions. In addition, I will discuss new ideas to further deal with covariate shifts between training and new samples. We show that in several drug discovery tasks, our methods narrow down the drug candidates to a manageable size of promising ones while controlling the proportion of falsely discovered. In a causal inference dataset, our methods identify students who benefit from an educational intervention, providing new insights for causal effects.
Bio: Ying Jin is a fifth-year Ph.D. student at Department of Statistics, Stanford University, advised by Emmanuel Candès and Dominik Rothenhäusler. Prior to this, she obtained B.S. in Mathematics from Tsinghua University. Her research focuses on devising modern statistical methodology that enables trusted inference and decisions with minimal assumptions, covering conformal inference, multiple testing, causal inference, distribution robustness, and data-driven decision-making.
January 24, 2024
Arkajyoti Saha
Inference for machine learning under dependence
Abstract: Recent interest has centered on uncertainty quantification for machine learning models. For the most part, this work has assumed independence of the observations. However, many of the most important problems arising across scientific fields, from genomics to climate science, involve systems where dependence cannot be ignored. In this talk, I will investigate conference on machine learning models in the presence of dependence.
In the first part of my talk, I will consider a common practice in the field of genomics in which researchers compute a correlation matrix between genes and threshold its elements in order to extract groups of independent genes. I will describe how to construct valid p-values associated with these discovered groups that properly account for the group selection process. While thesis related to the literature on selective inference developed in the past decade, this work involves inference about the covariance matrix rather than the mean, and therefore requires an entirely new technical toolset. This same toolset can be applied to quantify the uncertainty associated with canonical correlation analysis after feature screening.
In the second part of my talk, I will turn to an important problem in the field of oceanography as it relates to climate science. Oceanographers have recently applied random forests to estimate carbon export production, a key quantity of interest, at a given location in the ocean; they then wish to sum the estimates across the world’s oceans to obtain an estimate of global export production. While quantifying uncertainty associated with a single estimate is relatively straightforward, quantifying uncertainty of the summed estimates is not, due to their complex dependence structure. I will adapt the theory of V-statistics to this dependent data setting in order to establish a central limit theorem for the summed estimates, which can be used to quantify the uncertainty associated with global export production across the world’s oceans.
This is joint work with my postdoctoral supervisors, Daniela Witten (University of Washington) and Jacob Bien (University of Southern California).
Bio: Arkajyoti Saha is a postdoctoral fellow in the Department of Statistics, University of Washington. He received his Ph.D. in Biostatistics from the Johns Hopkins Bloomberg School of Public Health. His research lies at the intersection of machine learning, selective inference, and spatial statistics,with a focus on machine learning under dependence with applications in genomics and oceanography.
January 22, 2024
Satarupa Bhattacharjee
Geodesic Mixed Effects Models for Repeatedly Observed/Longitudinal Random Objects
Abstract: Mixed effect modeling for longitudinal data is challenging when the observed data are random objects, which are complex data taking values in a general metric space without either global linear or local linear (Riemannian) structure. In such settings, the classical additive error model and distributional assumptions are unattainable. Due to the rapid advancement of technology, longitudinal data containing complex random objects, such as covariance matrices, data on Riemannian manifolds, and probability distributions are becoming more common. Addressing this challenge, we develop a mixed-effects regression for data in geodesic spaces, where the underlying mean response trajectories are geodesics in the metric space and the deviations of the observations from the model are quantified by perturbation maps or transports. A key finding is that the geodesic trajectories assumption for the case of random objects is a natural extension of the linearity assumption in the standard Euclidean scenario to the case of general geodesic metric spaces. Geodesics can be recovered from noisy observations by exploiting a connection between the geodesic path and the path obtained by global Fréchet regression for random objects. The effect of baseline Euclidean covariates on the geodesic paths is modeled by another Fréchet regression step. We study the asymptotic convergence of the proposed estimates and provide illustrations through simulations and real-data applications.
Bio: I am a Postdoctoral Scholar in the Department of Statistics at Pennsylvania State University, working with Prof. Bing Li and Prof. Lingzhou Xue. I received my Ph.D. in Statistics at UC Davis advised by Prof. Hans-Georg Müller in September 2022.
My primary research centers around analyzing functional and non-Euclidean data situated in general metric spaces, which we refer to as random objects, with examples in brain imaging data, networks, distribution valued data, and high-dimensional genetics data.
January 17, 2024
Christopher Harshaw
Algorithm Design for Randomized Experiments
Abstract: Randomized experiments are one of the most reliable causal inference methods and are used in a variety of disciplines from clinical medicine, public policy, economics, and corporate A/B testing. Experiments in these disciplines provide empirical evidence which drives some of the most important decisions in our society: What drugs are prescribed? Which social programs are implemented? What corporate strategies to use? Technological advances in measurements and intervention -- including high dimensional data, network data, and mobile devices -- offer exciting opportunities to design new experiments to investigate a broader set of causal questions. In these more complex settings, standard experimental designs (e.g. independent assignment of treatment) are far from optimal. Designing experiments which yield the most precise estimates of causal effects in these complex settings is not only a statistical problem, but also an algorithmic one.
In this talk, I will present my recent work on designing algorithms for randomized experiments. I will begin by presenting Clip-OGD, a new algorithmic experimental design for adaptive sequential experiments. We show that under the Clip-OGD design, the variance of an adaptive version of the Horvitz-Thompson estimator converges to the optimal non-adaptive variance, resolving a70-year-old problem posed by Robbins in 1952. Our results are facilitated by drawing connections to regret minimization in online convex optimization. Time permitting, I will describe a new unifying framework for investigating causal effects under interference, where treatment given to one subject can affect the outcomes of other subjects. Finally, I will conclude by highlighting open problems and reflecting on future work in these directions.
Bio: Christopher Harshaw is a FODSI postdoc at MIT and UC Berkeley. He received his Ph.D. from Yale University where he was advised by Dan Spielman and Amin Karbasi. His research lies at the interface of causal inference, machine learning, and algorithm design, with a particular focus on the design and analysis of randomized experiments. His work has appeared in the Journal of the American Statistical Association, Electronic Journal of Statistics, ICML, NeurIPS, and won Best Paper Award at the NeurIPS 2022 workshop, CML4Impact.