Former Project Details
Project Title: Discovery & Engineering Protein Artificial Intelligence (DEPr-AI) with Manifold-aware Protein Resuscitation (MaPR): Synthetic Bi
Vincenzo Carnevale, Temple University
Project Abstract: Though separated in some cases by over 1B years of evolutionary time, divergent protein families may share remarkable sequence and structural patterns. Although the patterns are complex, and although the evolutionary parameters that generated them are ambiguous, the patterns are detectable, given sufficient protein data. Therefore, a model trained on sufficient data could in principle extract the parameters from the patterns, and then parameterize itself to generate synthetic proteins that are statistically indistinguishable from those generated by natural evolutionary processes, but in a controllable way. For the first time, sufficient data are available to train such a model. We propose Discovery and Engineering Protein Artificial Intelligence (DEPr-AI), a BERT-like autoencoder neural network (AENN) generative protein model trained on evolutionary patterns in protein sequence and structure data. Until just recently, sufficient volumes of protein structure data were unavailable, and so prior generative protein modeling methods focused primarily on sequence data. Here, we propose to leverage the rich information contained in the sequence-structure map that was previously unaccounted for. The recent release of AlphaFold2 (AF2) by Google's DeepMind, which can predict structures with atomic precision on par with experimental techniques, makes our proposed work newly feasible and extremely timely. The first part of this research proposal is to use Neocortex to generate hundreds of thousands of protein structures using AF2, a challenge that would take days or weeks of GPU on current XSEDE resources such as Bridges2, but could take days, hours, perhaps even minutes on Neocortex. After AF2 structure generation on Neocortex, we will extend our prior generative protein sequence modeling efforts to characterize the relationship between protein sequence-structure and conformational dynamics using DEPr-AI, which employs a novel joint embedding approach of sequences, merged with their corresponding structures, into a paired representation. By embedding these joint sequence-structure protein entities into the latent space of an AENN during training, DEPr-AI can learn the sequence-structure-function continuum from the evolutionary patterns in the data, and encode the continuum into the topology of a fitness landscape with improved accuracy and interpretability over current methods. We propose another method, Manifold-aware Protein Resuscitation (MaPR), which DEPr-AI can use to interpolate new synthetic proteins from the latent space by "resuscitating" them along high probability geodesic paths between known proteins. With MaPR, DEPr-AI, and AF2, all running on Neocortex, we will deliver breakthroughs in protein discovery, engineering, and analysis that were technologically infeasible until now. Further, we have already begun coordinating with experimental collaborators, who will verify that our synthetic proteins have the features predicted by DEPr-AI.
Project Title: ComputeCOVID19++: Accelerating Medical Diagnosis and Monitoring via High-Performance Deep Learning on CT Images
Wu Feng, Virginia Tech
Project Abstract: ComputeCOVID19++ builds on our existing work with ComputeCOVID19+, a CT-based framework that significantly enhances the speed and accuracy of diagnosing and monitoring COVID-19 (and its variants, status=static.DEACTIVATED) via a deep-learning network for CT image enhancement called DDnet, short for DenseNet and Deconvolution network. In this work, we propose to create a new algorithm that is synergistically co-designed with the Cerebras CS-1 neuromorphic hardware and its associated neuromorphic software in the integrated HPE Superdome Flex and Cerebras CS-1 system. As such, we seek to improve the regularization and specificity of our DDnet in ComputeCOVID19+, which enhances the quality of any given CT scan, and then map the sparsity of our model onto Cerebras CS-1 to reduce the training time of DDnet. In addition, we seek to propose and validate the efficacy of a new set of metrics that can then be used as a guide to quantify the sparsity of any deep-learning model for different types of layers such as convolution and fully connected layers.
Project Title: Apply Machine Learning to Predict Antibody Drug Developability
PIN-KUANG LAI, Stevens Institute of Technology
Project Abstract: The number of monoclonal antibody (mAb) drugs in clinical trials or approved for use has increased rapidly in recent years with 97 drugs currently approved by the U.S. Food and Drug Administration (FDA) or the European Medicines Agency (EMA) as of Aug 2020. In addition to successful antibody binding to the target to stimulate biological responses, the developability properties of mAbs such as the feasibility of their manufacture, stability in storage and absence of off-target stickiness are essential to new drug development. In fact, attrition of therapeutic candidates during clinical development is the major factor in high development costs. However, the developability profiles of antibodies are difficult to assess in the early-stage discovery and candidate screening due to limited number of molecules, material availability and lack of physical understanding. Therefore, developing predictive tools that can evaluate the developability of antibody as early in the discovery/development process is desired. These include low aggregation rate and low viscosity. Previously, we have developed different computational tools based on molecular dynamics (MD) simulations or use features extracted from MD simulations to develop machine learning model to predict antibody aggregation and viscosity. Two of the key descriptors are called spatial aggregation propensity (SAP) and spatial charge map (SCM), respectively. The calculation of SAP and SCM requires to build homology model from antibody sequences and run MD simulations to get ensemble average. This step is very time consuming and needs supercomputers to do the tasks. The goal of this project is to apply neural networks to train MD simulation results using antibody sequence as inputs. The ML model thus obtained will speed up the calculation of SAP and SCM scores and facilitate antibody screening in the early-stage design.
Project Title: Wafer-Scale Geometric Deep Learning on the PSC Neocortex
Kenneth Chiu, Binghamton University
Project Abstract: Neural networks, especially deep neural networks, have seen remarkable success in the past decade on a variety of problems. Many problems, however, are naturally modeled as a graph problem, for which traditional neural networks are not well-suited. This has led to the development of graph neural networks (GNN). GNNs or Message Passing Neural Networks are a set of deep learning algorithms based on message passing or graph convolutions, and are designed for supervised and unsupervised learning on graph structured data. The message passing or convolution operation is analogous to the filter operator from Convolutional Neural Networks (CNN) over neighboring pixels. CNNs can be considered grid-like or lattice graphs with a consistent number of neighbors. GNNs act over a more generalized set of operations, and thus can have an arbitrary number of neighbors and also varied kernel operations. As a result, kernel operations vary depending on the data itself, and require generalized sparse scatter/gather communication over the data features. We will use a customized CSR/CSC format with custom kernels to perform efficient reduction across neighboring graph vertices. We will co-develop our implementation with three applications. One application will be inverse molecular design. Molecules are naturally represented as graph structures, and deep learning on molecules has been hugely successful in many domains such as materials science, and drug discovery. Successful incorporation of deep learning in the molecular design loop can result in development of exotic materials for energy storage, energy generation, and combat climate change. Structure-property prediction is an important part of the design of new materials. Our other application will be predicting events in the world’s most popular multiplayer video game: League of Legends. Using high-resolution large-scale data of thousands of played games, we will learn interactions in complex dynamic graphs that update in real-time. Dynamic graphs such as these will be a case study for performing accelerated deep learning on real-time graphs. Our third application will be to identify state-sponsored disinformation by online interaction graphs.
Project Title: A novel deep learning method for discovering genetic mechanisms underlying differential gene regulation
Sreeskandarajan Sutharzan, Cincinnati Children's Hospital Medical Center
Project Abstract: Gene regulation is a fundamentally important molecular process that is required for all known forms of life. Gene regulation is defined as the processes underlying the activation or repression of gene expression levels. Transcription factors (TFs) are proteins that play key roles in gene regulation. The human genome encodes >1,600 TFs, each of which plays an important gene regulatory role in particular contexts. TFs act by recognizing short DNA sequences in the genome. Upon doing so, they recruit other proteins to ultimately influence gene expression levels. In this sense, TFs are the primary molecules responsible for interpreting the genome. Our lab and many others are currently engaged in understanding this complex “regulatory grammar”, with the ultimate goal of predicting gene expression levels from DNA sequence alone. Achieving this goal would enable a thorough understanding of genome function, and how genetic variation contributes to phenotypes and diseases. Recent advances in Deep Learning methodologies and capabilities are quickly enabling major progress towards this goal. In this study, we propose to leverage the power of Deep Learning to study a particularly important question in regulatory genomics – what DNA sequences underlie differential gene regulatory mechanisms that occur due to differential cellular conditions?
Project Title: Earthquake Phase Association with Graph Neural Networks
Gregory Beroza, Stanford University
Project Abstract: In this work we propose a new Graph Neural Network (GNN) architecture for earthquake phase association, in which we process streaming datasets of estimated seismic wave arrival times (known as “picks”), determine the number and location of earthquakes in a time window, and associate picks to earthquake sources. We train the GNN through supervised learning with synthetic pick datasets, for which ground truth is known, and for which there is high variability and noise (false picks) in the datasets. The network is not trained for a particular configuration of stations, rather it is trained to allow variable: network geometry, numbers of stations, station qualities, and pick rates. By frequently including closely overlapping events in space and time in the training data, the GNN learns to untangle overlapping events. As a mathematical function, the GNN maps sets of sets (sets of discrete picks, on each station) to a continuous, smooth, bounded-prediction of source-likelihoods in space-time, similar to the traditional back-projection (BP) mapping; however, it greatly suppresses side-lobes that plague traditional approaches, and large and small earthquakes are mapped to a similar output value (in contrast to BP, where outputs scale with the number of observed picks). The technique has been tested on real data from the NC network of northern California, using machine-learning-produced picks as input, where we recover over 95% of previously reported earthquakes > M1 throughout the interval 2000 – 2020. Initial applications suggest that the GNN will reveal at least 5x more previously undetected earthquakes < M1, that will reveal active fault structure in unprecedented detail. By enabling us to train more rapidly, the computing capabilities of Neocortex can help us to significantly enhance these results further. With the Neocortex computing platform, we will have the necessary capabilities to optimize the GNN more thoroughly over the hyperparameter space, tune the synthetic data generator to reduce the covariate shift between synthetic and real data, and add additional modules to the architecture, such as an initial full waveform processing layer. We will also be able to perform an ablation analysis to analyze the performance of the individual components of the GNN more thoroughly, which can help identify aspects of the architecture that can be improved, and assist other researchers in adapting our GNN to their own applications.
Project Title: Deep learning analysis for single-molecule ligand-receptor interaction
Cheng Zhu, Georgia Institute of Technology
Project Abstract: Ligand-receptor interactions' biophysical and biochemical characteristics govern many biological processes, particularly cell signal transduction, where extracellular molecular bindings relay messages through membranes to initiate intracellular responses. Our in-situ nanotools: micropipette adhesion frequency assay and biomembrane force probe have delivered cutting-edge knowledge about the single-molecule level ligand-receptor interaction right in their native micro-environments. At the core of these nanotools, their ultra-sensitive kinetic measures heavily rely on 1-dimensional deformation of the micropipette-aspirated red blood cell (RBC). Here, we propose to improve them with the convolutional neural network (CNN) for feature extraction followed by the recurrent neural network (RNN) for dynamic event detection, which potentially leads to more precise quantifications, insightful interpretations, and accurate decision-making. The unique opportunity created by Neocortex can ease the challenges and accelerate the progress to integrate these deep learning components into our current ligand-receptor kinetic analysis workflows.
Project Title: Artificial Intelligence Framework to Predict Wall Stresses on Aneurysms
Timothy Chung, University of Pittsburgh
Project Abstract: Abdominal aortic aneurysm (AAA) is the progressive, degenerative dilation of the terminal aorta; without treatment AAAs can undergo rupture, an often-fatal event that is the 13th most common cause of death in the US. Clinical intervention occurs when the maximum diameter exceeds 5.5 cm, a diameter beyond which it is thought that the risk of rupture is greater than risk of intervention. A biomechanical tool, the rupture potential index (RPI) was developed by our group2,3 through computational finite element analysis (FEA) and experimental uniaxial extension testing. RPI is defined as the ratio of transmural wall stress (driven by systolic pressure) and failure strength (the maximum strength the aneurysm wall can support). However, the RPI has not translated clinically due to the heavy computational requirement, the reliance on manual segmentation methods and the relatively low number of patient images studied (the combined number of significant studies investigating peak wall stress is around 348 where the RPI was not always calculated10). We use a combination of machine learning techniques to automatically segment aneurysm geometries and perform predictive modeling of wall stresses based on many computational simulations. Comparisons of shape and biomechanical indices are quantified to determine the reliability of the automatically reconstructed AAA geometry. Preliminary results have shown that we are able to predict wall stresses within 0.34% based on shape indices without the need for computational simulations. Increased sample size will allow us to further develop a clinically translatable tool to predict the biomechanical status of AAA.
Project Title: Improving predictability of anomalies in vitreous silica using uncertainty-based adversarial attack
Rafael Gomez-Bombarelli, Massachusetts Institute of Technology
Project Abstract: Understanding the structure of glassy materials represents a tremendous challenge for both experiments and computations. One of the most common glass materials, vitreous silica, has been used in a plethora of commercial and scientific applications, but is still not well understood despite decades of research. Sharing the same tetrahedral order as water, vitreous silica has been known to exhibit several anomalous behaviors in its physical properties, including a temperature dependent density minimum around 900˚C and density maximum around 1500˚C. Due to such anomalies, many empirical force fields and machine learning interatomic potentials have shown to be volatile in predictions of physical properties that accurately reflects mechanical and density anomalies in silica. Here, we exploit automatic differentiation strategy in graph neural network (GNN) potentials to discover highly uncertain glass configurations such that structural configurations that are responsible for anomalies in vitreous silica have a higher likelihood to be learned. The automatic differentiation strategy is done by performing adversarial attack on a differentiable uncertainty metric. When combined into an active learning loop, only a small amount of expensive ab initio molecule dynamics trajectories is needed as the initial training dataset.
Project Title: Analysis of differential dependency on large-scale RNA expression networks
Gil Speyer, Arizona State University
Project Abstract: The dependency between genes within a functional biological pathway can be contrasted between two conditions through the calculated divergence between distributions of dependency networks [1]. The EDDY (Evaluation of Differential DependencY) is a statistical test to identify gene sets, a.k.a., pathways, that are significantly “rewired”, by leveraging a probabilistic framework with resampling and permutation, aided by the incorporation of annotated gene sets, to demonstrate superior sensitivity when compared to other methods. Further, the ample and independent computation coupled with manageable memory footprint incurred by this statistical rigor positions EDDY as an excellent subject for graphical processing unit (GPU) acceleration [2]. Custom kernels written in CUDA decompose the independence test loop, network construction, network enumeration, and Bayesian network scoring to accelerate the computation. The algorithm has recently been used to discover novel drugs for pulmonary hypertension, repurposed from small compounds that are designed for cancer treatments [3]. The Neocortex RFP provides an opportunity to pursue new directions with EDDY analysis, such as the interrogation of larger gene sets and the development of statistical sampling strategies for larger (e.g. single-cell) RNA expression sample sets. [1] Jung S, Kim S. EDDY: a novel statistical gene set test method to detect differential genetic dependencies. Nucleic Acids Res. 2014 Apr;42(7):e60. doi: 10.1093/nar/gku099. Epub 2014 Feb 5. PMID: 24500204; PMCID: PMC3985670. [2] G. Speyer, J. Rodriguez, T. Bencomo and S. Kim, ""GPU-Accelerated Differential Dependency Network Analysis,"" 2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP), 2018, pp. 410-414, doi: 10.1109/PDP2018.2018.00072. [3] Negi V, Yang J, Speyer G, Pulgarin A, Handen A, Zhao J, Tai YY, Tang Y, Culley MK, Yu Q, Forsythe P, Gorelova A, Watson AM, Al Aaraj Y, Satoh T, Sharifi-Sanjani M, Rajaratnam A, Sembrat J, Provencher S, Yin X, Vargas SO, Rojas M, Bonnet S, Torrino S, Wagner BK, Schreiber SL, Dai M, Bertero T, Al Ghouleh I, Kim S, Chan SY. Computational repurposing of therapeutic small molecules from cancer to pulmonary hypertension. Sci Adv. 2021 Oct 22;7(43):eabh2794. doi: 10.1126/sciadv.abh2794. Epub 2021 Oct 20. PMID: 34669463.
Project Title: High-throughput and data-mining search for new rare-earth-free permanent magnetic borides
Boniface FOKWA, University of California, Riverside
Project Abstract: The project will focus on applying machine learning to discover new metal borides with high magnetocrystalline anisotropy and high Curie temperatures, with the long-term goal of realizing rare-earth free permanent magnets (PMs) that can be competitive or surpass the current PMs. The creation of DFT databases (predicted structures) and accumulated experimental data (e.g. the Materials Project and CITRINE INFORMATICS) has opened new avenues for designing new materials with targeted properties. Particularly, machine learning techniques have provided the ability to use these data sets to rapidly predict the preferred crystal structure or physical properties of intermetallics. Specifically, we will use subsets of known and predicted structures that will serve as training sets for the machine learning algorithm, while the large databases available (Materials Project and ICSD) will be used to expand the training data sets, which will then enable the prediction of new candidate structures.
Project Title: Ocean Reanalysis Data-Driven Deep Learning Forecast
Ruoying He, North Carolina State University
Project Abstract: In this project, a hybrid model of empirical orthogonal function (EOF)-complete ensemble empirical mode decomposition (CEEMD)-artificial neural network (ANN) will be developed to enable efficient and accurate ocean forecast for the Northwest Atlantic Ocean. EOF analysis transforms the spatial-temporal prediction problem into a time series prediction problem. It can reduce computational effort and dimensionality, capture spatial relationships, and consider correlations between different variables. Then, CEEMD can improve the predictability of nonlinear time series. ANNs are subsequently used to predict CEEMD-derived time series from the PCs corresponding to EOFs. This work is expected to lay a solid foundation for AI research in oceanography, and provide a temporal-spatial domain prediction of ocean conditions that can be used for marine hazards forecast and mitigation.
Project Title: Characterizing DNN training on Neocortex
Xulong Tang, University of Pittsburgh
Project Abstract: This project aims to conduct characterization of the new hardware, Neocortex, designed for AI applications. We aim to study the hardware execution statistics including the execution bottlenecks. The results and observations will help develop better application mappings as well as improve architecture designs for executing AI applications on Neocortex.
Project Title: Interpretable Deep Modeling of SARS-CoV-2 Sequences
Gail Rosen, Drexel University
Project Abstract: We propose to use Neocortex to generate interpretable deep learning models of how Sars-CoV-2 (COVID-19) sequence variation affects viral phenotype, viral evolution, and host phenotype / clinical outcomes. To date, nearly 4.9 million viral genome sequences have been collected and submitted to the GISAID Initiative’s central database (http://www.gisaid.org). This volume of data represents an unprecedented opportunity to learn more about this novel disease, and how it is evolving and changing. Building from our research group’s prior work on interpretable deep learning, we employ a Transformer architecture, using an optional CNN filter to reduce model complexity, with a distinct sequence-wide attention layer for the purpose of interpretability. Our framework provides for two levels of interpretability, by generating both attention graphs that reveal important sequence features, as well as embeddings that can be used to visualize underlying patterns in sequence variation. We will use the Neocortex architecture to analyze larger COVID-19 sequence data sets and improve our deep modeling framework.
Project Title: Exploring Interpretable Deep Learning from Information Theoretic Perspective: Modeling and Applications.
Huajie Shao, The College of William & Mary
Project Abstract: Despite the great success of AI techniques in many different applications, such as computer vision, self-driving cars, and robotics, it is still hard for humans to fully understand and interpret them. The goal of this proposal is to reason and understand deep learning models by learning the disentangled representations. Disentangled representation learning aims at learning a low-dimensional representation that consists of multiple interpretable latent factors of the observations. The semantically meaningful latent factors help us better explain which one affects the classification and prediction accuracy. However, learning disentangled representations based on Variational Autoencoders (VAE) models pose two major challenges. First, many existing models require prior knowledge of some data generative factors from human annotation to train the model, costing lots of human labor. The second challenge is the trade-off problem between reconstruction and disentanglement learning. This proposal intends to solve these two issues by applying control theory, information bottleneck, self-supervised learning, and casual representation learning. Finally, we plan to apply the disentangled representations from our models to improve downstream tasks, such as image generation, reinforcement learning, and text generation. The proposed solution requires a high computing capability, on-device memory, and inter-device communication throughput. We believe the CS-1 WSE is a natural fit for our problem and is expected to significantly reduce the requirement for GPUs to train the proposed model.
Project Title: Large-scale Pre-training for Natural Language to Code Generation
Graham Neubig, Carnegie Mellon University
Project Abstract: This project aims to create pre-trained models for natural language to code generation, the task of generating programs from natural language descriptions. This has the potential to make programming easier, and perhaps even allow for command and control of computers by non-programmers. Our research team has a large amount of experience in this area, but lacks resources to scale models to very large datasets such as training on the entirety of github, which this proposal aims to address. We also plan to examine novel models for code generation based on non-parametric models, which look up related examples in a training corpus, which is important both for performance and interpretability. All models we develop will be made available open source for the community to use.
Project Title: Automated sleep states classification for wide-field calcium imaging using deep learning
Mark Anastasio, University of Illinois Urbana-Champaign
Project Abstract: Wide-field calcium imaging (WFCI) with genetically encoded calcium indicators enables spatial-temporal recordings of neuronal depolarization in mice on a sub-second temporal scale with simultaneous examination of neurovascular coupling and cell type specificity. When applied to the study of sleep, it requires human experts to manually score hours of WFCI recordings by use of adjunct electroencephalogram (EEG) and electromyogram (EMG) signals. However, this process is tedious, time-consuming and often suffers from low inter- and intra-rate reliability and invasiveness. Therefore, an automated sleep states classification method applied on WFCI sequential data is desired. Given that sleep is a cyclic process and the high temporal resolution provided by WFCI, it is of our interest to investigate the use of deep learning models which exploits temporal dependencies among events to classify sleep states on a large-scale dataset of spatial-temporal sequential WFCI recordings. In addition, uncovering the spatial-temporal features underlying calcium dynamics in mice by use of deep learning may enable future sleep-focused studies with WFCI.
Project Title: Impute cell free DNA fragmentation pattern from low-coverage whole-genome sequencing
Yaping Liu, Cincinnati Children's Hospital Medical Center
Project Abstract: TBA
Project Title: Large-scale spiking network models to explain dynamics of visual perception and working memory
Lyle Muller, Salk Institute for Biological Studies
Project Abstract: TBA
Project Title: An Integrated Machine Learning Platform of GWAS (Genome Wide Association Study) and Epigenetics for Personalized Bladder Cancer Clinical Applications
Zhiyong Zhang, Stanford University
Project Abstract: TBA
Project Title: Large scale Machine Learning force fields for metal hydride systems
Venkatasubramanian Viswanthan, Carnegie Mellon University
Project Abstract: Machine Learning has enabled the prediction of various material properties from formation energies, HOMO/LUMO levels to atomic energies and forces. The increasing number of material and molecule datasets available present an opportunity to train Machine Learning models on datasets larger than typically used in materials science including larger sets of descriptors and more model parameters. Computational cost for training typically limits the dataset size and model size. In this work we train Machine Learning models to predict scalar properties of metal hydrides, materials which have been shown to have high temperature superconducting properties as well as molecular datasets such as QM9 important in various chemical processing industries. NeoCortex will allow us to push the limits of the sizes of training sets and models at record training speeds in an attempt to beat the state of the art accuracy on scalar properties.
Project Title: AI Understanding of Ultrasound Scans: Semantic Segmentation and Diagnosis Trained with Simulation and Genetic/Back-Prop Hybrid Training
John Galeotti, Carnegie Mellon University
Project Abstract: TBA
Project Title: Identifying Actor Characteristics in State-Linked Information Operations Using Twitter Data and Graph Based Neural Networks
John Wohlbier, Carnegie Mellon University
Project Abstract: TBA
Project Title: Voxel Pretraining for Few-Shot Learning
William Bradley, Mirabolic Consulting
Project Abstract: Because large, labelled medical imaging data can be difficult to collect, we are interested in few shot learning problems related to medical imaging (MRI and CT scans). To that end, we are interested in pretraining a voxel-based transformer network using a masked language model (MLM) through modifications of BERT. Voxel count grows cubically with edge size and a standard transformer’s memory usage grows quadratically with features, so traditional models can only examine voxels within a very small radius of a target point. We hope that the more powerful memory bandwidth of the Neocortex will allow us to increase this limit, and that that larger context will improve the model performance.
Project Title: Training of conservative physics-informed neural networks (CPINN) to solve the incompressible Navier-Stokes equation at high Reynolds number
George Karniadakis, Brown University
Project Abstract: TBA
Project Title: Simulation and Benchmarking of Quantum Machine Learning
Jason Larkin, Carnegie Mellon University
Project Abstract: TBA