Carnegie Mellon University
June 30, 2023

Researchers Develop Active Learning Workflow to Optimize Drug Design

By Heidi Opdyke

Jocelyn Duffy
  • Associate Dean for Marketing and Communication, MCS
  • 412-268-9982

At the height of the COVID-19 pandemic researchers from Carnegie Mellon University's Department of Chemistry built computer simulations of COVID-related protein inhibitors in an attempt to identify drug candidates that could treat the virus. In doing so, they developed an efficient automated workflow for identifying new compounds that could be used to develop future pharmaceutical treatments for a wide range of diseases or conditions.

The work by Evgeny "Eugene" Gutkin, a doctoral candidate in Professor of Chemistry Maria Kurnikova's research group, and Filipp Gusev, a graduate student in the joint CMU-Pitt P.D. program in Computational Biology and Assistant Professor of Chemistry Olexandr Isayev's group, is described in a paper published in the American Chemical Society's Journal of Chemical Information and Modeling and highlighted on the journal's cover.

When looking for new drug candidates, experts often consider potential repurposing for known molecules or their modification, Gusev said. Computational techniques add a level of computer-aided design for additional insights to find those enhancements faster.

"This approach requires expertise in the field to narrow down options but historically has been biased because it doesn't consider underexplored areas in chemical space," Gusev said. "The computational approach we developed and applied is agnostic to those biases because it's purely data-driven."

Illustration shares that an active learning cycle has simulations and machine learning to developing pharmaceutical drugs

The Carnegie Mellon researchers developed an efficient automated workflow for identifying compounds with high binding affinity to the target protein among thousands of congeneric ligands. Automated machine learning combined with molecular dynamics-based free energy calculations orchestrated by active learning allows unbiased and efficient search for a small set of best-performing molecules. Starting from the structure of a known SARS-CoV-2 PLpro inhibitor (shown on the left), the researchers screened a library of 1.3 billion commercially available compounds with similar substructure (highlighted in green) and identified several compounds with more than 100-fold improvement in predicted binding affinity (the best performing molecule is shown on the right).

Gusev and Gutkin were looking for potent inhibitors of SARS-CoV-2 papain-like protease, compounds that disrupt the replication of coronavirus. In this instance they were identifying compounds with the lowest protein-ligand binding free energy, which is a crucial indicator of drug potency, among thousands of molecules with the same common substructure.

Through an automated workflow that started with 1.6 billion commercially available molecules and narrowed to some 8,000 candidates, they were able to find 133 compounds that performed better than the known inhibitor and 16 of these showed more than 100-fold improvement in binding affinity, which in theory leads to significantly better inhibitory activity.

"Our hit rate outperformed that expected of traditional expert medicinal chemist-guided campaigns," Gutkin said. Through active learning and automated machine learning approaches, the researchers' methods got information from calculations 20 times faster than a brute force approach where calculations are performed for all molecules included in the focused set of 8,000 molecules.

Identifying compounds and designing drug candidates from a known starting point, a process known as lead optimization, is a looming challenge for modern computational chemistry, Gusev said. Computational intensive campaigns are limited by the availability of computational resources for molecular dynamics simulations as well as the difficulty of performing computations in a high-throughput manner. For this work, the researchers used the Pittsburgh Supercomputer Center's Extreme Science and Engineering Discovery Environment (XSEDE) and the Frontera computing project at the Texas Advanced Computing Center (TACC).

"A key outcome of this project is the development of the workflow that combines machine learning and molecular dynamics-based free energy calculations," Gutkin said. "We are planning to refine and optimize the workflow further and apply it to design potent inhibitors for other molecular targets."

Funding for this research was supported by the DSF Charitable Foundation, the COVID-19 HPC Consortium and the National Science Foundation.

"The beauty of the method is that it is transferable," Isayev said. "We applied it to COVID-19, and also we're testing it in a couple of other projects." 

— Related Content —