January 31, 2020
New Study Finds a Cost-Efficient Method for Active Search
In the sphere of machine learning there is a subfield known as active learning, which is when AI technology actively chooses data from a large, unlabeled pool. A paradigm of active learning is known as active search, through which the AI identifies a positive set of data points within this large pool. Since it is usually assumed that the positive points are rare within this pool, the search is challenging and the labeling process costly. Most active search efforts have focused on budget setting—an heuristic method wherein a user specifies the total number of observations an AI can make within a given budget. A new paper studies cost effective active search with the goal of finding a predefined number of positive points from a large unlabeled pool with minimum labeling cost. By conducting comprehensive experiments on drug and materials discovery datasets, the study found that this new method is more cost efficient than existing methods. The study also suggests that these findings could lead to superior learning software across numerous domains.
The study, by researchers at Carnegie Mellon University (CMU) and Washington University in St. Louis, appears in NeurIPS.
“This work considers a key problem in automated scientific discovery and improves the state-of-the-art learning algorithm. Frequently, scientists search for elements with a desired property, such as a drug addressing a particular disease or a metal with properties needed for an aircraft. This paper offers an automated process for selecting the elements to test that are most likely to have the desired property. This is exciting because it has the potential to impact many areas of science and engineering reliant on such methods.” says Ben Moseley, an Assistant Professor of Operations Research and Carnegie Bosch Junior Faculty Chair at CMU’s Tepper School of Business, who coauthored the study.
In the first study of its kind, the researchers applied a Bayesian decision-theoretic framework to study cost effective active search. First, they derived the Bayesian optimal policy and established that not only is the optimal policy difficult and inefficient to compute, but even using it to make an approximation is challenging. They then proposed an efficient strategy to approximate the optimal policy by introducing the negative Poisson binomial distribution, which they defined as the number of coins that need to be flipped in order to achieve a given number of “heads” results. Finally, they proposed efficient ways to approximate its expectation.
The researchers conducted experiments to compare their proposed strategy for approximating the optimal policy to three common baselines: the most widely used greedy policy, which chooses a point with the highest probability; a two-step lookahead policy, which considers the maximum probability a point would lead to if it is negative; and the recently developed nonmyopic search policy that has proven to be more efficient than the other two.
The authors chose to experiment using drug discovery as it is one of the main applications of active search. They simulated a virtual drug screening to find chemical compounds exhibiting binding activities with a certain biological target, ultimately conducting 270 experiments for each policy. On average, the researchers’ proposed strategy outperformed the others by a wide margin. An especially promising result was that their policy showed a 56-73 percent cost reduction on average compared to the standard greedy algorithm.
The researchers are especially encouraged since these results could lead to several more avenues of study. Determining how to adapt the study in a more principled way is an interesting future direction, as is extending the proposed method to batch setting, where multiple points are evaluated simultaneously.
“The community’s understanding of active search is still in its infancy. This work offers some of the first foundational work in the area and sets the path towards a broader understanding that will be developed. Already, the results are exciting and I look forward to seeing more breakthroughs in the area.” says Moseley.
Summarized from an article in NeurIPS, Cost Effective Active Search, Moseley, Benjamin (Carnegie Mellon University), Jiang, Shali (Washington University in St. Louis) and Garnett, Roman (Washington University in St. Louis). Copyright 2020. All rights reserved.