Carnegie Mellon University

benjamin moseley

November 27, 2018

Break Through Big Data With Strategic Clustering Algorithms

As “big data” becomes more pervasive across business functions, it becomes ever more significant that businesses have efficient methods for analyzing large sets of data. While data scientists are experts in mathematical analysis, the managers, marketers, sales representatives, and others who need to use the data may not be. With the right resources, these professionals can visualize large sets of data to make objective decisions.

Benjamin Moseley, Assistant Professor of Operations Research, together with Joshua Wang of Stanford University’s Department of Computer Science, sought an optimal mathematical framework to categorize data via machine learning algorithms. In their paper, “Approximation Bounds for Hierarchical Clustering: Average Linkage, Bisecting K-Means, and Local Search,” the co-authors analyze a common method of analysis known as “hierarchical clustering,” in which sets of data points are grouped into larger and larger clusters based on similar characteristics. 

For example, biologists categorize species on a tree based on Linnaean taxonomy.

“Grouping data into clusters where data points are similar is used to discover the key features of fraudulent transactions, customer behavior, and sales transactions,” the team wrote. “Improving clustering methods will reveal more subtle and difficult-to-find connections in data. This can lead to faster detection of fraud or better forecasts of future customer behavior.”

As their paper describes, there are two primary strategies for hierarchical clustering: agglomerative and divisive. In agglomerative clustering, data points are compared based on common characteristics that distinguish them from other clusters. Divisive clustering works in reverse: The set of all data points are divided into groups based on differences.

The researchers found that an agglomerative method in which the clusters are built based on average similarities (as opposed to minimum or maximum similarities) is close to being an optimal method of clustering. With this method, data analysts can more efficiently build mathematical models that can help an operations manager, for example, to use software that categorizes customer data to inform their strategic decisions.

“This research is exciting because it can be used to guide the design of improved solutions,” Moseley said. “We are leveraging this discovery to build algorithms that can process billions or trillions of data points in seconds — orders of magnitude faster than software currently available.” 

Read more in the fall/winter 2018 Tepper Magazine.