Big Data-Silicon Valley Campus - Carnegie Mellon University

Big Data-Silicon Valley Campus - Carnegie Mellon University

Big Data

Big data is currently one of the hottest areas in the software industry. This Spring, three students in Carnegie Mellon’s MS Software Engineering program, Andrew Chen, Brian Tran, and Raj Desai, are teaming up with associate research professor Ole Mengshoel of Carnegie Mellon on a semester-long practicum project applying machine learning techniques to social network datasets. The goal of the machine learning is to analyze social data records such as phone calls and mobile phone app check-ins to predict relationships and future behavior of the users.

Continuing price drops for storage and computing power over time have allowed a multitude of communications and web companies to gather and manipulate large amounts of data collected from their users. Additionally, advances in software have greatly simplified storing and retrieving this data. Whereas in the past, large datasets traditionally had to be stored in expensive relational databases such as Oracle, open source “NoSQL” solutions stemming from Apache’s Hadoop project have taken the industry by storm for their cheap cost and tremendous scalability. Managing large datasets using these NoSQL solutions has come to be known as “big data”.

Some potential applications of big data are to learn about social trends and mobility patterns. One particular area of interest is with mobile call data records (CDR). A CDR is logged for each call placed or received by a user and includes information such as caller, call recipient, time, duration, and cell tower ID. Since the locations of cell towers are known to the telephone company, interesting mobility data can be mined from call records, such as traffic patterns. Another popular domain is social networks, such as Facebook, Twitter, and Foursquare. The information recorded by social networking websites can include things such as the unique users, social structure, and even geographical details of the user in some instances. Many of these datasets are made publicly available for general use.

The goals of the project are to examine and utilize publicly available datasets in these two domains to apply machine-learning algorithms in the hopes of accurately predicting some aspect of the data. A few examples are prediction of whether a call will be placed between two parties, or whether two people in a social network are friends, also known as link prediction in the industry. Additionally, the team will focus on datasets with location information and use the location information for link prediction.

Two of these datasets are provided by the Stanford University’s Network Analysis Project, or SNAP( They are from the now defunct online social networking companies Gowalla and Brightkite. These companies’ products operated much like the popular FourSquare product, where users had “friends” in the network and could “check in” to physical locations using their mobile phones. Gowalla’s dataset has approximately 200 thousand users and over 6 million check-ins, while Brightkite’s has about 60 thousand users and almost 5 million check-ins.

To process this data, the team is using a multitude of open-source technologies. Hadoop ( is used for distributed file storage that also provides Map-Reduce functionality, which is a technique for parallelizing computation tasks across multiple servers. Hive ( is a technology build on top of Hadoop that provides a SQL interface for gathering analytics from the data. For machine learning, Weka ( provides many built-in algorithms for quick applications. Decision tree, Naïve Bayes, and SVM are some algorithms that are being utilized in the project. Additionally, Mahout (, a machine learning tool on top of Hadoop, is being explored as well.

The end goal of the project is to produce a research paper detailing the team’s findings that will be submitted to an upcoming conference. Some possible conferences are KDD ( and NIPS (