Systems Engineer, Electrical and Computer Engineering / Information Networking Institute
I joined Carnegie Mellon University in 2003 as a Systems Faculty with the Electrical and Computer Engineering Department and the Information Networking Institute. I was previously a Research Staff Member with Motorola's Broadband Communications Division in San Diego, CA, where I was involved in the H.264 video-compression standardization activity. I received a Motorola Outstanding Performance award in 2002 in recognition of my contributions to global standardization activities. Prior to this, I received my Ph.D. in March 2000 from the University of California, Santa Barbara and my B.Tech. degree from IIT Bombay in 1994.
My research interests are in the area of problem diagnosis or fingerpointing in large-scale distributed systems. Problem diagnosis involves instrumenting a given system to gather meaningful data, and analyzing the collected data to detect the source or even the root cause of the problems in the system. Fingerpointing is a challenging problem because the distributed nature of processing/computation can cause the problem to affect the behavior of all the nodes in the system. We are currently working on identifying performance problems in MapReduce systems such as Hadoop, and file systems such as PVFS, Lustre, BFS and CoreFS. Our current fingerpointing algorithms use black-box data and/or white-box data to fingerpoint a faulty node in Hadoop and the filesystems. My current research projects include the following:
- Problem Diagnosis in PVFS/Lustre: Automatically diagnosing performance problems in parallel file systems by identifying, gathering and analyzing either OS-level black-box performance metrics or system call attributes across parallel file systems.
- Kahuna: Diagnosing performance problems in Hadoop by comparing OS-level performance metrics and Hadoop's log statistics across all the nodes of a cluster to fingerpoint a faulty node.
- SALSA: Analyzing Logs as StAte machines: SALSA examines Hadoop logs to derive a state-machine view of the system's execution along with control-flow, data-flow models and related statistics. The state-machine view of Hadoop is then used for failure diagnosis and visualizing the Hadoop's distributed behavior.
- Gumshoe: Failure diagnosis in distributed systems through the application of statistical anomaly-detection algorithms, machine-learning techniques such as clustering, etc.
I am fortunate to work with talented students such as Jiaqi Tan , Soila Kavulya, Michael Kasick and Xinghao Pan. I am also affiliated with the Center for Sensed Critical Infrastructure Research (CenSCIR) and Parallel Data Lab (PDL) at CMU.