2004 NSF Participants-Department of Biological Sciences - Carnegie Mellon University

2004 National Science Foundation (NSF)-supported Participants

Julianna Conde-AdornoJulianna Conde-Adorno, Universidad Metropolitana
(Mentor: Dr. Robert Murphy)

Extraction of Subcellular Locations From Protein Databases
Methods have been created for the quantitative analysis of fluorescent microscope images for a large number of proteins to get their subcellular location patterns. The goal of this project is improve the algorithm to extract the subcellular locations contained in structured protein databases. We therefore wished to use the cellular component portion of the Gene Ontology (GO) database to associate a protein name with a specific subcellular location. Since more than one GO term can be assigned to a protein, we chose to use the lowest common ancestor (LCA) of these terms as the location of that protein. We implemented LCA search on a local copy of the GO database to provide rapid location determination. We compared our new program with a previous approach to determining location from GO terms. The LCA algorithm is much faster and more accurate. With this program we can compare the subcellular location determined by the image analysis with those extracted from protein databases. This comparison can reveal whether the description obtained from the analysis of the images is consistent or not, and, for example, can identify proteins that were mis-localized due to tagging artifacts.

Alexander DavidAlexander David, University of the Virgin Islands
(Mentor: Dr. Robert Murphy)

Comparing and Analyzing Clustering Algorithms When Applied to Proteomics
Proteomics is the term used to describe large scale documentation and characterization of many or all proteins expressed in a given cell type. Knowledge of a protein's subcellular location is critical to the understanding of its function. Previous research in Murphy Lab has shown that automated classifier and clustering methods could be trained to recognize and organize protein location patterns. The core of these approaches is the implementation and optimization of the Subcellular Location Features (SLFs), which are numerical descriptors of subcellular location patterns. The purpose of the current research is to compare different approaches to construct optimal cluster structures of a set of proteins based on their location patterns. One of the ways that this will be accomplished is by comparing different distance functions and the effect they have on the structure of a cluster. For this research nine different distance functions were implemented and a ratio of the distances between clusters and within clusters was obtained using each distance function. The ones that yielded the highest ratio were implemented in a K-means clustering algorithm, and the generated clusters were compared. Another goal of this research is to analyze the clusters that have been generated in an attempt to determine how well they fit the data set that was used to create them. One method of doing this is to determine the homogeneity of a cluster structure. This is a measure of how well the objects in a class fit in the class. The smaller the distances between each object, the larger the homogeneity.

Danielle IzaakDanielle Izaak, University of the Virgin Islands
(Mentor: Dr. Robert Murphy)

The Comparison of Several Algorithms and Integrating the Most Effective Algorithm in Extracting Protein and Cell Names
Data in the form of figures and accompanying captions in literature present special challenges for biological literature mining. Captions and figures are an important but little studied part of scientific publications. Researchers in Dr. Murphy's Lab have developed a system called SLIF (Subcellular Location Image Finder), which applies both image analysis and text interpretation to the figure and captions pairs produced from on-line journals. The information about the localization type comes from the image analysis, and the information about the protein name and cell type comes from caption interpretation. The purpose of creating such a system is to extract information about protein subcellular localization and generate assertions such as "Figure N depicts a localization of type L for protein P in cell type C." The present goal was to annotate 500 captions with protein names and cells names, compare the performance of several learning algorithms, come up with an effective algorithm and integrate it into SLIF so that it can be used to extract protein and cell names. Captions from online journals were extracted and labeled manually by hand in a Text labeler and tested for precision and recall of protein names with a Hidden Markov Model (HMM) with a dictionary. This HMM integrated with a dictionary was developed by the researchers in Dr. Murphy's lab to do a 'soft' match between the names in the text with the items in the dictionary. This algorithm will be compared to other learning algorithms such as Maximum Entropy (MaxEnt) Model and Conditional Random Fields (CRFs). The algorithm with the best accuracy will be integrated into SLIF. Those algorithms, HMM, MaxEnt and CRFs, are all are mathematical models of stochastic processes that go through random sequences of states to generate random sequences of outcomes (or observations) according to certain probabilities. In the protein/cell name extraction problem, the words in the text are our observations and there are basically two states: protein/cell name state and non-name state. The learning algorithms are used to determine the sequence of states, given the sequence of words. The most effective algorithm so far is the CRF model. The ongoing work deals with combining other utilities to or with the model to enhance its performance in extracting assertions from captions and figures. This fully automated online information extractor will provide a truly useful tool for harnessing the vast amounts of information about protein subcellular location s available in online journal articles.

Yenixsa Rivera-SierraYenixsa Rivera-Sierra, Universidad Metropolitana
(Mentor: Dr. Robert Murphy)

Portable Subcellular Location Feature Extraction
The area of proteomics focuses on the study of all the proteins expressed by a given cell type or tissue. Subcellular location is one aspect of each protein that is critical for describing and understanding the proteome of a cell and the most common method for determining subcellular location is interpretation of fluorescence microscope images. This interpretation is usually done visually by the investigator, which may be influenced by investigator bias. As a result, automated systems for interpreting protein localization patterns were developed, as web as numerical features that described the protein patterns. Several feature sets, termed Subcellular Location Features, were developed in our previous work. Current software for calculating these features relies on specialized libraries available only for certain computer platforms. The goal of my project was to improve the speed and portability of the feature calculations by creating standalone routines to replace the current code. Standalone routine that performs the thinning transformation on binary images was developed implementing a hit-and-miss operation. Our standalone routine was able to perform the same task as the built-in function producing identical outputs, yet was inferior in terms of computational time. The code was therefore modified to only evaluate the area of interest, thus reducing computational time. We implemented new features based on the outlines of the objects in cell images obtained by erosion and dilation. They are being evaluated in combination with the original features and independently. Step-Wise Discriminant Analysis was used to find the most informative features of the set. The numbers of features that produce the best accuracy were taken for classification. The performance of various features sets were evaluated using a Support Vector Machine classifier and 10-fold cross validation. The outcome of this project will improve the usefulness of our system for other investigators and may result in improved interpretation of fluorescence microscope images.

Yanine RosarioYanine Rosario, Universidad Metropolitana
(Mentor: Dr. Jelena Kovacevic)

Classification Accuracy of Multi-Resolution Sub-Sampled Cellular Images
Cellular images are complex in structure. Presently, experts in the field of biology can examine these cellular images and identify the different classes of proteins and their locations. This manual procedure of analyzing the cellular images is inefficient, prone to error and slow due to the nature of the examinations. The inefficiency of this technique is not limited to speed but also to the accuracy of the classification. There are particular classes of proteins that are extremely difficult to discriminate between even for experts. Techniques have been developed to classify these images automatically. In this research, we enhance the present system for automatic classification of cellular images. We call this an Intelligent Imaging System, which has been developed to acquire information from the image by means of a computer program. We increase the efficiency by providing higher resolution in areas where more pertinent information is located. We hypothesized that the multi-rate algorithm will not only be much faster in acquiring pertinent information from the image but also give us a better accuracy in classifying the objects of the image. We classified the original images using the Support Vector Machine (SVM) to obtain accuracy for ten classes of proteins. Then we assessed the value of the images using a multi-rate sampling algorithm to recreate new images and another set of images using standard down sampling of the same compression ratio. Both multi-rate and standard down sampled images were classified using the same method and training data as the original images. We then compared these results to the classification accuracy of the original images. Our results indicate that the multi-rate images are closer in accuracy to the original images than the standard down sampled images especially when higher compression ratios are present.