Carnegie Mellon University

The Merck Computational Biology and Chemistry Program

Distinguished Seminar Abstract

Dr. Marti Hearst
Associate Professor, University of California, Berkeley The Berkeley BioText Project
Friday, April 23, 2004
2:30 PM Wean Hall 4623

New methods and tools are needed to improve how bioscience researchers search for and synthesize information from the bioscience literature.

Towards this end, we are building a flexible, efficient, platform-independent database system infrastructure specifically geared towards supporting the advanced and particular search needs of bioscience researchers. We are using this infrastructure to support the development and deployment of statistical approaches to natural language processing which identify entities and relations between them in the bioscience literature. The results of the text analysis will be accessed via an intuitive, appealing search user interface that will be developed using the appropriate human-centered design methods. The resulting system will support new ways of asking scientific questions, and new tools for assembling the pieces of biosciences puzzles.

This project, called BioText, is still in its early stages, and so in this talk I will describe results we have achieved on subproblems on the path toward the final goals. Specifically, I will describe our work on database support for annotation layers for natural language processing, our automatic abbreviation recognizer, and our work on automatically identifying the relations that hold between entities within sentences. I will also discuss the important role that generalization using lexical ontologies such as MeSH play in our work. For more information, see biotext.berkeley.edu.