Carnegie Mellon University

Risk and Regulatory Services Innovation Center

Sponsored by PwC at Heinz College

Information Extraction from Commercial Lease Documents

Principal Investigator: Eduard Hovy, Research Professor, CMU Language Technology Institute

PwC Sponsor: Mike Flynn, Principal, Advanced Risk and Compliance Analytics 

The difficulty of managing and updating commercial leases grows over time, as new leases are signed and old ones evolve, creating families of close variants in a vast collection of documents. Of central importance is the ability to extract key facts from each document, systematize them, and store them in a database to enable future access, production of overall statistics, etc.

Since the language of leases is very complex and can vary considerably from lease to lease, rule-based information extraction technologies are not always fully effective. This projects seeks to develop natural language processing (NLP) technology to improve the output of PwC’s Information Extraction (IE) system. Success in automating the IE system would mean less manual effort in reading, classifying, and extracting key information from lease documents, potentially improving accuracy and efficiency.