Carnegie Mellon University

Labeled Pairwise Comparisons of USPTO Inventor Records | Carnegie Mellon University, Engineering and Public Policy


Citation and further information -- when using this data, please cite:

Ventura, S., Nugent, R., and Fuchs, E. 2015. "Seeing the Non-Stars: (Some) sources of bias in past disambiguation approaches and a new public tool leveraging labeled records”. Research Policy (selected for Special Issue on Big Data). In press.

Abstract:

To date, methods used to disambiguate inventors in the United States Patent and Trademark Office (USPTO) database have been rule- and threshold-based (requiring and leveraging expert knowledge) or semi-supervised algorithms trained on statistically generated artificial labels. Using a large, hand-disambiguated set of 98,762 labeled USPTO inventor records from the field of optoelectronics consisting of four sub-samples of inventors with varying characteristics (Akinsanmi et al., 2014) and a second large, hand-disambiguated set of 53,378 labeled inventor records corresponding to a subset of academics in the life sciences (Azoulay et al., 2012), we provide the first supervised learning approach for USPTO inventor disambiguation. Using these two sets of inventor records, we also provide extensive evaluations of both our algorithm and three examples of prior approaches to USPTO disambiguation arguably representative of the range of approaches used to-date. We show that the three past disambiguation algorithms we evaluate demonstrate biases depending on the feature distribution of the target disambiguation population. Both the rule- and threshold-based methods and the semi-supervised approach perform poorly (10–22% false negative error rates) on a random sample of optoelectronics inventors – arguably the closest of our sub-samples to what might be expected of the majority of inventors in the USPTO (based on disambiguation-relevant metrics). The supervised learning approach, using random forests and trained on our labeled optoelectronics dataset, consistently maintains error rates below 3% across all of our available samples. We make public both our labeled optoelectronics inventor records and our code to build supervised learning models and disambiguate inventors (see http://www.cmu.edu/epp/disambiguation). Our code also allows users to implement supervised learning approaches with their own representative labeled training data.

Brief Dataset Description:

  • Our original dataset consists of 98,762 labeled USPTO records corresponding to inventors of optoelectronics patents. (For more detail, see Ventura, Nugent, and Fuchs 2015)
  • We make all valid pairwise comparisons of records in this dataset, resulting in a set of [[insert number of pairs here]] record-pairs. We also include a sample dataset and the dataset used for training the classification models used in our paper.
  • We include "similarity scores" for each field in the original dataset with each pairwise comparison. Examples of these similarity scores include: Jaro-Winkler similarity for each record's first name field, SoundEx/Levenshtein similarity for each record's last name field, binary (exact match) similarity for each record's state or country fields', etc.
  • Each pair is labeled as either a pairwise match (if the pair refers to the same unique individual) or a pairwise non-match (if the pair refers to two unique individuals).
  • For more information, see "Fields Included In Dataset "below.

Sample Pairwise Comparisons Dataset:  The sample pairwise comparisons dataset contains the same fields described above, but only a small subset of 1000 record-pairs being compared, sampled randomly from the pairwise comparisons used for model training. This is intended for users who are unable to download large datasets but want a quick glance at the data.

I agree to the Full Terms and Conditions (necessary for downloading data).

Download Sample Dataset (111 KB)

Pairwise Comparisons Used for Model Training:  The pairwise comparisons used for model training in our paper (Ventura et al, 2015) are included here. This dataset contains comparisons of 150,000 record-pairs, sampled according to the scheme described in the above paper. These records are a subset of the full pairwise comparisons dataset, described below.

I agree to the Full Terms and Conditions (necessary for downloading data).

Download Training Dataset (16.6 MB)

Full Pairwise Comparisons Dataset:  The full pairwise comparisons dataset contains all possible comparisons of record-pairs from our labeled USPTO optoelectronics inventor records dataset. This dataset is very large, with 105,407,940 comparisons. For more information on the labeled USPTO optoelectronics inventor records dataset from which these comparisons were made, see Section 3.1 of our paper (Ventura et al, 2015).

I agree to the Full Terms and Conditions (necessary for downloading data).

Download Full Pairwise Comparison Dataset

Fields Included in Dataset:

Each field contains a prefix (e.g. "first", "last", "city", etc), indicating which field in the original dataset is being compared, and a suffix (e.g. "j", "l", "e", "s", etc), indicating the type of comparison being made. The prefix and suffix are separated by an underscore ("_"). Fields being compared:
  • first: the first name field of the inventor record
  • last: the last name field of the inventor record
  • mid: the middle name field of the inventor record
  • suffix: the name suffix (e.g. "Jr.", "III", etc) of the inventor record
  • city: the city name field of the inventor record; cities are associated with the inventor of the patent, not the assignee
  • st: the state abbreviation (e.g. "PA", "CA", "NY") field of the inventor record; states are associated with the inventor of the patent, not the assignee
  • country: the county abbreviation (e.g. "USA", "GB", "JP", etc) field of the inventor record; countries are associated with the inventor of the patent, not the assignee
  • ass: the assignee corresponding to the inventor record's patent
  • class: the list of technology classes corresponding to the inventor record's patent
  • subclass: the list of technology subclasses corresponding to the inventor record's patent
  • coinv: the list of co-inventors on the inventor record's patent
  • fileyear: the year the inventor record's patent was filed
Types of comparisons (all comparison methods are implemented in R's record linkage package unless otherwise noted)':
  • j: Jaro-Winkler String Similarity
  • l: Levenshtein String Similarity
  • e: Exact matching (1 if exactly equal, 0 otherwise)
  • s: Exact matching performed on the SoundEx abbreviations of the pair
  • a: Absolute difference between numerical values (e.g. file year of the patent)
  • pj: Jaro-Winkler Similarity of the phonetic representations of the strings
  • pl: Levenshtein Similarity of the phonetic representations of the strings
  • 1, 2, or 3: Exact matching performed on the first 1, 2, or 3 characters of the strings
  • set: Jaccard coefficient of the two sets (e.g. sets of co-inventors)
Before making any comparisons, all text fields are converted to all capital letters, punctuation is removed, and leading/training whitespace is removed.

Funding for this project came from:

CAREER: Rethinking National Innovation Systems – Economic Downturns, Offshoring, and the Global Evolution of Technology. NSF Science of Science and Innovation Policy Program; Award Number 1056955; Principal Investigator: Erica Fuchs; May 2011 – May 2016; $624,517.

Supporting funding also came from:

Quantifying the Resilience of Innovation Ecosystems: The Impact of Manufacturing Offshore on Firm Technology Trajectories and the Institutional Locus of Innovation. NSF Science of Science and Innovation Program; Award Number 0830354; Principal Investigator: Erica Fuchs; September 2008 – September 2010; $208,068. Census Research Node: Data Integration, Online Data Collection, and Privacy Protection for Census 2020. NSF Social and Economic Sciences Grant; Award Number 1130706; Principal Investigators:Stephen E. Fienberg and William F. Eddy; Co-Principal Investigators: Rebecca Nugent and Alessandro Acquisti; September 2011 -- September 2016; $3,000,000. Statistics and Machine Learning for Scientific Inference. NSF Research Training Group in the Mathematical Sciences; Award Number 25951-1-1121631; Principal Investigators: Rob Kass and William F. Eddy; Co-Principal Investigator: Rebecca Nugent; May 2011 -- June 2016; $2,250,979.

Related Publications:

  1. Fellegi, I.P., A.B. Sunter. 1969. "A Theory for Record Linkage.” Journal of the American Statistical Association. 64(328).
  2. Ge, C., K. Huang, I. Png. 2014. "Engineer/Scientist Careers: Patents, Online Profiles, and Misclassification Bias.” SSRN Working Paper 2531477. Revised, November.
  3. Lai, R., A. D’Amour, A. Yu, Y. Sun, L. Fleming. 2014. "Disambiguation and co-authorship networks of the U.S. Patent Inventor Database (1975–2010).” Research Policy, 43 (6) (2014), pp. 941–955..
  4. Trajtenberg, M., G. Shiff, R. Melamed. 2006. "The Names Game: Harnessing Inventors’ Patent Data for Economic Research.” National Bureau of Economic Research (Working Paper No. 12479).
  5. Ventura, S., Nugent, R., and Fuchs, E. 2014. "Hierarchical Linkage Clustering with Distributions of Distances for Large-Scale Record Linkage”. Privacy in Statistical Databases 2014 Conference Proceedings.