NLP Method for Creating Application Categories Across Patents and Papers:
An Example from CRISPR

 

Samantha Zyontz

Stanford Law School

 

In recommending new innovation policies, policymakers often wish to consistently categorize innovation outputs in ways that align with existing stakeholders.  For example, if patents and academic papers can be categorized by the same industry applications, then policymakers can more easily determine which organizations and companies they will need to engage.  However, common classification systems for patents (e.g., IPC or CPC) and academic papers (e.g., Web of Science or MeSH keywords) are not consistently defined over the two types of outputs.  They also do not necessarily capture the relevant industry applications, especially for a specific technology.  The lack of consistency and specificity makes meaningful comparisons across outputs and application categories difficult.  This presentation discusses an NLP method for creating consistent industry application categories across patents and papers in a specific technology – the Nobel Prize winning CRISPR DNA-editing system.  

 

Because patents are often more closely associated with applications, the process begins by hand coding 2,072 CRISPR patent families into one of four broad industry application categories.  The initial categories are determined by CRISPR scientists and research on the possible uses of CRISPR.  These titles, abstracts, and assigned dominant categories are converted into a matrix of TF-IDF factors and used to train a regularized linear model with stochastic gradient descent learning (implemented in Python’s scikit-learn).  The trained model is then applied to the remaining CRISPR patent families’ and CRISPR papers’ titles and abstracts.  The model produces a score for each of the four categories, indicating the probability that the patent or paper belongs to the category given the characterizing words.  The four probabilities for each record sum to 1 and the highest is deemed to be the dominant category.  The process defines paper and patent classifications the same way, making it easier to compare CRISPR innovation across output types and country by industry applications.