Making the invisible visible through natural language processing and interactive visualization
The DocuScope Project began in 1998 at Carnegie Mellon University as an interdisciplinary collaboration between David Kaufer, Professor of English, and Suguru Ishizaki, then a professor in the School of Design, today Professor of English. DocuScope’s natural language processing capability draws on a proprietary dictionary of millions of English phrases collected and classified over 20 years by David. DocuScope consists of an analytic engine, a suite of interactive visualizations, and a dictionary authoring tool.
Our earliest dictionary was developed based on David and Brian Butler's earlier theoretical work in rhetoric (Kaufer & Butler 1996) and their applied work in representational theories of language (Kaufer & Butler 2000). Our latest theoretical framework, as well as the overview of the generic dictionary, is presented in Power of Words: Unveiling the Speaker and Writer's Hidden Craft (Kaufer, Ishizaki, Butler and Collins 2004). (See Collins, Kaufer, Vlachos, Butler, Ishizaki, 2004; Kaufer & Hariman, 2008; Kaufer & Al-Malki, 2009a; and Kaufer & Ishizaki, 2006 for projects that used the generic dictionary.)
DocuScope was initially developed as an educational tool for David’s writing course, Narrative & Argument. We wished to create a studio-like writing course that allowed students to “see” and critique their drafts publicly. But we soon found that DocuScope was also a useful tool for corpus-based rhetorical analysis.
The original version of DocuScope created in 1998
Over 20+ years we, with the support of numerous scholars and students, have continued to improve DocuScope and its theoretical framework. In the past few years, our project has expanded significantly to encompass multiple applications.
Through our research we have created a range of tools for computer-aided text analysis and technology-enhanced writing instruction:
DocuScope Corpus Analysis
DocuScope Corpus Analysis is a text analysis environment with a suite of interactive visualization tools for corpus-based rhetorical analysis. The core elements of DocuScope Corpus Analysis are (1) a dictionary, created by David, consisting of tens of millions of uniquely classified linguistic patterns of English based on their effect on readers and (2) analysis and visualization software, designed and implemented by Suguru.
DocuScope Classroom is an online text analysis/visualization environment that helps students see how writing strategies are used in their drafts and how those strategies are similar and different from the strategies of their classmates. The visualizations enhance students’ awareness of their composing decisions and the relationship of those choices to their writing context and intended genre. DocuScope Classroom has been used in certain sections of Carnegie Mellon’s Writing & Communication program.
OnTopic is a revision environment made up of interactive visualizations designed to help students keep their writing coherent and on topic. OnTopic uses natural language processing algorithms to visualize the topical organization of the student’s draft by highlighting salient topics within each paragraph, as well as in the text as a whole. At the sentence level, OnTopic allows students to study their sentence proportionality and “flow” by tracking the number of noun phrases their readers must process before and after a sentence’s main verb.
The latest incarnation of DocuScope supports student writers who want to inspect both their topical organization and the rhetorical experiences they create, whether localized to specific topics or ambient across the whole text. DocuScope 6.0 combines the basic functionality of OnTopic and DocuScope Classroom in a seamless interactive visualization environment.
DiaGrammar is an online learning environment for practicing sentence diagramming. The technology is based on form-function sentence diagrams popularized in Paul Hopper’s book, A Short Course in Grammar. Following the instructional approach developed by Hopper, DiaGrammar provides an online practice environment with automated feedback, designed to support hybrid (virtual/in-person) grammar courses.
The Rand Pardee Graduate School of the RAND Corporation provides DocuScope to its quantitative analysts as a tool to analyze social media and other documents. RAND has published a series of technical reports using DocuScope, showing the declining objectivity of journalism. RAND has also incorporated a prior version of the DocuScope Dictionaries into its in-house product Rand-Lex, which it uses to conduct language analysis for clients. RAND was sufficiently impressed with DocuScope that it invested heavily in machine learning methods to produce DocuScope versions in Arabic, Russian, and Chinese.
Shakespeare scholars have used DocuScope to find that the bard’s history plays are organized around a single narrator, while his comedies and tragedies are organized around lighthearted and darker-spirited character dialogue, respectively. After writing his famous comedies and tragedies, Shakespeare composed his late “tragicomedies” or “problem plays,” a unique hybrid genre. Shakespeareans have long debated whether these plays are actually comedies, tragedies, or truly a novel combination. Using DocuScope, a team of leading Shakespeare scholars confirmed that these late plays are indeed genre-bending mixtures, expertly blended by an artist at the height of his craft.
You can read more about this project in the Early Modern Literary Studies journal article "The Very Large Textual Object: A Prosthetic Reading of Shakespeare." See also the article in Forbes Magazine and the “Digital History” blog post.
Educational Testing Service
ETS faces much criticism that its timed writing tests have no ecological validity with authentic writing tasks. A group of ETS scholars used DocuScope and compared hundreds of GRE “arguments” that test-takers wrote under timed conditions (45 minutes) and scores of “arguments” that graduate students had three weeks to draft and redraft with feedback. Researchers found that the “core” patterns that constitute argument did not differ between the two populations of writers. ETS recently won a general patent for testing validity of its timed writing tests using this method.
Alexander Hamilton and James Madison
Historians have long debated which of the 88 Federalist Papers were authored by Alexander Hamilton vs. James Madison. All prior studies assumed that for any paper, either Hamilton or Madison was the author, but not both. Using DocuScope, a research team discovered that some of the disputed papers were clearly co-authored, bearing the stamp of both writers. A look into the historical record confirmed that Hamilton and Madison did physically convene to plan a new block of papers. All of the texts that DocuScope found to be co-authored were the first paper in each block, very likely planned and at least partially drafted during Hamilton and Madison’s time together.
Clinton vs. Trump, 2016
While Donald Trump polled poorly in the run-up to the 2016 presidential election, it is also true that Hillary Clinton’s public perception was similarly negative. Clinton was perceived to be aloof, guarded, scripted, and inauthentic. To investigate the roots of this negative image, a team used DocuScope to analyze Clinton’s two memoirs, one a personal memoir after her first lady years (2003) and the second a policy memoir to describe her time as President Obama’s Secretary of State (2014). A DocuScope analysis revealed, as predicted, that the personal memoir relied much more on the language of disclosure than the policy memoir. But an analysis of where Clinton disclosed herself was revealing. Her most disclosive passages and chapters appeared when she talked about the trials of her parents and grandparents. She is less disclosive and more guarded when talking about herself, her marriage, and her husband’s infidelity. This rhetorical choice may have contributed to perceptions of Clinton as a personally remote candidate, especially when juxtaposed with Donald Trump’s style of brash, unedited speech.
Media Representation of Arab Women in Arab News
Western social science has documented a long tradition of western media representing Arab women as passive and voiceless, yet there has been little research studying how Arab women are represented in Arab media. In research funded by the Qatar National Research Fund using DocuScope, researchers found that at least some liberal Arab media outlets based in London do represent Arab women in more complex ways. The study was published in a book titled Arab Women in Arab News: Old Stereotypes and New Media (Bloomsbury) in 2012.
This project has been partially funded by:
- A.W. Mellon Foundation
- Macaulay Family Foundation
- Simon Initiative Seed Grant, Carnegie Mellon University
- Berkman Faculty Development Fund, Carnegie Mellon University
- Eberly Center, Teaching Excellence & Educational Innovation, Carnegie Mellon University
- Open Learning Initiative, Carnegie Mellon University
- Howard Seltman, Department of Statistics & Data Science
Kaufer, D. & Butler, B. (2000). Designing Interactive Worlds with Words: Principles of Writing as Representational Composition. Routledge.
Kaufer, D. & Butler, B. (1996). Rhetoric and the Arts of Design. Routledge.
Kaufer, D., Ishizaki, S., Butler, B., & Collins, J. (2004). The Power of Words: Unveiling the Speaker and Writer's Hidden Craft. Routledge.
Wetzel, D., Brown, D., Werner N., Ishizaki, S., Kaufer, D. (2021) Computer-Assisted Rhetorical Analysis: Instructional Design and Formative Assessment Using DocuScope. Journal of Writing Analytics. vol. 5, 292-393.
Beigman K. B., Ramineni, C., Kaufer, D., Yeoh, P., Ishizaki, S. (2019). Advancing the Validity Argument for Standardized Writing Tests using Quantitative Rhetorical Analysis. Language Testing, 36(1), 125-144.
Helberg, A., Poznahovska, A., Ishizaki, S., Kaufer, D., Werner, N., Wetzel, D. (2018) Teaching textual awareness with DocuScope: Using corpus-driven tools and reflection to support students’ written decision-making. Assessing Writing, 38 (October), 40-45.
Kaufer, D., Ishiaki, S., & Cai, X. (2016). Analyzing the Language of Citation across Discipline and Experience Levels: An Automated Dictionary Approach. Journal of Writing Research, 7(3), 453-483.
Atkinson, N., Kaufer, D., & Ishizaki, S. (2008). Presence and global presence in comparative genre of self presentation. Rhetorical Society Quarterly, 38(4), 357-384.
Kaufer, D., & Ishizaki, S. (2006). A corpus study of canned letters: Mining the latent rhetorical proficiencies marketed to writers. IEEE Transactions on Professional Communication, 49(3), 254-266.
Collins, J., Kaufer, D., Vlachos, P., Butler, B., & Ishizaki, S. (2004). Detecting collaborations in text comparing the authors' rhetorical language choices in the Federalist Papers. Computers and the Humanities, 38(1), 15-36.
Geisler, C., Kaufer, D. & Itext Working Group. (2001). Future directions for research on the relationship between information technology and writing. Journal of Business and Technical Communication, Part I, 270-308.
Kaufer, D. (2006). Genre variation and minority ethnic identity: exploring the personal profile in Indian American community publications. Discourse & Society, 17(6), 761-784.
Kaufer, D. & Al-Malki, A. M. (2009). A "first" for women in the kingdom: Arab/West representations of female trendsetters in Saudi Arabia. Journal of Arab and Muslim Media Research, 2(2), 113-133.
Kaufer, D. & Al-Malki, A. M. (2009). The War on Terror through Arab-American eyes: the Arab-American press as a rhetorical counterpublic. Rhetoric Review, 28(1), 47-65.
Kaufer, D. & Hariman, R. (2008). A corpus analysis evaluating Hariman's theory of political style. Text & Talk, 28(4), 475-500.
Kaufer, D. & Ishizaki, S. (2006). A corpus study of canned letters: mining the latent rhetorical proficiencies marketed to writers in a hurry and non-writers. IEEE Transactions on Professional Communication, 49(3), 254-266.
Kaufer, D., Ishizaki, S., Collins, J., & Vlachos, P. (2004). Teaching language awareness in rhetorical choice using Itext and visualization in classroom genre assignments. Journal for Business and Technical Communication, 18(3), 361-402.
Kaufer, D., Parry-Giles, S., & Klebanov, B. B. (forthcoming). Tracking "image bites" across the public/private divide: NBC News coverage of Hillary Clinton from scorned wife to senate candidate. Journal of Language and Politics.
Klebanov, B. B., Kaufer, D., & Franklin, H. (forthcoming). A figure in a field: semantic field-based analysis of antithesis. Journal of Cognitive Semiotics.
Parry-Giles, S. & Kaufer, D. (forthcoming). Lincoln reminiscences and nineteenth-century portraiture: the private virtues of presidential character. Rhetoric and Public Affairs.
Chapters in Edited Volumes
Hu, Y., Kaufer, D., & Ishizaki, S. (2010). Genre and Instinct. Computing with Instinct, Lecture Notes in Artificial Intelligence, LNAI 5897, ed. Cai, Y. Springer.
Ishizaki, S. & Kaufer, D. The DocuScope Text Analysis and Visualization Environment. (2011). Invited chapter for Applied Natural Language Processing and Content Analysis: Identification, Investigation, and Resolution, ed. McCarthy, P. & Boonthum, C.
Kaufer, D. (2004). Public vs. Private Rhetoric: An Analysis of the NY Times Writers on Writing Series. The Public in Rhetorical Theory, ed. Kent, T. & Couture, B. Utah State Press, 163-185.
Kaufer, D., Geisler, C., Ishizaki, S., & Vlachos, P. (2005). Computer-Support for Genre Analysis and Discovery. Ambient Intelligence for Scientific Discovery, ed. Cai, Y. Springer, 129-151.
Kaufer, D., Geisler, C., Vlachos, P., & Ishizaki, S. (2006). Mining Textual Knowledge for Writing Research and Education. Writing & Digital Media, ed. Waes, L. V., Leijten, M., & Neuwirth, C. Amsterdam: Elsevier, 115-129.
Kaufer, D., Ishizaki, S., & Al-Malki, A. M. (2007). A Framework for Training Writing Teachers in the Discourse Patterns Underlying Cross-institutional Writing Assignments. Sustaining Excellence in Communicating Across the Curriculum: Cross-institutional Experiences and Best Practices. Cambridge Scholars Press, UK.
Oakley, T. & Kaufer, D. (2007). Designing Clinical Experiences with Words: The Three Layers of Analysis in Clinical Reports; A Dilemma for Mental Spaces and Genre Theory. Mental Spaces in Discourse and Interaction, ed. Hougaard, A. & Oakley, T. John Benjamins Publishing Company.
Kaufer, D. and Ishizaki, S. (Accepted). Computer-Based Analysis: Tracking Writer’s Rhetorical Decision-Making on Political Texts, In Fahnestock, J. & Harris, R. (Eds.), Routledge Handbook on Language and Persuasion. Routledge, NY: New York.
Beigman Klebanov, B., Kaufer, D., Yeoh, P., Ishizaki, S., Holtzman, S. (2016). Argumentative writing in assessment and instruction: A comparative perspective. In N. Stukker, W. Spoonren, & G. Steen (Eds.), Genre in Discourse and Cognition: Concepts, Modes, and Methods. Stukker, Mouton de Gruyter.
Ishizaki, S., & Kaufer, D. (2011). DocuScope: Computer-aided rhetorical analysis. In P. McCarthy & C. Boonthum (Eds.), Applied Natural Language Processing and Content Analysis: Advances in Identification, Investigation, and Resolution (pp. 276-297). Hershey, PA: IGI Global.
Hu, Y., Kaufer, D., & Ishizaki, S. (2011). Genre and Instinct. In Computing with Instinct (pp. 58-81). New York, NY: Springer.
Kaufer, D., Al-Malki, A. & Ishizaki, S. (2007). Training writing teachers in the implicit knowledge underlying writing assignments. In N. Kassabgy & A. Elshimi (Eds.), Sustaining Excellence in Communicating Across the Curriculum: Cross-Institutional Experiences Best Practices (pp. 111-127). Newcastle, UK: Cambridge Scholars Publishing.
Kaufer, D., Geisler, C., Vlachos, P., & Ishizaki, S. (2006). Mining textual knowledge for writing education and research. In Luuk Van Waes, M. Leijten, & C. Neuwirth (Eds.), Writing and Digital Media (pp. 115-130). Oxford, UK: Elsevier Science.
Kaufer, D., Geisler, C., Ishizaki, S., & Vlachos, P. (2005). Textual genre analysis and identification. In Y. Cai (Ed.), Ambient Intelligence for Scientific Discovery (pp. 129-151). New York, NY: Springer.