Software Engineering: LOCKSS-Silicon Valley Campus - Carnegie Mellon University

Lots of Copies Keeps Semantics Safe — A LOCKSS Project

"Data is stored and archived, but that should not be the end of it. The data should enable users to access it through any conceivable question. Determining a relationship between two pieces of information normally requires the processing and understanding the development of relationships between them. Electronically this same process is done without the need to understand the information being processed. Using the LOCKSS system and the amount of information stored, this team has enabled the creation of new relationships from this data. These organic relationships enable users to formulate questions and receive valid and relevant responses e.g. ‘Which set of authors has the most experience with this piece of technology?'... And this is a powerful ability to have when it comes to information. " 

The preservation of information through replication and storage dates to the earliest libraries discovered in 1200B.C. While libraries have traditionally played an important role in providing access to board ranges of information, the rapid adoption of the internet has increased demand for electronic publications and has created problems related to the access and preservation of electronic documents. Stanford's ‘Lots of Copies Keeps Stuff Safe' (LOCKSS) project takes steps to solve these problems and provides broad access to research journals that would otherwise be difficult, if not impossible, to access.

Working closely together the Carnegie Mellon team composed of four graduate students: Timophey Zaitsev (Masters of Software Engineering), Minh Pham (Masters of Software Engineering), Mike Vrooman (Masters of Software Engineering), and Morgan Brown (Masters of Software Engineering) in conjunction with advisors from Carnegie Mellon, Dr. Ed Katz, and Stanford, Phillip Gust, has expanded the capabilities of LOCKSS through the use of schema-less storage (MongoDB) and inference-based semantic relationship engines (Apache Jena) to enhance both the access to and storage of archival information.

Traditional structured-query-based storage and retrieval has powered the majority of systems in the past decade. While these systems have been successful within LOCKSS, the requirement of predefined and well-understood schemas limits the speed and capability of the system as a whole. Using the latest Software Engineering practices, with a heavy reliance on SEMAT as a tool to guide their progress, this team has extended the LOCKSS project to support a schema-less storage and retrieval system. The ability to store serialized electronic information in a schema-less manor has allowed the LOCKSS system to simultaneously store the electronic information for rapid retrieval, generate semantic facts directly from the stored data, and to finally infer new semantic facts previously unknown about disparate and idly stored information.

Using crystallographic journals as a base, this team is very excited to present our results; LOCKSS has generated and inferred new and previously unknown facts about this topic. This team is confident that their work will find its way to the main LOCKSS distribution in the releases to come and they hope to see a wider distribution and greater uptake of LOCKSS in the years to come.