Carnegie Mellon University students enhance LOCKSS to preserve forms and lay groundwork for Web 2.0 crawling
Every day people visit websites and they assume that the information contained within will be available if they need to come back to it. For their Spring 2011 Practicum project, Frederick Kautz, Matt Lanken, and Vibhor Nanavati, a team of three Carnegie Mellon University Software Engineering graduate students worked with the Stanford University Libraries team behind the LOCKSS digital preservation project to allow the LOCKSS software to collect new types of information.
The CMU Practicum project added the capability to access e-content through forms, allowing some sites to archive thousands of new pages. Additional form content can now be preserved on many publisher websites, such as Highwire Press and IOP Science. The new pages will be available to the libraries participating in the LOCKSS alliance and if the original publisher website goes away or is blocked, the original content will still be accessible locally.
The Practicum course is a capstone to the learning the students have done throughout their software engineering education. The team applied a mix of agile and traditional Software Engineering techniques chosen from their learning to successfully complete a real software project for their clients, experienced LOCKSS developers Philip Gust and Tom Lipkis.
“It was exciting to get to work with a real client and apply techniques such as negotiating project scope and writing user stories to meet their needs”, said Matt Lanken, Practicum team member. “The unpredictability really brought home the value in applying the methods we had learned to manage risk and change.”
“It is our hope that our contributions will be beneficial not only to the LOCKSS team, but to the entire world community. Preservation of our collective human knowledge is an important endeavor, and we are glad to be a part of that effort”, said Frederick Kautz, also a Practicum team member.
A web 2.0 crawling prototype was also created to capture content from modern websites using dynamic content. The prototype accesses the content through a normal browser without compromising the security of the digital archive by running a browser on the same system. The existing LOCKSS proxy architecture, which is used to transparently serve digital content when a publisher’s website has become unavailable, was used to capture the URLs crawled by a remote browser using the industry-standard web application testing framework Selenium. The browser traversed the publisher's website, executing all the normal AJAX retrievals that a reader's browser would. While this prototype will take more time to integrate into the existing LOCKSS framework, the team added the capability to crawl publisher Thieme Connect and added more than 100 journals to the titles that LOCKSS can collect.
The LOCKSS Program is a unit of the Stanford University Libraries. Founded in 1998 libraries are building and preserving general collections in the Global LOCKSS Network. Approximately 450 publishers have committed their content for LOCKSS preservation. Libraries are using Private LOCKSS Network to preserve government documents (Digital Federal Depository Library Program), data sets, and special collections materials. See www.lockss.org.
About Carnegie Mellon Silicon Valley
Carnegie Mellon Silicon Valley is dedicated to educating its students to become leaders in global technology innovation and management and to performing innovative research that connects it to local, national, and global high-tech companies. Long known for its leadership in engineering and computer science research and education, Carnegie Mellon has established a natural branch in the Silicon Valley, one that integrates the rich heritage of the Pittsburgh campus with the opportunities available in the innovative and entrepreneurial Silicon Valley. Offering graduate programs in software engineering, software management, information networking, innovation and mobility, each program provides the appropriate mix of technical, business and organizational skills critical to our students' success. With research that focuses on a suite of new technologies, Carnegie Mellon Silicon Valley is committed to creating and implementing solutions for real problems.