Carnegie Mellon University

Alt Text for Image

August 29, 2023

A Ladder to the Shoulders of Giants

By Jim Blakley, Living Edge Lab Associate Director

Contact Name

Note: My frequent readers may have noticed that my blogs often start with a personal historical technology reference that provides context to current Living Edge Lab work. I hope that context helps readers make sense of new innovations and relate them to something they already know. This blog is different in that it is an introduction to the history of a technology innovation designed to capture history. I will, however, still start with another personal historical reference.

When I was a student at the University of Michigan in the late seventies, I frequently studied in the stacks of the Harlan Hatcher library. That library is famous for its mammoth collection of materials including floors of bound scientific journals going back decades. Many of these journals were rarely touched – they were primarily there so scholars could find the procedures and data from past scientific research to reproduce and improve upon their results.

Much of today’s scientific research depends on digital laboratories – computing hardware and software that process the experimental data gathered by researchers. Yet, hardware and software evolve at a rapid pace and reproducing the results from a past experiment done even a couple of years ago is a tenuous proposition. Unless the original processing environment is maintained intact, even the original researchers may not be able to reproduce the exact results. And, the difficulty in creating an identical processing environment may preclude independent researchers from verifying the results of their peers. This challenge creates a fundamental shift in the role of a technical librarian from preserving journal papers to preserving processing environments and datasets.

Our recent paper, Towards Reproducible Execution of Closed-Source Applications from Internet Archives, recounts the history of our ten year effort, known as Project Olive,  to address the problem of processing environment archiving in service of reproducibility. It also discusses our recent work in this area which expands the problem of archiving to the problem of archive access – making it easy for a researcher to access and run the processing environment while still protecting the software and data from unauthorized use. It is in this area that Olive intersects our work in edge computing specifically by using EdgeVDI to encapsulate and display the running software.

 The screenshot at right shows VDI access to The Great American History Machine, an application from the Olive2014 Archive. Our key learning from this effort is that while technologies like virtual machines, containers, binary translators, and virtual desktop infrastructure make archival and access possible, archiving is fundamentally a process of curation – choosing which environments and datasets to archive, assembling and storing the archive, and making that archive accessible to the users who need it. Curation like this is a human process that requires large-scale  institutional support – like the giant libraries of yesterday. It is not that the technologies themselves are expensive – it is that creating the archive and serving its patrons is a labor-intensive process.

To be sure, we outline additional technical challenges in the paper that form the basis for our future work – how to future-proof the archive as available hardware evolves, how to handle datasets, how to support accelerated computing on devices like GPUs, and other issues. These challenges are ultimately solvable but without the institutional will to create robust archives, the scientific method as we know it is at risk.

The Project Olive journey shows a group of technologists’ progress toward a solution to scientific computational reproducibility. Please read it in that sense. But, also read it as a call to action to those in library sciences and scientific institutions to launch the work needed to address the larger questions of collection, curation, and archive access.

Towards Reproducible Execution of Closed-Source Applications from Internet Archives

Mahadev Satyanarayanan, Jan Harkes, and James Blakley, Proceedings of the 2023 ACM Conference on Reproducibility and Replicability. 2023.