Preserving Open Access Datasets and Software for Sustained Computational Reproducibility

The Old Dominion University (ODU), in collaboration with the Internet Archive (IA) and the Virginia Polytechnic Institute & State University (Virginia Tech), was awarded a 3-year applied research project for preserving endangered Open Access Datasets and Software (OADS), i.e., publicly and freely available digital datasets and software packages used for reproducing research results reported in scholarly works. We focus on scholarly papers (journal articles and conference proceedings) and electronic theses and dissertations (ETDs) in multiple disciplines. We are grateful to the Institute of Museum of Library Services (IMLS) for this award!

The goal of this project is to develop, report about, and solve foundational problems related to the value, status, trends, and preservability of OADS for publicly available academic papers and ETDs, and to enable and ensure progress toward sustainable computational reproducibility.

To this end, we will focus on building machine learning models and datasets that encompass three key aspects of OADS, namely, availability (whether URLs linking to OADS, i.e., OADS-URLs, appear in scholarly works), discoverability (whether OADS-URLs are alive on the web or in the archive), and accessibility (whether OADS are accessible through OADS-URLs).

One contribution of our project is an automatically constructed database called OADS Repo, built by automatically applying a machine learning model on scholarly papers and ETDs. To train and evaluate such models, a dataset containing labeled URLs must be constructed first.

Our research covers four major areas.

  • RQ1: How to automatically and accurately identify OADS-URLs from academic documents at scale?

  • RQ2: What are the distributions of accessible OADS across disciplines and how fast do they disappear?

  • RQ3: How to predict which OADS should be archived, and how to rank archiving priorities.

  • RQ4: How do we preserve them and make them accessible using web archives and digital libraries?

This project will proceed at ODU, IA, and Virginia Tech. Responsibility for management, research, and dissemination will be shared between PI Wu and Co-PIs Dr. Sawood Alam, Dr. Ed Fox, and Bill Ingram.