Reproducible Research and the problems of preserving computer code and software

We collect and preserve a lot of the documentary evidence of science happening at UCSF — everything from lab notebooks to lab websites detailing research processes. We even hold tons and tons of data in our collections, mostly in physical form, as patient surveys or health records, or even raw data as it was initially recorded by hand in the lab.

But what about the products of contemporary science, where key digital elements such as computer code or software might be crucial to an understanding of the research? This is already presenting problems for research reproducibility. Think, for example, of a set of results which were obtained using a computer script written in the Python computer programming language. If you want to verify these results, are you able to view the source code which produced them? Are you able to execute that code on your own computer? Can you tell what each piece of the code does? Does the code rely on access to an external data set to work correctly, and can you access and/or assess that data set to test the code?

As we work more closely with our Data Science Initiative team on these issues, it becomes clear that these are preservation questions as well. A critical understanding of the scientific past and present requires access to the primary source documentation of that research, including computer code and software. Being able to understand and interpret that computer code involves many of the same questions mentioned above — executions of code, documentation of each process in the code, access to necessary data, etc.

To begin to address this, we are working with the Data Science team to assess researcher coding practices as a first step in understanding how the library can encourage better documentation and preservation of code in the service of reproducible research and the persistence of the scientific scholarly record. And if you’re a researcher who codes for your work, then we want feedback from you! Please consider attending one of our lunchtime listening sessions in the coming weeks — 4/20 from 12-1:30 pm at Mission Bay, and 4/27 from 12-1:30 pm at Parnassus. We will have an informal chat about research coding practices and will discuss some of the issues we encounter as information professionals, as well as talking about what the library can do to aid in these areas.

Join us as we make some in-roads on this challenging information problem.

New Sites in the UCSF Web-Archive

As discussed previously here, we’ve been working on expanding our web-archives presence in all areas across campus, and one of the developments we’re most excited about is getting the web-archiving process formalized in centralized UCSF workflow for upgrading websites or retiring abandoned ones. Now that we have been successful in establishing this program, archiving a site is an official part of the website roll over or retirement process, which means we have a much better finger on the pulse of the UCSF web presence.

And as this process ramps up, we’ve been adding all sorts of fascinating UCSF websites to our collections, so we wanted to highlight a few recent acquisitions.

First is a complete copy of the website for the W. M. Keck Center for Noncoding RNAs, which is scheduled to be rolled over to a new platform soon. The Keck Center explores the 98.6% of the human genome which is “non-coding,” or which is not the part of the genome directly containing the code to create proteins. Since, in their words, most genetic research focuses on the protein-encoding genes — those genes whose purpose is clear — the area of non-coding RNA can be thought of as “genetic dark matter.” Even though the purpose of this generic material is not clear, it still influences human health, and it is the mission of the Keck Center to figure out how.

screen shot of the homepage of the W.M. Keck Center for Non-Coding RNAs

Homepage of the archived version of the W.M. Keck Center website.

The lab uses mice in their process, modifying mouse stem cells and using mouse genes to examine the function of these non-coding RNAs. And conveniently, their lab website contains all the raw genetic data, as well as the experiment plans, images, and other associated data for these experiments. We’re excited about this capture because we were able to collect all this data at once and provide a snapshot of the lab’s work — complete with all the associated research materials. This is a huge help in tackling the problem of historic preservation of contemporary scientific work, and it even begins to address the very present problems of reproducibility in data-intensive and computing-intensive scientific research.

Additionally, another web-site which we have recently captured illustrates the value of curating a selection of the UCSF institutional ecosystem all together. This is the site of the UCSF Institutional Animal Care and Use Committee. Say, for example, that in examining the archived site of the Keck center, you also wondered what the legal treatment protocols and procedures were at the time for scientific research involving animal subjects, and whether or not the Keck center was following those protocols. With a little clicking around on the Wayback Machine you would be able to quickly answer that question, and would have a clear picture of where the Keck center’s research fit into the larger legal and ethical questions on campus and in the scientific community about proper treatment of and care for the animals used in research.

screen shot of the homepage of the UCSF Institutional Animal Care and Use Committee

Homepage of the archived version of the UCSF Institutional Animal Care and Use Committee.

We look forward to continuing to build and enrich our web-archive collections, and remember that if you have a suggestion you can always request that we begin capturing your UCSF site!