Web-Archiving the UCSF Response to COVID-19

We’re excited to announce the publication of the UCSF COVID-19 Response Web-Archive. UCSF has historically been a “first responder” to a wide variety of public health emergencies. At the outset of the COVID-19 pandemic, UCSF archivists recognized that the evolving UCSF response to the situation would contain valuable information about this important, tragic, and devastating historical moment, and that documenting that response as it grew and changed would be a powerful historical record. And we were able to act quickly, because so much of the record is on the web.

Archives and Special Collections has been archiving websites for a long time — our oldest captures date back to 2007, which feels like another epoch in web-time (you can see all of our web-archives here: https://archive-it.org/organizations/986). To archive the web, we use specialized tools to take “captures” or “snapshots” of a certain web-page at a certain time, usually coming back to take a new capture at regular intervals. Because of this technique, web-archives are a valuable way to watch any given website evolve and change, and this documents something like a rapidly-evolving response to a global pandemic very well.

Image of website of AIDS Research Institute's COVID-19 Task Force showing their March 25, 2020 update on the pandemic in California and San Francisco.
The March 25, 2020 update of the AIDS Research Institute’s COVID-19 task force. Note that at this time there were only 76 confirmed COVID cases and no deaths.

In documenting the UCSF response to COVID-19 however, we had to work much more quickly and in much greater volume than we are used to. As you likely remember, during the height of the early days of the pandemic both the UCSF and the nationwide response was changing daily based on rapidly shifting information. Archives usually captures web-pages every 3 months or every 6 months, but upon embarking on this collection we realized that we needed to begin capturing certain websites every day. Additionally, UCSF has at any given time as many as 1000 different official websites (something with ucsf.edu at the base domain), so knowing which of these contained COVID information and should be captured was difficult. To remedy this problem, archivists set up GoogleAlerts to notify us anytime something was published to a ucsf.edu domain which mentioned certain key words identified as likely COVID-related.

And this was only the official UCSF websites. We also wanted to document outside coverage of UCSF activities, things that appeared on news websites, blogs, and occasionally social media (though the latter is persistently difficult to capture — download your Twitter archives people!). We were able to use GoogleAlerts in a similar way to help alert us to these sites, but even more importantly we benefited from the immense assistance of the amazing Anirvan Chatterjee, Director of Data Strategy at the Clinical & Translational Science Institute. Anirvan reached out to us early in the pandemic with a list of sites he had collected that contained documentation of UCSF’s role in the pandemic response, and his human-curated list was immensely helpful. The proliferation of digital information makes human curation and metadata creation increasingly difficult in archival repositories, and having someone like Anirvan who was able to devote the time to it (most digital archivist aren’t able to devote such time, if you can believe it!) really improved the collection.

This collection is also important because it can be both accessed by a human browsing and by a computer doing computational research. We plan to use these materials to expand our work in digital health humanities as well as collections as data as our newest colleague Kathryn Stine gets underway in her role coordinating these programs. Have a question about the COVID-19 web-archive collection? Want to use it in a computational project? Just love it? Get in touch!

It’s World Digital Preservation Day!

A banner image showing digital file icons of all types.

Did I read that right, you may be wondering? A whole day for digital preservation? What on earth could that be for? Can’t we just ctrl+s and we’re good? Can’t we put things in the cloud and they’ll be there forever?

Well, you did read that right: it’s a whole day for digital preservation and it could use a whole lot more. Digital preservation takes a ton of work, and far from being a passive strategy it is an active process that continues for as long as a collection needs to be preserved. And who are we kidding, if you read our blog you probably already had an idea that this was the case, and didn’t even have any of those fake questions I tried to pose at the beginning.

On the first Monday in November we celebrate the people and the labor that goes into preserving our digital cultural, biological, medical, and general scientific heritage so that it can be accessed by future generations. For generations, archives and archivists have, intentionally or not, relied upon “benign neglect” — one of my favorite archival terms — for the preservation of many of our most valuable collections. “Benign neglect” refers to the fact that under decent climatic circumstances, many physical media — printed photos, printed documents, books, etc. — can be forgotten by their owner in a closet for 20, 50, or even 100 years and still be relatively fine, ready for an archivist to come, clean things up, document the order, and accession them right on into our collections.

Unfortunately, things don’t work this way for our digital heritage. Think about something you created on a computer 20 years ago. If you even had your own computer at that point, does that computer still run? Is that file format you were using even open-able by a computer today? (Anyone here use LotusNotes?) You were probably saving things on floppy disks back then, do you know anyone with a computer that has a floppy drive? (We have one, let us know if you want to use it!)

In addition to all those questions above, there is the danger of “bit-rot”, or the corruption of individual pieces of the data that make up a file due to the physical aging of the storage media. Floppy disks and CDs are all various forms of plastic, and you may notice that anything plastic tends to get a little less lustrous after 20 years of sitting in your desk. There’s also the tricky interaction with corporate intellectual property that dictates much of how we live our digital lives. In many cases, that file format you used to save your work is actually owned by some company, and if they decide they don’t want to keep it around anymore, then you’re out of luck.

If it’s not clear already, digital preservation is a very active process that is essential if we are to be able to access our digital heritage going forward. When we bring in new “born-digital” collections in the Archives, we have to start by stabilizing the files — removing them from their carrier media, documenting brief technical metadata about them so that we can be sure they haven’t become corrupted while we move them around, and prepping them to be easily migrated to more readable file formats if necessary. But it doesn’t stop there. Once things have been stabilized, they still need to be maintained in a digital preservation repository for as long as we intend to preserve them. Digital preservation repositories do things like check for file corruption, backup files in multiple geographic locations, and migrate files that are at risk of becoming un-readable or out-of-date to new file formats which are open and widely readable. This is all work that will need to be done for the entire preservation lifetime of the files.

As you can see, this is all a lot of work! That’s why we’re excited to take today to celebrate the people, the systems, and labor that go into preserving our digital heritage, and making it accessible to all of us long into the future.