UCSF Archives & Special Collections awarded $99,325 LSTA grant for textual data extraction from historical materials on AIDS/HIV

The Archives and Special Collections department of the University of California, San Francisco (UCSF) Library has been awarded a $99,325 “Pitch-An-Idea, Local” grant for the first year of a two-year project from the Institute of Museum and Library Services’ (IMLS) Library Services and Technology Act funding administered through the California State Library. The Archives will take the nearly 200,000 pages of textual AIDS/HIV historical materials which have been digitized as part of various digitization projects — including the National Historic Publications and Records Commission (NHPRC)-funded project­, “Evolution of San Francisco’s Response to a Public Health Crisis;” and the National Endowment for the Humanities (NEH)-funded project, “The San Francisco Bay Area’s Response to the AIDS Epidemic” — and will extract unstructured, textual data from these materials using Optical Character Recognition (OCR) and related software. The project team will prepare the text as a research-ready, unstructured textual dataset to be used for digital humanities, computationally driven cultural heritage, and machine learning research inquiries into the history of the HIV/AIDS epidemic.

The 24-month project, entitled “No More Silence — Opening the Data of the HIV/AIDS Epidemic” has commenced as of July 1, 2018. The digitized materials from which text will be extracted include handwritten correspondence, notebooks, typed reports, and agency records which represent a broad view of the lived experience of the epidemic, including documentation from People with AIDS and their friends, families, and scientists and public health officials working to slow the epidemic. All historical materials represented in this dataset have been previously screened to address privacy concerns. The resulting unstructured, textual dataset will be deposited in the UC Dash datasharing repository for public access and use by any interested parties, and will also be deposited in other similar data repositories as appropriate. “During my tenure at UCSF,” says health sciences historian and professor in the Department of Anthropology, History, and Social Medicine at UCSF, Dr. Aimee Medeiros, “I have been inspired by the library’s enthusiasm and dedication to public access and the use of practices in the digital humanities to help maximize access to HIV/AIDS material.” This project will build on that legacy by bringing these valuable historical materials into the realm of digital humanities and scientific research and making them computationally actionable.

According to Dr. Paul Volberding, director of the AIDS Research Institute at UCSF, “Discovering the complexities of the virus and developing effective treatments will be studied of course, but the lives of those directly involved as patients as well as care providers is equally significant. The cultural aspects of the epidemic will most directly benefit from the work [of this project]. Combining the growing field of computational science with the already large and rapidly growing archive of materials from all aspects of the AIDS epidemic demand the creation of new tools and I look forward to the new insights we gain from their application. [UCSF Library has] been sharply focused on the AIDS archives and have amassed a rich collection that, in its digitized form, will be the database for [these] new efforts. Together, this database and new computational tools, will enable a sophisticated analysis that I am convinced will be used to shed more insight in our understanding of the impact of the epidemic and ways our response will have meaning in the inevitable future crises.”

Once the preparation of the textual dataset is completed, the project team — consisting of archivists and technical staff from both the Archives and the Library — will embark on several pilot research projects using machine learning, and especially natural language processing research methods, on the data. The pilot projects, which will be scoped in collaboration with various stakeholders, will attempt to explore what kinds of structured data can be pulled out of the unstructured text, and define some simple critical inquiries which can be understood using this data, these methods, and the results of these experimental endeavors. Additionally the project team hopes to get a better sense of the functional requirements for systems supplying this type of data when tailored towards these kinds of medical humanities research questions. Through these efforts the project team will be able to better define the extent to which, as stated by Dr. Medeiros, “making 200,000 pages of primary-source archival documentation converted to unstructured textual data will… further meaningful research and our understanding of this epidemic.”

Finally, the project team will promote the existence of this dataset, and will lead workshops to help instruct potentially interested students, researchers, scholars, and members of the general public in its use. Again in the words of Dr. Medeiros, “the plans to provide workshops to help curious scholars learn how to best interface with this data is exciting as it will allow for those who are experts in the field but not necessarily in the digital medical humanities to conduct important research.”

This project will support innovation, creativity, and collaboration in and across the humanities, social sciences, and STEM fields (Science, Technology, Engineering, and Math) by opening up a new body of historical materials for research and discovery. The project will foster new creative research methods in the areas of the humanities, which are just beginning to experiment with computationally-driven research, and it will encourage collaboration through the use of the newly-created data resource, engaging the expertise of both humanists and scientists in making discoveries in the data. Not only does this collaborative work allow for innovation “at the edges” of each of these fields, it allows for computational access to a previously-inaccessible research object — the data of the lived experience and cultural history of the AIDS crisis in the Bay Area and beyond.

The following institutions and groups are serving as informal partners on this project:

About UCSF Archives & Special Collections (UCSF Library)
The mission of the UCSF Archives & Special Collections is to identify, collect, organize, interpret, and maintain rare and unique material to support research and teaching of the health sciences and medical humanities and to preserve institutional memory. The UCSF AIDS History Project (AHP) began in 1987 as a joint effort of historians, archivists, AIDS activists, health care providers, scientists, and others to secure historically significant resources documenting the response to the AIDS crisis, its holdings currently include 46 collections and they continue to grow.

UCSF Library logo

About the Library Services and Technology Act
Library Services and Technology Act (LSTA) grants are federal funds from the Institute of Museum and Library Services that are awarded by the State Library to eligible California libraries. This project was supported in whole or in part by the U.S. Institute of Museum and Library Services under the provisions of the Library Services and Technology Act, administered in California by the State Librarian. www.library.ca.gov/grants/library-services-technology-act/

California State Library logo
Institute of Museum and Library Services logo


Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.