UCSF Archives & Special Collections includes numerous digitized collections documenting health sciences topics ranging from institutional, community, and individual response to illness and disease to industry impacts on public health. We make many of these collections available as data that can be computationally analyzed for health sciences and humanities research.
If you are curious about working with data from the UCSF Archives and Special Collections, the Digital Health Humanities (DHH) pilot program will showcase our “archives as data” throughout the month. In two upcoming sessions, we’ll provide an orientation to available data as well as methods for finding, accessing, and exploring these data resources:
DHH programming also continues to partner with the Data Science Institute (DSI) to offer workshops on tools and methods well-suited to conducting research with “archives as data.” March workshops in the DSI Python for Data Analysis series will dig in to text analysis using natural language processing and building machine learning models:
Through these workshops and selected companion follow-up sessions with troubleshooting and guided process walkthroughs, researchers can learn and practice data analysis techniques and get familiar with data from our collections. Check out the library’s events calendar to find and register for the latest offerings!
If you have data you’d like to work with but it needs tidying and preparation attend a DSI OpenRefine workshop. This workshop will cover techniques for cleaning structured data, no programming required! There will be two OpenRefine sessions this month:
OpenRefine for Archives as Data, Wednesday, March 8, 12 – 1:30 p.m. PT (This is a DHH companion session to the Cleaning Spreadsheet Data with OpenRefine DSI workshop and all are welcome.)
Previously-held DHH session slides, linked resources, and recordings are available on the CLE. There you will find materials from a Digital Health Humanities Overview session and recorded walkthroughs for Unix, Python, and Jupyter notebooks basics. Related resources will be updated on the CLE following DHH sessions.
Please contact DHH Program Coordinator, Kathryn Stine, at email@example.com. The UCSF Digital Health Humanities Pilot is funded by the Academic Senate Chancellor’s Fund via the Committee on Library and Scholarly Communication.
We are excited to launch digital health humanities pilot programming starting January 2023! Digital health humanities (DHH) is an emerging discipline that utilizes digital methods and resources to explore research questions investigating the human experience around health and illness. The Digital Health Humanities Pilot (DHHP) will facilitate new insights into historical health data. Participants will learn how to evaluate and integrate digital methods and “archives as data” into their research through a range of offerings and trainings.
The programming from this pilot will bring a humanistic context to understanding institutional, personal and community responses to health issues, as well as social, cultural, political and economic impacts on individual and public health. The DHHP will offer researchers from all disciplines (including faculty, staff, and other learners) tailored workshops, classes, and skill-building sessions. Workshops will encourage the use of “archives as data” and utilize datasets from holdings within the UCSF Archives and Special Collections (including the AIDS History Project and Industry Documents Library, among others). Additionally, in spring 2023 we will be hosting the Digital Health Humanities Symposium. The symposium will provide space to consider theoretical issues central to this emerging field and highlight digital health humanities projects. More information on the symposium will be shared soon.
The UCSF Digital Health Humanities Pilot is funded by the Academic Senate Chancellor’s Fund via the Committee on Library and Scholarly Communication.
Register for an upcoming Digital Health Humanities overview session
This session will include an orientation led by Digital Health Humanities Program Coordinator, Kathryn Stine and Digital Archivist, Charlie Macquarie. We will discuss various approaches in DHH research, including getting familiar with data analysis and programming skills, and will share an overview of the UCSF Library’s archival collections data available for research.
For questions about digital health humanities at UCSF, please contact Digital Health Humanities Program Coordinator, Kathryn Stine at firstname.lastname@example.org.
The Data Science Initiative (DSI) is offering workshops in the coming months to support researchers interested in implementing DHH approaches. Follow-up sessions will be available for researchers to reinforce and contextualize programming foundations in practical application. Check out the upcoming sessions:
We invite you to check out the library’s events and classes calendar for upcoming DHHP (and related DSI) programming. If you are unable to attend any of the sessions listed above, we advise referring to the DSI Collaborative Learning Environment (CLE) (accessible with MyAccess credentials) for recordings and resources.
This is a guest post from Lubov McKone, the Industry Documents Library’s 2022 Data Science Senior Fellow.
This summer, I served as the Industry Documents Library’s Senior Data Science Fellow. A bit about me – I’m currently pursuing my MLIS at Pratt Institute with a focus in research and data, and I’m hoping to work in library data services after I graduate. I was drawn to this opportunity because I wanted to learn how libraries are using data-related techniques and technologies in practice – and specifically, how they are contextualizing these for researchers.
The UCSF Industry Documents Library is a vast collection of resources encompassing documents, images, videos, and recordings. These materials can be studied individually, but increasingly, researchers are interested in examining trends across whole collections, or subsets of it. In this way, the Industry Documents Library is also a trove of data that can be used to uncover trends and patterns in the history of industries impacting public health. In this project, the Industry Documents Library wanted to investigate what information is lost or changed when its collections are transformed into data.
There are many ways to generate data from digital collections. In this project we focused on a combination of collections metadata and computer-generated transcripts of video files. Like all information, data is not objective but constructed. Metadata is usually entered manually and is subject to human error. Video transcripts generated by computer programs are never 100% accurate. If accuracy varies based on factors such as the age of the video or the type of event being recorded, how might this impact conclusions drawn by researchers who are treating all video transcriptions as equally accurate? What guidance can the library provide to prevent researchers from drawing inaccurate conclusions from computer-generated text?
Kate Tasker, Industry Documents Library Managing Archivist
Rebecca Tang, Industry Documents Library Applications Programmer
Geoffrey Boushey, Data Science Initiative Application Developer and Instructor
Lubov McKone, Senior Data Science Fellow
Lianne De Leon, Junior Data Science Fellow
Rogelio Murillo, Junior Data Science Fellow
Based on the background and the goals of the Industry Documents Library, the project team identified the following research questions to guide the project:
Taking into account factors such as year and runtime, how does computer transcription accuracy differ between television commercials and court proceedings?
How might transcription accuracy impact the conclusions drawn from the data?
What guidance can we give to researchers to prevent uninformed conclusions?
This project is a case study that evaluates the accuracy of computer-generated transcripts for videos within the Industry Documents Library’s Tobacco Collection. These findings provide a foundation for UCSF’s Industry Documents Library to create guidelines for researchers using video transcripts for text analysis. This case study also acts as a roadmap and a collection of instructional materials for similar studies to be conducted on other collections. These materials have been gathered in a public github repo, viewable here.
Sourcing the Right Data
At the beginning of the project, we worked with the Junior Fellows to determine the scope of the project. The tobacco video collection contains 5,249 videos that encompass interviews, commercials, court proceedings, press conferences, news broadcasts, and more. We wanted to narrow our scope to two categories that would illustrate potential disparities in transcript accuracy and meaning. After transcribing several videos by hand, the fellows proposed commercials and court proceedings as two categories that would suit our analysis. We felt 40 would be a reasonable sample size of videos to study, so each fellow selected 10 videos from each category, selecting videos with a range of years, quality, and runtimes. The fellows were selecting videos from a list that was generated by the InternetArchive python API, containing video links and metadata such as year and runtime.
Computer & Human Transcripts
Once the 40 videos were selected, we extracted transcripts from each URL using the Google AutoML API for transcription. We saved a copy of each computer transcription to use for the analysis, and provided another copy to the Junior Fellows, who edited them to accurately reflect the audio in the videos. We saved these copies as well for comparison to the computer-generated transcription.
To compare the computer and human transcripts, we conducted research on common metrics for transcript comparison. We came up with two broad categories to compare – accuracy and meaning.
To compare accuracy, we used the following metrics:
Word Error Rate – a measure of how many insertions, deletions, and substitutions are needed to convert the computer-generated transcript into the reference transcript. We subtracted this number from 1 to get the Word Accuracy Rate (WAR).
BLEU score – a more advanced algorithm measuring n-gram matches between the transcripts, normalized for n-gram frequency.
Human-evaluated accuracy – a score from Poor, Fair, Good, and Excellent assigned by the fellows as they were editing the computer-generated transcripts.
Google AutoML confidence score – a score generated by Google AutoML during transcript generation indicating how accurate Google believes its transcription to be.
To compare meaning, we used the following metrics:
Sentiment – We generated sentiment scores and magnitude for both sets of transcripts. We wanted to see whether the computer transcripts were under- or over- estimating sentiment, and whether this differed across categories.
Topic modeling – We ran a k-means topic model for two categories to see how closely the computer transcripts matched the pre-determined categories vs. how closely they were matched by the human transcripts
Findings & Recommendations
Relationships in the data
From an initial review of the significant correlations in the data, we gained some interesting insights. As shown in the correlation matrix, AutoML confidence score, fellow accuracy rating, and Word Accuracy Rate (WAR) are all significantly positively correlated. This means that the AutoML confidence score is a relatively good proxy for transcript accuracy. We recommend that researchers who are seeking to use computer-generated transcripts look to the AutoML confidence score to get a sense of the reliability of the computer-generated text they are working with.
We also found a significant positive correlation between year and fellow accuracy rating, Word Accuracy Rate, and AutoML confidence score – suggesting that the more recent the video, the better the quality. We suggest informing researchers that newer videos may generate more accurate computer transcriptions.
Transcript accuracy over time
One of the Junior Fellows suggested that we look into whether there is a specific cutoff year where transcripts become more accurate. As shown in the visual below, there’s a general improvement in transcription quality after the 1960s, but not a dramatic one. Interestingly, this trend disappears when looking at each video type separately.
Transcript accuracy by video type
When comparing transcript accuracy between the two categories, we found that our expectations were challenged. We expected the accuracy of the advertising video transcripts to be higher, because advertisements generally have a higher production quality, and are less likely to have features like multiple people speaking over each other that could hinder transcription accuracy. However, we found that across most metrics, the court proceeding transcripts were more accurate. One potential reason for this is that commercials typically include some form of singing or more stylized speaking, which Google AutoML had trouble transcribing. We recommend informing researchers that video transcripts from media that contain singing or stylized speaking may be less accurate.
The one metric that the commercials were more accurate in was BLEU score, but this should be interpreted with caution. BLEU score is supposed to range from 0-1, but in our dataset its range was 0.0001 – 0.007. BLEU score is meant to be used on a corpus that is broken into sentences, because it works by aggregating n-gram accuracy on a sentence level, and then averaging the sentence-level accuracies across the corpus. However, the transcripts generated by Google AutoML did not contain any punctuation, so we were essentially calculating BLEU score on a corpus-length sentence for each transcript. This resulted in extremely small BLEU scores that may not be accurate or interpretable. For this reason, we don’t recommend the use of the BLEU score metric on transcripts generated by Google AutoML, or on other computer-generated transcripts that lack punctuation.
We looked to sentiment scores to evaluate differences in meaning between the test and reference transcripts. As we expected, commercials, which are sponsored by the companies profiting off of the tobacco industry, tend to have a positive sentiment, while court proceedings, which tend to be brought against these companies, tend to have a negative sentiment. As shown in the plot to the left, the sentiment of the computer transcripts was a slight underestimation in both video types, though this was not too dramatic of an underestimation.
Opportunities for Further Research
Throughout this project, it was important to me to document my work and generate a research dataset that could be used by others interested in extended this work beyond my fellowship. There were many questions that we didn’t get a chance to investigate over the course of this summer, but my hope is that the work can be built upon – maybe even by a future fellow! This dataset lives in the project’s github repository under data/final_dataset.csv.
One aspect of the data that we did not investigate as much as we had hoped was topic modeling. This will likely be an important next step in assessing whether transcript meaning varies between the test and reference transcripts.
Professional Learnings & Insights
My main area of interest in the field of library data services is critical data literacy – how we as librarians can use conversations around data to build relationships and educate researchers about how data-related tools and technologies are not objective, but subject to the same pitfalls and biases as other research methods. Through my work as the Industry Documents Library Senior Data Science Fellow, I had the opportunity to work with a thoughtful team who is thinking ahead about how to responsibly guide researchers in the use of data.
Before this fellowship, I wasn’t sure exactly how opportunities to educate researchers around data would come up in a real library setting. Because I previously worked for the government, I tended to imagine researchers sourcing data from government open data portals such as NYCOpenData, or other public data sources. This fellowship opened my eyes to how often researchers might be using library collections themselves as data, and to the unique challenges and opportunities that can arise when contextualizing this “internal” data for researchers. As the collecting institution, you might have more information about why data is structured the way it is – for instance, the Industry Documents Library created the taxonomy for the archive’s “Topic” field. However, you are also often relying on hosting systems that you don’t have full control over. In the case of this project, there were several quirks of the Internet Archive API that made data analysis more complicated – for example, the video names and identifiers don’t always match. I can see how researchers might be confused about what the library does and does not have control over.
Another great aspect of this fellowship was the opportunity to work with our high school Junior Fellows, who were both exceptional to work with. Not only did they contribute the foundational work of editing our computer-generated transcripts – tedious and detail-oriented work – they also had really fresh insights about what we should analyze and what we should consider about the data. It was a highlight to support them and learn from them.
I also appreciated the opportunity to work with this very unique and important collection. Seeing the breadth of what is contained in the Industry Documents Library opened my eyes to not only the wealth of government information that exists outside of government entities, but also to the range of private sector information that ought to be accessible to the public. It’s amazing that an archive like the Industry Documents Library is also so invested in thinking critically about the technical tools that it’s reliant upon, but I guess it’s not such a surprise! Thanks to the whole team and to UCSF for a great summer fellowship experience!
The Industry Documents Library (IDL) is excited to welcome three Data Science Fellows to our team this summer. The Data Science Fellows will be working with the IDL and with the UCSF Library Data Science Initiative (DSI) to to assess the impact of transcription accuracy on text analysis of digital archives, using the IDL collections.
Through tagging, human transcription, and computer-generated transcription, the team will assess how accuracy may differ between media or document types, and how and whether this difference is more or less pronounced in certain categories of media (for example, video recordings of focus groups, community meetings, court proceedings, or TV commercials, all of which are present in the IDL’s video collections). After identifying transcript accuracy in different media types, we aim to provide guidelines to researchers and technical staff for proper analysis, measurement, and reporting of transcript accuracy when working with digital media.
Our Junior Data Science Fellows are Rogelio Murillo and Lianne De Leon. Rogelio and Lianne are both participating in the San Francisco Unified School District (SFUSD) Career Pathway Summer Fellowship Program. This six-week program provides opportunities for high school students to gain work experience in a variety of industries and to expand their learning and skills outside of the classroom. Lianne and Rogelio will be learning about programming and creating transcription for selected audiovisual materials. The IDL thanks SFUSD and its partners for running this program and providing sponsorship support for our fellows.
Lubov McKone is our Senior Data Science Fellow and will be using automated transcription tools to extract text from audiovisual files, run sentiment and topic analyses, and compare automated results to human transcription. Lubov will also provide guidance and mentoring to the Junior Fellows.
Our Fellows have introduced themselves below. Please join us in welcoming Rogelio, Lianne, and Lubov to the UCSF Library this summer!
Hi my name is Lianne R. de Leon and I go to Phillip and Sala Burton High School as a rising senior. I love playing volleyball in my free time and you may see me at numerous open gyms around the city. In the future I hope to major in computer science or computer engineering. I’m looking forward to meeting many wonderful people here at UCSF and learning more about the data science industry from the inside.
Hi, my name is Rogelio Murillo and I’m a rising junior at Ruth Asawa School of the Arts. I enjoy playing a variety of music and percussion. I’ve played Japanese Taiko, Afro Brazilian drumming, and Latin Jazz. I’m also learning guitar over the summer. I’m a responsible and respectful person.
My name is Lubov McKone and I’m currently pursuing my Masters in Library and Information Science from Pratt Institute in Brooklyn, NY. I also hold a Bachelor’s degree in Statistics, and prior to entering graduate school I worked as a data analyst in local government. My professional interests include supporting researchers in the accurate and responsible use of data, and I aspire to work as a data librarian in an academic library after graduation. Outside of work, I spend my time cooking, doing yoga, and writing music. I’m very excited to be joining the UCSF Industry Documents Library this summer, and I’m looking forward to learning more about how researchers use digital collections!
By Erin Hurley, User Services & Accessioning Archivist
Although, in 2020, advice like “wash your hands” and “cover
your mouth when you cough” seem fairly obvious and common sense, there was a
time when this was not the case. That time was March 1855, when the situation
in British hospitals outside of Constantinople (now Istanbul, Turkey) during
the Crimean War had become so dire that Florence Nightingale and 40 other women
acting as trained volunteer nurses were finally allowed access to patients
(they had previously been denied access because of their gender). Hospitals
were overcrowded and extremely unsanitary conditions encouraged the spread of
infectious diseases like cholera, typhoid, typhus and dysentery, which Nightingale
recognized immediately. She implemented basic cleanliness measures, such as
baths for patients, clean facilities, and fresh linens, and advocated for an
approach that addressed the psychological and emotional, as well as the
physical, needs of patients. Her improvements brought a dramatic decline in the
mortality rate at these hospitals, which had previously been as high as 40%.
While Nightingale is well known as one of the world’s first nurses, she is less well known for her strikingly lovely data visualizations (including pie charts and a rose-shaped design called the “coxcomb”), which she used to highlight the number of deaths from diseases, in addition to deaths from wounds or injury, during the Crimean War. Nightingale, a mathematician and statistician, recognized the importance of eye-catching visuals in communicating the impact of her innovations.
For several years now, the NLM’s History of Medicine Division has been embracing the future as we continue our mission to collect, preserve, make freely available, and curate for diverse audiences the NLM’s treasured historical collections, which span ten centuries. I’ve described this mission as stewardship of the past, and I have argued that it is not mutually exclusive of embracing the future. This is because to be the best steward of history during times of change, it is important to anticipate, explore, and chart the paths toward many possible futures. So what do I mean by embracing the future?
Embracing the future means facing change. It means engaging and grappling with it, because studying history can contribute meaningfully to contextualizing and shaping change.
Embracing the future means supporting open and “citizen-centered” government. It means enabling access to all, not just a few. It means engaging new audiences, not only the traditional ones. It involves engagement across the disciplines, and across the spectrum of the public, to ensure that scholars, educators, and interested people of today and tomorrow can have access to the world’s historical medical heritage for research, teaching, and learning.
NLM’s treasured historical collections span ten centuries and originate from nearly every part of the world. Our digitization of these materials, for greater access by researchers of all disciplines, goes hand in hand with our preservation of them, in their original form, for future generations of researchers.
Embracing the future means embracing fair use and supporting robust digitization as a means of both access and preservation, and achieving these goals through mutually-supportive public and private partnerships. Moreover, embracing the future means appreciating and understanding that digitized historical medical collections exist in a format appealing not only to those focused on deep reading and close study of individual works, but also to scholars and to entirely new audiences who are interested in mining these digital surrogates and their associated metadata data for more data-focused research. The evolving digital world is producing an ever-increasing volume of digitized physical material and born-digital resources. The worlds of “big data” and data science are meeting a longstanding world of persistent physical objects that contain records of the human condition. As these worlds collide and coexist, opportunities abound to advance interdisciplinary collaboration and expand cooperation among institutions and organizations that preserve history and support current and future medical research, and research in all disciplines.
A Chorus of Voices. Through its blog Circulating Now, the NLM is giving voice our patrons from a variety of disciplines and backgrounds, whoeach in their own way and together recognize the research and educational value of our world-renowned historical collections.
And finally, from a leadership perspective, embracing the future means meeting individuals where they stand, treating them as colleagues and as part of a team. It means supporting mentorship to advance careers, and continuous learning to advance interdisciplinary research and teaching focused on historical and contemporary issues of health and the human condition. These initiatives are not only keys to embracing the future of challenges and opportunities. They are keys to succeeding in that future.
To learn more about my thoughts about embracing the future as stewards of the past, you can read this article or, if you wish, watch my October 21, 2016, Archives Talk at UCSF.