Analytics with GenAI: Know Your Data!

Posted on June 27, 2025 by Lisa Nguyen

Person working on a laptop with code on the screen, holding a smartphone. A coffee cup and glasses are on the table.

Post by Geoffrey Boushey, Head of Data Engineering, UCSF Library’s Data Science and Open Scholarship Team. He teaches and consults on programming for data pipelines with an emphasis on Python, Unix and SQL.

Teaching Approaches to Data Science

Recently, Ariel Deardorff (director of the Data Science & Open Scholarship at the UCSF Library) forwarded me a paper titled “Generative AI for Data Science 101: Coding Without Learning to Code.” In this paper, the authors described how they used GitHub Copilot, a tool for generating code with AI, to supplement a fundamentals of data science course for MBA students, most of whom had no prior coding experience. Because the instructors wanted students to use AI to generate code for analysis, but not the full analysis itself, they opted for a tool that generates code without potentially “opening the Pandora’s box too wide” with ChatGPT, a tool that might blur the line between coding and analysis.They also deliberately de-emphasized the code itself, encouraging students to focus on analytical output rather than scrutinizing the R code line by line.

This approach has some interesting parallels, along with some key differences, with the way I teach programming at the UCSF library through the “Data and Document Analysis with Python, SQL, and AI” series. These workshops are attended largely by graduate students, postdocs, research staff, and faculty (people with an exceptionally strong background in research and data science) who are looking to augment their programming, machine learning, and AI skills. These researchers don’t need me to teach them science (it turns out UCSF scientists are already pretty good at science), but they do want to learn how to leverage programming and AI developments to analyze data. In these workshops, which include introductory sessions for people who have not programmed before, I encourage participants to generate their own AI-driven code. However, I have always strongly emphasized the importance of closely any code generated for analytics or data preparation, whether pulled from online examples or created through generative AI.

The goal is to engage researchers with the creative process of writing code while also guarding against biases, inaccuracies, and unintended side effects (these are issues that can arise even in code you write yourself). Although the focus on careful examination contrasts with the approach described in the paper, it made me wonder: what if I diverged in the other direction and bypassed code altogether? If the instructors were successful in teaching MBA students to generate R code without scrutinizing it, could we skip that step entirely and perform the analysis directly in ChatGPT?

Experimental Analysis with ChatGPT

As a personal experiment, I decided to recreate an analysis from a more advanced workshop in my series, where we build a machine learning model to evaluate the impact of various factors on the likelihood of a positive COVID test using a dataset from Carbon Health. I’ve taught several iterations of this workshop, starting well before Generative AI was widely available, and have more recently started integrating GenAI-generated code into the material. But this time, I thought I’d try skipping the code entirely and see how AI could handle the analysis on its own.

I got off to a slightly rocky start with data collection. The covid clinical data repository contains a years worth of testing data split into shorter CSV files representing more limited (weekly) time periods, and I was hoping I could convince ChatGPT to infer this structure from a general link to the github repository and glob all the CSV files sequentially into a pandas dataframe (a tabular data frame format). This process of munging and merging data, while common in data engineering, can be a little complicated, as github provides both a human readable and “raw” view of CSV files. Pandas needs the raw link, which requires multiple clicks through the github web interface to access. Unfortunately, I was unsuccessful in coaxing ChatGPT into reading this structure, and eventually decided to supply github with a direct link to the raw file for one of the CSV files [8]. This worked, and ChatGPT now had a pandas dataframe with about 2,000 covid test records. Ideally, I’d do this with the full ~100k row set, but for this experiment, 2,000 records was enough.

Key Findings & Limitations

Now that I had some data loaded, I asked ChatGPT to rank the features in the dataset based on their ability to predict positive COVID test results. The AI generated a reasonably solid analysis without any need for code. ChatGPT suggested using logistic regression and produced a ranked list of features. When I followed up and asked ChatGPT to use a random forest model and calculate feature importances, it did so immediately, even offering to create a bar chart to visualize the results—no coding required.

Bar chart generated by ChatGPT showing Random Forest Feature Importances for Predicting COVID Test Result.

Here is the bar chart generated by ChatGPT showing the feature importances, with the inclusion of oxygen saturation but the notable omission of loss of smell:

One feature ChatGPT identified as highly significant was oxygen saturation, which I had overlooked in my prior work with the same dataset. This was a moment of insight, but there was one crucial caveat: I couldn’t validate the result in the usual way. Typically, when I generate code during a workshop, we can review it as a group and debug it to ensure that the analysis is sound. But in this no-code approach, the precise stages of this process were hidden from me. I didn’t know exactly how the model had been trained, how the data had been cleaned or missing values imputed, how the feature importances had been calculated, or whether the results had been cross-validated. I also didn’t have access to the feature importance scores from some machine learning algorithms (such as random forest) that I had built and trained myself. The insight was valuable, but it was hard to fully trust or even understand it without transparency into the process.

This lack of transparency became even more apparent when I asked ChatGPT about a feature that wasn’t showing up in the results: loss_of_smell. When I mentioned, through the chat interface, that this seems a likely predictor for a positive test and asked why it hadn’t been included, ChatGPT told me that this feature would indeed be valuable and articulated why, but repeated that it wasn’t part of the dataset. This surprised me, as it was in the dataset under the column name “loss_of_smell”

(The full transcript of this interaction, including the AI’s responses and corrections, can be found in the footnote [1] below).

This exchange illustrated both the potential and the limitations of AI-powered tools. The tool was quick, efficient, and pointed me to a feature I hadn’t considered. But it still needed human oversight. Tools like ChatGPT can miss straightforward details or introduce small errors that a person familiar with the data might easily catch. It could also introduce errors that may be notably more obscure, and only detectable after very careful examination and consideration of the output by someone with a much deeper knowledge of the data.

The Importance of Understanding Your Data

The experience reinforced a key principle I emphasize in teaching: know your data. Before jumping into analysis, it’s important to understand how your data was collected, how it’s been processed, what (ahem) a data engineering team may have done to it as they prepared it for you, and what it represents. Without that context, you may not be able to know when AI or other tools have led you in the wrong direction or missed critical context.

While the experiment I conducted with AI analysis was fascinating and demonstrates the potential of low- or no-code approaches, it does underscore, for me, the continued importance of generating and carefully reading code during my programming workshops. Machine learning tasks related to document analysis, such as classification, regression, and feature analysis, involve a very precise set of instructions for gathering, cleaning, formatting, processing, analyzing, and visualizing data. While generative AI tools often provide quick, opaque results, the precision involved in these processes can be obscured from the user. This lack of transparency carries significant implications for repeatable research and proper validation. For now, it remains crucial to have access to the underlying code and understand its workings to ensure thorough validation and review.

Conclusions & Recommendations

Programming languages will likely remain a crucial part of a data scientist’s toolkit for the foreseeable future. But whether you are generating and verifying code or using AI directly on a dataset, keep in mind that the core of data science is the data itself, not the tools used to analyze it.

No matter which tool you choose, the most important step is to deeply understand your data – its origins, how it was collected, any transformations it has undergone, and what it represents. Take the time to engage with it, ask thoughtful questions, and stay vigilant about potential biases or gaps that could influence your analysis. (Is it asking too much to suggest you love your data? Probably. But either way, you might enjoy the UC-wide “Love Data Week” conference).

In academic libraries, much of the value of archives comes from the richness of the objects themselves—features that don’t necessarily come through in a digital format. This is why I encourage researchers to not just work with digital transcriptions, but to also consider the physicality of the data: the texture of the paper, the marks or annotations on the margins, and the context behind how that data came to be. These details often carry meaning that isn’t immediately obvious in a dataset or a plain text transcription. Even in the digital realm, knowing the context, understanding how the data was collected, and remaining aware of the possibility of hidden bias are essential parts of the research process. [3] [4] Similarly, when working with archives or historical records, consider the importance of engaging with the data beyond just the text transcript or list of AI-detected objects in images.

Get to know your data, before, during, and after you analyze it. If possible, handle documents physically, consider what you may have missed. Visit the libraries, museums, and archives [12] where objects are physically stored, talk to archivists and curators who work with them. Your data will tend to outlast the technology that you use to analyze it, and while the tools and techniques you use for analysis will evolve, your knowledge of your data will form the core of its long-term value to you.

GPT transcript: https://github.com/geoffswc/Know-Your-Data-Post/blob/main/GPT_Transcript.pdf
In the workshop series, we use python to merge the separate csv files into a single pandas dataframe using the python glob module. It wouldn’t be difficult to do this and resume working with ChatGPT, though it does demonstrate the difficulty of completing an analysis without any manual intervention through code (for now).
“Bias and Data Loss in Transcript Generation” UCTech, 2023: https://www.youtube.com/watch?v=sNNrx1i96wc
Leveraging AI for Document Analysis in Archival Research and Publishing, It’s About a Billion Lives Symposium 2025 (recording to be posted) https://tobacco.ucsf.edu/it%E2%80%99s-about-billion-lives-annual-symposium
https://www.library.ucsf.edu/archives/ucsf/

Evaluating Automated Transcription Accuracy: A Data Science Fellowship Report

Posted on September 22, 2023 by Kate Tasker

Guest post by Noel Salmeron, 2023 Senior Data Science Fellow for the Industry Documents Library and Data Science Initiative.

Hi everyone! I had the opportunity of interning for the Industry Documents Library in coordination with the Data Science Initiative as the Senior Data Science Fellow for the Summer of 2023. I am working towards my Bachelor’s degree in Data Science and minoring in Education, and I plan to graduate in May 2024. I feel grateful that I could earn this position with UCSF and work with the fascinating Industry Documents Library as I realize how valuable archives and data are, especially when doing my own research. The Data Science Initiative was extremely helpful in teaching me Machine Learning and Natural Language Processing topics pertinent to the project and valuable for my future in data science.

Project Background

Currently, the Industry Documents Library contains more than 18 million documents relating to public health, as well as thousands of audiovisual materials, such as “recordings of internal focus groups and corporate meetings, depositions of tobacco industry employees, Congressional hearings, and radio and TV cigarette advertisements.” With this project, we wanted to evaluate the transcription accuracy of digital archives and its impact on documentation and the creation of subject words and descriptions for such archives.

Project Team

Kate Tasker, Industry Documents Library Managing Archivist
Rebecca Tang, Industry Documents Library Applications Programmer (and Junior Fellows Advisor)
Geoffrey Boushey, Head of Data Engineering (and Senior Fellow Advisor)
Rachel Taketa, Industry Documents Library Processing and Reference Archivist
Melissa Ignacio, Industry Documents Library Program Coordinator
Noel Salmeron, Senior Data Science Fellow
Adam Silva, Junior Data Science Fellow
Bryce Quintos, Junior Data Science Fellow

Project Terminology

Here are a few important terms to note!

Metadata: a set of data that describes other data (i.e., author, date published, file size, etc.)
Classification: categorizing objects (or text) into organized groups
Text cleaning: reducing complex text to simple text for more efficient use in Natural Language Processing

And a few terms were used interchangeably throughout this project!

Description / Summary
- A condensed version of some text
Subject / Tag / Keyword / Topic
- A single word that helps to define the text or frequently appears within the text

Project Objectives

Overall, the project had a couple of main objectives. The team wanted to train a Machine Learning model to extract subjects from Industry Documents Library video transcripts and evaluate the accuracy of the machine-generated subjects. We planned to utilize the junior interns’ datasheet they created with subjects and descriptions for over 300 videos to train the model for each tag we chose to analyze.

(The video transcripts were generated beforehand by Google AutoML with the help of Geoffrey Boushey).

Transcript Cleaning

Once the video transcripts were created from Google AutoML, I was able to clean the text using techniques I learned from previous Data Science Initiative workshops. The “Machine Learning NLP Prep” workshop techniques were especially helpful for this portion of the project. I began by setting all 324 transcripts in our dataframe to a lowercase format. This helps simplify text analysis in the long run, especially when avoiding case sensitivity complications. My next step was to remove stop words, which are common and redundant words such as articles, conjunctions, and prepositions. This was possible with the Natural Language Toolkit library for Python, which contains a list of stop words I could add to since I especially noticed ‘p.m.’ and ‘a.m.’ appearing in depositions. I continued by removing everything that isn’t alphabetic using a regular expression (or regex), a sequence of characters corresponding to a pattern to be matched. Any single characters or two character pairs were also removed. Finally, it was essential to stem words to be able to group common words without worrying about suffixes.

ML Model Creation using ID and subject/tag

After text cleaning, we set video IDs as their indices in our running dataframe to efficiently and consistently identify them. Our running dataframe consisted of a row for each of the 324 videos with columns that denoted their ID, subject words, transcript, and a category value of ‘0’ for ‘no’ or ‘1’ for ‘yes’ that corresponded to whether or not the video’s subjects words included the specific tag we were after in each single-tag analysis.

To provide a more concrete example, we will use the “lawsuit” tag, which means each video was denoted with a ‘1’ in the category column if it contained the “lawsuit” tag from the junior interns’ datasheet.

Continuing, we created training and test sets from the dataframe with a 50/50 split. This was followed by a pipeline of several operations in a sequence that included Count Vectorization and Random Forest Classification. Count Vectorization is a method in Natural Language Processing to convert text into numerical values primed for Machine Learning. This way, we can note word frequency in each word for each transcript. Furthermore, Random Forest Classification is a collection of decision trees that make binary decisions based on input and continually “bootstraps” (re-samples) from the training data set to make predictions about whether or not a video contained the “lawsuit” tag.

Features for Each Tag

We then gathered feature words and their importance values as to how they supported the model in determining if a video belonged to the “lawsuit” tag. These feature words included “exhibit,” “plaintiffs,” “counsel,” and “documents,” which change every time we run the model. It appears the less common words also slipped through, such as the company name “Mallinckrodt,” which may not appear as important in other transcript datasets relating to lawsuits.

Chart displaying a list of feature words such as “email,” “exhibit,” “court,” and other words, each labeled with numeric value representing the word’s importance value.

Cross Validation and Match Probability

Moving forward, we used Cross Validation to verify that the model’s performance was not drastically different with different training and test subsets from the running dataframe. Following this process, we were able to create a dataframe that included a column “y_adj” to indicate “Not” for the video not falling under the “lawsuit” tag and an indication of “Match” otherwise. Moreover, we included two columns, “prob_no_match” and “prob_match,” that denote the model’s assessment of the probability that a video doesn’t fit under “lawsuit” or does, respectively.

Chart displaying a list of video IDs and an associated numeric value representing the video’s probability match.

We also ran some code that narrowed down the dataframe to videos where the model incorrectly predicted a video’s match.

Chart displaying a list of video IDs and associated information, representing videos which had been incorrectly matched.

This is where we began to run into issues with this dataset since it contained a relatively small amount of videos and, therefore, a low number of videos where the “lawsuit” tag applied. The “lawsuit” tag was filed under only 26 of the 324 videos, a mere 8 percent of the dataset. It was also quite difficult to discern an appropriate threshold for whether or not a video transcript should be marked as a match to a tag because the videos that the model marked incorrectly usually appear to have significantly different probabilities for matching.

This caused our models for tags with counts under 25 or so to result in a non-existent F-score, as well as precision and recall, but a high accuracy which I will explain shortly. Meanwhile, an F-score is critical in providing an overall measure or metric for the performance of a Machine Learning model using its precision and recall.

Chart displaying a list of tags including “tobacco,” “marketing,” “lawsuit,” and other words, and a numeric value representing how many times the tag appears.

Precision & Recall

Diving into Precision and Recall, Precision can be defined as the proportion of correct positive predictions in the number of predicted positive values, while Recall is the proportion of correct positive predictions in the number of actual positive values.

In this project, the positive values would be video matches for a tag, so in terms of the project, precision is the proportion of correct match predictions out of the predicted matches, and recall is the proportion of correct match predictions out of the actual, true matches. In addition, Accuracy refers to the comprehensive correctness of all positive and negative predictions.

This image may also help visualize the precision/recall relationship:

Graphic illustrating the concept of precision (how many retrieved items are relevant) vs the concept of recall (how many relevant items are retrieved).

Thresholds

Another step we took in this project’s analysis was creating precision-recall curves for specified thresholds by the Scikit-learn library that allows for Machine Learning in Python. This way, we could recognize that as the threshold for the probability of a match increases, the precision slowly increases from about 90 percent to 100 percent. In comparison, the recall decreases from 100 percent to 0 percent.

Two charts illustrating precision-recall curves.

This can be explained by referring back to the definitions of precision and recall! Suppose the threshold for the probability of a match increases and becomes stricter. In that case, precision (the proportion of correct matches out of predicted matches) will only increase as the requirement for a video to be labeled as a match becomes stricter. Regarding the recall (the proportion of correct match predictions out of the actual matches), it becomes clear that since there is more precision, there will be no videos incorrectly marked as matches or any videos correctly marked as matches.

Opportunities for Further Research

There were a few concerns and curiosities that I ended the beginning of this project with since it was simply a pilot, and there is much more to be explored. This includes more text cleaning in subjects/tags and transcripts to make the Natural Language Processing as streamlined as possible. Additionally, it would be crucial to explore this same analysis of subjects/tags for descriptions/summaries that we could not get to. Having a fully-developed human-made datasheet for a larger dataset to explore would also be incredibly useful.

Conclusion

I am pleased to have been a part of this team with UCSF’s Industry Documents Library and Data Science Initiative this summer, as it provided me with extensive real-world experience in data analysis, machine learning, and natural language processing. It truly puts into perspective how much valuable data is out there and all of the fascinating analysis you can conduct.

Prior to this summer, I had worked with various datasets in classes, but I felt inspired by the IDL’s endeavor to enhance its vast collection and make it easier for users to search through documents with supplementary metadata. I can especially appreciate this as I have spent countless hours sifting through documents for research papers in the past. Once the subject and description generations are in full effect, I can only imagine the potential of this data and what it could lead to, as I hope it supports other people’s work.

I also tremendously appreciate the time and effort the junior interns, Adam and Bryce, put into populating their datasheet after watching hundreds of videos. Their work was foundational to getting this project running.

I also want to express my appreciation for Geoffrey and Rebecca throughout this summer for working closely with me, making me feel welcome, and addressing any concerns or questions I had during my fellowship. I am incredibly grateful for this work experience with exceptional communication, collaboration, and kindness.

Thank you to UCSF and everyone on this team for an enjoyable and fascinating fellowship experience!

Addendum: When Should We Apply a Subject Tag to an Uncategorized Document?

By Geoff Boushey, UCSF Library Head of Data Engineering

Overview

Noel described the process for creating a machine learning (ML) model, analyzing the features that go into classifying a document, and applying the model to estimate the probability that a transcript generated from the Tobacco or Opioid collection should be included in a subject tag, such as “marketing,” “legal,” or “health.”

Because most tags in the collection show up in less than 10% of the records in our training and testing set, we shouldn’t expect most tags to apply to most records. As a result, we’re looking for a relatively rare event. If we were only concerned with the overall accuracy or our model, we could achieve 90% effectiveness by never applying a specific tag to a record.

The output from our machine learning model reflects this low probability. By default, our machine learning model would only include a tag if it estimates that the probability of a match exceeds 50%. Because we’re trying to predict a relatively rare event (again, a specific tag would only apply to at most 10% of the records in a collection), it’s unlikely that we’ll have many predictions that exceed this threshold. In fact, when we test our model, we can see that records that clearly (based on human observation) belong to a specific category may have no more than a 30-40% estimated probability of belonging to this category according to the ML model. While this is below the default 50% threshold, it does represent a much higher probability than random chance, (30-40% vs 10%).

We don’t want to erroneously include a tag too often, or it will become clutter. We don’t want to erroneously exclude it too often, or researchers will miss out on relevant record matches. We may want to lower the threshold for determining when to apply a tag to a particular record, but the right threshold isn’t always clear, and can vary depending on the frequency of a tag, the accuracy of our model, and the scenario-dependent benefit or harm of false positives versus false negatives.

The harm of false positives or negatives depends heavily on the research or use scenario. For example, a researcher who wants to retrieve all reasonably likely matches and is not concerned with the inclusion of a few documents that are not related to litigation might want to set the threshold very low, even below 10%. Alternatively, a researcher might simply wish to sample a small number of litigation-related documents with a very high level of accuracy. In this case, a high threshold would be more beneficial.

Precision and Recall curves can help find an optimal threshold that strikes the right balance between false positives and false negatives.

Technical Considerations and Limitations

Because our initial dataset is small (only 300 human reviewed records are available for supervised classification), and many of the tags only show up in 10% of the records, we limit our initial analysis to a small set of metadata tags. Because these tags are human-generated and do not conform to a limited and controlled vocabulary, there is inconsistency in the training data as well. Some tags are redundant, showing up in clusters (legal and litigation, for instance, have a 95%+ overlap). Other times, two categories that might be better approached as a single category cause a split that may greatly reduce the effectiveness of an ML based classifier. Human ambiguity is often amplified when used to train ML models, and we see that effect at work here.

Precision-Recall Curves

Because there is a class imbalance between positive and negative categorization (including versus excluding a tag) and false positives are unlikely to be a serious problem (though, as discussed above, there may be some scenarios, such as sampling, where we would want to avoid them), we’ll take a look at precision-recall curves for a few of the more commonly occurring tags.

For quick reference, *Precision* refers to how often a positive classification was correct. For example, if our model predicted that a “Legal” tag should apply correctly 9 times and incorrectly 1 time, the Precision would be 90%. *Recall* refers to how often a positive classification was accurately detected. For example, if 10 records should have been classified as Legal, and our model detected 8 of them, our recall rate would be 80%. Ideally, we would like to strike some kind of balance between these two metrics, something we can achieve by raising or lowering the probability threshold for including a record in a tag. For example, if our model assigned a 30% chance that a particular record should be classified as “Legal”, we might or might not set that assignment based on whether we are trying to improve precision or recall.

For a more technical/mathematical discussion of Precision and Recall, please consult the scikit learn documentation at:

https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html

Workbook

The jupyter notebook implementing a Precision-Recall visualization for the “Legal” tag is

available at:

https://github.com/geoffswc/IDL-DSOS-2023/blob/main/Precision-Recall-Tag.ipynb

This workbook uses the scikit-plot module from scikit-learn to generate a precison-recall curve for a tag used in the classification model. Keep in mind that there isn’t much benefit to analyzing tags that show up in less than 10% of the records, and some tags may result in an error, as positive observations may be so rare (fewer than 1-2% of the records) that there is insufficient data to train or apply an ML model (a random test/train split may have *no* observations for a rare tag such a small dataset).

The visualizations generated by this workbook are available in the next section.

Visualization and Interpretation

This section displays the PR curve for “Legal”, a tag that shows up in approximately 10% of the training records. Keep in mind that common tags like “Tobacco”, which show up in 90% of the records, are auto-assigned based on the source of the collection, and do not represent the common use case. As a result, “Legal” will provide a better overview for a common tag that does not apply to most records, and performs relatively well in our predictive model.

Precision-Recall

The precision recall curve for Legal indicates a wide threshold range that preserves usable precision and recall levels. Very high or low thresholds cause degradation of model performance, but precision and recall above 80% are available with flexibility to optimize for one or the other.

Precision/Recall-Threshold

This chart plots both precision and recall curves on the Y axis with the threshold level on the X axis. We see a rapid improvement of precision with a gradual, near-linear decrease in recall, indicating an effective threshold range well below 50%.

Graph plotting the precision and recall curves (on the Y axis) with the threshold level (on the X axis).

Production Application

Although our current data set is small, these results suggest that there is some value in using a supervised classification model to extend metadata to uncategorized documents based on ML generated transcript, though there are a number of challenges, and integrating these techniques into production would involve a number of decisions that are outside the scope of this pilot.

Challenges

A production rollout of an ML based model similar to this pilot would likely run into a number of issues with scale, such as:

Training Data: our supervised machine learning model requires a set of categorized transcripts for training. This is a very time and labor intensive undertaking. We may not be able to create a sufficiently large and broad training dataset to create a meaningful model that covers even the most common tags.

Varying Thresholds: The ideal threshold will vary based on the model performance for each individual tag and the research objectives. This variance, combined with the scale of processing required, may make customizable searches based on tag probability unrealistic in a production system.

Availability of Transcripts: The tobacco, opioid, and other industry documents collections contain a large number of files (current estimate is 18 million), many are video or audio files without transcriptions. Without transcriptions available, it won’t be possible to apply the results of an ML model to make predictions for uncategorized documents.

Recommendations

This pilot does provide a template for an interesting and promising approach, and researchers may be interested in building their own ML models to analyze the transcripts in the collections.

We could provide some of this utility without a full production integration through the following:

Pre-Built Transcription Datasets: The Industry Documents Library website currently provides pre-built transcription datasets for many image record collections. A similar initiative to provide transcriptions for video and audio would provide substantial benefit for researchers, independent of the ML based classification model.

Classification Probability Estimates: Instead of integrating classification probabilities or tags into search, we could provide the ML output for each record in a pre-built dataset. This would leave the decision for setting a threshold up to researchers, but it would avoid the need to re-generate results based on model performance and researcher scenario for each tag. This approach might allow researchers to benefit from partial information.

Generalized ML Models: Several AI tools, such as Google AutoML AI, do provide pre-trained models that can provide categorization. Because these models wouldn’t be trained specifically on our metadata, they may not capture the kind of classification most relevant to researchers, but they would eliminate the need for the very labor intensive generation of a training data set.

Student Fellows Explore Machine Learning with UCSF Industry Documents Library and Data Science Initiative

Posted on July 18, 2023 by Kate Tasker

The UCSF Industry Documents Library (IDL) and Data Science Initiative (DSI) teams are excited to be working with three Data Science Fellows this summer. The Data Science Fellows are part of a joint IDL-DSI project to explore machine learning technologies to create and enhance descriptive metadata for thousands of audio and video recordings in IDL’s archival collections. This year’s summer program includes two junior fellows and one senior fellow.

Our junior fellows are tasked with manually assigning or improving metadata fields such as title, description, subject, and runtime for a selection of videos in IDL’s collection on the Internet Archive. This is a detailed and time-consuming task, which would be costly to perform for the entire collection. In contrast, our senior fellow is using transcriptions of the videos, which we have generated with Google’s AutoML tool, to explore different technologies to automatically extract the descriptive information. We’ll then compare the human-generated data with the machine-generated data to assess accuracy. The hope is that IDL can develop a workflow for using machine learning to create or improve metadata for many other videos in our collections.

Our Junior Data Science Fellows are Bryce Quintos and Adam Silva. Bryce and Adam are both participating in the San Francisco Unified School District (SFUSD) Career Pathway Summer Fellowship Program. This six-week program provides opportunities for high school students to gain work experience in a variety of industries and to expand their learning and skills outside of the classroom. Bryce and Adam are learning about programming and creating transcription for selected audiovisual materials. The IDL thanks SFUSD and its partners for running this program and providing sponsorship support for our fellows.

Noel Salmeron is our Senior Data Science Fellow participating in Life Science Cares Bay Area’s Project Onramp. Noel is using automated transcription tools to extract text from audiovisual files, run sentiment and topic analyses, and compare automated results to human transcription. Noel also provides guidance and mentoring to the Junior Fellows.

Our Fellows have shared a bit about themselves below. Please join us in recognizing Bryce, Adam, and Noel for their contributions to the UCSF Library this summer!

IDL-DSI Junior Data Science Fellow Bryce Quintos

Hi everyone! My name is Bryce Quintos and I am an incoming freshman at Boston University. I
hope to major in biochemistry and work in the biotechnology and pharmaceutical field. As someone who is interested in medical research and science, I am incredibly honored for the opportunity to help organize the Industry Documents Library at UCSF this summer and learn more about computer programming. I can’t wait to meet all of you!

IDL-DSI Junior Data Science Fellow Adam Silva

Hi, my name is Adam Silva and I am a Junior Intern for the UCSF Library. Currently, I am 17 years old and I am going into my senior year at Abraham Lincoln High School in San Francisco. I am part of Lincoln High School’s Dragon Boat team and I am also a part of Boy Scout Troop 15 in San Francisco. My favorite activities include cooking, camping, hiking, and backpacking. My favorite thing that I did in Boy Scouts was backpacking through Rae Lakes for a week. I am excited to work as a Junior Intern this year because working online rather than in person is new to me. I look forward to working with other employees and gaining the experience of working in a group.

IDL-DSI Senior Data Science Fellow Noel Salmeron

My name is Noel Salmeron and I am a third-year data science major and education minor at UC Berkeley. I’m excited to work with everyone this summer and looking forward to contributing to the Industry Documents Library!

Archives as Data Research Guide Now Available!

Posted on March 27, 2023 by Kathryn Stine

To help researchers in finding and understanding how to work with data from archival health sciences collections, we have compiled and published the Archives as Data research guide. “Archives as Data” refers to archival collection materials in digital form that can be shared, accessed, analyzed, and referenced as data. Using digital tools, researchers can work with archives as data to explore and evaluate characteristics of collection materials and analyze trends and connections within and across them.

*AIDS History Project Collections document included in the No More Silence dataset with Python code used for analysis.*

UCSF Archives and Special Collections makes data available from a number of our digital collections. Researchers will find information in the guide about accessing and using such data as well as descriptions of both the form and content this data takes. As well, you’ll find a growing set of links to to learning resources about various data analysis methods used to work with archives as data.

This new Archives as Data research guide provides researchers with a centralized resource hub with brief descriptions of collection materials as well as links to the datasets that have been prepared from them, including:

The No More Silence dataset, an aggregation of data from selected collections included in the AIDS History Project which range from the records of community activism groups to the papers of health researchers and journalists.
Data from the Industry Documents Library, comprising collections of documents from the tobacco, food, drug, fossil fuel, chemical, and opioid industries, all of which impact public health.
Selected datasets from the COVID Tracking Project, a volunteer organization launched from The Atlantic and dedicated to collecting and publishing the data required to understand the COVID-19 outbreak in the United States, with data collected from March 2020-March 2021.
Data from digitized UCSF University Publications, from course catalogs to annual reports, newsletters, and more.

We look forward to updating the guide as more data from UCSF Archives and Special Collections becomes available, and anticipate expanding to include links to “archives as data” of interest for digital health humanities work made available by other institutions and organizations.

To learn more about how we are making archives as data available at UCSF, check out recordings and resources from our recent sessions on Finding and Exploring Archives as Data for Digital Health Humanities!

The Archives as Data Research Guide has been published as part of the UCSF DIgital Health Humanities pilot program. Please reach out to the Digital Health Humanities Program Coordinator Kathryn Stine, at kathryn.stine@ucsf.edu with any questions about DHH at UCSF. The UCSF Digital Health Humanities Pilot is funded by the Academic Senate Chancellor’s Fund via the Committee on Library and Scholarly Communication.

How to Digitize 68,000 Pages of Documents

Posted on March 1, 2023 by Edith Escobedo

Guest post by Heather Wagner, Digitization Coordinator at UC Merced Library

For the Pioneering Child Studies project the UC Merced Library’s Digital Curation and Scholarship unit was tasked with digitizing 68,000 pages of documents. So, how do we go about digitizing 68,000 pages of documents? With some help. That help comes from four undergraduate student assistants who play an important part in the digitization process.

The first part of the process is the actual digitization. Our undergraduate student assistants digitize materials on a variety of equipment. These include high speed document scanners and flatbed scanners for documents, book scanners for bound material, and cameras on stands for oversize or fragile materials.

*Student Nicolas Fleming digitizing bound materials using a book scanner*

Once the digitization is complete, the next step is quality checking. Students review each image in Adobe Bridge and zoom in to check for issues such as lines in scans or items out of focus. Some images may need minor editing such as straightening and cropping which is completed during the quality checking step in Photoshop. The quality checking step is time consuming but necessary, so we are sure we are receiving the best possible results from digitization.

*Student Dathan Hansell quality checking digitized documents.*

PDFs with optical character recognition (OCR) are created from the digitized image files so they are accessible to users. OCR makes the PDF document searchable. The PDF documents are then quality checked by the students, and the documents are then optimized. Optimizing the PDF files reduces their file size, which makes them better suited for web viewing. The files are then ready for uploading.

We appreciate the hard work of our undergraduate student assistants. We would not be able to complete digitization projects of this size without them.

Dr. Leona Mayer Bayer Digital Collection Now Available

Posted on January 31, 2023 by Edith Escobedo

UCSF Archives and Special Collections is delighted to announce the publication of the Leona Mayer Bayer Correspondence digital collection on Calisphere. The digitization project is part of the NHPRC grant, Pioneering Child Studies: Digitizing and Providing Access to Collection of Women Physicians who Spearheaded Behavioral and Developmental Pediatrics. We worked in partnership with UC Merced Library’s Digital Assets Unit towards our goal of digitizing and publishing 68,000 pages from the collections of Drs. Hulda Evelyn Thelander, Helen Fahl Gofman, Selma Fraiberg, Leona Mayer Bayer, and Ms. Carol Hardgrove. To date we have digitized over 59,000 pages. Most digitized material is still undergoing quality assurance (QA) procedures. Here are some items we have digitized from Dr. Leona Mayer Bayer collection.

Dr. Leona Mayer Bayer

Dr. Leona Mayer Bayer received her MD from Stanford University Medical School in 1928. She worked with the Institute of Human Development in Berkeley and focused on child development, human growth, and psychology of sick children. The collection consists of around 400 digitized pages and the collection features professional correspondence of Dr. Leona Mayer Bayer. Some items that may be of interest is her correspondence with Dr. Hilde Bruch and her acceptance remarks for the PSR Broadstreet Pump Award she received in March of 1987.

Hilde Bruch correspondence, 1961. Leona Mayer Bayer Correspondence box 1, folder 1

In the next months we will digitize and soon publish our next four collections on Calisphere. Stay tuned for our next update

Alex Duryee Named New COVID Tracking Project Archive Lead

Posted on December 1, 2022 by Polina Ilieva

The UCSF Archives & Special Collections is delighted to welcome our new colleague, Alex Duryee who took over from Kevin Miller as the COVID Tracking Project Archive Lead. The project team continues the work of preserving, providing online access, and building educational resources for the organizational records and datasets of the COVID Tracking Project at The Atlantic (CTP).

Alex brings a background in metadata, digital archives, and archival access to the COVID Tracking Project Archive team. He holds a BA from The College of New Jersey and a MLIS from Rutgers University, and also serves as the Manager for Archival Metadata at the New York Public Library. In this position, he manages the Library’s archival metadata platforms and develops metadata policy for the Library’s archival collections. He also collaborates with staff across the organization to improve systems integrations and develop new methods for accessing and using archival materials. Alex also serves on the National Finding Aid Network (NAFAN) Technical Advisory Working Group, SAA’s Technical Subcommittee for Encoded Archival Standards, and as the chair of the SNAC (Social Networks and Archival Context) Technology & Infrastructure Working Group. He contributes to open-source projects such as ArchivesSpace, as well as developing open-source metadata tools. In 2019, his team was awarded the C. F. W. Coker Award for Archival Description by the Society of American Archivists.

Alex’s background also includes experience as a freelance ArchivesSpace developer, a consultant with AVP, and a digital archives fellow with Rhizome.

Alex enjoys puzzles of all sorts (including metadata), board games, baking, and dancing.

“Data for All, For Good, Forever”: Working Towards Sustainable Digital Preservation at the iPRES 2022 Conference

Posted on September 26, 2022 by Kate Tasker

The 18th International Conference on Digital Preservation (iPRES) took place from September 12-16, 2022, in Glasgow, Scotland. First convened in 2004 in Beijing, iPRES has been held on four different continents and aims to embrace “a variety of topics in digital preservation – from strategy to implementation, and from international and regional initiatives to small organisations.” Key values are inclusive dialogue and cooperative goals, which were very much centered in Glasgow thanks to the goodwill of the attendees, the conference code of conduct, and the significant efforts of the remarkable Digital Preservation Coalition (DPC), the iPRES 2022 organizational host.

I attended the conference in my role as the UCSF Industry Documents Library’s managing archivist to gain a better understanding of how other institutions are managing and preserving their rapidly-growing digital collections. For me and for many of the delegates, iPRES 2022 was the first opportunity since the COVID pandemic began to join an in-person conference for professional conversation and exchange. It will come as no surprise to say that gathering together was incredibly valuable and enjoyable (in no small part thanks to the traditional Scottish ceilidh dance which took place at the conference dinner!) The Program Committee also did a fantastic job designing an inclusive online experience for virtual attendees, with livestreamed talks, online social events, and collaborative session notes.

Session themes focused on Community, Environment, Innovation, Resilience, and Exchange. Keynotes were delivered by Amina Shah, the National Librarian of Scotland; Tamar Evangelestia-Dougherty, the inaugural director of the Smithsonian Libraries and Archives; and Steven Gonzalez Monserrate, an ethnographer of data centers and PhD Candidate in the History, Anthropology, Science, Technology & Society (HASTS) program at the Massachusetts Institute of Technology.

Every session I attended was excellent, informative, and thought-provoking. To highlight just a few:

Amina Shah’s keynote “Video Killed the Radio Star: Preserving a Nation’s Memory” (featuring the official 1980 music video by the Buggles!) focused on keeping up with the pace of change at the National Library of Scotland by engaging with new formats, new audiences, and new uses for collections. She noted that “expressing value in a key part of resilience” and that the cultural heritage community needs to talk about “why we’re doing digital preservation, not just how.” This was underscored by her description of our world as a place where the truth is under attack, that capturing the truth and finding a way to present it is crucial, and that it is also crucial that this work be done by people who aren’t trying to make a profit from it.

“Green Goes with Anything: Decreasing Environmental Impact of Digital Libraries at Virginia Tech,” a long paper presented by Alex Kinnaman as part of the wholly excellent Environment 1 session, examined existing digital library practices at Virginia Tech University Libraries, and explored changes in documentation and practice that will foster a more environmentally sustainable collections platform. These changes include choosing the least-energy consumptive hash algorithms (MD4 and MD5) for file fixity checks; choosing cloud storage providers based on their environmental practices; including environmental impact of a digital collection as part of appraisal criteria; and several other practical and actionable recommendations.

The Innovation 2 session included two short papers (by Pierre-yves Burgi, and by Euan Cochrane) and a fascinatingly futuristic panel discussion posing the question “Will DNA Form the Fabric of our Digital Preservation Storage?” (Also special mention to the Resilience 1 session which presented proposed solutions for preserving records of nuclear decommissioning and nuclear waste storage for the very long term – 10,000 years!)

Tamar Evangelestia-Dougherty’s keynote Digital Ties That Bind: Effectively Engaging With Communities For Equitable Digital Preservation Ecosystems was an electric presentation that called unequivocally for centering equity and inclusion within our digital ecosystems, and for recognizing, respecting, and making space for the knowledge and contributions of community archivists. She called out common missteps in digital preservation outreach to communities, and challenged all those listening to “get more people in the room” to include non-white, non-Western perspectives.

“’…provide a lasting legacy for Glasgow and the nation’: Two years of transferring Scottish Cabinet records to National Records of Scotland,” a short paper by Garth Stewart in the Innovation 4 session, touched on a number of challenges very familiar to the UCSF Industry Documents Library team! These included the transfer of a huge volume of recent and potentially sensitive digital documents, in redacted and unredacted form; a need to provide online access as quickly as possible; serving the needs of two major access audiences – the press, and the public; normalizing files to PDF in order to present them online; and dealing with incomplete or missing files.

And so much more, summarized by the final keynote speaker Steven Gonzalez Monserrate after his fantastical storytelling closing talk on the ecological impact of massive terrestrial data centers and what might come after “The Cloud” (underwater data centers? Clay tablets? Living DNA storage?). And, I didn’t even mention the Digital Preservation Bake Off Challenge…

After the conference I also had the opportunity to visit the Archives of the Royal College of Physicians and Surgeons of Glasgow, where our tour group was welcomed by the expert library staff and shown several fascinating items from their collections, including an 18th century Book of Herbal Remedies (which has been digitized for online access).

After five collaborative and collegial days in Glasgow, I’m looking forward to bringing these ideas back to our work with digital archival collections here at UCSF. Many thanks to iPRES, the DPC, the Program Committee, the speakers and presenters, and all the delegates for building this wonderful community for digital preservation!

An 18th-century Book of Herbal Remedies on display at the Archives of the Royal College of Physicians and Surgeons of Glasgow

Contextualizing Data for Researchers: A Data Science Fellowship Report

Posted on August 31, 2022 by Kate Tasker

This is a guest post from Lubov McKone, the Industry Documents Library’s 2022 Data Science Senior Fellow.

This summer, I served as the Industry Documents Library’s Senior Data Science Fellow. A bit about me – I’m currently pursuing my MLIS at Pratt Institute with a focus in research and data, and I’m hoping to work in library data services after I graduate. I was drawn to this opportunity because I wanted to learn how libraries are using data-related techniques and technologies in practice – and specifically, how they are contextualizing these for researchers.

Project Background

The UCSF Industry Documents Library is a vast collection of resources encompassing documents, images, videos, and recordings. These materials can be studied individually, but increasingly, researchers are interested in examining trends across whole collections, or subsets of it. In this way, the Industry Documents Library is also a trove of data that can be used to uncover trends and patterns in the history of industries impacting public health. In this project, the Industry Documents Library wanted to investigate what information is lost or changed when its collections are transformed into data.

There are many ways to generate data from digital collections. In this project we focused on a combination of collections metadata and computer-generated transcripts of video files. Like all information, data is not objective but constructed. Metadata is usually entered manually and is subject to human error. Video transcripts generated by computer programs are never 100% accurate. If accuracy varies based on factors such as the age of the video or the type of event being recorded, how might this impact conclusions drawn by researchers who are treating all video transcriptions as equally accurate? What guidance can the library provide to prevent researchers from drawing inaccurate conclusions from computer-generated text?

Project Team

Kate Tasker, Industry Documents Library Managing Archivist
Rebecca Tang, Industry Documents Library Applications Programmer
Geoffrey Boushey, Data Science Initiative Application Developer and Instructor
Lubov McKone, Senior Data Science Fellow
Lianne De Leon, Junior Data Science Fellow
Rogelio Murillo, Junior Data Science Fellow

Project Summary

Research Questions

Based on the background and the goals of the Industry Documents Library, the project team identified the following research questions to guide the project:

Taking into account factors such as year and runtime, how does computer transcription accuracy differ between television commercials and court proceedings?
How might transcription accuracy impact the conclusions drawn from the data?
What guidance can we give to researchers to prevent uninformed conclusions?

Uses

This project is a case study that evaluates the accuracy of computer-generated transcripts for videos within the Industry Documents Library’s Tobacco Collection. These findings provide a foundation for UCSF’s Industry Documents Library to create guidelines for researchers using video transcripts for text analysis. This case study also acts as a roadmap and a collection of instructional materials for similar studies to be conducted on other collections. These materials have been gathered in a public github repo, viewable here.

Sourcing the Right Data

At the beginning of the project, we worked with the Junior Fellows to determine the scope of the project. The tobacco video collection contains 5,249 videos that encompass interviews, commercials, court proceedings, press conferences, news broadcasts, and more. We wanted to narrow our scope to two categories that would illustrate potential disparities in transcript accuracy and meaning. After transcribing several videos by hand, the fellows proposed commercials and court proceedings as two categories that would suit our analysis. We felt 40 would be a reasonable sample size of videos to study, so each fellow selected 10 videos from each category, selecting videos with a range of years, quality, and runtimes. The fellows were selecting videos from a list that was generated by the InternetArchive python API, containing video links and metadata such as year and runtime.

Computer & Human Transcripts

Once the 40 videos were selected, we extracted transcripts from each URL using the Google AutoML API for transcription. We saved a copy of each computer transcription to use for the analysis, and provided another copy to the Junior Fellows, who edited them to accurately reflect the audio in the videos. We saved these copies as well for comparison to the computer-generated transcription.

Comparing Transcripts

To compare the computer and human transcripts, we conducted research on common metrics for transcript comparison. We came up with two broad categories to compare – accuracy and meaning.

To compare accuracy, we used the following metrics:

Word Error Rate – a measure of how many insertions, deletions, and substitutions are needed to convert the computer-generated transcript into the reference transcript. We subtracted this number from 1 to get the Word Accuracy Rate (WAR).
BLEU score – a more advanced algorithm measuring n-gram matches between the transcripts, normalized for n-gram frequency.
Human-evaluated accuracy – a score from Poor, Fair, Good, and Excellent assigned by the fellows as they were editing the computer-generated transcripts.
Google AutoML confidence score – a score generated by Google AutoML during transcript generation indicating how accurate Google believes its transcription to be.

To compare meaning, we used the following metrics:

Sentiment – We generated sentiment scores and magnitude for both sets of transcripts. We wanted to see whether the computer transcripts were under- or over- estimating sentiment, and whether this differed across categories.
Topic modeling – We ran a k-means topic model for two categories to see how closely the computer transcripts matched the pre-determined categories vs. how closely they were matched by the human transcripts

Findings & Recommendations

Relationships in the data

From an initial review of the significant correlations in the data, we gained some interesting insights. As shown in the correlation matrix, AutoML confidence score, fellow accuracy rating, and Word Accuracy Rate (WAR) are all significantly positively correlated. This means that the AutoML confidence score is a relatively good proxy for transcript accuracy. We recommend that researchers who are seeking to use computer-generated transcripts look to the AutoML confidence score to get a sense of the reliability of the computer-generated text they are working with.

Correlation matrix showing that AutoML confidence score, fellow accuracy rating, and Word Accuracy Rate (WAR) are all significantly positively correlated

We also found a significant positive correlation between year and fellow accuracy rating, Word Accuracy Rate, and AutoML confidence score – suggesting that the more recent the video, the better the quality. We suggest informing researchers that newer videos may generate more accurate computer transcriptions.

Transcript accuracy over time

One of the Junior Fellows suggested that we look into whether there is a specific cutoff year where transcripts become more accurate. As shown in the visual below, there’s a general improvement in transcription quality after the 1960s, but not a dramatic one. Interestingly, this trend disappears when looking at each video type separately.

Line graph showing transcript accuracy over time, separated into two categories: commercials and court proceedings

Transcript accuracy by video type

When comparing transcript accuracy between the two categories, we found that our expectations were challenged. We expected the accuracy of the advertising video transcripts to be higher, because advertisements generally have a higher production quality, and are less likely to have features like multiple people speaking over each other that could hinder transcription accuracy. However, we found that across most metrics, the court proceeding transcripts were more accurate. One potential reason for this is that commercials typically include some form of singing or more stylized speaking, which Google AutoML had trouble transcribing. We recommend informing researchers that video transcripts from media that contain singing or stylized speaking may be less accurate.

The one metric that the commercials were more accurate in was BLEU score, but this should be interpreted with caution. BLEU score is supposed to range from 0-1, but in our dataset its range was 0.0001 – 0.007. BLEU score is meant to be used on a corpus that is broken into sentences, because it works by aggregating n-gram accuracy on a sentence level, and then averaging the sentence-level accuracies across the corpus. However, the transcripts generated by Google AutoML did not contain any punctuation, so we were essentially calculating BLEU score on a corpus-length sentence for each transcript. This resulted in extremely small BLEU scores that may not be accurate or interpretable. For this reason, we don’t recommend the use of the BLEU score metric on transcripts generated by Google AutoML, or on other computer-generated transcripts that lack punctuation.

Transcript sentiment

We looked to sentiment scores to evaluate differences in meaning between the test and reference transcripts. As we expected, commercials, which are sponsored by the companies profiting off of the tobacco industry, tend to have a positive sentiment, while court proceedings, which tend to be brought against these companies, tend to have a negative sentiment. As shown in the plot to the left, the sentiment of the computer transcripts was a slight underestimation in both video types, though this was not too dramatic of an underestimation.

Graph comparing average sentiment scores from computer and human transcriptions of commercials and court proceedings

Opportunities for Further Research

Throughout this project, it was important to me to document my work and generate a research dataset that could be used by others interested in extended this work beyond my fellowship. There were many questions that we didn’t get a chance to investigate over the course of this summer, but my hope is that the work can be built upon – maybe even by a future fellow! This dataset lives in the project’s github repository under data/final_dataset.csv.

One aspect of the data that we did not investigate as much as we had hoped was topic modeling. This will likely be an important next step in assessing whether transcript meaning varies between the test and reference transcripts.

Professional Learnings & Insights

My main area of interest in the field of library data services is critical data literacy – how we as librarians can use conversations around data to build relationships and educate researchers about how data-related tools and technologies are not objective, but subject to the same pitfalls and biases as other research methods. Through my work as the Industry Documents Library Senior Data Science Fellow, I had the opportunity to work with a thoughtful team who is thinking ahead about how to responsibly guide researchers in the use of data.

Before this fellowship, I wasn’t sure exactly how opportunities to educate researchers around data would come up in a real library setting. Because I previously worked for the government, I tended to imagine researchers sourcing data from government open data portals such as NYCOpenData, or other public data sources. This fellowship opened my eyes to how often researchers might be using library collections themselves as data, and to the unique challenges and opportunities that can arise when contextualizing this “internal” data for researchers. As the collecting institution, you might have more information about why data is structured the way it is – for instance, the Industry Documents Library created the taxonomy for the archive’s “Topic” field. However, you are also often relying on hosting systems that you don’t have full control over. In the case of this project, there were several quirks of the Internet Archive API that made data analysis more complicated – for example, the video names and identifiers don’t always match. I can see how researchers might be confused about what the library does and does not have control over.

Another great aspect of this fellowship was the opportunity to work with our high school Junior Fellows, who were both exceptional to work with. Not only did they contribute the foundational work of editing our computer-generated transcripts – tedious and detail-oriented work – they also had really fresh insights about what we should analyze and what we should consider about the data. It was a highlight to support them and learn from them.

I also appreciated the opportunity to work with this very unique and important collection. Seeing the breadth of what is contained in the Industry Documents Library opened my eyes to not only the wealth of government information that exists outside of government entities, but also to the range of private sector information that ought to be accessible to the public. It’s amazing that an archive like the Industry Documents Library is also so invested in thinking critically about the technical tools that it’s reliant upon, but I guess it’s not such a surprise! Thanks to the whole team and to UCSF for a great summer fellowship experience!

Pioneering Child Studies: Digitizing and Providing Access to Collection of Women Physicians Who Spearheaded Behavioral and Developmental Pediatrics Update

Posted on August 30, 2022 by Edith Escobedo

We are at the one-year point of the project Pioneering Child Studies: Digitizing and Providing Access to Collection of Women Physicians who Spearheaded Behavioral and Developmental Pediatrics. UCSF Archives & Special Collections and UC Merced have made significant headway towards our goal of digitizing and publishing 68,000 pages from the collections of Drs. Hulda Evelyn Thelander, Helen Fahl Gofman, Selma Fraiberg, Leona Mayer Bayer, and Ms. Carol Hardgrove.

To date we have digitized over 33,000 pages. The digitized material are still undergoing quality assurance (QA) procedures. Here are some items we have digitized so far.

Dr. Leona Mayer Bayer

This collection features professional correspondence of Dr. Leona Mayer Bayer. Her work focused on child development and human growth and psychology of sick children.

MSS 86-54, Leona Mayer Bayer Correspondence, Carton 1, Folder 9, PSR Award acceptance speech

Dr. Selma Horwitz Fraiberg

This collection includes several drafts of her research papers on important aspects of developmental-behavioral pediatrics.

MSS 83-9, Selma Fraiberg Papers, Carton 14, Folder 8, draft of “Transference and Resistance in the Analysis of a Child with a Severe Behavior Disorder”

In the next year we will continue digitizing and will soon publish our collections on Calisphere. Stay tuned for our next update.

Brought to Light

Stories from UCSF Archives & Special Collections

Category Archives: Digital Projects