The 18th International Conference on Digital Preservation (iPRES) took place from September 12-16, 2022, in Glasgow, Scotland. First convened in 2004 in Beijing, iPRES has been held on four different continents and aims to embrace “a variety of topics in digital preservation – from strategy to implementation, and from international and regional initiatives to small organisations.” Key values are inclusive dialogue and cooperative goals, which were very much centered in Glasgow thanks to the goodwill of the attendees, the conference code of conduct, and the significant efforts of the remarkable Digital Preservation Coalition (DPC), the iPRES 2022 organizational host.
I attended the conference in my role as the UCSF Industry Documents Library’s managing archivist to gain a better understanding of how other institutions are managing and preserving their rapidly-growing digital collections. For me and for many of the delegates, iPRES 2022 was the first opportunity since the COVID pandemic began to join an in-person conference for professional conversation and exchange. It will come as no surprise to say that gathering together was incredibly valuable and enjoyable (in no small part thanks to the traditional Scottish ceilidh dance which took place at the conference dinner!) The Program Committee also did a fantastic job designing an inclusive online experience for virtual attendees, with livestreamed talks, online social events, and collaborative session notes.
Session themes focused on Community, Environment, Innovation, Resilience, and Exchange. Keynotes were delivered by Amina Shah, the National Librarian of Scotland; Tamar Evangelestia-Dougherty, the inaugural director of the Smithsonian Libraries and Archives; and Steven Gonzalez Monserrate, an ethnographer of data centers and PhD Candidate in the History, Anthropology, Science, Technology & Society (HASTS) program at the Massachusetts Institute of Technology.
Every session I attended was excellent, informative, and thought-provoking. To highlight just a few:
Amina Shah’s keynote “Video Killed the Radio Star: Preserving a Nation’s Memory” (featuring the official 1980 music video by the Buggles!) focused on keeping up with the pace of change at the National Library of Scotland by engaging with new formats, new audiences, and new uses for collections. She noted that “expressing value in a key part of resilience” and that the cultural heritage community needs to talk about “why we’re doing digital preservation, not just how.” This was underscored by her description of our world as a place where the truth is under attack, that capturing the truth and finding a way to present it is crucial, and that it is also crucial that this work be done by people who aren’t trying to make a profit from it.
“Green Goes with Anything: Decreasing Environmental Impact of Digital Libraries at Virginia Tech,” a long paper presented by Alex Kinnaman as part of the wholly excellent Environment 1 session, examined existing digital library practices at Virginia Tech University Libraries, and explored changes in documentation and practice that will foster a more environmentally sustainable collections platform. These changes include choosing the least-energy consumptive hash algorithms (MD4 and MD5) for file fixity checks; choosing cloud storage providers based on their environmental practices; including environmental impact of a digital collection as part of appraisal criteria; and several other practical and actionable recommendations.
The Innovation 2 session included two short papers (by Pierre-yves Burgi, and by Euan Cochrane) and a fascinatingly futuristic panel discussion posing the question “Will DNA Form the Fabric of our Digital Preservation Storage?” (Also special mention to the Resilience 1 session which presented proposed solutions for preserving records of nuclear decommissioning and nuclear waste storage for the very long term – 10,000 years!)
Tamar Evangelestia-Dougherty’s keynote Digital Ties That Bind: Effectively Engaging With Communities For Equitable Digital Preservation Ecosystemswas an electric presentation that called unequivocally for centering equity and inclusion within our digital ecosystems, and for recognizing, respecting, and making space for the knowledge and contributions of community archivists. She called out common missteps in digital preservation outreach to communities, and challenged all those listening to “get more people in the room” to include non-white, non-Western perspectives.
“’…provide a lasting legacy for Glasgow and the nation’: Two years of transferring Scottish Cabinet records to National Records of Scotland,” a short paper by Garth Stewart in the Innovation 4 session, touched on a number of challenges very familiar to the UCSF Industry Documents Library team! These included the transfer of a huge volume of recent and potentially sensitive digital documents, in redacted and unredacted form; a need to provide online access as quickly as possible; serving the needs of two major access audiences – the press, and the public; normalizing files to PDF in order to present them online; and dealing with incomplete or missing files.
After five collaborative and collegial days in Glasgow, I’m looking forward to bringing these ideas back to our work with digital archival collections here at UCSF. Many thanks to iPRES, the DPC, the Program Committee, the speakers and presenters, and all the delegates for building this wonderful community for digital preservation!
For the last 3 years, UCSF Archives & Special Collections has been working to provide access our archival collections documenting the lived experience of the HIV/AIDS epidemic “as data.” That last part — “as data” — can be admittedly opaque. Is it not all data? Are all the digitized collections we post to Calisphere and elsewhere not “data” already? What we mean when we say this is that we are providing access to the collections, and their accompanying metadata, in bulk as one download. We’ve been talking about this work as the “No More Silence project.”
We’re excited to announce that this work now has a public home on our website, and that you can access almost all of the various outputs of this project in one single location — even better than a convenience store! That location is here: https://www.library.ucsf.edu/archives/aids/data-projects/.
One of the exciting things about the No More Silence project was the way it started with the goal to provide access to the collections as data, but quickly grew to include community building and teaching around this newly-created data resource. The naivete of our “build it and they will come” approach is something that deserves an entire blog post of its own, but the point here is that there were classes, webinars, and computer code that all ended up coming out of this project. We’re excited to be able to make all of those things available to whoever wants to use them as well.
In addition, this project benefitted immensely from the partnerships we were able to undertake with people outside of the UCSF Campus community as well. It’s especially important to thank Dr. Clair Kronk, postdoctoral fellow in medical informatics at the Yale University School of Medicine and creator of the Gender, Sex, and Sexual Orientation Ontology, and Krü Maekdo, founder of the Black Lesbian Archives.
Check it all out here: https://www.library.ucsf.edu/archives/aids/data-projects/
This project was supported in part by the U.S. Institute of Museum and Library Services under the provisions of the Library Services and Technology Act, administered in California by the State Librarian. The opinions expressed herein do not necessarily reflect the position or policy of the U.S. Institute of Museum and Library Services or the California State Library, and no official endorsement by the U.S. Institute of Museum and Library Services or the California State Library should be inferred.
This is a guest post from Lubov McKone, the Industry Documents Library’s 2022 Data Science Senior Fellow.
This summer, I served as the Industry Documents Library’s Senior Data Science Fellow. A bit about me – I’m currently pursuing my MLIS at Pratt Institute with a focus in research and data, and I’m hoping to work in library data services after I graduate. I was drawn to this opportunity because I wanted to learn how libraries are using data-related techniques and technologies in practice – and specifically, how they are contextualizing these for researchers.
The UCSF Industry Documents Library is a vast collection of resources encompassing documents, images, videos, and recordings. These materials can be studied individually, but increasingly, researchers are interested in examining trends across whole collections, or subsets of it. In this way, the Industry Documents Library is also a trove of data that can be used to uncover trends and patterns in the history of industries impacting public health. In this project, the Industry Documents Library wanted to investigate what information is lost or changed when its collections are transformed into data.
There are many ways to generate data from digital collections. In this project we focused on a combination of collections metadata and computer-generated transcripts of video files. Like all information, data is not objective but constructed. Metadata is usually entered manually and is subject to human error. Video transcripts generated by computer programs are never 100% accurate. If accuracy varies based on factors such as the age of the video or the type of event being recorded, how might this impact conclusions drawn by researchers who are treating all video transcriptions as equally accurate? What guidance can the library provide to prevent researchers from drawing inaccurate conclusions from computer-generated text?
Kate Tasker, Industry Documents Library Managing Archivist
Rebecca Tang, Industry Documents Library Applications Programmer
Geoffrey Boushey, Data Science Initiative Application Developer and Instructor
Lubov McKone, Senior Data Science Fellow
Lianne De Leon, Junior Data Science Fellow
Rogelio Murillo, Junior Data Science Fellow
Based on the background and the goals of the Industry Documents Library, the project team identified the following research questions to guide the project:
Taking into account factors such as year and runtime, how does computer transcription accuracy differ between television commercials and court proceedings?
How might transcription accuracy impact the conclusions drawn from the data?
What guidance can we give to researchers to prevent uninformed conclusions?
This project is a case study that evaluates the accuracy of computer-generated transcripts for videos within the Industry Documents Library’s Tobacco Collection. These findings provide a foundation for UCSF’s Industry Documents Library to create guidelines for researchers using video transcripts for text analysis. This case study also acts as a roadmap and a collection of instructional materials for similar studies to be conducted on other collections. These materials have been gathered in a public github repo, viewable here.
Sourcing the Right Data
At the beginning of the project, we worked with the Junior Fellows to determine the scope of the project. The tobacco video collection contains 5,249 videos that encompass interviews, commercials, court proceedings, press conferences, news broadcasts, and more. We wanted to narrow our scope to two categories that would illustrate potential disparities in transcript accuracy and meaning. After transcribing several videos by hand, the fellows proposed commercials and court proceedings as two categories that would suit our analysis. We felt 40 would be a reasonable sample size of videos to study, so each fellow selected 10 videos from each category, selecting videos with a range of years, quality, and runtimes. The fellows were selecting videos from a list that was generated by the InternetArchive python API, containing video links and metadata such as year and runtime.
Computer & Human Transcripts
Once the 40 videos were selected, we extracted transcripts from each URL using the Google AutoML API for transcription. We saved a copy of each computer transcription to use for the analysis, and provided another copy to the Junior Fellows, who edited them to accurately reflect the audio in the videos. We saved these copies as well for comparison to the computer-generated transcription.
To compare the computer and human transcripts, we conducted research on common metrics for transcript comparison. We came up with two broad categories to compare – accuracy and meaning.
To compare accuracy, we used the following metrics:
Word Error Rate – a measure of how many insertions, deletions, and substitutions are needed to convert the computer-generated transcript into the reference transcript. We subtracted this number from 1 to get the Word Accuracy Rate (WAR).
BLEU score – a more advanced algorithm measuring n-gram matches between the transcripts, normalized for n-gram frequency.
Human-evaluated accuracy – a score from Poor, Fair, Good, and Excellent assigned by the fellows as they were editing the computer-generated transcripts.
Google AutoML confidence score – a score generated by Google AutoML during transcript generation indicating how accurate Google believes its transcription to be.
To compare meaning, we used the following metrics:
Sentiment – We generated sentiment scores and magnitude for both sets of transcripts. We wanted to see whether the computer transcripts were under- or over- estimating sentiment, and whether this differed across categories.
Topic modeling – We ran a k-means topic model for two categories to see how closely the computer transcripts matched the pre-determined categories vs. how closely they were matched by the human transcripts
Findings & Recommendations
Relationships in the data
From an initial review of the significant correlations in the data, we gained some interesting insights. As shown in the correlation matrix, AutoML confidence score, fellow accuracy rating, and Word Accuracy Rate (WAR) are all significantly positively correlated. This means that the AutoML confidence score is a relatively good proxy for transcript accuracy. We recommend that researchers who are seeking to use computer-generated transcripts look to the AutoML confidence score to get a sense of the reliability of the computer-generated text they are working with.
We also found a significant positive correlation between year and fellow accuracy rating, Word Accuracy Rate, and AutoML confidence score – suggesting that the more recent the video, the better the quality. We suggest informing researchers that newer videos may generate more accurate computer transcriptions.
Transcript accuracy over time
One of the Junior Fellows suggested that we look into whether there is a specific cutoff year where transcripts become more accurate. As shown in the visual below, there’s a general improvement in transcription quality after the 1960s, but not a dramatic one. Interestingly, this trend disappears when looking at each video type separately.
Transcript accuracy by video type
When comparing transcript accuracy between the two categories, we found that our expectations were challenged. We expected the accuracy of the advertising video transcripts to be higher, because advertisements generally have a higher production quality, and are less likely to have features like multiple people speaking over each other that could hinder transcription accuracy. However, we found that across most metrics, the court proceeding transcripts were more accurate. One potential reason for this is that commercials typically include some form of singing or more stylized speaking, which Google AutoML had trouble transcribing. We recommend informing researchers that video transcripts from media that contain singing or stylized speaking may be less accurate.
The one metric that the commercials were more accurate in was BLEU score, but this should be interpreted with caution. BLEU score is supposed to range from 0-1, but in our dataset its range was 0.0001 – 0.007. BLEU score is meant to be used on a corpus that is broken into sentences, because it works by aggregating n-gram accuracy on a sentence level, and then averaging the sentence-level accuracies across the corpus. However, the transcripts generated by Google AutoML did not contain any punctuation, so we were essentially calculating BLEU score on a corpus-length sentence for each transcript. This resulted in extremely small BLEU scores that may not be accurate or interpretable. For this reason, we don’t recommend the use of the BLEU score metric on transcripts generated by Google AutoML, or on other computer-generated transcripts that lack punctuation.
We looked to sentiment scores to evaluate differences in meaning between the test and reference transcripts. As we expected, commercials, which are sponsored by the companies profiting off of the tobacco industry, tend to have a positive sentiment, while court proceedings, which tend to be brought against these companies, tend to have a negative sentiment. As shown in the plot to the left, the sentiment of the computer transcripts was a slight underestimation in both video types, though this was not too dramatic of an underestimation.
Opportunities for Further Research
Throughout this project, it was important to me to document my work and generate a research dataset that could be used by others interested in extended this work beyond my fellowship. There were many questions that we didn’t get a chance to investigate over the course of this summer, but my hope is that the work can be built upon – maybe even by a future fellow! This dataset lives in the project’s github repository under data/final_dataset.csv.
One aspect of the data that we did not investigate as much as we had hoped was topic modeling. This will likely be an important next step in assessing whether transcript meaning varies between the test and reference transcripts.
Professional Learnings & Insights
My main area of interest in the field of library data services is critical data literacy – how we as librarians can use conversations around data to build relationships and educate researchers about how data-related tools and technologies are not objective, but subject to the same pitfalls and biases as other research methods. Through my work as the Industry Documents Library Senior Data Science Fellow, I had the opportunity to work with a thoughtful team who is thinking ahead about how to responsibly guide researchers in the use of data.
Before this fellowship, I wasn’t sure exactly how opportunities to educate researchers around data would come up in a real library setting. Because I previously worked for the government, I tended to imagine researchers sourcing data from government open data portals such as NYCOpenData, or other public data sources. This fellowship opened my eyes to how often researchers might be using library collections themselves as data, and to the unique challenges and opportunities that can arise when contextualizing this “internal” data for researchers. As the collecting institution, you might have more information about why data is structured the way it is – for instance, the Industry Documents Library created the taxonomy for the archive’s “Topic” field. However, you are also often relying on hosting systems that you don’t have full control over. In the case of this project, there were several quirks of the Internet Archive API that made data analysis more complicated – for example, the video names and identifiers don’t always match. I can see how researchers might be confused about what the library does and does not have control over.
Another great aspect of this fellowship was the opportunity to work with our high school Junior Fellows, who were both exceptional to work with. Not only did they contribute the foundational work of editing our computer-generated transcripts – tedious and detail-oriented work – they also had really fresh insights about what we should analyze and what we should consider about the data. It was a highlight to support them and learn from them.
I also appreciated the opportunity to work with this very unique and important collection. Seeing the breadth of what is contained in the Industry Documents Library opened my eyes to not only the wealth of government information that exists outside of government entities, but also to the range of private sector information that ought to be accessible to the public. It’s amazing that an archive like the Industry Documents Library is also so invested in thinking critically about the technical tools that it’s reliant upon, but I guess it’s not such a surprise! Thanks to the whole team and to UCSF for a great summer fellowship experience!
We are at the one-year point of the project Pioneering Child Studies: Digitizing and Providing Access to Collection of Women Physicians who Spearheaded Behavioral and Developmental Pediatrics. UCSF Archives & Special Collections and UC Merced have made significant headway towards our goal of digitizing and publishing 68,000 pages from the collections of Drs. Hulda Evelyn Thelander, Helen Fahl Gofman, Selma Fraiberg, Leona Mayer Bayer, and Ms. Carol Hardgrove.
To date we have digitized over 33,000 pages. The digitized material are still undergoing quality assurance (QA) procedures. Here are some items we have digitized so far.
Dr. Leona Mayer Bayer
This collection features professional correspondence of Dr. Leona Mayer Bayer. Her work focused on child development and human growth and psychology of sick children.
Dr. Selma Horwitz Fraiberg
This collection includes several drafts of her research papers on important aspects of developmental-behavioral pediatrics.
In the next year we will continue digitizing and will soon publish our collections on Calisphere. Stay tuned for our next update.
The COVID Tracking Project Archive has several unique challenges, namely how to preserve unique, born-digital materials in formats that will be easily accessible to researchers far in the future. Tools like Twitter, Instagram, and Slack are constantly changing their interfaces, making preservation difficult.
To make the job of the archive and other archivists easier, the COVID Tracking Project is releasing several tools we have developed to preserve these digital formats on our Github Organization. These include:
Twitter Preserver – A tool to convert the downloaded Zip file a user gets from Twitter into stable HTML files. This includes Direct Messages as well as public Tweets and Favorites. View a preview of the output of this tool.
XLSX Bulk Converted – A python script that will bulk-convert Excel files into folders of CSV files, one file per worksheet.
Instagram Preserver — A tool that logs into Instagram and downloads all the feed data and images from another account. Instagram is particularly difficult to access without logging in, so this tool uses an internal API to access the user’s feed.
Donated by her husband, Dennis Hirschfelder, the Arlene Hirschfelder Collection was accessioned into the UCSF Archives & Special Collections this year. Arlene Hirschfelder was an educator and scholar who authored numerous books and other resource materials on tobacco control specifically as it relates to teenagers and young adults. She passed away in 2021.
The collection contains tobacco control resource materials that Hirschfelder authored such as A Century of Tobacco & Smoking (1998) which chronicles US tobacco history from the 1870s to 1990s with a focus on the marketing strategies of the tobacco industry.
The collection also includes materials that she assembled such as an anti-smoking board game, posters, cigarettes, candy cigarettes, and other ephemera.
After a four-year break, last semester the archives team hosted a History of Health Sciences course, the Anatomy of an Archive. This course was developed and co-taught by the Department of Humanities and Social Sciences Associate Professor, Aimee Medeiros and Associate University Librarian for Collections and UCSF Archivist, Polina Ilieva. Charlie Macquarie, Digital Archivist, facilitated the discussion on Digital Projects. Polina, Peggy Tran-Le, Research and Technical Services Managing Archivist, and Edith Escobedo, Processing Archivist, served as mentors for students’ processing projects throughout the duration of the course.
The goal of this course was to provide an overview of archival science with an emphasis on the theory, methodology, technologies and best practices of archival research, arrangement and description. The archivists put together a list of collections requiring processing and also corresponding to students’ research interests and each student selected one that they worked on with their mentor to arrange and create a finding aid. During this 11-week hybrid course students developed competencies related to researching and describing archival collections, as well as interpreting the historical record. At the conclusion of this course students wrote a story about their experiences highlighting collections they processed. In the next few weeks, we will be sharing these stories with you.
This week’s story comes from Alexzandria Simon, PhD student, UCSF Department of History & Social Sciences.
Post by Alexzandria Simon:
Having never stepped into any kind of archival space or discussion, I was excited to engage with, learn about, and understand what the archives are and mean. Now, after working with Polina Ilieva and Aimee Medeiros, at UCSF, I realize all the intricacies, time, and special attention that goes into the archival collection process. There are practices and standards that guide researchers and archivists, and emotions and ethics play a role in shaping collections and entire archives. The journey of processing a collection is time consuming, interdisciplinary, and sometimes messy. However, the craft of processing a collection allows individuals to discover new characters, information, and stories that take place during a different time and space.
When I saw my collection for the first time, all I could think to myself was how small the collection is. I was surprised after seeing others, some that consisted of 5 boxes worth of documents, that the one I was planning to work on could be confined to one file. I could not begin to comprehend how the file could tell such a large story. I began flipping through all the documents, photographs, and pamphlets and skimming through the letters and correspondence trying to put all the pieces together. The file had no organizational layout, and so my priority was to put everything in chronological order. I wanted to understand the starting point and the ending point. What I came to discover is that sometimes, collections do not always have a solid beginning and concluding aspect. Stories sometimes begin right in the middle and then end abruptly, leaving many questions.
Figure 1: “Physician – Patient – Pastor” Pamphlet, San Francisco Medical Center, May 1961, Chaplaincy Services at UCSF, MSS 22-03.
The Chaplaincy Services at UCSF Collection began with correspondence between UCSF administrators interested in starting a chaplaincy program. They sought to understand how chaplains, priests, and rabbis could have a role in their hospital space and provide services to patients. What they came to learn and understand, from informational pamphlets, is the connection between chaplains and patients is a powerful one. Chaplains offer judgement free support and a space for patient’s belief and repent needs. When a patient is alone with no family members or loved ones, they can call upon their religion to give them a person of guidance and care.
Figure 2: Installation Service Program for Reverend Elmer Laursen, S.T.M., Lutheran Welfare Service of Northern California, September 18, 1960, Chaplaincy Services at UCSF, MSS 22-03.
These discussions would ultimately lead to the establishment of a Clinical Pastoral Education Program initiated and headed by Reverend Elmer Laursen, S.T.M. Reverend Laursen was a prominent figure in the Chaplaincy Services at UCSF and established clinical pastoral work as necessary for patient care. Reverend Laursen engaged in public outreach, fundraising, patient and student advocacy, and building relationships with other colleges and hospitals. His work inspired other pastors, reverends, and religious officials to begin implementing clinical pastoral education programs to develop student learning and patient care. He believed that pastoral care is imperative to patient care. Patients deal with challenging, and sometimes traumatizing and scary, medical procedures. The Chaplaincy Program could offer solitude and peace for patients who have no one else to call on. Chaplaincy Programs offer a humanistic approach to patient care in a field that is saturated with data, clinicians, and the medical unknown.
Figure 3: Group Photo of Chaplains, Reverends, Nuns, and Administrators at the 21st Anniversary Celebration of Chaplaincy Training Event, September 1982, Chaplaincy Services at UCSF, MSS 22-03.
After reading through the collection, I began dividing the documents into subject folders. These consisted of “Chaplaincy Service Materials,” “Pamphlets & Booklets,” “Funding,” “Chaplaincy Facility Space,” “Chaplain Elmer Laursen,” “Correspondence – August 1959 – September 1974,” “Photographs,” “21st Anniversary Celebration,” and “Rabbi Services.” Through these folders, the collection is now organized in a way researchers and others can trace the narrative. While I was processing the collection, I kept reminding myself to make the finding aid easy and accessible. I want anyone, scholar or not, to be able to open the finding aid or file and know what the collection includes. It is difficult to not let the records overwhelm you with tiny details. It is difficult to not get lost in every aspect of a collection. I found gaps in the correspondence, and every time I read something new, I seemed to come up with more questions. However, I believe that to be a part of the journey and work of archivists and scholars. We are always left wanting more. The documents in the collection are only a portion of the much larger story around Chaplaincy Services at UCSF. Even more miniscule in the larger history around religion and hospitals.
The Archives and Special Collections reading room is now open Wednesday–Friday from 9 am–noon and 1–4 pm by appointment only. For non-UCSF visitors, please see the following information:
Request materials and make appointments using our new request system; it’s easy to request materials and make reading room appointments. After an initial sign-up, you can track your requests and appointments.
The requirements for access to reading room UCSF Library facilities are currently only open to those with a UCSF ID. External researchers can make appointments to review materials in the Archives & Special Collections reading room. At the time of appointment, visitors will be met at the entrance to the library by the archives staff and accompanied to the reading room. Any individuals visiting the UCSF campus facilities are required to follow UCSF campus guest requirements.
We are excited to introduce Kathryn Stine who joined the Archives & Special Collections team as a Digital Health Humanities Program Coordinator. This position will support development and day-to-day operations of a new Digital Health Humanities Pilot. The goal of this initiative is to guide and support faculty in their engagement with digital tools and methods to facilitate interdisciplinary scholarship that will advance understanding of the profound effects of illness and disease on patients, health professionals, and the social worlds in which they live and work.
Kathryn Stine has an extensive background in developing and providing access to digital collections. Her experience includes nearly 10 years working for the University of California system at the California Digital Library (CDL) in various roles, the most recent of which as the Senior Product Manager for Digitization & Digital Content. In that position, Kathryn managed the team that supports and coordinates the University of California Libraries’ engagement with HathiTrust and mass digitization activities.
Prior to joining CDL, Kathryn held several positions at the University of Illinois at Chicago where she was responsible for leading the university archives program and managed special collections processing.
Kathryn is deeply experienced in developing and managing cross-institutional and cross-departmental library projects and building communities across diverse functions and perspectives. Her work at CDL included managing and contributing to both investigative and operations-focused systemwide project teams, coordinating web archiving initiatives, advising for the UC Berkeley digital lifecycle program, and leading a team of developers and analysts to launch, maintain, and enhance a metadata management system for and with HathiTrust. She is motivated by supporting cross-functional teams in bringing both collaboration and creativity to common purpose.
Working with (meta)data is a throughline in Kathryn’s career, and she is enthusiastic about encouraging new ways of deriving and analyzing collections data in support of innovative digital research. In developing and providing workshops and providing project consultation, Kathryn has found working with researchers to make the most of digital collections to be incredibly rewarding. She is very excited to be joining the UCSF Library for the opportunities to work with researchers, technologists, and archivists to match health humanities research inquiry to relevant collections, digital analysis methods, and technical tools.
Kathryn loves a good metadata challenge to puzzle through, and also enjoys improvisational cooking, garment sewing, and getting outdoors with her family, especially to camp and open-water swim.
The Industry Documents Library (IDL) is excited to welcome three Data Science Fellows to our team this summer. The Data Science Fellows will be working with the IDL and with the UCSF Library Data Science Initiative (DSI) to to assess the impact of transcription accuracy on text analysis of digital archives, using the IDL collections.
Through tagging, human transcription, and computer-generated transcription, the team will assess how accuracy may differ between media or document types, and how and whether this difference is more or less pronounced in certain categories of media (for example, video recordings of focus groups, community meetings, court proceedings, or TV commercials, all of which are present in the IDL’s video collections). After identifying transcript accuracy in different media types, we aim to provide guidelines to researchers and technical staff for proper analysis, measurement, and reporting of transcript accuracy when working with digital media.
Our Junior Data Science Fellows are Rogelio Murillo and Lianne De Leon. Rogelio and Lianne are both participating in the San Francisco Unified School District (SFUSD) Career Pathway Summer Fellowship Program. This six-week program provides opportunities for high school students to gain work experience in a variety of industries and to expand their learning and skills outside of the classroom. Lianne and Rogelio will be learning about programming and creating transcription for selected audiovisual materials. The IDL thanks SFUSD and its partners for running this program and providing sponsorship support for our fellows.
Lubov McKone is our Senior Data Science Fellow and will be using automated transcription tools to extract text from audiovisual files, run sentiment and topic analyses, and compare automated results to human transcription. Lubov will also provide guidance and mentoring to the Junior Fellows.
Our Fellows have introduced themselves below. Please join us in welcoming Rogelio, Lianne, and Lubov to the UCSF Library this summer!
Hi my name is Lianne R. de Leon and I go to Phillip and Sala Burton High School as a rising senior. I love playing volleyball in my free time and you may see me at numerous open gyms around the city. In the future I hope to major in computer science or computer engineering. I’m looking forward to meeting many wonderful people here at UCSF and learning more about the data science industry from the inside.
Hi, my name is Rogelio Murillo and I’m a rising junior at Ruth Asawa School of the Arts. I enjoy playing a variety of music and percussion. I’ve played Japanese Taiko, Afro Brazilian drumming, and Latin Jazz. I’m also learning guitar over the summer. I’m a responsible and respectful person.
My name is Lubov McKone and I’m currently pursuing my Masters in Library and Information Science from Pratt Institute in Brooklyn, NY. I also hold a Bachelor’s degree in Statistics, and prior to entering graduate school I worked as a data analyst in local government. My professional interests include supporting researchers in the accurate and responsible use of data, and I aspire to work as a data librarian in an academic library after graduation. Outside of work, I spend my time cooking, doing yoga, and writing music. I’m very excited to be joining the UCSF Industry Documents Library this summer, and I’m looking forward to learning more about how researchers use digital collections!