Student Fellows Explore Machine Learning with UCSF Industry Documents Library and Data Science Initiative

The UCSF Industry Documents Library (IDL) and Data Science Initiative (DSI) teams are excited to be working with three Data Science Fellows this summer. The Data Science Fellows are part of a joint IDL-DSI project to explore machine learning technologies to create and enhance descriptive metadata for thousands of audio and video recordings in IDL’s archival collections.  This year’s summer program includes two junior fellows and one senior fellow.

Our junior fellows are tasked with manually assigning or improving metadata fields such as title, description, subject, and runtime for a selection of videos in IDL’s collection on the Internet Archive. This is a detailed and time-consuming task, which would be costly to perform for the entire collection. In contrast, our senior fellow is using transcriptions of the videos, which we have generated with Google’s AutoML tool, to explore different technologies to automatically extract the descriptive information. We’ll then compare the human-generated data with the machine-generated data to assess accuracy.  The hope is that IDL can develop a workflow for using machine learning to create or improve metadata for many other videos in our collections.

Our Junior Data Science Fellows are Bryce Quintos and Adam Silva. Bryce and Adam are both participating in the San Francisco Unified School District (SFUSD) Career Pathway Summer Fellowship Program. This six-week program provides opportunities for high school students to gain work experience in a variety of industries and to expand their learning and skills outside of the classroom. Bryce and Adam are learning about programming and creating transcription for selected audiovisual materials. The IDL thanks SFUSD and its partners for running this program and providing sponsorship support for our fellows.

Noel Salmeron is our Senior Data Science Fellow participating in Life Science Cares Bay Area’s Project Onramp. Noel is using automated transcription tools to extract text from audiovisual files, run sentiment and topic analyses, and compare automated results to human transcription. Noel also provides guidance and mentoring to the Junior Fellows.

Our Fellows have shared a bit about themselves below. Please join us in recognizing Bryce, Adam, and Noel for their contributions to the UCSF Library this summer!

IDL-DSI Junior Data Science Fellow Bryce Quintos

Hi everyone! My name is Bryce Quintos and I am an incoming freshman at Boston University. I
hope to major in biochemistry and work in the biotechnology and pharmaceutical field. As someone who is interested in medical research and science, I am incredibly honored for the opportunity to help organize the Industry Documents Library at UCSF this summer and learn more about computer programming. I can’t wait to meet all of you!

IDL-DSI Junior Data Science Fellow Adam Silva

Hi, my name is Adam Silva and I am a Junior Intern for the UCSF Library. Currently, I am 17 years old and I am going into my senior year at Abraham Lincoln High School in San Francisco. I am part of Lincoln High School’s Dragon Boat team and I am also a part of Boy Scout Troop 15 in San Francisco. My favorite activities include cooking, camping, hiking, and backpacking. My favorite thing that I did in Boy Scouts was backpacking through Rae Lakes for a week. I am excited to work as a Junior Intern this year because working online rather than in person is new to me. I look forward to working with other employees and gaining the experience of working in a group.

IDL-DSI Senior Data Science Fellow Noel Salmeron

My name is Noel Salmeron and I am a third-year data science major and education minor at UC Berkeley. I’m excited to work with everyone this summer and looking forward to contributing to the Industry Documents Library!

Contextualizing Data for Researchers: A Data Science Fellowship Report

This is a guest post from Lubov McKone, the Industry Documents Library’s 2022 Data Science Senior Fellow.

This summer, I served as the Industry Documents Library’s Senior Data Science Fellow. A bit about me – I’m currently pursuing my MLIS at Pratt Institute with a focus in research and data, and I’m hoping to work in library data services after I graduate. I was drawn to this opportunity because I wanted to learn how libraries are using data-related techniques and technologies in practice – and specifically, how they are contextualizing these for researchers.

Project Background

The UCSF Industry Documents Library is a vast collection of resources encompassing documents, images, videos, and recordings. These materials can be studied individually, but increasingly, researchers are interested in examining trends across whole collections, or subsets of it. In this way, the Industry Documents Library is also a trove of data that can be used to uncover trends and patterns in the history of industries impacting public health. In this project, the Industry Documents Library wanted to investigate what information is lost or changed when its collections are transformed into data. 

There are many ways to generate data from digital collections. In this project we focused on a combination of collections metadata and computer-generated transcripts of video files. Like all information, data is not objective but constructed. Metadata is usually entered manually and is subject to human error. Video transcripts generated by computer programs are never 100% accurate. If accuracy varies based on factors such as the age of the video or the type of event being recorded, how might this impact conclusions drawn by researchers who are treating all video transcriptions as equally accurate? What guidance can the library provide to prevent researchers from drawing inaccurate conclusions from computer-generated text?

Project Team

  • Kate Tasker, Industry Documents Library Managing Archivist
  • Rebecca Tang, Industry Documents Library Applications Programmer
  • Geoffrey Boushey, Data Science Initiative Application Developer and Instructor
  • Lubov McKone, Senior Data Science Fellow
  • Lianne De Leon, Junior Data Science Fellow
  • Rogelio Murillo, Junior Data Science Fellow

Project Summary

Research Questions

Based on the background and the goals of the Industry Documents Library, the project team identified the following research questions to guide the project:

  • Taking into account factors such as year and runtime, how does computer transcription accuracy differ between television commercials and court proceedings?
  • How might transcription accuracy impact the conclusions drawn from the data? 
  • What guidance can we give to researchers to prevent uninformed conclusions?

Uses

This project is a case study that evaluates the accuracy of computer-generated transcripts for videos within the Industry Documents Library’s Tobacco Collection. These findings provide a foundation for UCSF’s Industry Documents Library to create guidelines for researchers using video transcripts for text analysis. This case study also acts as a roadmap and a collection of instructional materials for similar studies to be conducted on other collections. These materials have been gathered in a public github repo, viewable here

Sourcing the Right Data

At the beginning of the project, we worked with the Junior Fellows to determine the scope of the project. The tobacco video collection contains 5,249 videos that encompass interviews, commercials, court proceedings, press conferences, news broadcasts, and more. We wanted to narrow our scope to two categories that would illustrate potential disparities in transcript accuracy and meaning. After transcribing several videos by hand, the fellows proposed commercials and court proceedings as two categories that would suit our analysis. We felt 40 would be a reasonable sample size of videos to study, so each fellow selected 10 videos from each category, selecting videos with a range of years, quality, and runtimes. The fellows were selecting videos from a list that was generated by the InternetArchive python API, containing video links and metadata such as year and runtime.

Computer & Human Transcripts

Once the 40 videos were selected, we extracted transcripts from each URL using the Google AutoML API for transcription. We saved a copy of each computer transcription to use for the analysis, and provided another copy to the Junior Fellows, who edited them to accurately reflect the audio in the videos. We saved these copies as well for comparison to the computer-generated transcription.

Comparing Transcripts

To compare the computer and human transcripts, we conducted research on common metrics for transcript comparison. We came up with two broad categories to compare – accuracy and meaning. 

To compare accuracy, we used the following metrics:

  • Word Error Rate – a measure of how many insertions, deletions, and substitutions are needed to convert the computer-generated transcript into the reference transcript. We subtracted this number from 1 to get the Word Accuracy Rate (WAR).
  • BLEU score – a more advanced algorithm measuring n-gram matches between the transcripts, normalized for n-gram frequency.
  • Human-evaluated accuracy –  a score from Poor, Fair, Good, and Excellent assigned by the fellows as they were editing the computer-generated transcripts.
  • Google AutoML confidence score –  a score generated by Google AutoML during transcript generation indicating how accurate Google believes its transcription to be.

To compare meaning, we used the following metrics:

  • Sentiment – We generated sentiment scores and magnitude for both sets of transcripts. We wanted to see whether the computer transcripts were under- or over- estimating sentiment, and whether this differed across categories. 
  • Topic modeling – We ran a k-means topic model for two categories to see how closely the computer transcripts matched the pre-determined categories vs. how closely they were matched by the human transcripts

Findings & Recommendations

Relationships in the data

From an initial review of the significant correlations in the data, we gained some interesting insights. As shown in the correlation matrix, AutoML confidence score, fellow accuracy rating, and Word Accuracy Rate (WAR) are all significantly positively correlated. This means that the AutoML confidence score is a relatively good proxy for transcript accuracy. We recommend that researchers who are seeking to use computer-generated transcripts look to the AutoML confidence score to get a sense of the reliability of the computer-generated text they are working with.

Correlation matrix showing that AutoML confidence score, fellow accuracy rating, and Word Accuracy Rate (WAR) are all significantly positively correlated

We also found a significant positive correlation between year and fellow accuracy rating, Word Accuracy Rate, and AutoML confidence score – suggesting that the more recent the video, the better the quality. We suggest informing researchers that newer videos may generate more accurate computer transcriptions.

Transcript accuracy over time

One of the Junior Fellows suggested that we look into whether there is a specific cutoff year where transcripts become more accurate. As shown in the visual below, there’s a general improvement in transcription quality after the 1960s, but not a dramatic one. Interestingly, this trend disappears when looking at each video type separately.

Line graph showing transcript accuracy over time for all video types
Line graph showing transcript accuracy over time, separated into two categories: commercials and court proceedings

Transcript accuracy by video type

Bar graphs showing transcript accuracy by video type (commercials and court proceedings) according to four ratings: AutoML Confidence Average; Bleu Score; Fellow Accuracy Rating; and Word Accuracy Rate (WAR)

When comparing transcript accuracy between the two categories, we found that our expectations were challenged. We expected the accuracy of the advertising video transcripts to be higher, because advertisements generally have a higher production quality, and are less likely to have features like multiple people speaking over each other that could hinder transcription accuracy. However, we found that across most metrics, the court proceeding transcripts were more accurate. One potential reason for this is that commercials typically include some form of singing or more stylized speaking, which Google AutoML had trouble transcribing. We recommend informing researchers that video transcripts from media that contain singing or stylized speaking may be less accurate.

The one metric that the commercials were more accurate in was BLEU score, but this should be interpreted with caution. BLEU score is supposed to range from 0-1, but in our dataset its range was 0.0001 – 0.007. BLEU score is meant to be used on a corpus that is broken into sentences, because it works by aggregating n-gram accuracy on a sentence level, and then averaging the sentence-level accuracies across the corpus. However, the transcripts generated by Google AutoML did not contain any punctuation, so we were essentially calculating BLEU score on a corpus-length sentence for each transcript. This resulted in extremely small BLEU scores that may not be accurate or interpretable. For this reason, we don’t recommend the use of the BLEU score metric on transcripts generated by Google AutoML, or on other computer-generated transcripts that lack punctuation.

Transcript sentiment

We looked to sentiment scores to evaluate differences in meaning between the test and reference transcripts. As we expected, commercials, which are sponsored by the companies profiting off of the tobacco industry, tend to have a positive sentiment, while court proceedings, which tend to be brought against these companies, tend to have a negative sentiment. As shown in the plot to the left, the sentiment of the computer transcripts was a slight underestimation in both video types, though this was not too dramatic of an underestimation. 

Graph comparing average sentiment scores from computer and human transcriptions of commercials and court proceedings

Opportunities for Further Research

Throughout this project, it was important to me to document my work and generate a research dataset that could be used by others interested in extended this work beyond my fellowship. There were many questions that we didn’t get a chance to investigate over the course of this summer, but my hope is that the work can be built upon – maybe even by a future fellow! This dataset lives in the project’s github repository under data/final_dataset.csv.

One aspect of the data that we did not investigate as much as we had hoped was topic modeling. This will likely be an important next step in assessing whether transcript meaning varies between the test and reference transcripts.

Professional Learnings & Insights

My main area of interest in the field of library data services is critical data literacy – how we as librarians can use conversations around data to build relationships and educate researchers about how data-related tools and technologies are not objective, but subject to the same pitfalls and biases as other research methods. Through my work as the Industry Documents Library Senior Data Science Fellow, I had the opportunity to work with a thoughtful team who is thinking ahead about how to responsibly guide researchers in the use of data. 

Before this fellowship, I wasn’t sure exactly how opportunities to educate researchers around data would come up in a real library setting. Because I previously worked for the government, I tended to imagine researchers sourcing data from government open data portals such as NYCOpenData, or other public data sources. This fellowship opened my eyes to how often researchers might be using library collections themselves as data, and to the unique challenges and opportunities that can arise when contextualizing this “internal” data for researchers. As the collecting institution, you might have more information about why data is structured the way it is – for instance, the Industry Documents Library created the taxonomy for the archive’s “Topic” field. However, you are also often relying on hosting systems that you don’t have full control over. In the case of this project, there were several quirks of the Internet Archive API that made data analysis more complicated – for example, the video names and identifiers don’t always match. I can see how researchers might be confused about what the library does and does not have control over.

Another great aspect of this fellowship was the opportunity to work with our high school Junior Fellows, who were both exceptional to work with. Not only did they contribute the foundational work of editing our computer-generated transcripts – tedious and detail-oriented work – they also had really fresh insights about what we should analyze and what we should consider about the data. It was a highlight to support them and learn from them.

I also appreciated the opportunity to work with this very unique and important collection. Seeing the breadth of what is contained in the Industry Documents Library opened my eyes to not only the wealth of government information that exists outside of government entities, but also to the range of private sector information that ought to be accessible to the public. It’s amazing that an archive like the Industry Documents Library is also so invested in thinking critically about the technical tools that it’s reliant upon, but I guess it’s not such a surprise! Thanks to the whole team and to UCSF for a great summer fellowship experience!

Welcome to Industry Documents Library Data Science Fellows!

The Industry Documents Library (IDL) is excited to welcome three Data Science Fellows to our team this summer. The Data Science Fellows will be working with the IDL and with the UCSF Library Data Science Initiative (DSI) to to assess the impact of transcription accuracy on text analysis of digital archives, using the IDL collections.

Through tagging, human transcription, and computer-generated transcription, the team will assess how accuracy may differ between media or document types, and how and whether this difference is more or less pronounced in certain categories of media (for example, video recordings of focus groups, community meetings, court proceedings, or TV commercials, all of which are present in the IDL’s video collections). After identifying transcript accuracy in different media types, we aim to provide guidelines to researchers and technical staff for proper analysis, measurement, and reporting of transcript accuracy when working with digital media.

Our Junior Data Science Fellows are Rogelio Murillo and Lianne De Leon. Rogelio and Lianne are both participating in the San Francisco Unified School District (SFUSD) Career Pathway Summer Fellowship Program. This six-week program provides opportunities for high school students to gain work experience in a variety of industries and to expand their learning and skills outside of the classroom. Lianne and Rogelio will be learning about programming and creating transcription for selected audiovisual materials. The IDL thanks SFUSD and its partners for running this program and providing sponsorship support for our fellows.

Lubov McKone is our Senior Data Science Fellow and will be using automated transcription tools to extract text from audiovisual files, run sentiment and topic analyses, and compare automated results to human transcription. Lubov will also provide guidance and mentoring to the Junior Fellows.

Our Fellows have introduced themselves below. Please join us in welcoming Rogelio, Lianne, and Lubov to the UCSF Library this summer!

Hi my name is Lianne R. de Leon and I go to Phillip and Sala Burton High School as a rising senior. I love playing volleyball in my free time and you may see me at numerous open gyms around the city. In the future I hope to major in computer science or computer engineering. I’m looking forward to meeting many wonderful people here at UCSF and learning more about the data science industry from the inside.

Image of Lianne De Leon, one of IDL's Summer 2022 Junior Data Science Fellows.
IDL Junior Data Science Fellow Lianne de Leon

Hi, my name is Rogelio Murillo and I’m a rising junior at Ruth Asawa School of the Arts. I enjoy playing a variety of music and percussion. I’ve played Japanese Taiko, Afro Brazilian drumming, and Latin Jazz. I’m also learning guitar over the summer. I’m a responsible and respectful person.

Image of Rogelio Murillo, one of IDL's Summer 2022 Junior Data Science Fellows.
IDL Junior Data Science Fellow Rogelio Murillo

My name is Lubov McKone and I’m currently pursuing my Masters in Library and Information Science from Pratt Institute in Brooklyn, NY. I also hold a Bachelor’s degree in Statistics, and prior to entering graduate school I worked as a data analyst in local government. My professional interests include supporting researchers in the accurate and responsible use of data, and I aspire to work as a data librarian in an academic library after graduation. Outside of work, I spend my time cooking, doing yoga, and writing music. I’m very excited to be joining the UCSF Industry Documents Library this summer, and I’m looking forward to learning more about how researchers use digital collections!

Image of Lubov McKone, IDL's Summer 2022 Senior Data Science Fellow.
IDL Senior Data Science Fellow Lubov McKone

Welcome to Summer Interns May Yuan and Lianne de Leon!

Please join us in giving a warm welcome to our two newest summer interns, May Yuan and Lianne de Leon!

May and Lianne are both participating in the San Francisco Unified School District (SFUSD) Career Pathway Summer Fellowship Program. This six-week program provides opportunities for high school students to gain work experience in a variety of industries and to expand their learning and skills outside of the classroom. Lianne and May will be working (remotely) with the UCSF Industry Documents Library (IDL), and we are grateful to SFUSD and its partners for sponsoring these internships.

May and Lianne will be working on several collection description projects with IDL this summer, including correcting and enhancing document metadata, and creating descriptions for audio-visual materials. They have provided their introductions below.

My name is May Yuan and I’m a junior at Raoul Wallenberg Traditional High School. During my free time, I enjoy reading, learning and trying new things, and helping others academically. I’m super excited to work here at the UCSF IDL to help provide valuable information to the public as well as learn more about the various documents, lawsuits, etc. myself; I also hope to enhance my productivity and organization skills during my time working here as these skills are crucial to college and everyday life in general. The career paths I’m interested in are bioengineering (bioinformatics/biostatistics), law, and finance.

IDL Summer Intern May Yuan

Hi, my name is Lianne R. de Leon. I am a part of the Class of 2023 at Phillip and Sala Burton High School. In the past, I have worked on VEX EDR Robotics competition in 2018-2019. In my spare time I enjoy trying new foods and yoga. I aspire to become a computer hardware engineer and to travel across the entirety of Asia. I look forward to meeting and working with you all.

IDL Summer Intern Lianne de Leon

Welcome to IDL Summer Intern, Khushi Bhat

Please join us in giving a warm welcome to Khushi Bhat, who will be conducting a remote internship with the UCSF Industry Documents Library (IDL) this summer.

Khushi is currently a rising senior at Rutgers University where she is majoring in Biotechnology and minoring in Computer Science. This summer, she is working in the Industry Documents Library researching tools and methods to extract geographic locations from a collection of documents related to the tobacco industry’s influence in public policy.

Khushi will be conducting an independent course project to help the IDL team enhance descriptive metadata for our industry documents collections. We have long been aware of a research need to be able to filter documents by geographic location. Tobacco control researchers and other public health experts at UCSF and around the world use the documents in the Industry Documents Library to understand how corporations impact public health. This research is often used to inform policymakers who write laws and policies regulating the sale and use of products such as tobacco. Researchers and policymakers need information which relates to their local area such as their city, county, state, or country.

Geographic location is not currently included in IDL’s document-level metadata, and since IDL contains more than 15 million documents it is not feasible to manually catalog this information.

Khushi’s work will focus on researching Natural Language Processing (NLP) and Named Entity Recognition (NER) text analysis methods. She will investigate available tools which have the potential to automatically identify and label geographic information in text. Khushi’s research, recommendations, and pilot testing will help the IDL team outline workflows and strategies for enhancing our document metadata to include geographic information.

Khushi aspires to pursue a career in bioinformatics in the future and intends on pursuing higher education in this field upon graduation. In her spare time, Khushi enjoys dancing, baking, and hiking. Prior to joining Rutgers, she was an avid Taekwondo practitioner (and has a 2nd degree black belt to show for it!)

Image of IDL intern Khushi Bhat
IDL Summer Intern Khushi Bhat

New Archives Intern: Brittany Peretiako

Today’s post is an introduction from Brittany Peretiako, our newest intern here in the Archives who will be working on helping us digitize materials and clean metadata in preparation for larger-scale digitization projects.


My name is Brittany Peretiako, and I am excited to join you all as an intern. As a brief introduction, I am originally from Santa Barbara, CA. I have three siblings, one brother and two sisters. My brother lives out in Arizona, and my sisters live in Emeryville. I moved to the bay area about three years ago to attend UC Berkeley where I earned my bachelor’s degree studying US history with a focus on human rights issues.

Currently, I live in Concord, CA with my husband Ivan and our one year old son Emery. We have another addition to our family on the way, who will be arriving in November. As a family, we love to spend time outside exploring the bay. One of our favorite activities is hiking, and we are always looking for new trails to take.

I am enrolled in an online archives and records administration graduate program through San Jose State University. Although I am only in my first year, I have learned so much already and cannot wait to see what lies ahead. During my time as an intern here, I will be working on metadata clean-up and digitization. I may also have the opportunity to participate in web archiving. I was drawn to this position because it provides me with an opportunity to apply the skills I am learning in school to real-world tasks. Much of my schoolwork involves simply learning the importance of items such as metadata and digitization, but does not provide the ability to actually do hands-on work.

I look forward to getting to know all of you better over the next three months! 

New Archives Intern: Harold Hardin

Harold Hardin is joining us in Archives & Special Collections this spring to work on finishing the NEH grant-funded project The San Francisco Bay Area’s Response to the AIDS Epidemic. Harold will be helping QA digital objects among other tasks related to the digitization workflow.

Harold Hardin is a current student in Cuesta Colleges’ Library/Information Technology program and San Francisco City College’s Paralegal Studies program. While pursuing a double major in Sociology/Critical Race Ethnic Studies at UC Santa Cruz Harold developed an academic interest in the often hidden and occluded histories of marginalized communities, particularly histories of oppression and resistance. Through their own experiences of political activism at UC Santa Cruz and beyond (#Blacklivesmatter Oakland/ Stockton, GaySHAME SF) Harold has insisted on moving iteratively between theory and praxis: centering an intersectional feminist analysis of power. 

These analytical lenses and political participation increased Harold’s consciousness regarding the fundamental ways in which access to information (particularly personal/community histories) profoundly shapes participation in our democracy (or lack thereof). Harold is interested in the nuances of political participation and uncovering the innumerable sites of quotidian resistance! Therefore, Harold sees their internship within UCSF’s AIDS History Project as not only a unique privilege to work toward increasing community access to Queer history, but also, and importantly,  an extension of the deeply personal (political) work of (re)understanding their multiple positions within (and outside) of the Archives.

Internship Opportunities

UCSF Library Archives and Special Collections has 2 new internship opportunities.

Archives Intern for AIDS History

The San Francisco Bay Area’s Response to the AIDS Epidemic: Digitizing, Reuniting and Providing Universal Access to Historical AIDS Records.

The Archives Intern for AIDS History will be assigned various tasks to assist in completion of the project including performing Quality Control checks on digitized papers, digital objects and metadata. Candidate should be a student or recent graduate from a library or information science program, preferably with a concentration or interest in archives and special collections. Students of public history, and history of health sciences are also encouraged to apply. This is a part time temporary appointment.
Department: Archives and Special Collections
Rank and Salary: Library Intern – $15/hr
Term: 150 hours Fall 2018 – Spring 2019

Project Description

The Archives and Special Collections department of the University of California, San Francisco (UCSF) Library, in collaboration with the San Francisco Public Library (SFPL) and the Gay, Lesbian, Bisexual, Transgender (GLBT) Historical Society, has been awarded a $315,000 implementation grant from the National Endowment for the Humanities. The collaborating institutions will digitize about 127,000 pages from 49 archival collections related to the early days of the AIDS epidemic in the San Francisco Bay Area and make them widely accessible to the public online. In the process, collections whose components had been placed in different archives for various reasons will be digitally reunited, facilitating access for researchers outside the Bay Area.
 The 127,000 pages from the three archives range from handwritten correspondence and notebooks to typed reports and agency records to printed magazines. Also included are photographic prints, negatives, transparencies, and posters. The materials will be digitized by the University of California, Merced Library’s Digital Assets Unit, which has established a reputation for digitizing information resources so that they can be made available to the world via the web. All items selected for digitization will be carefully examined to address any privacy concerns. The digital files generated by this project will be disseminated broadly through the California Digital Library, with the objects freely accessible to the public through both Calisphere, operated by the University of California, and the Digital Public Library of America, which will have an AIDS history primary sources set.

Skills and experience desired:

  • Strong candidates will be detail oriented and possess excellent organizational skills
  • Proficiency  with MS Excel and Google spreadsheets
  • Proficiency with document sharing and cloud computing services (Google drive, Box)
  • Experience with digital asset management systems
  • Ability to work independently
  • Ability to lift boxes weighing up to 40 pounds.

Hours and Location:

The timing of the internship is flexible, but should be carried out during the Fall of 2018 and ending early Spring 2019,  based on applicant and institutional commitments.  Up to two 8-hour days per week for 10-12 weeks. Work will be performed onsite at the library, though offsite work is possible.

Stipend:

A stipend of $15/hour is available for the internship. 

To Apply:

Applications for the UCSF Archives & Special Collections Internship, including a cover letter, resume, and names/contact info of two references should be sent to 
David Krah, Project Archivist 
UCSF Archives and Special Collections
University of California, San Francisco
530 Parnassus Avenue
San Francisco, CA 94143-0840
Apply for this position

Digital Processing and Implementation Intern

The Digital Processing and Implementation Intern will assist the UCSF Digital Archivist with various aspects of the Digital Archives program as they are implemented and brought online for the first time. Potential projects include:

  • Testing digital forensics and processing hardware and software being implemented in the digital forensics lab.
  • Compiling inventory of physical archival collections containing digital media, and pulling collections and identifying, counting, and cataloging digital media present.
  • Disk-imaging digital media removed from collections and transferring data to library storage systems.
  • Creating metadata about digital media being processed in digital forensics lab, editing metadata for various digitization or cataloging projects.
  • Operating scanning equipment to digitize archival collections for patron and researcher use.
  • Processing digital collections under the supervision of the Digital Archivist, including finding aid and container list creation and manipulation of access copies of born-digital content to create access-ready versions of collection.
  • Researching computer tools and systems for management and preservation of digital objects, and compiling and reporting on capabilities, requirements, dependencies, etc. of these utilities.
  • Participate in staff meetings, assist with writing blog posts, and help with reference/duplication requests.
Department: Archives and Special Collections
Rank and Salary: Library Intern – $15/hr
Term: 150 – 200 hours Fall 2018 – Spring 2019

Location

UCSF Library and Center for Knowledge Management,
530 Parnassus Avenue, San Francisco, CA 94143-0840

Work Type

Archival Processing, Information Technology, Computer Science

Work To Be Done

On site, with occasional opportunities to work from home or other location

Desired Qualifications

  • Experience with ArchivesSpace, Nuxeo or other archival collections management software
  • Experience with or interest in digital preservation, digital file formats and media, computer science, or history of computing technologies
  • Experience with or interest in digital forensics in archival collections and various digital forensics tools, such as FTK Imager and BitCurator
  • Familiarity with scripting, computer programming in any language, Unix.
  • Excellent analytical and writing skills
  • High level of accuracy and attention to detail
  • Ability to work independently
  • Ability to lift boxes weighing up to 40 pounds

Stipend

A stipend of $15/hour is available for the internship. The internship is intended for those who are currently enrolled in an undergraduate/graduate program.

Hours

Up to two 8-hour days per week for 10-12 weeks. Specific on-site hours are negotiable, but must be completed between 8:00 a.m.  and 5:00 pm Monday through Friday. Start and end dates are flexible.

Application Process

Please submit a letter of interest, a current resume and contact information for two professional references to:

Charles Macquarie
Digital Archivist
UCSF Archives and Special Collections
University of California, San Francisco
530 Parnassus Avenue
San Francisco, CA 94143-0840

Apply for this position

The UCSF Library is committed to a culture of inclusion and respect. We embrace diversity of thought, experience, and people as a source of strength which is critical to our success. We encourage candidates to apply who thrive in an environment which celebrates and serves our diverse communities.

Equal Employment Opportunity
The University of California San Francisco is an Equal Opportunity/Affirmative Action Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, age, protected veteran or disabled status, or genetic information.

About UCSF
The University of California, San Francisco (UCSF) is a leading university dedicated to promoting health worldwide through advanced biomedical research, graduate-level education in the life sciences and health professions, and excellence in patient care. It is the only campus in the 10-campus UC system dedicated exclusively to the health sciences.

About UCSF Archives and Special Collections
UCSF Archives & Special Collections is a dynamic health sciences research center that contributes to innovative scholarship, actively engages users through educational activities, preserves past knowledge, enables collaborative research experiences to address contemporary challenges, and translates scientific research into patient care.

Intern Report: Learning to Process Collections

This is a guest post by Lauren Wolters, UCSF Archives Intern.

I have recently completed my internship working in the UCSF Archives and Special Collections. I really enjoyed the challenge of a new experience and learned a tremendous amount in my short time working there.

During my internship I was able to complete several processing and arrangement projects. I created two finding aids, one for the Langley Porter Psychiatric Institute records and another for the files of Bernice Hemphill, longtime head of the Irwin Memorial Blood Bank.

I appreciated the opportunity to learn from an archival mentor and gain practical experience by independently processing each collection and working with tangible materials. It was very satisfying being able to contribute to the preservation of these documents and their history and help make them more accessible for future use. I look forward to being able to pursue future endeavors with an informed understanding of the archival process.