New Archives Intern: Brittany Peretiako

Today’s post is an introduction from Brittany Peretiako, our newest intern here in the Archives who will be working on helping us digitize materials and clean metadata in preparation for larger-scale digitization projects.


My name is Brittany Peretiako, and I am excited to join you all as an intern. As a brief introduction, I am originally from Santa Barbara, CA. I have three siblings, one brother and two sisters. My brother lives out in Arizona, and my sisters live in Emeryville. I moved to the bay area about three years ago to attend UC Berkeley where I earned my bachelor’s degree studying US history with a focus on human rights issues.

Currently, I live in Concord, CA with my husband Ivan and our one year old son Emery. We have another addition to our family on the way, who will be arriving in November. As a family, we love to spend time outside exploring the bay. One of our favorite activities is hiking, and we are always looking for new trails to take.

I am enrolled in an online archives and records administration graduate program through San Jose State University. Although I am only in my first year, I have learned so much already and cannot wait to see what lies ahead. During my time as an intern here, I will be working on metadata clean-up and digitization. I may also have the opportunity to participate in web archiving. I was drawn to this position because it provides me with an opportunity to apply the skills I am learning in school to real-world tasks. Much of my schoolwork involves simply learning the importance of items such as metadata and digitization, but does not provide the ability to actually do hands-on work.

I look forward to getting to know all of you better over the next three months! 

Processing the Laurie Garrett papers

As part of our current National Archive NHPRC grant project “Evolution of San Francisco’s Response to a Public Health Crisis: Providing Access to New AIDS History Collections,” we’ve been processing the papers of Laurie Garrett. Garrett is a Peabody, Polk, and Pulitzer Prize award winning journalist. She won the Pulitzer Prize for Explanatory Journalism for her work chronicling the Ebola virus in Zaire published in Newsday. She is also a bestselling author of the book The Coming Plague: Newly Emerging Diseases in a World Out of Balance. Garrett has worked for National Public Radio, Newsday, and was a senior fellow for The Council of Foreign Relations. She has won many awards including the Award of Excellence from the National Association of Black Journalists and the Bob Considine Award of the Overseas Press Club of America.

Headshot of Laurie Garrett
Laurie Garrett, photograph by Erica Berger, MSS 2013-03

The Laurie Garrett papers include drafts of her two books, The Coming Plague and Betrayal of Trust. The collection also includes material related to her service in The Council on Foreign Relations. Garrett’s papers feature correspondence, records of the various national and international conferences and meetings of which she was a part. Some unique types of material present in the collection include audiovisual recordings, photographs, videotapes, film reels, notebooks, and interviews.

A conference bag with several flyers for HIV/AIDS conferences laying on a table.
Conference bags, notebook, and press card, Laurie Garrett papers, MSS 2013-03

Once the Laurie Garrett papers are processed, a finding aid will be prepared and put on the Online Archive of California, and a small selection of the collection will be digitized and made available online to researchers via Calisphere.

-Edith Martinez, processing archivist for AIDS History Project

UCSF Archives & Special Collections awarded grant to digitize historical California public health materials

UCSF Archives has been awarded a California Revealed grant to digitize historical reports, newspapers, yearbooks, and other publications that document the development of medicine and public health in California and the Bay Area and various activist and community roles in that history. The publications to be digitized include The Cap & Seal yearbook of the San Francisco General Hospital Nursing School, the Annual Reports of the San Francisco Nursery for Homeless Children, the Annual Reports of St. Mary’s Hospital, the Bay Area Health Liberation News Newspaper, the Annual Reports of the California Women’s Hospital, the Clarion journal of the SF Department of Public Health Tuberculosis division, and the Annual Reports of St. Luke’s Hospital.

These materials contain fascinating and valuable primary source documentation of the development of medicine and public health in California. Included are countless historical images of hospital spaces, technologies, and equipment; historical data on hospital patients, surgeries, and finances; historical patient voices through writings and illustrations; and evidence of the broad and diverse movement building which was a part of progressive public health development in the civil rights era.

The project will include 80 total volumes of the items outlined above. Having the digitization provided for free by California Revealed is equivalent to an estimated $5500 of actual digitization costs. The digitized materials will be published to Calisphere for public access and download.

The front page of the Bay Area Health Liberation News newspaper with an article about medical repression in prisons.

About California Revealed:
California Revealed is a State Library initiative to help California’s public libraries, in partnership with other local heritage groups, digitize, preserve, and provide online access to archival materials – books, newspapers, photographs, audiovisual recordings, and more – that tell the incredible stories of the Golden State.

Celebrating 20 Years of the UCSF Tobacco Center and Industry Documents Library

Celebrate with us >

Image of tobacco company executives testifying before congress with the following text over the top: Celebrating 20 years of the UCSF Tobacco Center and Industry Documents Library

Tuesday, November 27, 12 – 1 pm

Parnassus Library, 5th Floor, Lange Room: Join us to celebrate 20 years since the signing of the Master Settlement Agreement and the creation of the UCSF Center for Tobacco Control Research and Education and the UCSF Industry Documents Library!

In November 1998 the 5 largest cigarette manufacturers signed the Master Settlement Agreement (MSA) with 46 U.S. states and 6 U.S. jurisdictions. This was the largest civil litigation settlement in U.S. history.

The MSA imposed restrictions on the sale and marketing of cigarettes, especially to youth, and required hundreds of billions of dollar in payments to the states in perpetuity to partially compensate them for the Medicaid costs smoking causes. It also created the American Legacy Foundation (now known as the Truth Initiative) which funded the creation of the UCSF Center for Tobacco Control Research and Education and the Industry Documents Library.

This event is open to the public. Cake and beverages will be provided while supply lasts. RSVP using the link at the top of the post. 

This event is co-organized by the UCSF Industry Documents Library and the Center for Tobacco Control Research and Education.

UCSF Archives & Special Collections awarded $99,325 LSTA grant for textual data extraction from historical materials on AIDS/HIV

The Archives and Special Collections department of the University of California, San Francisco (UCSF) Library has been awarded a $99,325 “Pitch-An-Idea, Local” grant for the first year of a two-year project from the Institute of Museum and Library Services’ (IMLS) Library Services and Technology Act funding administered through the California State Library. The Archives will take the nearly 200,000 pages of textual AIDS/HIV historical materials which have been digitized as part of various digitization projects — including the National Historic Publications and Records Commission (NHPRC)-funded project­, “Evolution of San Francisco’s Response to a Public Health Crisis;” and the National Endowment for the Humanities (NEH)-funded project, “The San Francisco Bay Area’s Response to the AIDS Epidemic” — and will extract unstructured, textual data from these materials using Optical Character Recognition (OCR) and related software. The project team will prepare the text as a research-ready, unstructured textual dataset to be used for digital humanities, computationally driven cultural heritage, and machine learning research inquiries into the history of the HIV/AIDS epidemic.

The 24-month project, entitled “No More Silence — Opening the Data of the HIV/AIDS Epidemic” has commenced as of July 1, 2018. The digitized materials from which text will be extracted include handwritten correspondence, notebooks, typed reports, and agency records which represent a broad view of the lived experience of the epidemic, including documentation from People with AIDS and their friends, families, and scientists and public health officials working to slow the epidemic. All historical materials represented in this dataset have been previously screened to address privacy concerns. The resulting unstructured, textual dataset will be deposited in the UC Dash datasharing repository for public access and use by any interested parties, and will also be deposited in other similar data repositories as appropriate. “During my tenure at UCSF,” says health sciences historian and professor in the Department of Anthropology, History, and Social Medicine at UCSF, Dr. Aimee Medeiros, “I have been inspired by the library’s enthusiasm and dedication to public access and the use of practices in the digital humanities to help maximize access to HIV/AIDS material.” This project will build on that legacy by bringing these valuable historical materials into the realm of digital humanities and scientific research and making them computationally actionable.

According to Dr. Paul Volberding, director of the AIDS Research Institute at UCSF, “Discovering the complexities of the virus and developing effective treatments will be studied of course, but the lives of those directly involved as patients as well as care providers is equally significant. The cultural aspects of the epidemic will most directly benefit from the work [of this project]. Combining the growing field of computational science with the already large and rapidly growing archive of materials from all aspects of the AIDS epidemic demand the creation of new tools and I look forward to the new insights we gain from their application. [UCSF Library has] been sharply focused on the AIDS archives and have amassed a rich collection that, in its digitized form, will be the database for [these] new efforts. Together, this database and new computational tools, will enable a sophisticated analysis that I am convinced will be used to shed more insight in our understanding of the impact of the epidemic and ways our response will have meaning in the inevitable future crises.”

Once the preparation of the textual dataset is completed, the project team — consisting of archivists and technical staff from both the Archives and the Library — will embark on several pilot research projects using machine learning, and especially natural language processing research methods, on the data. The pilot projects, which will be scoped in collaboration with various stakeholders, will attempt to explore what kinds of structured data can be pulled out of the unstructured text, and define some simple critical inquiries which can be understood using this data, these methods, and the results of these experimental endeavors. Additionally the project team hopes to get a better sense of the functional requirements for systems supplying this type of data when tailored towards these kinds of medical humanities research questions. Through these efforts the project team will be able to better define the extent to which, as stated by Dr. Medeiros, “making 200,000 pages of primary-source archival documentation converted to unstructured textual data will… further meaningful research and our understanding of this epidemic.”

Finally, the project team will promote the existence of this dataset, and will lead workshops to help instruct potentially interested students, researchers, scholars, and members of the general public in its use. Again in the words of Dr. Medeiros, “the plans to provide workshops to help curious scholars learn how to best interface with this data is exciting as it will allow for those who are experts in the field but not necessarily in the digital medical humanities to conduct important research.”

This project will support innovation, creativity, and collaboration in and across the humanities, social sciences, and STEM fields (Science, Technology, Engineering, and Math) by opening up a new body of historical materials for research and discovery. The project will foster new creative research methods in the areas of the humanities, which are just beginning to experiment with computationally-driven research, and it will encourage collaboration through the use of the newly-created data resource, engaging the expertise of both humanists and scientists in making discoveries in the data. Not only does this collaborative work allow for innovation “at the edges” of each of these fields, it allows for computational access to a previously-inaccessible research object — the data of the lived experience and cultural history of the AIDS crisis in the Bay Area and beyond.

The following institutions and groups are serving as informal partners on this project:

About UCSF Archives & Special Collections (UCSF Library)
The mission of the UCSF Archives & Special Collections is to identify, collect, organize, interpret, and maintain rare and unique material to support research and teaching of the health sciences and medical humanities and to preserve institutional memory. The UCSF AIDS History Project (AHP) began in 1987 as a joint effort of historians, archivists, AIDS activists, health care providers, scientists, and others to secure historically significant resources documenting the response to the AIDS crisis, its holdings currently include 46 collections and they continue to grow.
www.library.ucsf.edu

UCSF Library logo

About the Library Services and Technology Act
Library Services and Technology Act (LSTA) grants are federal funds from the Institute of Museum and Library Services that are awarded by the State Library to eligible California libraries. This project was supported in whole or in part by the U.S. Institute of Museum and Library Services under the provisions of the Library Services and Technology Act, administered in California by the State Librarian. www.library.ca.gov/grants/library-services-technology-act/

California State Library logo
Institute of Museum and Library Services logo

 

Intern Report: Crafting a Digital Forensics Lab

This is a guest post by our Digital Archives intern for summer 2018. The intern worked on implementing, testing, and piloting equipment for the Digital Forensics Lab to capture content off of decaying computer media which are present in our collections. 

This summer I worked with UCSF’s Digital Archivist Charles Macquarie on building up the UCSF Archives & Special Collections Digital Forensics Lab. It was such an honor to come and join the UCSF team because of the great people, and the unique and important collections that the Archives preserves and provides access to. I am grateful for the experience and the shared wisdom of the staff, and to be able to contribute to this growing piece of the work of the Archives.

What exactly does it mean to build a Digital Forensics Lab? In the case of UCSF, many of the collections contain obsolete and legacy media – things like floppy disks and ZIP disks, and even personal digital assistants (PDAs, remember those pre-smart phone devices?) and SD cards. As more time passes, it isn’t so easy to read or utilize these formats effectively without access to the machines and software that are used to read and create them.

Without access, we risk losing important parts of collections when outdated digital document formats and research materials become no-longer readable. By creating a lab environment where we can rescue these items, we can give them life again. This is why I love this kind of work.

Building a lab like this is no small challenge! I’ve worked on configuring new software to power old hardware, testing lab equipment with “dummy disks” and files, troubleshooting the problems that arise with new implementation, and even building a new housing for a few select drives using the 3d-printing equipment in the UCSF Makers Lab. It has been a busy summer.

Throughout my time here I was able to contribute documentation, workflows, and hardware (my favorite part) to the lab implementation process, and to create software troubleshooting steps that should make it easier for archivists and researchers to use the lab equipment to retrieve difficult-to-read media and file formats from our archival collections.

I also appreciate being able to learn more about archival processing for digital collections in the context of digital forensics and born-digital archival materials. I gained practical field experience with both digital forensic work and digital curation that I can take back with me in the final year of my Master’s degree. Though I’m sad to be leaving the UCSF Library, I am grateful for the experience I’ve gained working on this challenging project, and I hope to return to visit next year.

Experiments with Digital Tools in the Archives — OCR

Working on digital “stuff” in the archives is always fascinating, because it blurs the borders between digital and physical. Most of the work the takes up my time is at these borders. Physical stuff requires lots of human touches to transition to “digital,” and digital stuff similarly requires lots tending by humans to ensure that it is preserved physically. After all, the 1s and 0s are stored physically somewhere, even if on the cloud or in DNA.

We’re currently working on several projects to convert physical materials to digital text. The huge quantities of rich and complicated textual material in archival collections is full of potential for use as data in both computational health research and also digital medical humanities work, but to be usable for these kinds of projects it needs to be converted to digital text or data, so that it can be interpreted by computers. To get to this point the documents must be scanned, and the scanned documents must either be transcribed, which can be immensely labor intensive, or converted directly by computers using a software that can perform Optical Character Recognition, or OCR. One of our projects using OCR to extract text from a document provides a fascinating look into the world of computer vision.

A pen and ink illustration of the lungs and a lymph gland from the Ralph Sweet Collection of Medical Illustrations

An example of the illustrations in the Ralph Sweet Collection

The Ralph Sweet Collection of Medical Illustration contains extraordinary examples of the work of one of the most renowned medical illustrators in the United States, so we’re working on digitizing the collection and putting it online. To do this we need to have detailed metadata — the kind of information you might expect to find in a catalog record, title, date, author — about each illustration. Currently this metadata for the Sweet Collection exists only in the form of printed index that was written on a typewriter. We can scan the index, but we do not have the labor to transcribe each of the 2500 or so entries. This is a job for OCR.

The image below shows what a page of the Ralph Sweet index looks like. This is the metadata that we want to be able to extract and turn into digital text so that it can be understood by a computer and used as data.

A page of an type-written index of the Ralph Sweet Collection, showing metadata about each illustration in the colleciton.

A page of the index for the Ralph Sweet Collection.

One of the first problems we encountered in attempting to extract text from this document is a classic difficulty of computer vision. As English-speaking humans, we know by looking at this document that it contains three columns, and that the order in which to read the page is top to bottom by column, starting on the left and moving right. To a computer however, it is simply a page full of text, and there is no way to know whether or not the text is broken into columns or whether each line should be read all the way across the page. This simple task presented a difficulty for most of the software that we tested, but we found one software which could identify these columns easily. The software is called Tesseract, and it was actually developed in the 1980’s but continues to be a good open-source tool to perform OCR.

If we plug the above page into Tesseract, we get mostly recognizable text, which in itself is pretty miraculous when you think about it. Looking at the text though, it quickly becomes clear that it is not an exact transcription of what’s on the page. There are misspellings (“Iivev”), and some chunks of text have been separated from the entry in which they should appear (“horizontal”).

An image of the text-output of the software tesseract showing errors in transciption.

An example of the text extracted from the Ralph Sweet Collection Index by Tesseract.

Digging into the way that Tesseract (and OCR software more generally) works can help us begin to understand why these errors are cropping up. Plus, it looks really cool.

OCR programs have to go through a set of image manipulation processes to help them decide which marks on the page are text — and hence should be interpreted — and which are other marks that can be ignored. This all happen behind the scenes, and usually this involves deciding what the background parts of the image are and blurring them out, increasing the image contrast, and making the image bi-tonal so that everything on the page is only black or white. Then, the computer can trace the black pixels on the image and get a series of shapes which it can use to begin attempting to interpret as text. The image below shows the shapes that Tesseract has identified as letters and traced out for interpretation. Each new color indicates that the computer believes it has moved on to a new letter.

A page of colorful text on a black background illustrating the text that has been automatically traced from the Ralph Sweet Index by the computer program Tessearact.

The result of Tesseract tracing the letters it has interpreted. Each new color is something that’s been identified as a new letter.

Interestingly, when comparing the computer tracing of the letters to the original image you can see that Tesseract has already made the assumption that the black spaces from the three-hole punch in the top of the page are not letters, and thus it should not bother to trace them. Lucky for us, that’s correct.

Next the computer has to take all these letters and turn them into words. In actual practice it’s not quite this simple, but basically the computer iterates on each letter identification that it believes it has made by testing whether or not that word is in its dictionary, and thus whether or not it is likely to be a word. If the combination of letters that the computer thinks it sees are not a word, then it will go back and make a new guess about the letters and test whether or not that’s a word, and so on. Part of this whole process is to chunk the letters into words using their actual spacing on the page. Below you can see an image of how Tesseract has begun to identify words using the spaces between lines and letters.

A view of a page of the Ralph Sweet Index showing each word as a blue rectangle encompassing the space taken up by that block of text against a black background -- the "word" output of the OCR program Tesseract.

The “words” that the OCR software has identified on the page. Each blue rectangle represents a space that Tesseract has marked as containing a word.

In addition to checking the word against the dictionary though, most OCR programs also use the context of the surrounding words to attempt to make a better guess about the text. In most cases this is helpful — if the computer has read a sentence that says “the plant requires wader” it should be a relatively easy task to decide that the last word is actually “water.” In our case though, this approach breaks down. The text we want the computer to extract in this case is not sentences, but rather (meta)data. The meaning of the language has little influence on how each individual word should be read. One of the next steps for us will be trying to figure out how to better instruct Tesseract about the importance of context in making word-identification decisions (i.e., that it’s not important).

Finally, as the OCR software interprets the text it also identifies blocks of words that it believes should be grouped together, like a paragraph. Below you can see the results of this process with Tesseract.

A view of the different elements of tesseract's text identification showing letters traced in primary colors and contained in yellow bounding boxes, words set against blue rectangles outlining the space they encompass, and blocks of text outlined in larger bounding boxes and numbered -- all of this set against a black background.

This view shows all of the elements of Tesseract’s word identification combined. Text has been traced in color, separate letters are contained in bounding boxes, words are contained in blue rectangles, and blocks are contained in larger bounding boxes and are numbered (though the numbers are a bit difficult to see).

A line has been drawn around each block of text, and it has been given a number indicating the order in which the computer is reading it. Here we can see the source of one of the biggest problems of the OCR-generated text from earlier. Tesseract is in-accurately excluding a lot of words from their proper blocks. In the above photo, the word “Pen” is a good example. It is a part of block 20, but it has been interpreted by the computer as it’s own block — block 21 — and has been set aside to appear in the text file after block 20. Attempting to solve this problem will be our next hurdle, and hopefully we can catch you up after we are successful.

Using OCR to extract text from digital images can be a frustrating endeavor if accuracy is a necessity, but it is also a fascinating illustration of the way computers see and make decisions. Anytime we ask computers to perform tasks that interface with humans, we will always be grappling with similar issues.

Reproducible Research and the problems of preserving computer code and software

We collect and preserve a lot of the documentary evidence of science happening at UCSF — everything from lab notebooks to lab websites detailing research processes. We even hold tons and tons of data in our collections, mostly in physical form, as patient surveys or health records, or even raw data as it was initially recorded by hand in the lab.

But what about the products of contemporary science, where key digital elements such as computer code or software might be crucial to an understanding of the research? This is already presenting problems for research reproducibility. Think, for example, of a set of results which were obtained using a computer script written in the Python computer programming language. If you want to verify these results, are you able to view the source code which produced them? Are you able to execute that code on your own computer? Can you tell what each piece of the code does? Does the code rely on access to an external data set to work correctly, and can you access and/or assess that data set to test the code?

As we work more closely with our Data Science Initiative team on these issues, it becomes clear that these are preservation questions as well. A critical understanding of the scientific past and present requires access to the primary source documentation of that research, including computer code and software. Being able to understand and interpret that computer code involves many of the same questions mentioned above — executions of code, documentation of each process in the code, access to necessary data, etc.

To begin to address this, we are working with the Data Science team to assess researcher coding practices as a first step in understanding how the library can encourage better documentation and preservation of code in the service of reproducible research and the persistence of the scientific scholarly record. And if you’re a researcher who codes for your work, then we want feedback from you! Please consider attending one of our lunchtime listening sessions in the coming weeks — 4/20 from 12-1:30 pm at Mission Bay, and 4/27 from 12-1:30 pm at Parnassus. We will have an informal chat about research coding practices and will discuss some of the issues we encounter as information professionals, as well as talking about what the library can do to aid in these areas.

Join us as we make some in-roads on this challenging information problem.

New Sites in the UCSF Web-Archive

As discussed previously here, we’ve been working on expanding our web-archives presence in all areas across campus, and one of the developments we’re most excited about is getting the web-archiving process formalized in centralized UCSF workflow for upgrading websites or retiring abandoned ones. Now that we have been successful in establishing this program, archiving a site is an official part of the website roll over or retirement process, which means we have a much better finger on the pulse of the UCSF web presence.

And as this process ramps up, we’ve been adding all sorts of fascinating UCSF websites to our collections, so we wanted to highlight a few recent acquisitions.

First is a complete copy of the website for the W. M. Keck Center for Noncoding RNAs, which is scheduled to be rolled over to a new platform soon. The Keck Center explores the 98.6% of the human genome which is “non-coding,” or which is not the part of the genome directly containing the code to create proteins. Since, in their words, most genetic research focuses on the protein-encoding genes — those genes whose purpose is clear — the area of non-coding RNA can be thought of as “genetic dark matter.” Even though the purpose of this generic material is not clear, it still influences human health, and it is the mission of the Keck Center to figure out how.

screen shot of the homepage of the W.M. Keck Center for Non-Coding RNAs

Homepage of the archived version of the W.M. Keck Center website.

The lab uses mice in their process, modifying mouse stem cells and using mouse genes to examine the function of these non-coding RNAs. And conveniently, their lab website contains all the raw genetic data, as well as the experiment plans, images, and other associated data for these experiments. We’re excited about this capture because we were able to collect all this data at once and provide a snapshot of the lab’s work — complete with all the associated research materials. This is a huge help in tackling the problem of historic preservation of contemporary scientific work, and it even begins to address the very present problems of reproducibility in data-intensive and computing-intensive scientific research.

Additionally, another web-site which we have recently captured illustrates the value of curating a selection of the UCSF institutional ecosystem all together. This is the site of the UCSF Institutional Animal Care and Use Committee. Say, for example, that in examining the archived site of the Keck center, you also wondered what the legal treatment protocols and procedures were at the time for scientific research involving animal subjects, and whether or not the Keck center was following those protocols. With a little clicking around on the Wayback Machine you would be able to quickly answer that question, and would have a clear picture of where the Keck center’s research fit into the larger legal and ethical questions on campus and in the scientific community about proper treatment of and care for the animals used in research.

screen shot of the homepage of the UCSF Institutional Animal Care and Use Committee

Homepage of the archived version of the UCSF Institutional Animal Care and Use Committee.

We look forward to continuing to build and enrich our web-archive collections, and remember that if you have a suggestion you can always request that we begin capturing your UCSF site!