The Craft of Archival Processing: A Journey through Space and Time with Dr. Mary Olney

Introduction by Polina Ilieva

During the spring semester 2018 the archives team co-taught and facilitated a new History of Health Sciences course, the Anatomy of an Archive. The idea of this course was conceived by the Department of Anthropology, History and Social Medicine (DAHSM) Assistant Professor, Aimee Medeiros and UCSF Head of Archives & Special Collections, Polina Ilieva. Kelsi Evans, Project Archivist, co-facilitated the discussion sessions and Kelsi, Polina and David Uhlich, Access and Collections Archivist, served as mentors for students’ processing projects throughout the duration of the course.

The goal of this course was to provide an overview of archival science with an emphasis on the theory, methodology, technologies and best practices of archival research, arrangement and description. The archivists put together a list of collections requiring processing and also corresponding to students’ research interests and each student selected one that she/he worked on with her/his mentor to arrange and create a finding aid. During this 10 week long assignment students developed competence researching and describing an archival collection, as well as interpreting the historical record. At the conclusion of this course students wrote a story about their experience and collections they researched for the archives blog. In the next three weeks we will be sharing these posts with you.

Our final story comes from Hsinyi Hsieh, PhD student, UCSF Department of Anthropology, History and Social Medicine.

Post by Hsinyi Hsieh

Building an archival collection is similar to traveling through space and time. Before embarking on this journey, archival practitioners need to possess a diverse set of creative and sensitive abilities—specifically, a knowledge of scientific principles, a familiarity with artful practices, and the ability to think critically. Most significantly, processing a collection requires getting your hands dirty, interacting with various types of historical materials, and building a rapport with future researchers. I am grateful to have worked with Kelsi Evans and Polina Ilieva, archivists at UCSF, who not only taught me the craft of archival work through the Mary Olney collection but also provided me with a golden opportunity to travel with Dr. Olney. [1]

Figure 1: Mary Olney’s contribution on “Sugar Free Summer,” San Francisco Sunday Examiner & Chronicle June 5, 1983. Olney papers, MSS 98-64.

My archival journey began by imbibing tacit knowledge about processing archival collections. When we encountered some mold affected materials in the Mary Olney collection, the UCSF archivists taught me how to assess a mold bloom. It was truly a fascinating experience to watch as Kelsi and Polina observed the color and smell of the document and defined whether the mold actively presented a hazard to the unaffected materials. This document was sent for professional treatment at the UC Berkeley Library’s Conservation Treatment Division. This is an example of the tacit knowledge possessed by archivists, which only develops through continuous professional practice and education. The mold situation in the archive is akin to unforeseen circumstances arising during a trip. Thanks to the archivists’ expertise, we successfully prevented the other materials from being affected by the mold and kept our archival journey going.

Family camp, 1976. Olney papers, MSS 98-64.

The adventure had the perfect mixture of historical lessons and archival practice. I had the opportunity to learn about Dr. Olney’s experiences as a female pediatrician, social advocate, and director of the Diabetic Youth Foundation (DYF) and its summer camps for diabetic children. As I learned more about the collection, I was able to arrange its photos, pamphlets, and correspondences for future researchers interested not only in Dr. Olney but also pediatric diabetic patients.. Through this immersive experience, I felt as though I had become a part of her camping staff but in the future. In fact, during the archival arrangement, we also reconstructed the progress of Dr. Olney’s efforts in running the summer camps for decades—notably, her hard work in terms of fundraising, staff training, and building relationships with other relevant organizations. Mary Olney was a pioneering pediatrician who not only operated under the broad vision of improving the lives of diabetic children but also employed a practical outlook, doing everything she could to maintain the summer camp for decades.

Figure 3: The cover of Bear Facts, First issue, Second session, Aug 4, 1985. Olney papers, MSS 98-64.

During archival processing, revealing the mystery of certain folders is much like exploring exotic locations while traveling. For example, I was preoccupied with examining several folders in Dr. Olney’s collection that were labeled “loose papers.” Upon examining the documents inside these folders, I found that most of the materials—specifically Bear Facts and Whitaker Whiz—were from the DYF newsletters, which aimed at improving health communication among young diabetic patients. The DYF newsletter was published since the early 1940s and targeted young patients; the newsletter introduced camping programs, provided health information about diabetes, and featured beautiful artwork and written compositions by these patients.

By relabeling these materials, “loose papers,” the archivists were able to provide researchers with more accurate finding aids and inspiration as well. Imagine that you are visiting a new country and are consulting a number of travel guides; the ones that are written more clearly might contain better suggestions on places to explore; these recommendations might be missed if you followed the relatively unclear guidebooks. Further, information that is more accurate can enable researchers to ask questions that might never occur to them otherwise. Take the DYF newsletters, for example. How do the articles in Bear Facts and Whitaker Whiz communicate medical knowledge about diet to young patients and their families? Thus, clarifying vague folder names might improve the experience of users and researchers when exploring such archives, thereby enabling them to contemplate new historical questions.

Figure 4: Diet suggestion on Whitaker Whiz, August 22, 1951. Olney papers, MSS 98-64.

The task of processing the archival collection took me on a journey to Northern California with Dr. Olney and the DYF foundation during the twentieth century. It took me back to when and where the materials originated and how they would go on to influence researchers in the future. During her lifetime, Dr. Olney continued with her efforts to translate her expertise and knowledge into useful information for young diabetic patients. It takes the invisible labor of archivists to make these accomplishments visible and highlight all aspects of her persona: a female pediatrician, a camp organizer, a Northern California resident, a daughter, and a woman. This has been possible only through processing this archival collection. Thus, the work of archival practitioners plays a crucial role in enabling future researchers to embark on a journey with Dr. Mary Olney. Let me tell you, it is a fun and interesting ride!

[1] On the life history of Mary Olney, please see Sharon R. Kaufman, 1994. The Healer’s Tale: Transforming Medicine and Culture. Madison, Wisconsin: University of Wisconsin Press. Kelsi Evans, 2015. “Celebrating Food Day: Recipes from the Archives.” Source: https://blogs.library.ucsf.edu/broughttolight/2015/10/23/celebrating-food-day-recipes-from-the-archives/.

New Archives Intern: Marissa Nadeau

Today’s post is an introduction from Marissa Nadeau, our newest intern here in the Archives. She will be working on the upcoming exhibit, Open Wide: 500 Years of Dentistry in Art.

Marissa Nadeau is from the town of Brookfield, Connecticut, and has lived along the East Coast her entire life. Transferring from a university down in South Carolina to one in Connecticut, Marissa ended up receiving her Bachelor of Arts from New York University, majoring in Art History with a double minor in Italian and Creative Writing (2016). During her time in NYC, Marissa interned in galleries and non-profits throughout the Chelsea neighborhood, most notably C24 Gallery and The Kitchen; she helped expand their social media platforms and fell in love with curatorial work by getting the chance to work closely with the team’s curator and contemporary artists. Marissa had the opportunity to study in Florence, Italy, for a semester (2015), which allowed her to adopt a global perspective of museums and the art market.

Marissa uprooted her East Coast ties and moved to San Francisco to follow her passion of pursuing curatorial work, and is currently a Masters candidate in the University of San Francisco’s Museum Studies program. She co-curated Modern Myth: South Asian Modern and Contemporary Works on Paper at the school’s Thacher Gallery in 2017, and has been interning with the Bay Area’s FOR-SITE Foundation since January 2018.

Marissa is excited for her newest role as a Curatorial Intern at the UCSF Archives & Special Collections and she looks forward to gaining a better understanding of archival best practices, while putting her theoretical knowledge to the test. She will be assisting with research, design, and installation of the upcoming exhibit, Open Wide: 500 Years of Dentistry in Art, opening this summer in the Library.

A Terrible Thing to Waste: The Black Caucus and Mental Health Awareness

Introduction by Polina Ilieva

During the spring semester 2018 the archives team co-taught and facilitated a new History of Health Sciences course, the Anatomy of an Archive. The idea of this course was conceived by the Department of Anthropology, History and Social Medicine (DAHSM) Assistant Professor, Aimee Medeiros and UCSF Head of Archives & Special Collections, Polina Ilieva. Kelsi Evans, Project Archivist, co-facilitated the discussion sessions and Kelsi, Polina and David Uhlich, Access and Collections Archivist, served as mentors for students’ processing projects throughout the duration of the course.

The goal of this course was to provide an overview of archival science with an emphasis on the theory, methodology, technologies and best practices of archival research, arrangement and description. The archivists put together a list of collections requiring processing and also corresponding to students’ research interests and each student selected one that she/he worked on with her/his mentor to arrange and create a finding aid. During this 10 week long assignment students developed competence researching and describing an archival collection, as well as interpreting the historical record. At the conclusion of this course students wrote a story about their experience and collections they researched for the archives blog. In the next three weeks we will be sharing these posts with you.

This week’s story comes from Antoine S. Johnson, PhD student, UCSF Department of Anthropology, History and Social Medicine.

Post by Antoine S. Johnson

Historically, racism in America has taken its toll on its victims and UCSF has been no exception—from the black hospital sanitary worker who was restricted to use only the basement bathroom to the qualified medical student denied residency. One month after the assassination of Dr. Martin Luther King, Jr., black employees fought back and formed the Black Caucus. The organization not only fought for equal treatment but also advocated for the reaffirmation of black humanity and an increased awareness about the health impact of racism on its sufferers.

The Black Power Movement of the late 1960s and early 1970s consisted of witty slogans that reasserted black humanity. African Americans shouted slogans such as “Black is Beautiful” as a way to convince themselves that black was not as negative and distasteful as society portrayed it. This had important psychological effects. The Black Caucus ensured black staff, faculty, and students joined the movement with similar quotes and passages in its Black Bulletin. In fact, in a March 1971 edition of the Bulletin, the Caucus adopted its own version of the catchphrase, “Black Beauty You Are the Best.” Promoting blackness in a positive way was a unique way for the Caucus to align itself with larger black issues. To show their solidarity, the Caucus advertised speaking engagements and updates on leaders of the Black Power Movement, including Huey P. Newton, Angela Davis, and Kathleen Cleaver.

The Black Caucus aimed to demonstrate that racism not only is an injustice but that it can be hazardous to one’s health. Racial discrimination was stressful, it argued. Also, in a 1972 letter to the chancellor, the Caucus posited, “One theory of psychology revolves around the fact that crises and confrontation are two of the most volatile means of bringing about change.”

Thus, they refused to allow emotional issues to fall by the wayside, even if the university saw such problems as trivial. The stress could also be a trigger for underlying issues, like G6PD Deficiency. In G6PD, stress is one of the main triggers, resulting in abdominal and back pain, as well as fever and fatigue. The genetic disorder destroys red blood cells prematurely, cutting off oxygen traveling to the lungs, shortening one’s breath, and increasing their heart rate. In the United States, the condition is most common among men of African descent. Aware of this correlation, in 1971, the Caucus screened nearly 600 people on campus for sickle-cell anemia research and G6PD. This campaign to make visible the damage stress brought on by racism could do to black people was extended to the community via the Blackman’s Free Clinic on McAllister Street. Racism knows no boundaries, and the Black Caucus wanted to bring awareness to what should be considered a health concern beyond the walls of UCSF.

The health of African Americans, in particular mental health, also influenced the Caucus’ demands to diversify UCSF’s clinical faculty. In a statistical document from 1972, the Black Caucus concluded that one in 670 white American citizens became doctors, compared to one in every 5,000 African Americans. The psychiatry department was one of the main divisions in which the Caucus, as well as Edward Weinshel, then-director of the Department of Psychiatry, saw as an imperative to the school’s future. In fact, in a letter dated July 14, 1971, Weinshel pleaded for the university’s psychiatry department to recruit more black applicants. Dr. Charles T. Carman, the Acting Dean, responded in less than two weeks, notifying Weinshel that he sent his letter to the chairman of the psychiatry department, the Assistant Dean for Postdoctoral Affairs, and to Joanne Lewis, then the chairperson of the Black Caucus.

A vested interest in black students would result in more licensed black psychiatrists, a field that both the Caucus and Weinshel saw in dire need of black physicians to assess the mental and physical characteristics of black patients. More importantly, Weinshel foresaw black psychiatrists assisting members of the Westside Community Mental Health Consortium, home of the “greatest number of black residents in San Francisco as well as significant numbers of other minority groups.”

Indeed, “A mind is a terrible thing to waste;” it is also a terrible thing not to protect. UCSF’s Black Caucus was keenly aware of the potential harm endemic racism had on black faculty, staff, and students and surrounding community members. By promoting racial pride and bringing attention to the harmful effects of racism, the Black Caucus spearheaded a movement that highlighted the mental implications of racism, offered solutions, and found allies in their struggle who saw avenues through which the Caucus could get involved within and outside of the university.

The Anatomy of an Archive: The Renée Hoffinger Papers

Introduction by Polina Ilieva

During the spring semester 2018 the archives team co-taught and facilitated a new History of Health Sciences course, the Anatomy of an Archive. The idea of this course was conceived by the Department of Anthropology, History and Social Medicine (DAHSM) Assistant Professor, Aimee Medeiros and UCSF Head of Archives & Special Collections, Polina Ilieva. Kelsi Evans, Project Archivist, co-facilitated the discussion sessions and Kelsi, Polina and David Uhlich, Access and Collections Archivist, served as mentors for students’ processing projects throughout the duration of the course.

The goal of this course was to provide an overview of archival science with an emphasis on the theory, methodology, technologies and best practices of archival research, arrangement and description. The archivists put together a list of collections requiring processing and also corresponding to students’ research interests and each student selected one that she/he worked on with her/his mentor to arrange and create a finding aid. During this 10 week long assignment students developed competence researching and describing an archival collection, as well as interpreting the historical record. At the conclusion of this course students wrote a story about their experience and collections they researched for the archives blog. In the next three weeks we will be sharing these posts with you.

This week’s story comes from Aaron J. Jackson, PhD student, UCSF Department of Anthropology, History and Social Medicine. 

Post by Aaron J. Jackson

In the Spring term of 2018, my fellow History of Health Sciences (HHS) students and I in the UCSF Department of Anthropology, History & Social Medicine (DAHSM) had the opportunity to take a class on archival science with the staff of the UCSF Archives and Special Collections. Led by Archivist Polina Ilieva, Ph.D., and DAHSM Assistant Professor Aimee Medeiros, Ph.D., this class provided us with an overview of archival science with an emphasis on theory, methodology, and best practices of archival research, arrangement, and description. Most of us had used archives in the past—I even had experience with the UCSF Archives and Special Collections through a blog on the experiences of Base Hospital No. 30 in the First World War—but few of us really understood how archives work, how collections are cultivated and maintained, or the considerations that go into archival collection, assessment, processing, preservation, and presentation. This class provided us with a rare insight into a sector of knowledge production that is all-too-often taken for granted by historians.

UCSF Archives and Special Collections Reading Room and Parnassus Storage Facility.

Many historians and other scholars—myself included, before this class—believe that archives are mere repositories of historically-important data, objective interlocutors who merely preserve the past. Material is collected, inventoried, and stored for future researchers to come along and “discover” the contents and subsequently draw out the stories therein; yet, this is a myth, and one that Drs. Ilieva and Medeiros intended to dispel in their students. To achieve this task, students were allowed to choose from a list of as-yet unprocessed collections. We would be assigned an archivist mentor and process the collections while also meeting each week for a seminar discussion on the historical development and modern concerns of archival science. With my own interests rooted in the history of veterans’ care, I choose the Renée Hoffinger papers because the accession record indicated (with my emphasis) “Renee Hoffinger, MHSE, RD worked in the field of substance abuse for over 20 years at the North Florida/South Georgia Veterans Health System in Gainesville, FL.” While I did not find much of use for my own research, what I discovered while processing the Renée Hoffinger papers will undoubtedly prove to be far more beneficial in the long run.

The Provenance of the Renée Hoffinger Papers

Renée Hoffinger, MHSE, RD, image from “Dietetic Career Spotlight: Renée Hoffinger, MHSE, RD,” by Sarah Koszyk, MA, RDN, https://www.nutritionjobs.com/blog/blog/dietetic-career-spotlight-renee-hoffinger-mhse-rd/, accessed June 3, 2018.

Renée Hoffinger has been a dietitian since 1982 and interested in nutrition and HIV/AIDS since pursuing a health sciences education in the 1990s. While processing her collection, I had the pleasure of being able to correspond with Renée about her collection and why she donated her papers to UCSF’s AIDS History Project. She noted that her experience of researching HIV/AIDS and providing care for patients in Gainesville was vastly different—in terms of support and information availability—than that of health professionals in larger cities like New York, San Francisco, and Miami. During her volunteer work at the North Central Florida AIDS Network, Renée said she was “given a desk and access to patients at the HIV clinic at the local health [department], and spent a lot of time at the medical library tracking down any information I could get my hands on…. Not feeling like I knew very much, I soon unwittingly became the local ‘expert’ on nutrition and HIV.” Renée spent the rest of her career working with other dieticians interested in HIV/AIDS, and even after her retirement in 2013, she has continued writing about and leading hands-on nutrition education workshops. She had heard about the UCSF AIDS History Project and reached out to Archivist Polina Ilieva to find out how she could contribute, and so she decided to donate her papers to the archive.

This story reveals more than just the background of how Renée Hoffinger’s papers ended up at UCSF to be processed by a first-year Ph.D. student in the HHS program. It provides an anecdotal example of how collections end up in archives. Polina Ilieva’s background as an archivist does not make her an expert in HIV/AIDS nutrition, but it does give her training and insight into what future researchers may look for when investigating the history of AIDS and how contemporary medicine attempted to address it. Renée Hoffinger’s papers are stored at UCSF because they provide a small window into how parts of the country outside the urban epicenters of the disease and aspects of medicine not usually associated with the disease dealt with the epidemic’s effects. Thus, Ilieva decided to choose to take on the archival responsibility for the Hoffinger papers—to assess their potential value, to inventory and process their contents, to build finding aids that would serve future researchers, and to be responsible for maintaining the artifacts in the collection for the use of future generations. But she could have just as easily chosen to leave the responsibility to others for any number of reasons including limited archival space and funding, or because the archivist felt the collection would be a better fit elsewhere. In other cases, archivists actively solicit new collections, seeking permission to preserve the data. The decision to donate/accept the papers was therefore only the first step in the archival preservation of data, and it calls to question: what is missing from archival collections, and why?

Archival Concerns and Overhead

A Selection of HIV-AIDS Nutrition Documents from the Hoffinger (Renée) Papers at UCSF.

The story of how the Hoffinger papers came to reside in UCSF’s archives was only the beginning of a journey in what, at times, could seem like a foreign country. The archives have a unique vocabulary and vernacular. Archivists may speak of the accession or deaccession of artifacts or collections. Their language includes terms like “provenance” and “fonds” as well as concepts like “original order” and “finding aids.” Many of these terms may seem somewhat familiar, but their meaning within the archival space can often be different than the assumptions of those outside it, and those meanings can change over time, which is only one of the difficulties that archivists have to navigate in their mission to collect, preserve, and process archival collections. They put a great deal of work into cultivating collections, processing their contents in accordance with laws, regulations, and industry standards, and making the product of that work available to their target audience, which is often the public but may be restricted in some cases. For example, archivists at healthcare institutions like UCSF must pay special attention to the privacy restrictions of the Health Insurance Portability and Accountability Act (HIPAA). They also need to concern themselves with copyright protections and dozens of other concerns, including securing funding and finding the manpower to process and reprocess miles of archival material. For reference, a 12 x 10 x 15 inch banker’s box contains only 1.25 linear feet of material by archival measurement standards—all of which requires storage space that not only protects the archived data but makes it available to public access. Digitization of archival material puts more stress on archivists’ time and resources, not less, as someone has to digitize the materials and provide for electronic storage and access points, often in addition to caring for the original documents. And all of this can be further complicated by unwilling donors. Some communities, particularly those who have been traditionally marginalized, are difficult to archive, requiring archivists to build long-neglected relationships and partnerships to preserve those aspects of history. In other cases, such as the UCSF Industry Documents Library, many of the contents are collected through court order from institutions who are less than thrilled to be forced to hand over internal documents. Such collections often require extraordinary processing efforts precisely because the donors are uncooperative, leaving the archivists to do their best to understand and arrange the documents in a useful manner.

The Contents of the Renée Hoffinger Papers

The Hoffinger Collection Contains AIDS Line Documents and Industry Publications.

In the case of the Hoffinger papers, the process was relatively straightforward. Renée Hoffinger, being alive and well at the time she deeded her papers to UCSF. The collection includes no patient records, so HIPAA was not a concern. Some of the documents are protected under copyright and therefore not likely to be digitized and posted online, but researchers are always welcome to view the documents in person. Regardless of the relative simplicity of this collection, I realized that what goes into the archives is very much the result of a creative and complicated process of selection, compliance, and access on the part of both the author of the papers and the archivists who collect and process them. In other words, archivists play an important role in precisely what is preserved, and this is something that researchers should keep in mind.

Patient Handouts & North Central Florida AIDS Network Newsletters.

The Hoffinger papers contain information chronologically ranging from 1980 to 2006, topically from the concerns of nutrition on AIDS/HIV wasting syndrome, lipodystrophy, prescription medications, substance abuse, alternative medicine, steroids, protocols, and phosphatidylethanolamine drug combinations known as AL-721 and COQ. Hoffinger also included various publications including many AIDS Nutrition Services Association conference materials and presentations, industry and lay press publications, presentations, course syllabi, and patient handouts and publications. Her papers reflect more than twenty years of professional work in the interests of her patients. How future researchers use these materials is impossible to predict, but it is important that when they access this collection, they understand the role played by everyone involved in the collection, from Renée Hoffinger’s selection of materials to donate and UCSF’s willingness to preserve the papers, to a relatively inexperienced history Ph.D. student who helped process the collection and build the finding aid—the collection of metadata that helps researchers find useful materials within the archives—all played an important role in creating, processing, and preserving this information. If you are interested in this collection or others, you can visit the Renée Hoffinger papers at the UCSF Archives and Special Collections. I would also highly encourage anyone interested in the wealth of information available in this collection to provide feedback to the archivists about this collection or any others that you may explore. Would a certain keyword or phrase be useful to others if included on the finding aid? Did you encounter confidential information that was not flagged as such? Did the archives raise questions about potential gaps in the record? These things and others are useful bits of information that the archivists would appreciate.

The Anatomy of an Archive course in the Spring term of 2018 provided students with an invaluable insight into the behind-the-scenes processes of archival work. It helped us identify some professional blind spots and to think critically about archival data. It also helped us earn a profound appreciation for all the work that our archivists do for their fellow scholars and for their role in helping to create, not just preserve, the historical record. And if there is one invaluable piece of advice I can pass along, it is this: when starting your research, always ask an archivist for help. They know their archives better than anyone else and asking their advice will likely save hours of frustration and/or bear unforeseen fruits. And when you ask them for help, make sure to ask about the provenance of the collections you research. It will not only show that you appreciate their work but also provide you with invaluable information in how you approach your research.

Acknowledgements

This blog post was possible not only because it was a requirement on the syllabus, but because this course provided the author with a novel opportunity to peek behind the curtain. It is with the sincerest thanks to Dr. Aimee Medeiros and archivists Dr. Polina Ilieva, Kelsi Evans, and David Uhlich for making this experience possible and to Renée Hoffinger for being so indulgent with a graduate student’s questions. I would also like to extend appreciation to UCSF digital archivist Charlie Macquarie and Dr. Mario Ramirez of Indiana University for taking the time to join our seminar session discussions and to the members of the Archivists and Librarians in the History of Health Sciences association for so warmly welcoming a historian like me among their ranks. I will endeavor to do for my students what all of you have done for me. Thank you.

New Archives Intern: Lauren Wolters

Lauren Wolters

Lauren Wolters is a rising junior undergraduate student at Skidmore College. She is double majoring in History and Psychology and is interested in learning the basics of archival theory and practice. Being a history major, Lauren is fascinated by old artifacts and is excited to have the unique opportunity to work with collections that are not always available to the public eye. Currently, she has been assisting by taking inventory of a collection of photographs and organizing a digital list of metadata. Eventually, she will be transitioning to aid on a project relating to the Langley Porter Psychiatric Institute Records. This project is perfectly tailored towards both of her interests as it combines her two majors.

Lauren was born and raised in San Francisco, CA. She plays volleyball at Skidmore College and enjoys photography as a hobby. Lauren is enjoying working in the library with the archivists and looks forward to learning even more about the archives.

Experiments with Digital Tools in the Archives — OCR

Working on digital “stuff” in the archives is always fascinating, because it blurs the borders between digital and physical. Most of the work the takes up my time is at these borders. Physical stuff requires lots of human touches to transition to “digital,” and digital stuff similarly requires lots tending by humans to ensure that it is preserved physically. After all, the 1s and 0s are stored physically somewhere, even if on the cloud or in DNA.

We’re currently working on several projects to convert physical materials to digital text. The huge quantities of rich and complicated textual material in archival collections is full of potential for use as data in both computational health research and also digital medical humanities work, but to be usable for these kinds of projects it needs to be converted to digital text or data, so that it can be interpreted by computers. To get to this point the documents must be scanned, and the scanned documents must either be transcribed, which can be immensely labor intensive, or converted directly by computers using a software that can perform Optical Character Recognition, or OCR. One of our projects using OCR to extract text from a document provides a fascinating look into the world of computer vision.

A pen and ink illustration of the lungs and a lymph gland from the Ralph Sweet Collection of Medical Illustrations

An example of the illustrations in the Ralph Sweet Collection

The Ralph Sweet Collection of Medical Illustration contains extraordinary examples of the work of one of the most renowned medical illustrators in the United States, so we’re working on digitizing the collection and putting it online. To do this we need to have detailed metadata — the kind of information you might expect to find in a catalog record, title, date, author — about each illustration. Currently this metadata for the Sweet Collection exists only in the form of printed index that was written on a typewriter. We can scan the index, but we do not have the labor to transcribe each of the 2500 or so entries. This is a job for OCR.

The image below shows what a page of the Ralph Sweet index looks like. This is the metadata that we want to be able to extract and turn into digital text so that it can be understood by a computer and used as data.

A page of an type-written index of the Ralph Sweet Collection, showing metadata about each illustration in the colleciton.

A page of the index for the Ralph Sweet Collection.

One of the first problems we encountered in attempting to extract text from this document is a classic difficulty of computer vision. As English-speaking humans, we know by looking at this document that it contains three columns, and that the order in which to read the page is top to bottom by column, starting on the left and moving right. To a computer however, it is simply a page full of text, and there is no way to know whether or not the text is broken into columns or whether each line should be read all the way across the page. This simple task presented a difficulty for most of the software that we tested, but we found one software which could identify these columns easily. The software is called Tesseract, and it was actually developed in the 1980’s but continues to be a good open-source tool to perform OCR.

If we plug the above page into Tesseract, we get mostly recognizable text, which in itself is pretty miraculous when you think about it. Looking at the text though, it quickly becomes clear that it is not an exact transcription of what’s on the page. There are misspellings (“Iivev”), and some chunks of text have been separated from the entry in which they should appear (“horizontal”).

An image of the text-output of the software tesseract showing errors in transciption.

An example of the text extracted from the Ralph Sweet Collection Index by Tesseract.

Digging into the way that Tesseract (and OCR software more generally) works can help us begin to understand why these errors are cropping up. Plus, it looks really cool.

OCR programs have to go through a set of image manipulation processes to help them decide which marks on the page are text — and hence should be interpreted — and which are other marks that can be ignored. This all happen behind the scenes, and usually this involves deciding what the background parts of the image are and blurring them out, increasing the image contrast, and making the image bi-tonal so that everything on the page is only black or white. Then, the computer can trace the black pixels on the image and get a series of shapes which it can use to begin attempting to interpret as text. The image below shows the shapes that Tesseract has identified as letters and traced out for interpretation. Each new color indicates that the computer believes it has moved on to a new letter.

A page of colorful text on a black background illustrating the text that has been automatically traced from the Ralph Sweet Index by the computer program Tessearact.

The result of Tesseract tracing the letters it has interpreted. Each new color is something that’s been identified as a new letter.

Interestingly, when comparing the computer tracing of the letters to the original image you can see that Tesseract has already made the assumption that the black spaces from the three-hole punch in the top of the page are not letters, and thus it should not bother to trace them. Lucky for us, that’s correct.

Next the computer has to take all these letters and turn them into words. In actual practice it’s not quite this simple, but basically the computer iterates on each letter identification that it believes it has made by testing whether or not that word is in its dictionary, and thus whether or not it is likely to be a word. If the combination of letters that the computer thinks it sees are not a word, then it will go back and make a new guess about the letters and test whether or not that’s a word, and so on. Part of this whole process is to chunk the letters into words using their actual spacing on the page. Below you can see an image of how Tesseract has begun to identify words using the spaces between lines and letters.

A view of a page of the Ralph Sweet Index showing each word as a blue rectangle encompassing the space taken up by that block of text against a black background -- the "word" output of the OCR program Tesseract.

The “words” that the OCR software has identified on the page. Each blue rectangle represents a space that Tesseract has marked as containing a word.

In addition to checking the word against the dictionary though, most OCR programs also use the context of the surrounding words to attempt to make a better guess about the text. In most cases this is helpful — if the computer has read a sentence that says “the plant requires wader” it should be a relatively easy task to decide that the last word is actually “water.” In our case though, this approach breaks down. The text we want the computer to extract in this case is not sentences, but rather (meta)data. The meaning of the language has little influence on how each individual word should be read. One of the next steps for us will be trying to figure out how to better instruct Tesseract about the importance of context in making word-identification decisions (i.e., that it’s not important).

Finally, as the OCR software interprets the text it also identifies blocks of words that it believes should be grouped together, like a paragraph. Below you can see the results of this process with Tesseract.

A view of the different elements of tesseract's text identification showing letters traced in primary colors and contained in yellow bounding boxes, words set against blue rectangles outlining the space they encompass, and blocks of text outlined in larger bounding boxes and numbered -- all of this set against a black background.

This view shows all of the elements of Tesseract’s word identification combined. Text has been traced in color, separate letters are contained in bounding boxes, words are contained in blue rectangles, and blocks are contained in larger bounding boxes and are numbered (though the numbers are a bit difficult to see).

A line has been drawn around each block of text, and it has been given a number indicating the order in which the computer is reading it. Here we can see the source of one of the biggest problems of the OCR-generated text from earlier. Tesseract is in-accurately excluding a lot of words from their proper blocks. In the above photo, the word “Pen” is a good example. It is a part of block 20, but it has been interpreted by the computer as it’s own block — block 21 — and has been set aside to appear in the text file after block 20. Attempting to solve this problem will be our next hurdle, and hopefully we can catch you up after we are successful.

Using OCR to extract text from digital images can be a frustrating endeavor if accuracy is a necessity, but it is also a fascinating illustration of the way computers see and make decisions. Anytime we ask computers to perform tasks that interface with humans, we will always be grappling with similar issues.