Analytics with GenAI: Know Your Data!

Person working on a laptop with code on the screen, holding a smartphone. A coffee cup and glasses are on the table.

Post by Geoffrey Boushey, Head of Data Engineering, UCSF Library’s Data Science and Open Scholarship Team. He teaches and consults on programming for data pipelines with an emphasis on Python, Unix and SQL.

Teaching Approaches to Data Science

Recently, Ariel Deardorff (director of the Data Science & Open Scholarship at the UCSF Library) forwarded me a paper titled “Generative AI for Data Science 101: Coding Without Learning to Code.” In this paper, the authors described how they used GitHub Copilot, a tool for generating code with AI, to supplement a fundamentals of data science course for MBA students, most of whom had no prior coding experience. Because the instructors wanted students to use AI to generate code for analysis, but not the full analysis itself, they opted for a tool that generates code without potentially “opening the Pandora’s box too wide” with ChatGPT, a tool that might blur the line between coding and analysis.They also deliberately de-emphasized the code itself, encouraging students to focus on analytical output rather than scrutinizing the R code line by line.

This approach has some interesting parallels, along with some key differences, with the way I teach programming at the UCSF library through the “Data and Document Analysis with Python, SQL, and AI” series. These workshops are attended largely by graduate students, postdocs, research staff, and faculty (people with an exceptionally strong background in research and data science) who are looking to augment their programming, machine learning, and AI skills. These researchers don’t need me to teach them science (it turns out UCSF scientists are already pretty good at science), but they do want to learn how to leverage programming and AI developments to analyze data. In these workshops, which include introductory sessions for people who have not programmed before, I encourage participants to generate their own AI-driven code. However, I have always strongly emphasized the importance of closely any code generated for analytics or data preparation, whether pulled from online examples or created through generative AI.

The goal is to engage researchers with the creative process of writing code while also guarding against biases, inaccuracies, and unintended side effects (these are issues that can arise even in code you write yourself). Although the focus on careful examination contrasts with the approach described in the paper, it made me wonder: what if I diverged in the other direction and bypassed code altogether? If the instructors were successful in teaching MBA students to generate R code without scrutinizing it, could we skip that step entirely and perform the analysis directly in ChatGPT?

Experimental Analysis with ChatGPT

As a personal experiment, I decided to recreate an analysis from a more advanced workshop in my series, where we build a machine learning model to evaluate the impact of various factors on the likelihood of a positive COVID test using a dataset from Carbon Health. I’ve taught several iterations of this workshop, starting well before Generative AI was widely available, and have more recently started integrating GenAI-generated code into the material. But this time, I thought I’d try skipping the code entirely and see how AI could handle the analysis on its own.

I got off to a slightly rocky start with data collection. The covid clinical data repository contains a years worth of testing data split into shorter CSV files representing more limited (weekly) time periods, and I was hoping I could convince ChatGPT to infer this structure from a general link to the github repository and glob all the CSV files sequentially into a pandas dataframe (a tabular data frame format). This process of munging and merging data, while common in data engineering, can be a little complicated, as github provides both a human readable and “raw” view of CSV files. Pandas needs the raw link, which requires multiple clicks through the github web interface to access. Unfortunately, I was unsuccessful in coaxing ChatGPT into reading this structure, and eventually decided to supply github with a direct link to the raw file for one of the CSV files [8]. This worked, and ChatGPT now had a pandas dataframe with about 2,000 covid test records. Ideally, I’d do this with the full ~100k row set, but for this experiment, 2,000 records was enough.

Key Findings & Limitations

Now that I had some data loaded, I asked ChatGPT to rank the features in the dataset based on their ability to predict positive COVID test results. The AI generated a reasonably solid analysis without any need for code. ChatGPT suggested using logistic regression and produced a ranked list of features. When I followed up and asked ChatGPT to use a random forest model and calculate feature importances, it did so immediately, even offering to create a bar chart to visualize the results—no coding required.

Bar chart generated by ChatGPT showing Random Forest Feature Importances for Predicting COVID Test Result.

Here is the bar chart generated by ChatGPT showing the feature importances, with the inclusion of oxygen saturation but the notable omission of loss of smell:

One feature ChatGPT identified as highly significant was oxygen saturation, which I had overlooked in my prior work with the same dataset. This was a moment of insight, but there was one crucial caveat: I couldn’t validate the result in the usual way. Typically, when I generate code during a workshop, we can review it as a group and debug it to ensure that the analysis is sound. But in this no-code approach, the precise stages of this process were hidden from me. I didn’t know exactly how the model had been trained, how the data had been cleaned or missing values imputed, how the feature importances had been calculated, or whether the results had been cross-validated. I also didn’t have access to the feature importance scores from some machine learning algorithms (such as random forest) that I had built and trained myself. The insight was valuable, but it was hard to fully trust or even understand it without transparency into the process.

Screenshot from ChatGPT.

This lack of transparency became even more apparent when I asked ChatGPT about a feature that wasn’t showing up in the results: loss_of_smell. When I mentioned, through the chat interface, that this seems a likely predictor for a positive test and asked why it hadn’t been included, ChatGPT told me that this feature would indeed be valuable and articulated why, but repeated that it wasn’t part of the dataset. This surprised me, as it was in the dataset under the column name “loss_of_smell”

(The full transcript of this interaction, including the AI’s responses and corrections, can be found in the footnote [1] below).

This exchange illustrated both the potential and the limitations of AI-powered tools. The tool was quick, efficient, and pointed me to a feature I hadn’t considered. But it still needed human oversight. Tools like ChatGPT can miss straightforward details or introduce small errors that a person familiar with the data might easily catch. It could also introduce errors that may be notably more obscure, and only detectable after very careful examination and consideration of the output by someone with a much deeper knowledge of the data.

The Importance of Understanding Your Data

The experience reinforced a key principle I emphasize in teaching: know your data. Before jumping into analysis, it’s important to understand how your data was collected, how it’s been processed, what (ahem) a data engineering team may have done to it as they prepared it for you, and what it represents. Without that context, you may not be able to know when AI or other tools have led you in the wrong direction or missed critical context.

While the experiment I conducted with AI analysis was fascinating and demonstrates the potential of low- or no-code approaches, it does underscore, for me, the continued importance of generating and carefully reading code during my programming workshops. Machine learning tasks related to document analysis, such as classification, regression, and feature analysis, involve a very precise set of instructions for gathering, cleaning, formatting, processing, analyzing, and visualizing data. While generative AI tools often provide quick, opaque results, the precision involved in these processes can be obscured from the user. This lack of transparency carries significant implications for repeatable research and proper validation. For now, it remains crucial to have access to the underlying code and understand its workings to ensure thorough validation and review.

Conclusions & Recommendations

Programming languages will likely remain a crucial part of a data scientist’s toolkit for the foreseeable future. But whether you are generating and verifying code or using AI directly on a dataset, keep in mind that the core of data science is the data itself, not the tools used to analyze it.

No matter which tool you choose, the most important step is to deeply understand your data – its origins, how it was collected, any transformations it has undergone, and what it represents. Take the time to engage with it, ask thoughtful questions, and stay vigilant about potential biases or gaps that could influence your analysis. (Is it asking too much to suggest you love your data? Probably. But either way, you might enjoy the UC-wide “Love Data Week” conference).

In academic libraries, much of the value of archives comes from the richness of the objects themselves—features that don’t necessarily come through in a digital format. This is why I encourage researchers to not just work with digital transcriptions, but to also consider the physicality of the data: the texture of the paper, the marks or annotations on the margins, and the context behind how that data came to be. These details often carry meaning that isn’t immediately obvious in a dataset or a plain text transcription. Even in the digital realm, knowing the context, understanding how the data was collected, and remaining aware of the possibility of hidden bias are essential parts of the research process. [3] [4] Similarly, when working with archives or historical records, consider the importance of engaging with the data beyond just the text transcript or list of AI-detected objects in images.

Get to know your data, before, during, and after you analyze it. If possible, handle documents physically, consider what you may have missed. Visit the libraries, museums, and archives [12] where objects are physically stored, talk to archivists and curators who work with them. Your data will tend to outlast the technology that you use to analyze it, and while the tools and techniques you use for analysis will evolve, your knowledge of your data will form the core of its long-term value to you.

  1. GPT transcript: https://github.com/geoffswc/Know-Your-Data-Post/blob/main/GPT_Transcript.pdf
  2. In the workshop series, we use python to merge the separate csv files into a single pandas dataframe using the python glob module. It wouldn’t be difficult to do this and resume working with ChatGPT, though it does demonstrate the difficulty of completing an analysis without any manual intervention through code (for now).
  3. “Bias and Data Loss in Transcript Generation” UCTech, 2023: https://www.youtube.com/watch?v=sNNrx1i96wc
  4. Leveraging AI for Document Analysis in Archival Research and Publishing, It’s About a Billion Lives Symposium 2025 (recording to be posted) https://tobacco.ucsf.edu/it%E2%80%99s-about-billion-lives-annual-symposium
  5. https://www.library.ucsf.edu/archives/ucsf/

“Data for All, For Good, Forever”: Working Towards Sustainable Digital Preservation at the iPRES 2022 Conference

iPRES 2022 banner

The 18th International Conference on Digital Preservation (iPRES) took place from September 12-16, 2022, in Glasgow, Scotland. First convened in 2004 in Beijing, iPRES has been held on four different continents and aims to embrace “a variety of topics in digital preservation – from strategy to implementation, and from international and regional initiatives to small organisations.” Key values are inclusive dialogue and cooperative goals, which were very much centered in Glasgow thanks to the goodwill of the attendees, the conference code of conduct, and the significant efforts of the remarkable Digital Preservation Coalition (DPC), the iPRES 2022 organizational host.

I attended the conference in my role as the UCSF Industry Documents Library’s managing archivist to gain a better understanding of how other institutions are managing and preserving their rapidly-growing digital collections. For me and for many of the delegates, iPRES 2022 was the first opportunity since the COVID pandemic began to join an in-person conference for professional conversation and exchange. It will come as no surprise to say that gathering together was incredibly valuable and enjoyable (in no small part thanks to the traditional Scottish ceilidh dance which took place at the conference dinner!) The Program Committee also did a fantastic job designing an inclusive online experience for virtual attendees, with livestreamed talks, online social events, and collaborative session notes.

Session themes focused on Community, Environment, Innovation, Resilience, and Exchange. Keynotes were delivered by Amina Shah, the National Librarian of Scotland; Tamar Evangelestia-Dougherty, the inaugural director of the Smithsonian Libraries and Archives; and Steven Gonzalez Monserrate, an ethnographer of data centers and PhD Candidate in the History, Anthropology, Science, Technology & Society (HASTS) program at the Massachusetts Institute of Technology.

Every session I attended was excellent, informative, and thought-provoking. To highlight just a few:

Amina Shah’s keynote “Video Killed the Radio Star: Preserving a Nation’s Memory” (featuring the official 1980 music video by the Buggles!) focused on keeping up with the pace of change at the National Library of Scotland by engaging with new formats, new audiences, and new uses for collections. She noted that “expressing value in a key part of resilience” and that the cultural heritage community needs to talk about “why we’re doing digital preservation, not just how.” This was underscored by her description of our world as a place where the truth is under attack, that capturing the truth and finding a way to present it is crucial, and that it is also crucial that this work be done by people who aren’t trying to make a profit from it.

“Green Goes with Anything: Decreasing Environmental Impact of Digital Libraries at Virginia Tech,” a long paper presented by Alex Kinnaman as part of the wholly excellent Environment 1 session, examined existing digital library practices at Virginia Tech University Libraries, and explored changes in documentation and practice that will foster a more environmentally sustainable collections platform. These changes include choosing the least-energy consumptive hash algorithms (MD4 and MD5) for file fixity checks; choosing cloud storage providers based on their environmental practices; including environmental impact of a digital collection as part of appraisal criteria; and several other practical and actionable recommendations.

The Innovation 2 session included two short papers (by Pierre-yves Burgi, and by Euan Cochrane) and a fascinatingly futuristic panel discussion posing the question “Will DNA Form the Fabric of our Digital Preservation Storage?” (Also special mention to the Resilience 1 session which presented proposed solutions for preserving records of nuclear decommissioning and nuclear waste storage for the very long term – 10,000 years!)

Tamar Evangelestia-Dougherty’s keynote Digital Ties That Bind: Effectively Engaging With Communities For Equitable Digital Preservation Ecosystems was an electric presentation that called unequivocally for centering equity and inclusion within our digital ecosystems, and for recognizing, respecting, and making space for the knowledge and contributions of community archivists. She called out common missteps in digital preservation outreach to communities, and challenged all those listening to “get more people in the room” to include non-white, non-Western perspectives.

“’…provide a lasting legacy for Glasgow and the nation’: Two years of transferring Scottish Cabinet records to National Records of Scotland,” a short paper by Garth Stewart in the Innovation 4 session, touched on a number of challenges very familiar to the UCSF Industry Documents Library team! These included the transfer of a huge volume of recent and potentially sensitive digital documents, in redacted and unredacted form; a need to provide online access as quickly as possible; serving the needs of two major access audiences – the press, and the public; normalizing files to PDF in order to present them online; and dealing with incomplete or missing files.

And so much more, summarized by the final keynote speaker Steven Gonzalez Monserrate after his fantastical storytelling closing talk on the ecological impact of massive terrestrial data centers and what might come after “The Cloud” (underwater data centers? Clay tablets? Living DNA storage?). And, I didn’t even mention the Digital Preservation Bake Off Challenge

After the conference I also had the opportunity to visit the Archives of the Royal College of Physicians and Surgeons of Glasgow, where our tour group was welcomed by the expert library staff and shown several fascinating items from their collections, including an 18th century Book of Herbal Remedies (which has been digitized for online access).

After five collaborative and collegial days in Glasgow, I’m looking forward to bringing these ideas back to our work with digital archival collections here at UCSF. Many thanks to iPRES, the DPC, the Program Committee, the speakers and presenters, and all the delegates for building this wonderful community for digital preservation!

An 18th-century Book of Herbal Remedies on display at the Archives of the Royal College of Physicians and Surgeons of Glasgow

Pioneering Child Studies: Digitizing and Providing Access to Collection of Women Physicians Who Spearheaded Behavioral and Developmental Pediatrics Update

We are at the one-year point of the project Pioneering Child Studies: Digitizing and Providing Access to Collection of Women Physicians who Spearheaded Behavioral and Developmental Pediatrics. UCSF Archives & Special Collections and UC Merced have made significant headway towards our goal of digitizing and publishing 68,000 pages from the collections of Drs. Hulda Evelyn Thelander, Helen Fahl Gofman, Selma Fraiberg, Leona Mayer Bayer, and Ms. Carol Hardgrove.

To date we have digitized over 33,000 pages. The digitized material are still undergoing quality assurance (QA) procedures. Here are some items we have digitized so far.

Dr. Leona Mayer Bayer

This collection features professional correspondence of Dr. Leona Mayer Bayer. Her work focused on child development and human growth and psychology of sick children.

Dr. Selma Horwitz Fraiberg

This collection includes several drafts of her research papers on important aspects of developmental-behavioral pediatrics.

In the next year we will continue digitizing and will soon publish our collections on Calisphere.  Stay tuned for our next update.

Creating a Collection: Chaplaincy Services at the University of California, San Francisco

Introduction by Polina Ilieva:

After a four-year break, last semester the archives team hosted a History of Health Sciences course, the Anatomy of an Archive. This course was developed and co-taught by the Department of Humanities and Social Sciences Associate Professor, Aimee Medeiros and Associate University Librarian for Collections and UCSF Archivist, Polina Ilieva. Charlie Macquarie, Digital Archivist, facilitated the discussion on Digital Projects. Polina, Peggy Tran-Le, Research and Technical Services Managing Archivist, and Edith Escobedo, Processing Archivist, served as mentors for students’ processing projects throughout the duration of the course.

The goal of this course was to provide an overview of archival science with an emphasis on the theory, methodology, technologies and best practices of archival research, arrangement and description. The archivists put together a list of collections requiring processing and also corresponding to students’ research interests and each student selected one that they worked on with their mentor to arrange and create a finding aid. During this 11-week hybrid course students developed competencies related to researching and describing archival collections, as well as interpreting the historical record. At the conclusion of this course students wrote a story about their experiences highlighting collections they processed. In the next few weeks, we will be sharing these stories with you.

This week’s story comes from Alexzandria Simon, PhD student, UCSF Department of History & Social Sciences.

Post by Alexzandria Simon:

Having never stepped into any kind of archival space or discussion, I was excited to engage with, learn about, and understand what the archives are and mean. Now, after working with Polina Ilieva and Aimee Medeiros, at UCSF, I realize all the intricacies, time, and special attention that goes into the archival collection process. There are practices and standards that guide researchers and archivists, and emotions and ethics play a role in shaping collections and entire archives. The journey of processing a collection is time consuming, interdisciplinary, and sometimes messy. However, the craft of processing a collection allows individuals to discover new characters, information, and stories that take place during a different time and space.

            When I saw my collection for the first time, all I could think to myself was how small the collection is. I was surprised after seeing others, some that consisted of 5 boxes worth of documents, that the one I was planning to work on could be confined to one file. I could not begin to comprehend how the file could tell such a large story. I began flipping through all the documents, photographs, and pamphlets and skimming through the letters and correspondence trying to put all the pieces together. The file had no organizational layout, and so my priority was to put everything in chronological order. I wanted to understand the starting point and the ending point. What I came to discover is that sometimes, collections do not always have a solid beginning and concluding aspect. Stories sometimes begin right in the middle and then end abruptly, leaving many questions.

Figure 1: “Physician – Patient – Pastor” Pamphlet, San Francisco Medical Center, May 1961, Chaplaincy Services at UCSF, MSS 22-03.

Figure 1: “Physician – Patient – Pastor” Pamphlet, San Francisco Medical Center, May 1961, Chaplaincy Services at UCSF, MSS 22-03.

The Chaplaincy Services at UCSF Collection began with correspondence between UCSF administrators interested in starting a chaplaincy program. They sought to understand how chaplains, priests, and rabbis could have a role in their hospital space and provide services to patients. What they came to learn and understand, from informational pamphlets, is the connection between chaplains and patients is a powerful one. Chaplains offer judgement free support and a space for patient’s belief and repent needs. When a patient is alone with no family members or loved ones, they can call upon their religion to give them a person of guidance and care.

Figure 2: Installation Service Program for Reverend Elmer Laursen, S.T.M., Lutheran Welfare Service of Northern California, September 18, 1960, Chaplaincy Services at UCSF, MSS 22-03.

Figure 2: Installation Service Program for Reverend Elmer Laursen, S.T.M., Lutheran Welfare Service of Northern California, September 18, 1960, Chaplaincy Services at UCSF, MSS 22-03.

These discussions would ultimately lead to the establishment of a Clinical Pastoral Education Program initiated and headed by Reverend Elmer Laursen, S.T.M. Reverend Laursen was a prominent figure in the Chaplaincy Services at UCSF and established clinical pastoral work as necessary for patient care. Reverend Laursen engaged in public outreach, fundraising, patient and student advocacy, and building relationships with other colleges and hospitals. His work inspired other pastors, reverends, and religious officials to begin implementing clinical pastoral education programs to develop student learning and patient care. He believed that pastoral care is imperative to patient care. Patients deal with challenging, and sometimes traumatizing and scary, medical procedures. The Chaplaincy Program could offer solitude and peace for patients who have no one else to call on. Chaplaincy Programs offer a humanistic approach to patient care in a field that is saturated with data, clinicians, and the medical unknown.

Figure 3: Group Photo of Chaplains, Reverends, Nuns, and Administrators at the 21st Anniversary Celebration of Chaplaincy Training Event, September 1982, Chaplaincy Services at UCSF, MSS 22-03.

Figure 3: Group Photo of Chaplains, Reverends, Nuns, and Administrators at the 21st Anniversary Celebration of Chaplaincy Training Event, September 1982, Chaplaincy Services at UCSF, MSS 22-03.

After reading through the collection, I began dividing the documents into subject folders. These consisted of “Chaplaincy Service Materials,” “Pamphlets & Booklets,” “Funding,” “Chaplaincy Facility Space,” “Chaplain Elmer Laursen,” “Correspondence – August 1959 – September 1974,” “Photographs,” “21st Anniversary Celebration,” and “Rabbi Services.” Through these folders, the collection is now organized in a way researchers and others can trace the narrative. While I was processing the collection, I kept reminding myself to make the finding aid easy and accessible. I want anyone, scholar or not, to be able to open the finding aid or file and know what the collection includes. It is difficult to not let the records overwhelm you with tiny details. It is difficult to not get lost in every aspect of a collection. I found gaps in the correspondence, and every time I read something new, I seemed to come up with more questions. However, I believe that to be a part of the journey and work of archivists and scholars. We are always left wanting more. The documents in the collection are only a portion of the much larger story around Chaplaincy Services at UCSF. Even more miniscule in the larger history around religion and hospitals.