Analytics with GenAI: Know Your Data!

Person working on a laptop with code on the screen, holding a smartphone. A coffee cup and glasses are on the table.

Post by Geoffrey Boushey, Head of Data Engineering, UCSF Library’s Data Science and Open Scholarship Team. He teaches and consults on programming for data pipelines with an emphasis on Python, Unix and SQL.

Teaching Approaches to Data Science

Recently, Ariel Deardorff (director of the Data Science & Open Scholarship at the UCSF Library) forwarded me a paper titled “Generative AI for Data Science 101: Coding Without Learning to Code.” In this paper, the authors described how they used GitHub Copilot, a tool for generating code with AI, to supplement a fundamentals of data science course for MBA students, most of whom had no prior coding experience. Because the instructors wanted students to use AI to generate code for analysis, but not the full analysis itself, they opted for a tool that generates code without potentially “opening the Pandora’s box too wide” with ChatGPT, a tool that might blur the line between coding and analysis.They also deliberately de-emphasized the code itself, encouraging students to focus on analytical output rather than scrutinizing the R code line by line.

This approach has some interesting parallels, along with some key differences, with the way I teach programming at the UCSF library through the “Data and Document Analysis with Python, SQL, and AI” series. These workshops are attended largely by graduate students, postdocs, research staff, and faculty (people with an exceptionally strong background in research and data science) who are looking to augment their programming, machine learning, and AI skills. These researchers don’t need me to teach them science (it turns out UCSF scientists are already pretty good at science), but they do want to learn how to leverage programming and AI developments to analyze data. In these workshops, which include introductory sessions for people who have not programmed before, I encourage participants to generate their own AI-driven code. However, I have always strongly emphasized the importance of closely any code generated for analytics or data preparation, whether pulled from online examples or created through generative AI.

The goal is to engage researchers with the creative process of writing code while also guarding against biases, inaccuracies, and unintended side effects (these are issues that can arise even in code you write yourself). Although the focus on careful examination contrasts with the approach described in the paper, it made me wonder: what if I diverged in the other direction and bypassed code altogether? If the instructors were successful in teaching MBA students to generate R code without scrutinizing it, could we skip that step entirely and perform the analysis directly in ChatGPT?

Experimental Analysis with ChatGPT

As a personal experiment, I decided to recreate an analysis from a more advanced workshop in my series, where we build a machine learning model to evaluate the impact of various factors on the likelihood of a positive COVID test using a dataset from Carbon Health. I’ve taught several iterations of this workshop, starting well before Generative AI was widely available, and have more recently started integrating GenAI-generated code into the material. But this time, I thought I’d try skipping the code entirely and see how AI could handle the analysis on its own.

I got off to a slightly rocky start with data collection. The covid clinical data repository contains a years worth of testing data split into shorter CSV files representing more limited (weekly) time periods, and I was hoping I could convince ChatGPT to infer this structure from a general link to the github repository and glob all the CSV files sequentially into a pandas dataframe (a tabular data frame format). This process of munging and merging data, while common in data engineering, can be a little complicated, as github provides both a human readable and “raw” view of CSV files. Pandas needs the raw link, which requires multiple clicks through the github web interface to access. Unfortunately, I was unsuccessful in coaxing ChatGPT into reading this structure, and eventually decided to supply github with a direct link to the raw file for one of the CSV files [8]. This worked, and ChatGPT now had a pandas dataframe with about 2,000 covid test records. Ideally, I’d do this with the full ~100k row set, but for this experiment, 2,000 records was enough.

Key Findings & Limitations

Now that I had some data loaded, I asked ChatGPT to rank the features in the dataset based on their ability to predict positive COVID test results. The AI generated a reasonably solid analysis without any need for code. ChatGPT suggested using logistic regression and produced a ranked list of features. When I followed up and asked ChatGPT to use a random forest model and calculate feature importances, it did so immediately, even offering to create a bar chart to visualize the results—no coding required.

Bar chart generated by ChatGPT showing Random Forest Feature Importances for Predicting COVID Test Result.

Here is the bar chart generated by ChatGPT showing the feature importances, with the inclusion of oxygen saturation but the notable omission of loss of smell:

One feature ChatGPT identified as highly significant was oxygen saturation, which I had overlooked in my prior work with the same dataset. This was a moment of insight, but there was one crucial caveat: I couldn’t validate the result in the usual way. Typically, when I generate code during a workshop, we can review it as a group and debug it to ensure that the analysis is sound. But in this no-code approach, the precise stages of this process were hidden from me. I didn’t know exactly how the model had been trained, how the data had been cleaned or missing values imputed, how the feature importances had been calculated, or whether the results had been cross-validated. I also didn’t have access to the feature importance scores from some machine learning algorithms (such as random forest) that I had built and trained myself. The insight was valuable, but it was hard to fully trust or even understand it without transparency into the process.

Screenshot from ChatGPT.

This lack of transparency became even more apparent when I asked ChatGPT about a feature that wasn’t showing up in the results: loss_of_smell. When I mentioned, through the chat interface, that this seems a likely predictor for a positive test and asked why it hadn’t been included, ChatGPT told me that this feature would indeed be valuable and articulated why, but repeated that it wasn’t part of the dataset. This surprised me, as it was in the dataset under the column name “loss_of_smell”

(The full transcript of this interaction, including the AI’s responses and corrections, can be found in the footnote [1] below).

This exchange illustrated both the potential and the limitations of AI-powered tools. The tool was quick, efficient, and pointed me to a feature I hadn’t considered. But it still needed human oversight. Tools like ChatGPT can miss straightforward details or introduce small errors that a person familiar with the data might easily catch. It could also introduce errors that may be notably more obscure, and only detectable after very careful examination and consideration of the output by someone with a much deeper knowledge of the data.

The Importance of Understanding Your Data

The experience reinforced a key principle I emphasize in teaching: know your data. Before jumping into analysis, it’s important to understand how your data was collected, how it’s been processed, what (ahem) a data engineering team may have done to it as they prepared it for you, and what it represents. Without that context, you may not be able to know when AI or other tools have led you in the wrong direction or missed critical context.

While the experiment I conducted with AI analysis was fascinating and demonstrates the potential of low- or no-code approaches, it does underscore, for me, the continued importance of generating and carefully reading code during my programming workshops. Machine learning tasks related to document analysis, such as classification, regression, and feature analysis, involve a very precise set of instructions for gathering, cleaning, formatting, processing, analyzing, and visualizing data. While generative AI tools often provide quick, opaque results, the precision involved in these processes can be obscured from the user. This lack of transparency carries significant implications for repeatable research and proper validation. For now, it remains crucial to have access to the underlying code and understand its workings to ensure thorough validation and review.

Conclusions & Recommendations

Programming languages will likely remain a crucial part of a data scientist’s toolkit for the foreseeable future. But whether you are generating and verifying code or using AI directly on a dataset, keep in mind that the core of data science is the data itself, not the tools used to analyze it.

No matter which tool you choose, the most important step is to deeply understand your data – its origins, how it was collected, any transformations it has undergone, and what it represents. Take the time to engage with it, ask thoughtful questions, and stay vigilant about potential biases or gaps that could influence your analysis. (Is it asking too much to suggest you love your data? Probably. But either way, you might enjoy the UC-wide “Love Data Week” conference).

In academic libraries, much of the value of archives comes from the richness of the objects themselves—features that don’t necessarily come through in a digital format. This is why I encourage researchers to not just work with digital transcriptions, but to also consider the physicality of the data: the texture of the paper, the marks or annotations on the margins, and the context behind how that data came to be. These details often carry meaning that isn’t immediately obvious in a dataset or a plain text transcription. Even in the digital realm, knowing the context, understanding how the data was collected, and remaining aware of the possibility of hidden bias are essential parts of the research process. [3] [4] Similarly, when working with archives or historical records, consider the importance of engaging with the data beyond just the text transcript or list of AI-detected objects in images.

Get to know your data, before, during, and after you analyze it. If possible, handle documents physically, consider what you may have missed. Visit the libraries, museums, and archives [12] where objects are physically stored, talk to archivists and curators who work with them. Your data will tend to outlast the technology that you use to analyze it, and while the tools and techniques you use for analysis will evolve, your knowledge of your data will form the core of its long-term value to you.

  1. GPT transcript: https://github.com/geoffswc/Know-Your-Data-Post/blob/main/GPT_Transcript.pdf
  2. In the workshop series, we use python to merge the separate csv files into a single pandas dataframe using the python glob module. It wouldn’t be difficult to do this and resume working with ChatGPT, though it does demonstrate the difficulty of completing an analysis without any manual intervention through code (for now).
  3. “Bias and Data Loss in Transcript Generation” UCTech, 2023: https://www.youtube.com/watch?v=sNNrx1i96wc
  4. Leveraging AI for Document Analysis in Archival Research and Publishing, It’s About a Billion Lives Symposium 2025 (recording to be posted) https://tobacco.ucsf.edu/it%E2%80%99s-about-billion-lives-annual-symposium
  5. https://www.library.ucsf.edu/archives/ucsf/

The UCSF Digital Health Humanities Interdisciplinary Symposium: Summary and Recordings Release

attendees at a presentation

This summer, the UCSF Library Archives and Special Collections hosted the first UCSF Digital Health Humanities Interdisciplinary Symposium. The symposium brought together researchers working at the intersections of health sciences, data science, and digital humanities. The program kicked off with an introduction to Digital Health Humanities (DHH) at UCSF followed by a lightning talk session. These sessions showcased research projects and works in progress related to this emerging domain. The afternoon sessions were topically-oriented panels and speakers shared their projects and resources for analyzing medical literature and addressing challenges and opportunities when working with historical patient records. The post-session discussions emphasized how researchers across disciplines can converge to compare ways of working with digital methods and historical materials. Multi-disciplinary collaborations can provide significant insights into health and healthcare experiences and influences. Researchers from the UCSF community gave lightning talks covering research processes, exploration and experimentation, early findings, challenges, and new research questions under consideration.

Common threads included:  

  • using industry documents and new analysis methods to identify patterns of industry influence on community organizations and scientific discourse;
  • the role of geography in understanding landscapes of disease and activism;
  • surfacing and confronting omissions in the historical record, particularly that of marginalized communities; and
  • the value and importance of integrating personal experiences in health care provision and historical interpretation of the health sciences.  

Each lightning talk included provocative descriptions of how digital methods have been or could be employed to further understanding of health humanities materials. Presenters also discussed how digital methods can support the inclusion of significant, yet overlooked or underrepresented experiences or perspectives. A forthcoming post will summarize presentations from the lightning talk showcase session. 

Working with historical patient records

A session on the challenges and opportunities of working with historical patient records as data included panelist presentations representing archival, technological, and historical perspectives. UCSF Associate University Librarian for Collections and UCSF Archivist Polina Ilieva shared how archivists can address the access complications presented by historical patient records to realize their potential as the research subject. Methods include digitizing and presenting data within innovative discovery and responsible access platforms proposed by UCSF Archives and Special Collections. Aimee Medeiros, vice chair and associate professor for the UCSF Department of Humanities and Social Sciences, discussed how research benefits from liberating data from historical patient records for quantitative and qualitative inquiry, especially that which expands and deepens understanding of health sciences knowledge networks, including historical structures of oppression, clinical care, and patient and care providers’ social contexts. The panel presentations concluded with Kim Pham, currently research technology officer at the Max Planck Institute. Pham shared both process insights and ethical access considerations from a “collections as data” project she was involved in at the University of Denver that made patient data from historical records of the Jewish Consumptive’s Relief Society available.  

Examining racism in medical literature

Symposium programming closed with a panel presentation and discussion about analyzing medical literature, particularly medical journal archives, to track social topics over time including racism in medicine. Claudia von Vacano the founding executive director and senior research associate of D-Lab and digital humanities at the University of California, Berkeley (UC Berkeley), and Pratik Sachdeva, senior data scientist at the UC Berkeley D-Lab shared an initiative to identify racism narratives in medical literature. Dr. von Vacano explained the need for this project, noting the pervasive reality of structural racism in healthcare and its significant negative impacts particularly on Black and Latino individuals. Sachdeva shared their approach to studying narratives of racism in prominent published medical literature. They identify and analyze racism and power-related terms by adapting a corpus labeling and analysis methodology they established in an earlier D-Lab initiative that measured hate speech.

Melissa Grafe, board member for the Medical Heritage Library, John R. Bumstead Librarian for Medical History and head of the Medical Historical Library at Yale University, presented the range of digitized resources made publicly available by the Medical Heritage Library that can be analyzed to inform research around racism narratives in medicine. This includes State Medical Society JournalsHistorical American Medical Journals as well as the curated collection sets Roots of Racism, and Anti-Black Racism in Medicine. Finally, Moustafa Abdalla, a surgical resident and an independent principal investigator at Massachusetts General Hospital in Boston, presented on his textual analysis work. Abdalla has conducted computational research at Harvard Medical School and the University of Oxford and shared his findings from text analysis on more than 200 years of the Journal of the American Medical Association and the New England Journal of Medicine articles. He has also built and shared an N-gram viewer for this data to help others conduct exploratory research across the corpus.

Access symposium recordings

The DHH pilot program has been humbled by the breadth and depth of meaningful work presented during the symposium. We are encouraged by the insights, conversations, and opportunities for future collaboration that were seeded throughout the day. As UCSF DHH programming continues and researcher networks grow, we intend to host future events that build upon the momentum from this symposium! All symposium recordings are now available on the UCSF CLE. 

About the Digital Health Humanities program

The UCSF Digital Health Humanities pilot program is funded by the Academic Senate Chancellor’s Fund, via the Committee on Library and Scholarly Communication to facilitate interdisciplinary scholarship that advances understanding of the profound effects of illness and disease on patients, health professionals, and the social worlds in which they live and work. The UCSF DHH was launched in 2022 and provides programming and resources to guide and support researchers in their engagement with digital tools and methods. The program also provides resources for working with archives as data.

Archives as Data Research Guide Now Available!

To help researchers in finding and understanding how to work with data from archival health sciences collections, we have compiled and published the Archives as Data research guide. “Archives as Data” refers to archival collection materials in digital form that can be shared, accessed, analyzed, and referenced as data. Using digital tools, researchers can work with archives as data to explore and evaluate characteristics of collection materials and analyze trends and connections within and across them.

AIDS History Project Collections document included in the No More Silence dataset with Python code used for analysis.

UCSF Archives and Special Collections makes data available from a number of our digital collections. Researchers will find information in the guide about accessing and using such data as well as descriptions of both the form and content this data takes. As well, you’ll find a growing set of links to to learning resources about various data analysis methods used to work with archives as data.

This new Archives as Data research guide provides researchers with a centralized resource hub with brief descriptions of collection materials as well as links to the datasets that have been prepared from them, including:

  • The No More Silence dataset, an aggregation of data from selected collections included in the AIDS History Project which range from the records of community activism groups to the papers of health researchers and journalists.
  • Data from the Industry Documents Library, comprising collections of documents from the tobacco, food, drug, fossil fuel, chemical, and opioid industries, all of which impact public health.
  • Selected datasets from the COVID Tracking Project, a volunteer organization launched from The Atlantic and dedicated to collecting and publishing the data required to understand the COVID-19 outbreak in the United States, with data collected from March 2020-March 2021.
  • Data from digitized UCSF University Publications, from course catalogs to annual reports, newsletters, and more.

We look forward to updating the guide as more data from UCSF Archives and Special Collections becomes available, and anticipate expanding to include links to “archives as data” of interest for digital health humanities work made available by other institutions and organizations.

To learn more about how we are making archives as data available at UCSF, check out recordings and resources from our recent sessions on Finding and Exploring Archives as Data for Digital Health Humanities!

The Archives as Data Research Guide has been published as part of the UCSF DIgital Health Humanities pilot program. Please reach out to the Digital Health Humanities Program Coordinator Kathryn Stine, at kathryn.stine@ucsf.edu with any questions about DHH at UCSF. The UCSF Digital Health Humanities Pilot is funded by the Academic Senate Chancellor’s Fund via the Committee on Library and Scholarly Communication.

Digital Health Humanities: Showcasing “Archives as Data” for Analysis

UCSF Archives & Special Collections includes numerous digitized collections documenting health sciences topics ranging from institutional, community, and individual response to illness and disease to industry impacts on public health. We make many of these collections available as data that can be computationally analyzed for health sciences and humanities research.

Voyant Cirrus term frequency visualization generated from AIDS health crisis workshops file data, 1986 from the UCSF AIDS Health Project Records, UCSF Archives & Special Collections (data available in the No More Silence dataset).

If you are curious about working with data from the UCSF Archives and Special Collections, the Digital Health Humanities (DHH) pilot program will showcase our “archives as data” throughout the month. In two upcoming sessions, we’ll provide an orientation to available data as well as methods for finding, accessing, and exploring these data resources:

Voyant Bubbleline term occurrence visualization generated from Letter from the FDA to Purdue re: new drug application for OxyContin Controlled-Release Tablets data, 1995 from the Kentucky Opioid Litigation Documents collection, UCSF Industry Documents Library (data available from from item page link or as part of collection dataset).

Python for Data Analysis series workshops

DHH programming also continues to partner with the Data Science Institute (DSI) to offer workshops on tools and methods well-suited to conducting research with “archives as data.” March workshops in the DSI Python for Data Analysis series will dig in to text analysis using natural language processing and building machine learning models:

Through these workshops and selected companion follow-up sessions with troubleshooting and guided process walkthroughs, researchers can learn and practice data analysis techniques and get familiar with data from our collections. Check out the library’s events calendar to find and register for the latest offerings!

OpenRefine workshops

If you have data you’d like to work with but it needs tidying and preparation attend a DSI OpenRefine workshop. This workshop will cover techniques for cleaning structured data, no programming required! There will be two OpenRefine sessions this month:

Previously-held DHH session slides, linked resources, and recordings are available on the CLE. There you will find materials from a Digital Health Humanities Overview session and recorded walkthroughs for Unix, Python, and Jupyter notebooks basics. Related resources will be updated on the CLE following DHH sessions.

Questions?

Please contact DHH Program Coordinator, Kathryn Stine, at kathryn.stine@ucsf.edu. The UCSF Digital Health Humanities Pilot is funded by the Academic Senate Chancellor’s Fund via the Committee on Library and Scholarly Communication.

Launching the Digital Health Humanities Pilot

We are excited to launch digital health humanities pilot programming starting January 2023! Digital health humanities (DHH) is an emerging discipline that utilizes digital methods and resources to explore research questions investigating the human experience around health and illness. The Digital Health Humanities Pilot (DHHP) will facilitate new insights into historical health data. Participants will learn how to evaluate and integrate digital methods and “archives as data” into their research through a range of offerings and trainings.

Participants at the first workshop for the No More Silence project, a precursor to digital health humanities pilot programming

The programming from this pilot will bring a humanistic context to understanding institutional, personal and community responses to health issues, as well as social, cultural, political and economic impacts on individual and public health. The DHHP will offer researchers from all disciplines (including faculty, staff, and other learners) tailored workshops, classes, and skill-building sessions. Workshops will encourage the use of “archives as data” and utilize datasets from holdings within the UCSF Archives and Special Collections (including the AIDS History Project and Industry Documents Library, among others). Additionally, in spring 2023 we will be hosting the Digital Health Humanities Symposium. The symposium will provide space to consider theoretical issues central to this emerging field and highlight digital health humanities projects. More information on the symposium will be shared soon.

The UCSF Digital Health Humanities Pilot is funded by the Academic Senate Chancellor’s Fund via the Committee on Library and Scholarly Communication.

Register for an upcoming Digital Health Humanities overview session

Are you interested in learning how DHH can inform your research? We invite you to participate in our virtual session, Digital Health Humanities: An Overview of Methods, Tools, Archives, and Applications, Thursday, January 19, from 1 to 3 p.m. PT.

This session will include an orientation led by Digital Health Humanities Program Coordinator, Kathryn Stine and Digital Archivist, Charlie Macquarie. We will discuss various approaches in DHH research, including getting familiar with data analysis and programming skills, and will share an overview of the UCSF Library’s archival collections data available for research.

For questions about digital health humanities at UCSF, please contact Digital Health Humanities Program Coordinator, Kathryn Stine at kathryn.stine@ucsf.edu.

Register Now

Collaborating with the Data Science Initiative

The Data Science Initiative (DSI) is offering workshops in the coming months to support researchers interested in implementing DHH approaches. Follow-up sessions will be available for researchers to reinforce and contextualize programming foundations in practical application. Check out the upcoming sessions:

We invite you to check out the library’s events and classes calendar for upcoming DHHP (and related DSI) programming. If you are unable to attend any of the sessions listed above, we advise referring to the DSI Collaborative Learning Environment (CLE) (accessible with MyAccess credentials) for recordings and resources.



Kathryn Stine Joins UCSF Archives & Special Collections

We are excited to introduce Kathryn Stine who joined the Archives & Special Collections team as a Digital Health Humanities Program Coordinator. This position will support development and day-to-day operations of a new Digital Health Humanities Pilot. The goal of this initiative is to guide and support faculty in their engagement with digital tools and methods to facilitate interdisciplinary scholarship that will advance understanding of the profound effects of illness and disease on patients, health professionals, and the social worlds in which they live and work.

Kathryn Stine, UCSF Library digital health humanities program coordinator
Kathryn Stine

Kathryn Stine has an extensive background in developing and providing access to digital collections. Her experience includes nearly 10 years working for the University of California system at the California Digital Library (CDL) in various roles, the most recent of which as the Senior Product Manager for Digitization & Digital Content. In that position, Kathryn managed the team that supports and coordinates the University of California Libraries’ engagement with HathiTrust and mass digitization activities. 

Prior to joining CDL, Kathryn held several positions at the University of Illinois at Chicago where she was responsible for leading the university archives program and managed special collections processing. 

Kathryn is deeply experienced in developing and managing cross-institutional and cross-departmental library projects and building communities across diverse functions and perspectives. Her work at CDL  included managing and contributing to both investigative and operations-focused systemwide project teams, coordinating web archiving initiatives, advising for the UC Berkeley digital lifecycle program, and leading a team of developers and analysts to launch, maintain, and enhance a metadata management system for and with HathiTrust. She is motivated by supporting cross-functional teams in bringing both collaboration and creativity to common purpose.

Working with (meta)data is a throughline in Kathryn’s career, and she is enthusiastic about encouraging new ways of deriving and analyzing collections data in support of innovative digital research. In developing and providing workshops and providing project consultation, Kathryn has found working with researchers to make the most of digital collections to be incredibly rewarding. She is very excited to be joining the UCSF Library for the opportunities to work with researchers, technologists, and archivists to match health humanities research inquiry to relevant collections, digital analysis methods, and technical tools.

Kathryn loves a good metadata challenge to puzzle through, and also enjoys improvisational cooking, garment sewing, and getting outdoors with her family, especially to camp and open-water swim.