Topic Modeling | Ryan C. Yeung

Workshop Overview:

Researchers often collect open text or narrative data. While these text/narrative data are rich and meaningful, researchers often have limited access to methods of analyzing these data. Traditional approaches such as manual coding require a great deal of resources (e.g., time, money, training, personnel), can be difficult to replicate or reproduce, and cannot be conducted at scale (e.g., larger sample sizes). Topic modeling is an alternative class of methods that predicts and quantifies what a given text/document is about using machine learning/natural language processing. These computational methods enable large-scale content analyses in a fraction of the time typically needed, allow novel analyses of topics as continuous, non-mutually exclusive variables, and (ideally) keep humans in the driver’s seat (Yeung et al., 2022; Yeung & Fernandes, 2023).

Workshop Objectives:

In this session, we’ll cover (1) the theoretical bases of topic modeling, (2) two open-source, local-environment topic modeling methods: structural topic modeling (STM) and BERTopic, and (3) situations where topic modeling might be particularly appropriate (or inappropriate).

For code and any associated materials: https://github.com/ryancyeung/topic_model_workshop

References

2023

Specific topics, specific symptoms: Linking the content of recurrent involuntary memories to mental health using computational text analysis

Ryan C Yeung and Myra A Fernandes

npj Mental Health Research, 2023

Abstract DOI OSF

Researchers debate whether recurrent involuntary autobiographical memories (IAMs; memories of one’s personal past retrieved unintentionally and repetitively) are pathological or ordinary. While some argue that these memories contribute to clinical disorders, recurrent IAMs are also common in everyday life. Here, we examined how the content of recurrent IAMs might distinguish between those that are maladaptive (related to worse mental health) versus benign (unrelated to mental health). Over two years, 6187 undergraduates completed online surveys about recurrent IAMs; those who experienced recurrent IAMs within the past year were asked to describe their memories, resulting in 3624 text descriptions. Using a previously validated computational approach (structural topic modeling), we identi ed coherent topics (e.g., “Conversations” , “Experiences with family members”) in recurrent IAMs. Specific topics (e.g., “Negative past relationships”, “Abuse and trauma”) were uniquely related to symptoms of mental health disorders (e.g., depression, PTSD), above and beyond the self-reported valence of these memories. Importantly, we also found that content in recurrent IAMs was distinct across symptom types (e.g., “Communication and miscommunication” was related to social anxiety, but not symptoms of other disorders), suggesting that while negative recurrent IAMs are transdiagnostic, their content remains unique across different types of mental health concerns. Our work shows that topics in recurrent IAMs—and their links to mental health—are identifiable, distinguishable, and quantifiable.

2022

Understanding autobiographical memory content using computational text analysis

Ryan C Yeung, Marek Stastna, and Myra A Fernandes

Memory, 2022

Abstract DOI OSF

Although research on autobiographical memory (AM) continues to grow, there remain few methods to analyze AM content. Past approaches are typically manual, and prohibitively time- and labour-intensive. These methodological limitations are concerning because content may provide insights into the nature and functions of AM. In particular, analyzing content in recurrent involuntary autobiographical memories (IAMs; those that spring to mind unintentionally and repetitively) could resolve controversies about whether these memories typically involve mundane or distressing events. Here, we present computational methods that can analyze content in thousands of participants’ AMs, without needing to hand-code each memory. A sample of 6,187 undergraduates completed surveys about recurrent IAMs, resulting in 3,624 text descriptions. Using frequency analyses, we identified common (e.g., “time” , “friend”) and distinctive words in recurrent IAMs (e.g., “argument” as distinctive to negative recurrent IAMs). Using structural topic modelling, we identified coherent topics (e.g., “Negative past relationships” , “Conversations” , “Experiences with family members”) within recurrent IAMs and found that topic use significantly differed depending on the valence of these memories. Computational methods allowed us to analyze large quantities of AM content with enhanced granularity and reproducibility. We present the means to enable future research on AM content at an unprecedented scope and scale.