As a proponent of the open science movement, open-source development is deeply important to me. My training as a cognitive neuroscientist informs my approach towards using and developing AI methods, in that fundamental principles of ethics, reproducibility, and replicability are more important than ever. To this end, I release all of my code and data openly, wherever ethically sound.
Furthermore, an important cornerstone of my research is developing and validating novel methods that can enable/improve science at large. To this end, I have developed and validated novel AI/ML/NLP methods and subsequently released them fully open-source. For example, I have created a novel automated method for detecting invalid open-text responses based solely on the text response itself (Yeung & Fernandes, 2022). My current work also compares state-of-the-art transformer-based language models (e.g., DeBERTaV3) against simple ML algorithms, finding that while AI models offer benefits in some tasks (e.g., sentiment analysis, accuracy scoring), the costs sometimes outweigh the gains. In the future, my goal is to continue developing and validating methods that span not only cutting-edge AI techniques, but also classic, interpretable, and explicable ML techniques.
References
2022
-
Machine learning to detect invalid text responses: Validation and comparison to existing detection methods
Ryan C Yeung and Myra A Fernandes
Behavior Research Methods, 2022
A crucial step in analysing text data is the detection and removal of invalid texts (e.g., texts with meaningless or irrelevant content). To date, research topics that rely heavily on analysis of text data, such as autobiographical memory, have lacked methods of detecting invalid texts that are both effective and practical. Although researchers have suggested many data quality indicators that might identify invalid responses (e.g., response time, character/word count), few of these methods have been empirically validated with text responses. In the current study, we propose and implement a supervised machine learning approach that can mimic the accuracy of human coding, but without the need to hand-code entire text datasets. Our approach (a) trains, validates, and tests on a subset of texts manually labelled as valid or invalid, (b) calculates performance metrics to help select the best model, and (c) predicts whether unlabelled texts are valid or invalid based on the text alone. Model validation and evaluation using autobiographical memory texts indicated that machine learning accurately detected invalid texts with performance near human coding, significantly outperforming existing data quality indicators. Our openly available code and instructions enable new methods of improving data quality for researchers using text as data.