EMNLP2025

Exploring Large Language Models for Detecting Mental Disorders

Gleb Kuzmin, Petr Strepetov, Maksim Stankevich, Natalya V. Chudova, Artem Shelmanov, Ivan V. Smirnov

Abstract

This paper compares the effectiveness of traditional machine learning methods, encoderbased models, and large language models (LLMs) on the task of detecting depression and anxiety. Five Russian-language datasets were considered, each differing in format and in the method used to define the target pathology class. We tested AutoML models based on linguistic features, several variations of encoder-based Transformers such as BERT, and state-of-the-art LLMs as pathology classification models. The results demonstrated that LLMs outperform traditional methods, particularly on noisy and small datasets where training examples vary significantly in text length and genre. However, psycholinguistic features and encoder-based models can achieve performance comparable to language models when trained on texts from individuals with clinically confirmed depression, highlighting their potential effectiveness in targeted clinical applications. 1 3. This method demonstrates state-of-the-art results on Twitter and Weibo depression datasets by employing zero-shot and few-shot learning. Another study (Hadzic et al., 2024) compares the effectiveness of fine-tuned BERT with GPT-3.5 and GPT-4 in the depression detection task. The authors use Patient Health Questionnaire-8 scores for classifying transcribed audio data from the Distress Analysis Interview Corpus, KID, and a simulated dataset. With scores separated into depressive and non-depressive groups, the zeroshot method for GPT-4 outperforms GPT-3.5 and BERT across all datasets, highlighting the potential of LLMs in depression detection. Additionally, Wang et al. ( 2024 ) investigates depression symptom detection and severity classification using LLMs on the eRisk 2021 and eRisk 2023 datasets. Utilizing Beck's Depression Inventory to form queries related to depression symptoms and the Universal Sentence Encoder for text embeddings, the study creates two datasets containing top-1 and top-5 ranked texts for each query. LLMs fine-tuned with QLoRA are used for classification into four levels of depression severity. The DORIS (Lan et al., 2024) system addresses the challenges of detecting depression through social media posts from the Sina Weibo Depression Dataset. The authors use GPT3.5-Turbo-1103 for annotating high-risk texts according to the DSM-5 depression scale; also, LLM is used to summarize critical information from users' historical mood records (mood courses). The final model based on XGBoost is learned on features from annotations and gte-small-zh model vector representations of post histories and mood courses and shows an improvement over the baseline. We are the first to examine and compare three generations of the discussed models for depression and anxiety detection tasks in Russian, namely, traditional ML models, encoder-decoder models, and LLMs. Unlike other works, we used various models from each group and carefully compared the results of the models between the groups on five datasets, aiming for a general recommendation on the best models to use in practice. Data This paper considers five Russian-language datasets: 2 for depression and 3 for anxiety. Classes in all datasets were represented in the binary format: a healthy class (no signs of mental disorders) and a pathology class (depression or anxiety). The general description of the datasets used in our study is shown in Table 1 .