3,030

Views

CrossRef citations to date

Altmetric

Research Article

Recognizing hotspots in Brief Eclectic Psychotherapy for PTSD by text and audio mining

Reconociendo los puntos críticos (hotspots) en la psicoterapia ecléctica breve para tept por minería de texto y audio

通过文本和音频挖掘识别PTSD的简短折衷心理疗法中的热点

Sytske Wiegersmaa Department of Research Methodology, Measurement and Data Analysis, University of Twente, Enschede, NetherlandsCorrespondence[email protected]

https://orcid.org/0000-0003-3512-8859

Mirjam J. Nijdamb Department of Psychiatry, Amsterdam University Medical Centres, Academic Medical Centre, Amsterdam, Netherlands;c ARQ National Psychotrauma Centre, Diemen, Netherlands

Arjan J. van Hessend Department of Languages, Literature and Communication, Utrecht University, Utrecht, Netherlands;e Department of Human Media Interaction, University of Twente, Enschede, Netherlands

Khiet P. Truonge Department of Human Media Interaction, University of Twente, Enschede, Netherlands

Bernard P. Veldkampa Department of Research Methodology, Measurement and Data Analysis, University of Twente, Enschede, Netherlands

Miranda Olffb Department of Psychiatry, Amsterdam University Medical Centres, Academic Medical Centre, Amsterdam, Netherlands;c ARQ National Psychotrauma Centre, Diemen, Netherlands

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

Background: Identifying and addressing hotspots is a key element of imaginal exposure in Brief Eclectic Psychotherapy for PTSD (BEPP). Research shows that treatment effectiveness is associated with focusing on these hotspots and that hotspot frequency and characteristics may serve as indicators for treatment success.

Objective: This study aims to develop a model to automatically recognize hotspots based on text and speech features, which might be an efficient way to track patient progress and predict treatment efficacy.

Method: A multimodal supervised classification model was developed based on analog tape recordings and transcripts of imaginal exposure sessions of 10 successful and 10 non-successful treatment completers. Data mining and machine learning techniques were used to extract and select text (e.g. words and word combinations) and speech (e.g. speech rate, pauses between words) features that distinguish between ‘hotspot’ (N = 37) and ‘non-hotspot’ (N = 45) phases during exposure sessions.

Results: The developed model resulted in a high training performance (mean F₁-score of 0.76) but a low testing performance (mean F₁-score = 0.52). This shows that the selected text and speech features could clearly distinguish between hotspots and non-hotspots in the current data set, but will probably not recognize hotspots from new input data very well.

Conclusions: In order to improve the recognition of new hotspots, the described methodology should be applied to a larger, higher quality (digitally recorded) data set. As such this study should be seen mainly as a proof of concept, demonstrating the possible application and contribution of automatic text and audio analysis to therapy process research in PTSD and mental health research in general.

Antecedentes:La identificación y el abordaje de los puntos críticos (hotspots en inglés) es un elemento clave para exposición imaginaria en la Psicoterapia Ecléctica Breve para TEPT (BEPP por sus siglas en inglés). La investigación muestra que la efectividad del tratamiento se asocia con la focalización en estos puntos críticosy que la frecuencia y características de los puntos críticos podría servir de indicador para el éxito terapéutico.

Objetivo: Este estudio tiene como objetivo desarrollar un modelo para reconocer automáticamente los puntos críticos basados en características de texto y discurso, lo que podría ser una forma eficiente de seguir los progresos del paciente y predecir la eficacia del tratamiento.

Metodo: Se desarrolló un modelo de clasificación supervisada multimodal basado en grabaciones y transcripciones de cintas analógicas de sesiones de exposición imaginaria de diez de tratamiento exitosos y diez no exitosos. Se usaron técnicas de minería de datos y técnicas de aprendizaje automático para extraer y seleccionar las características de texto (ej., palabras y combinaciones de palabras) y discurso (ej., velocidad del discurso, pausas entre las palabras) que distinguen entre las fases de ‘puntos críticos’ (N= 37) y ‘ puntos no críticos’ (N= 45) durante las sesiones de exposición.

Resultados: El modelo desarrollado resultó en un alto rendimiento de entrenamiento (puntaje F1 promedio de 0.76) pero un bajo rendimiento de prueba (puntaje F1 promedio = 0.52). Esto muestra que las características de los textos y discursos seleccionados podrían distinguir claramente entre puntos críticos y puntos no críticos en el conjunto de datos actual, pero probablemente no reconocerá muy bien los puntos críticos de nuevos datos de entrada.

Conclusiones: Para mejorar el reconocimiento de nuevos puntos críticos, la metodología descrita debería ser aplicada a un conjunto de datos más grande y de mejor alta calidad (grabado digital). Como tal, este estudio debe verse principalmente como una prueba de concepto, demostrando la posible aplicación y contribución del análisis automático de texto y audio para la investigación del proceso terapéutico en TEPT e investigación en salud mental en general.

背景:在PTSD的简短折衷心理疗法 (BEPP) 中, 识别和解决热点 (hotspot) 是想象暴露法的关键要素。研究表明, 治疗效果与关注这些热点有关, 并且热点频率和特征可以作为治疗成功的指标。

目标:本研究旨在开发一种基于文本和语音特征来自动识别热点的模型, 这可能是追踪患者进展和预测治疗效果的有效方法。

方法:基于对十个成功和十个不成功治疗完成者的想象暴露会话的模拟磁带录音和笔录, 建立了多模式监督分类模型。数据挖掘和机器学习技术用于在暴露会话中提取和选择区分‘热点’ (N = 37) 和‘非热点’ (N = 45) 阶段的文本 (例如单词和单词组合) 和语音 (例如语速, 单词间的停顿) 。

结果:开发的模型训练成绩很高 (平均F1得分为0.76), 但测试性能较低 (平均F1得分为0.52) 。这表明所选的文本和语音特征可以清楚地区分此数据集的热点和非热点, 但可能无法很好地识别新输入数据中的热点。

结论:为了提高对新热点的识别, 应将所描述的方法应用于更大, 更高质量 (数字记录) 的数据集。因此, 本研究应主要被视为概念的证明, 证明自动文本和音频分析在PTSD的治疗过程研究和一般心理健康研究中可能的应用和贡献。

KEYWORDS:

PALABRAS CLAVES:

关键词:

HIGHLIGHTS:

• Hotspots during imaginal exposure can be automatically recognized.• Text and audio features were used for supervised classification.• A combined feature set resulted in highest model performance.• A larger data set might help to improve model generalizability.

1. Introduction

Posttraumatic stress disorder (PTSD) is a mental health disorder that can develop after experiencing or witnessing a traumatic event (American Psychiatric Association, Citation2013). The lifetime prevalence rate of PTSD in the general population is 7.4% (De Vries & Olff, Citation2009; Kessler et al., Citation2017). Several effective treatments for PTSD exist (Bisson et al., Citation2019), examples of which are trauma-focused cognitive behavioural therapy (CBT; Ehlers & Clark, Citation2000) and eye movement desensitization and reprocessing (EMDR; Shapiro, Citation2001). Of all effective psychotherapies, one of the ingredients they have in common is exposure to trauma (Olff et al., Citationin press; Schnyder et al., Citation2015). Despite its efficacy, there is still a considerable proportion of patients that does not (sufficiently) respond to this form of trauma-focused therapy. For example, in their meta-analysis of psychotherapy for PTSD, Bradley, Greene, Russ, Dutra, and Westen (Citation2005) report mean improvement rates of 37.6% and 47.4%, among CBT intent-to-treat patients and treatment completers, respectively.

Grey, Young, and Holmes (Citation2002) argue that the effectiveness of PTSD treatment can significantly improve by focusing on hotspots. This is in line with the results of Nijdam, Baas, Olff, and Gersons (Citation2013), who showed that hotspots were more frequently addressed in successful than in non-successful treatments. Hotspots, the moments of traumatic experiences with the highest emotional impact, have been an important topic of research in the past decades. For example, Ehlers, Hackmann, and Michael (Citation2004) and Ehlers, Clark, Hackmann, McManus, and Fennell (Citation2005) found that imaginal exposure during trauma-focused CBT should focus on addressing and changing the meaning of hotspots as this could lead to greater PTSD symptom reduction. The importance of hotspots in psychotherapy was also highlighted in earlier studies that argued that hotspots need to be addressed to ensure habituation (Richards & Lovell, Citation1999) or to identify deeper meanings (Ehlers & Clark, Citation2000).

A form of trauma-focused CBT that focuses on the identification and addressing of hotspots is Brief Eclectic Psychotherapy for PTSD (BEPP; Gersons, Carlier, Lamberts, & Van Der Kolk, Citation2000). Through imaginal exposure, the patient is led slowly through the traumatic situation until the worst moment (the hotspot) is reached (Grey & Holmes, Citation2008). Hotspots are addressed by encouraging the patient to describe and remember the exact details of the most frightening or emotional moment, for example, by asking about sounds, smells, weather, or surroundings. By helping the patient to remember the details, cues to new aspects and details of the event can come to mind, enabling the patient to relive the situation as vividly as possible (Gersons, Meewisse, Nijdam, & Olff, Citation2011). When the hotspot is sufficiently covered, the trauma narrative can be continued until (over the course of several exposure sessions) all the hotspots have been addressed. The imaginal exposure phase is completed when all hotspots are addressed and the emotions associated with the traumatic event are levelled down sufficiently (Nijdam et al., Citation2013).

With regard to the content of hotspots, previous studies focused on the presence of emotions (Grey, Holmes, & Brewin, Citation2001; Grey et al., Citation2002) and cognitions (Grey & Holmes, Citation2008; Holmes, Grey, & Young, Citation2005), which showed that especially anxiety, helplessness, horror, anger, sadness, shame, and guilt frequently occurred in hotspots. In addition, Grey et al. (Citation2002) found that hotspots are characterized by subtle textual changes, which may guide the therapist in the identification of emotional hotspots. An example of a study assessing textual differences within trauma narratives is that of Jelinek et al. (Citation2010), who studied the organization and content of the ‘worst moments’ of traumatic memories by analysing the degree of disorganization, emotions, and speaking style. They found that these moments showed different characteristics with regard to the organization than the rest of the narrative.

To obtain a deeper knowledge and understanding of trauma treatment and specifically hotspots, more in-depth, large-scale analysis of treatment and hotspot content is required. Until now, treatment content has mainly been studied by manually coding the occurrence of a predefined set of characteristics within therapy session recordings or transcripts retrospectively. Due to the time-consuming nature of such analyses, most of these studies focus on one specific construct, such as text cohesion (Foa, Molnar, & Cashman, Citation1995), complexity (Amir, Stafford, Freshman, & Foa, Citation1998), or dissociation (Zoellner, Alvarez-Conrad, & Foa, Citation2002). It is suggested that future studies should focus on assessing the relationship between multiple constructs underlying traumatic narratives instead of studying every construct separately (Amir et al., Citation1998).

An effective way to study multiple constructs and variables at once is to analyse treatment sessions using automatic text and audio analyses. Text analysis is frequently used in PTSD research, as word use and linguistic features proved to be indicative of people’s mental, social, and sometimes physical state, and their defensive operations (Nelson & Horowitz, Citation2010; Pennebaker, Mehl, & Niederhoffer, Citation2003). Word counts have been used to study trauma narrative content in relation to PTSD symptom severity (e.g. Jelinek et al., Citation2010; Pennebaker, Citation1993) and insight in the linguistic elements present within trauma narratives could lead to improved PTSD treatment (Alvarez-Conrad, Zoellner, & Foa, Citation2001). For example, specific linguistic features such as cognitive processing words (Alvarez-Conrad et al., Citation2001; D’Andrea, Chiu, Casas, & Deldin, Citation2012; Pennebaker, Francis, & Booth, Citation2001), emotion words (Alvarez-Conrad et al., Citation2001; Pennebaker et al., Citation2001), words related to insight (Pennebaker et al., Citation2001), reflection (D’Andrea et al., Citation2012), causation (Boals & Klein, Citation2005), and affection and death (Alvarez-Conrad et al., Citation2001) have been used to predict improvements in post-treatment PTSD symptoms, perceived physical health, and personal functioning. Because mood and emotions are found to influence speaking behaviour and speech sound characteristics, audio signal analysis is regularly applied in psychiatric studies as well, for example, to predict recovery time in depression (Kuny & Stassen, Citation1993) or to recognize psychosis development in high-risk youths (Bedi et al., Citation2015).

Text and speech features can be used to identify and study specific concepts on a large scale, in a transparent and uniform fashion, over a long period of time. For the automatic recognition or prediction of pre-defined concepts, supervised classification is generally used. Supervised classification is a data mining application in which objects (e.g. texts or audio signals) are assigned to a set of predefined class labels using a classification model based on labelled training samples (Bird, Klein, & Loper, Citation2009). Supervised classification based on text features has been used, for example, to screen forum posts for PTSD (He, Veldkamp, & de Vries, Citation2012) or to predict treatment adherence for schizophrenia patients (Howes, Purver, McCabe, Healey, & Lavelle, Citation2012), whereas speech features have been used to classify distress in PTSD patients (Van Den Broek, Van Der Sluis, & Dijkstra, Citation2009). Though most studies use either text or audio analysis, Schuller, Villar, Rigoll, and Lang (Citation2005) and Forbes-Riley and Litman (Citation2004) found that models based on multimodal feature sets outperformed models based on either acoustic or linguistic features alone (e.g. in emotion classification), as multimodal sets provide a broader and more complete picture of one’s (emotional) state (Bhaskar, Sruthi, & Nedungadi, Citation2015).

This study aims to develop a multimodal supervised classification model to automatically recognize hotspots based on text and speech features extracted from tape recordings and transcripts of imaginal exposure sessions of successful and non-successful treatment completers. Automatic hotspot recognition can provide clinicians with insight in the occurrence and characteristics of hotspots during their treatments, which may assist them in offering a more effective intervention. We hypothesized that a combination of text and speech features extracted from patient speech could be used to develop a supervised classification model to automatically distinguish between hotspot and non-hotspot phases during imaginal exposure sessions. Based on the formal hotspot characteristics and previous research on hotspots and CBT sessions, we identified nine constructs (affect, emotions, cognitions, dissociation, avoidance, cohesion, organization, fragmentation, and complexity, further described in the Methods section) that we expected to differ between hotspots and non-hotspots. Each construct was operationalized through a number of text and speech characteristics that were captured using a large range of features extracted from CBT session transcripts and recordings.

2. Methods

2.1. Sample and data set

We used data of patients undergoing Brief Eclectic Psychotherapy for PTSD (BEPP; Gersons et al., Citation2000). To develop the hotspot classification model, an existing expert-annotated data set consisting of imaginal exposure session recordings was used in which hotspots and their characteristics were coded. This data set consisted of analog cassette tape recordings of 45 PTSD patients and was collected for a previous study by Nijdam et al. (Citation2013), who investigated differences in hotspots between successful and unsuccessful BEPP trauma-focused psychotherapies. They analysed session recordings in which imaginal exposure was present for 20 of the 45 patients (the 10 most and the 10 least successful treatment completers). The sample consisted of twelve female and eight male adults with a mean age of 39.60 (SD = 10.98) and different ethnic backgrounds (mainly Dutch, N = 15, but also Indonesian, Surinamese, Aruban, and Bosnian). The types of trauma the patients experienced included assault (N = 13), disaster (N = 2), sexual assault (N = 1), accident (N = 1), war-related (N = 1), and other (N = 2).

Nijdam et al. (Citation2013) coded the frequency of hotspots, their characteristics (interrater reliability K = 0.86), emotions (interrater reliability K = 0.81), and cognitions (interrater reliability K = 0.85) for 102 recordings based on the Hotspot Identification Manual, an adaptation of the Hotspots Manual of Holmes and Grey (Citation2002), developed by Nijdam and colleagues to enable retrospective coding based on audio recordings. Of the 102 coded sessions, recordings of insufficient quality for transcription (mainly due to heavy background noise, N = 29) or that did not contain any hotspots (N = 29) were excluded in the present study. From the remaining 44 recordings one session was selected per patient. This was the session in which the most hotspots occurred. In case there were multiple sessions with the same number of hotspots, the session occurring earliest in treatment was used. In total the 20 selected sessions contained 37 hotspots; seven recordings with three hotspots, three recordings with two hotspots, and ten recordings with one hotspot (see ).

Figure 1. Data selection chart for available session recordings

2.1.1. Data preparation

The data consisted of tape recordings (mono channel) of complete imaginal exposure sessions, which were converted to WAV format (16-bit, 16 kHz, mono) using the digital audio editor Audacity® version 2.0.5 (Audacity Team, Citation2013). The recordings were over 10-years old at the time of digitization, which negatively influenced the sound quality. Each recording contained a complete imaginal exposure session consisting of four elements (Gersons et al., Citation2011):

Discussion: discussion of the previous session, the course of the PTSD symptoms, and the structure and content of the current session.
Relaxation exercises: repeatedly tensing and relaxing muscle groups to enable the patient to focus on the traumatic event and go back to the situation.
Exposure: for the first exposure session, the patient is brought back to the day of the traumatic event and is asked to give a detailed account of the situation prior to the event and the event itself. In subsequent sessions, exposure starts where it left off in the previous session.
Discussion: discussion of the exposure experienced so far and explanation of the content and structure of the following session.

Since we were only interested in the imaginal exposure phase, the initial discussion of the previous session, relaxation, and concluding discussion were removed, leaving only the exposure phase, usually about 15–20 min per recording, for analysis.

Because the audio was of poor quality and transcriptions needed to be as detailed as possible, automatic speech recognition (ASR) was not applicable. The recordings were therefore transcribed and annotated by the first author, who was blind to therapeutic outcome. The transcriptions are verbatim, meaning that every recorded word, including unfinished words (stammering), non-fluencies (e.g. uh, hmm), and forms of backchanneling (e.g. uhhu, ok), was transcribed. Background noise was removed only if necessary for transcription, using the noise reduction function implemented in Audacity® version 2.0.5. However, for some sessions small parts of the speech could still not be transcribed due to the amount of noise, heavy emotions, or weakness of the speech signal. These parts are coded as ‘inaudible’, including start and end time. The exact start and end time of each hotspot were coded by the first two authors.

The transcriptions were then converted to the (C)XML file format for annotating transcriptions to enable parsing (easily separating patient from therapist speech and hotspot from non-hotspot phases for the text analysis) and to link the transcribed text to the digital audio recordings. Linking of text and audio data was done using forced alignment within the WebMaus Pipeline version 2.25 (Kisler, Reichel, & Schiel, Citation2017), including the Chunker function by Poerner and Schiel (Citation2016). The resulting TextGrid files were then complemented with interval tiers; connected sequences of labelled intervals annotating hotspots, the speaker turns, and silences, using Praat version 6.0.4.3 (Boersma & Weenink, Citation2019). This way the transcriptions and recordings were converted to input formats suitable for the multimodal classification pipeline.

2.1.2. Identifying hotspots

The Hotspots Manual of Holmes and Grey (Citation2002), and succeeding research on hotspots by Holmes et al. (Citation2005) and Grey and Holmes (Citation2008), was used to identify hotspots: 1) the moment is defined by the patient as the ‘worst moment’; 2) the moment was identified as a hotspot in a previous session; 3) an audible change in affect; 4) the patient changes from present to past tense; 5) the patient changes from first to third person; 6) the patient is ‘whizzing through’; 7) the patient cannot remember details of the moment; 8) the patient is dissociating; or 9) the moment is mentioned by the patient to correspond to an intrusion.

2.2. Operational constructs for automatic recognition

We distinguished nine constructs underlying hotspots that could be used in their automatic recognition. Five of these are based on the formal hotspot characteristics; affect, emotions, cognitions, dissociation, and avoidance. The remaining four (cohesion, organization, fragmentation, and complexity) were selected based on previous research on CBT sessions. Although until now, except for organization, these additional constructs were mainly studied with regard to complete trauma narratives and not to specific parts such as hotspots, we expected them to be useful for automatic hotspot recognition as they do play a part in the emotional processing of traumatic events (Amir et al., Citation1998).

Each construct is operationalized through variables that can be measured based on combinations of either text, speech, or text and speech features. Since the aim is to recognize hotspots automatically, we only used those variables that could be measured based on automatically extracted (i.e. without the need for manual coding) text and speech features. The features used to capture each construct are described in sections 2.3.2 Text feature extraction and 2.3.3 Speech feature extraction. More feature details, including examples and equations, can be found in Appendix A and Appendix B. The operationalization of each construct and the related features are shown schematically in and elaborated upon in Appendix C.

2.2.1. Affect

According to Grey et al. (Citation2002) a visible change in affect (e.g. bursting into tears, turning red, shaking, or sweating) is the most obvious way to identify a hotspot. When working with audio files, audible cues can be used instead of visible cues, as in Nijdam et al. (Citation2013), who showed that change in affect remains a strong identifier even without the visible aspect. Juslin and Scherer (Citation2005) define affect as ‘a general, umbrella term that subsumes a variety of phenomena such as emotion, stress, mood, interpersonal stance, and affective personality traits’ [p. 69].

2.2.2. Emotions

Emotion is one of the affective phenomena listed by Juslin and Scherer (Citation2005), and as such the constructs affect and emotions are closely related. Holmes et al. (Citation2005) distinguished 11 emotion categories based on emotion words that occurred during hotspots: fear, helplessness, anger, sadness, surprise, disgust, dissociation, happiness, shame, guilt, and horror. Of these, especially anxiety, helplessness, and horror are deemed important, as these were specified explicitly under PTSD criterion A2 of the DSM-4-TR (American Psychiatric Association, Citation2013), although this criterion was removed from the most recent version, the DSM-5 (American Psychiatric Association, Citation2000). We also expected higher occurrences of the emotions anger, sadness, shame, disgust, and guilt, as these were found to be often related to hotspots (Grey et al., Citation2001, Citation2002; Holmes et al., Citation2005).

2.2.3. Cognitions

In addition to emotion categories, Holmes et al. (Citation2005) distinguished seven cognitive themes that can characterize hotspots: uncertain threat, general threat of injury and death, control and reasoning, consequences, abandonment, esteem, and cognitive avoidance. Cognitive themes of psychological threat (sense of self) were found to appear more in hotspots than those of physical threat (physical integrity) (Grey & Holmes, Citation2008; Holmes et al., Citation2005).

2.2.4. Dissociation

Hotspots are also identified by changes in speaking style. During imaginal exposure, patients are asked to describe the past event as if it were happening now, in the first person present tense. Patients may dissociate during hotspots by changing from present to past tense or from first to third person (Grey et al., Citation2002). This altered or unreal perception of the traumatic event may indicate that peritraumatic dissociation occurred during or directly after the traumatic experience.

2.2.5. Avoidance

Other hotspot characteristics related to speaking style described by Grey et al. (Citation2002) are ‘whizzing through’ (rushing through the main event giving minimal details, while extensively describing the build-up and aftermath) and the patient declaring he or she is unable to remember details of the moment. These characteristics reflect (non-conscious) avoidance.

2.2.6. Cohesion

Narrative cohesion focuses on the occurrence of explicit cues within the text that enable the reader (or listener) to make connections within or between sentences or clauses (Crossley, Kyle, & McNamara, Citation2016). Previous studies found cohesion to be related to the level of intrusive symptoms in children (O’Kearny, Speyer, & Kenardy, Citation2007) and trauma-related avoidance (O’Kearney, Hunt, & Wallace, Citation2011), which both are hotspot characteristics.

2.2.7. Organization

Trauma survivors with PTSD are found to produce more disorganized trauma narratives than trauma survivors without PTSD (Halligan, Michael, Clark, & Ehlers, Citation2003; Jones, Harvey, & Brewin, Citation2007). The (dis)organization of the ‘worst moments’ (hotspots) in traumatic memories was previously studied based on text features by Jelinek et al. (Citation2010).

2.2.8. Fragmentation

Foa et al. (Citation1995) suggest that trauma memories are more fragmented (i.e. lacking flow) for trauma survivors with PTSD, because information could not be adequately processed and encoded under stressful conditions. They found a significant correlation between fragmentation and PTSD symptoms over treatment.

2.2.9. Complexity

Amir et al. (Citation1998) found that narrative complexity correlated negatively with PTSD severity three months after the trauma. They found that patients who wrote more simplistic narratives showed more severe PTSD than patients who wrote more complex narratives. However, later studies concluded that found effects could also be due to differences in writing skill and cognitive ability (see Gray & Lombardo, Citation2001). Complexity may relate to the hotspot characteristic ‘whizzing through’, due to which hotspot moments are described in a more simplistic fashion and in less detail. Also, hotspot moments may be narrated in a more fragmented way due to changes in affect.

2.3. Classification pipeline

The development of a new classification model involves two phases; a training phase and a prediction phase. In the training phase, information is extracted from each object following a range of preprocessing and feature extraction steps, resulting in labelled feature sets. A machine learning algorithm uses those labelled feature sets to learn and select the most discriminative text and speech features for the ‘hotspot’ versus the ‘non-hotspot’ phases.

Figure 2. Operationalization scheme for constructs underlying hotspots (red), related variables (blue), and extracted features (green). For each node is indicated whether it is expected to increase (+), decrease (-), change in both directions (~), or either direction (?)

In the prediction phase, the classifier uses those features to identify hotspots from new imaginal exposure session recordings and transcripts (for more on the development of classification models, see Wiegersma, Van Noije, Sools, & Veldkamp, Citationn.d.). This sequence of steps, in which the output of each step is the input for the next, is called a pipeline (see ).

Figure 3. Multimodal supervised classification pipeline

The preparation, preprocessing and feature extraction steps were done separately for text and speech features because they require different techniques. Feature selection and machine learning were applied to the combined, multimodal feature sets. Text preprocessing and feature extraction was done in Python 3.7.2 (Python Software Foundation, Citation2019) using the Natural Language Toolkit (NLTK 3.4; NLTK Project, Citation2019) and Python’s Textstat package (version 0.5.4; Bansal & Aggarwal, Citation2018), and in LIWC using the Dutch LIWC dictionary and the NRC emotion lexicon. Audio preprocessing and feature extraction were done using Audacity® version 2.0.5, WebMaus version 2.25, and Praat version 6.0.4.3. Conversion of the text transcripts from plain text files to parsable and linkable file formats was done using custom XML and CXML converters developed by one of the authors (available upon request). For feature selection and machine learning, the Scikit-learn library (Pedregosa et al., Citation2011) version 0.20.2 was used.

2.3.1. Preprocessing

The text and audio analysis focused on patient speech only. The textual input for the classification pipeline consisted of plain text files containing the transcribed, anonymized patient speech cut into ‘hotspot’ and ‘non-hotspot’ segments (parts in the exposure phase preceding or following a hotspot). In total the transcripts were split into 37 hotspot segments and 45 non-hotspot segments. To analyse the text on word level, separate words were extracted from the transcripts using the word tokenizer for Dutch implemented in NLTK (see Perkins, Citation2014, for more on tokenization). All words were normalized by removing punctuation, accents, and capital letters. For the N-gram extraction, each word except for stop words was stemmed (reduced to its base form, see Jurafsky & Martin, Citation2009, for more on stemming) using a standard Dutch Snowball stemmer included in NLTK (Porter, Citation2001). For the tagger-based feature extraction and the overall text characteristics the unstemmed input text was used.

For the audio analysis, the prepared TextGrid files (see 2.1.1 Data Preparation) were directly processed in Praat, selecting the audio signals for patient-speech only and distinguishing between hotspot and non-hotspot phases within the annotated interval tiers. In line with Jurafsky and Martin (Citation2009), we used utterances instead of sentences because we work with a corpus of transcribed speech that does not contain punctuation such as original text corpora. Utterances, which can be words, phrases or clauses, were identified based on Tanaka, Sakti, Neubig, Toda, and Nakamura (Citation2014), in which utterances are separated based on a pause in speech longer than 1 s.

2.3.2. Text feature extraction

Text features capture what is being said, focusing on the textual content. Text content can be examined on word or phrase level by extracting unigrams, N-grams, or N-multigrams (single words, phrases, or variable-length word combinations). With small samples, frequencies of individual words or phrases may be too low to recognize specific patterns. In that case, it is useful to analyse words belonging to particular grammatical or lexical categories by assigning labels (tags) to each word using parts-of-speech (POS), lexicon-based, or custom taggers.

In general, grammatical POS tags such as personal pronouns and verb tense are thought to give information about one’s (temporal) focus and psychological distance towards a situation or event, which may provide cues on thought processes, priorities, and intentions (Tausczik & Pennebaker, Citation2010). Tags regarding verb tense are also considered useful in assessing memory (dis)organization and time perspective (Jelinek et al., Citation2010). Previous studies in which POS tags were used showed that tags such as first-person singular pronouns correlated positively with psychological distress (Rude, Gortner, & Pennebaker, Citation2004; Wolf, Sedway, Bulik, & Kordy, Citation2007). In addition, trauma survivors that were sensitive to developing posttraumatic stress symptoms were found to use more first-person plural than first-person singular pronouns (Chung & Pennebaker, Citation2007; Stone & Pennebaker, Citation2002).

A widely used lexicon-based tagger is LIWC, which assigns words to categories related to linguistic elements, emotions, and cognitive processes, and counts their relative frequencies. Since hotspots are the most emotionally distressing moments of trauma (Nijdam et al., Citation2013), special attention was paid to the emotions present in the transcripts. Although LIWC extracts several emotion categories (anxiety, anger, and sadness), more extensive insight in the emotions was gained using a General Purpose Emotion Lexicon (GPEL), which is considered to significantly improve emotion classification (Aman & Szpakowicz, Citation2007).

Finally, text characteristics and statistics were extracted to analyse textual differences on the general level. Previous studies showed that these characteristics can be used to detect emotions (Lee & Narayanan, Citation2005) or as indicators for physical symptoms and discomfort (Alvarez-Conrad et al., Citation2001).

The text features were extracted over the complete hotspot or non-hotspot phase, extracting all text features for each separate hotspot and non-hotspot segment. To prevent bias towards longer text documents, the extracted N-grams were weighted by normalized term frequency (tf, occurrence counts normalized by document length, see more in Forman, Citation2003) or term frequency-inverse document frequency (tf-idf, see more in Jurafsky & Martin, Citation2009), which are the most commonly used feature weights. The occurrence frequencies returned by the taggers were normalized by document length. A detailed description of all used text features, their relation to the operational constructs, and the extraction process can be found in Appendix A.

2.3.3. Speech feature extraction

In addition to what is being said, which is captured by the text features, it is of interest how things are said, since one’s manner of speaking can convey signs of emotions or stress (Lefter, Rothkrantz, Van Leeuwen, & Wiggers, Citation2011; Scherer, Citation2003). Some emotions, especially emotions that are high in arousal, such as anger and fear, can be better identified from spoken than from written data (e.g. Truong & Raaijmakers, Citation2008).

The study of speech sounds is called phonetics. Phonetic studies can focus on how sounds are produced (articulatory phonetics), how sounds are perceived (auditory phonetics), or how sounds are transmitted (acoustic phonetics) (Ashby, Citation2013). The latter concentrates specifically on the acoustic characteristics (or physical properties) of speech, such as frequency, amplitude, and duration, which can be objectively measured by analysing acoustic waveforms. A waveform is a graphical representation of a sound wave, in which the variation in air pressure (y-axis) involved with the production of sound is plotted over time (x-axis) (Jurafsky & Martin, Citation2009). It is generally assumed that one’s affective state is reflected by objectively measurable voice cues. As such, acoustic phonetics are considered the most promising phonetic features in examining affect and emotion (Juslin & Scherer, Citation2005).

Lefter et al. (Citation2011) divide acoustic features into prosodic, spectral, and voice quality features. Studies in which the identification of emotions or affective state plays a role mostly depend on prosodic features. Prosody refers to a collection of acoustic features that concern intonation-related (pitch), loudness-related (intensity), and tempo-related (e.g. durational aspects, speaking rate) features (Jurafsky & Martin, Citation2009). This can closely contribute to meaning and may reveal information normally not captured by textual features, such as emotional state or attitude (Wilson & Wharton, Citation2006).

Prosodic features generally cover speech units larger than one segment, such as syllables, words, or speaker turns, and are therefore also termed suprasegmentals (Jurafsky & Martin, Citation2009). The suprasegmentals pitch, loudness, and duration (tempo and pause) are among the most used features in the phonetic study of prosody (see, e.g. the prosodic frameworks of Ladd & Cutler, Citation1983; Roach, Citation2000; Schoentgen, Citation2006). Several recent clinical studies used suprasegmental features for the diagnosis of a range of psychological disorders (Scherer et al., Citation2013), or specific disorders such as PTSD (Vergyri et al., Citation2015) and dementia (Fraser et al., Citation2014; Jarrold et al., Citation2014). Other purposes for which suprasegmentals have been used include identifying indicators for PTSD therapy progress (Van Den Broek et al., Citation2009) and assessing depression severity during therapeutic intervention (Lamers, De Jong, Truong, Steunenberg, & Westerhof, Citation2014).

In addition to prosodic features, spectral features such as Mel-frequency cepstral coefficients (MFCCs) are commonly used in emotion detection as these are affected by emotional arousal (Lefter et al., Citation2011). Voice quality features such as high-frequency energy (HF) are found to be strongly related to emotions as well. Apart from neutral, voice qualities can be, for example, breathy, creaky, harsh, tense, or whispery. Finally, overall speaker turn statistics (e.g. turn length, the number of utterances per turn) were extracted as these can gauge language strength (poverty of speech) and structural organization (Orimaye et al., Citation2014).

The audio data were analysed based on the prosodic features pitch, loudness, and duration, which are the most commonly used voice cues (Juslin & Scherer, Citation2005), acoustic parameters related to spectral and voice quality features, and turn statistics. The prosodic, voice quality, and general features were extracted at speaker turn and utterance level, the spectral features at the frame level. In three segments not all speech features could be extracted at the patient level because these segments contained no or only one voiced segment, due to which no SDs could be calculated for the concerning speech features. For these missing values, overall averages for the concerning classes (hotspot or non-hotspot) were imputed. More information on the used speech features, their relation to the operational constructs, and the extraction process is given in Appendix B.

2.3.4. Feature union

shows all extracted text and speech features. These features consist of a mixture of scales and quantities (e.g. normalized term and category frequencies, overall text statistics, mean amplitude values, and duration measures). Feature rescaling was done to make sure all input features have the same scale. This is preferred for many machine learning applications, to prevent features measured in greater numeric ranges from dominating features measured in smaller ranges. As such, each text and speech feature was rescaled to the [−1, +1] range, as proposed by Hsu, Chang, and Lin (Citation2003), so that each feature’s maximal absolute value is equal to one (see ). This same scaling method is later applied to rescale the features in the test set.

Figure 4. Rescaling process applied to extracted text and speech features before feature

Recognizing hotspots in Brief Eclectic Psychotherapy for PTSD by text and audio mining

Reconociendo los puntos críticos (hotspots) en la psicoterapia ecléctica breve para tept por minería de texto y audio

通过文本和音频挖掘识别PTSD的简短折衷心理疗法中的热点

ABSTRACT

1. Introduction

2. Methods

2.1. Sample and data set

2.1.1. Data preparation

2.1.2. Identifying hotspots

2.2. Operational constructs for automatic recognition

2.2.1. Affect

2.2.2. Emotions

2.2.3. Cognitions

2.2.4. Dissociation

2.2.5. Avoidance

2.2.6. Cohesion

2.2.7. Organization

2.2.8. Fragmentation

2.2.9. Complexity

2.3. Classification pipeline

2.3.1. Preprocessing

2.3.2. Text feature extraction

2.3.3. Speech feature extraction

2.3.4. Feature union

Table 1. Feature overview

2.3.5. Feature selection

2.3.6. Excluded features

2.3.7. Machine learning algorithm

2.3.8. Classification performance

Table 2. Confusion matrix to assess model performance

Table 3. Performance metrics and functions

2.3.9. Analytical strategy

2.3.10. Parameter grid search

3. Results

3.1. Sample characteristics

Table 4. Summary of characteristics hotspots, non-hotspots, and total sample

3.1.1. Validation splits

3.2. Model comparison

3.2.1. Text features only

3.2.2. Speech features only

3.2.3. Text and speech features combined

3.3. Final model

3.3.1. Most informative features

Table 5. Selection of most informative features of the multimodal classifier

3.3.2. Confusion matrix

Table 6. Confusion matrix to assess model performance

3.3.3. Generalizability

Table 7. Mean testing performance

4. Discussion

Acknowledgments

Disclosure statement

References

Appendix A.

Extracted text features

Table A1. N-grams

Table A2. Parts-of-speech (POS) tags

Table A3. LIWC categories

Table A4. NRC emotion lexicon

Table A5. Custom tags

Table A6. Text characteristics and statistics

Table A7. Stop word list

Appendix B.

Extracted speech features

Table B1. Pitch

Table B2. Loudness

Table B3. Duration

Table B4. Spectral features

Table B5.Voice quality features

Table B6. Turn statistics

Appendix C.

Operationalization of hotspot constructs

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Table B5.
Voice quality features