1,663
Views
4
CrossRef citations to date
0
Altmetric
Original Articles

The latest development of the DELAD project for sharing corpora of speech disorders

ORCID Icon, ORCID Icon, ORCID Icon, ORCID Icon, ORCID Icon, ORCID Icon & ORCID Icon show all
Pages 102-110 | Received 22 Oct 2020, Accepted 31 Mar 2021, Published online: 23 Apr 2021

ABSTRACT

Corpora of speech of individuals with communication disorders (CSD) are invaluable resources for education and research, but they are costly and hard to build and difficult to share for various reasons. DELAD, which means ‘shared’ in Swedish, is a project initiated by Professors Nicole Müller and Martin Ball in 2015 that aims to address this issue by establishing a platform for researchers to share datasets of speech disorders with interested audiences. To date four workshops have been held, where selected participants, covering various expertise including researchers in clinical phonetics and linguistics, speech and language therapy, infrastructure specialists, and ethics and legal specialists, participated to discuss relevant issues in setting up such an archive. Positive and steady progress has been made since 2015, including refurbishing the DELAD website (http://delad.net/) with information and application forms for researchers to join and share their datasets and linking with the CLARIN K-Centre for Atypical Communication Expertise (https://ace.ruhosting.nl/) where CSD can be hosted and accessed through the CLARIN B-Centres, The Language Archive (https://tla.mpi.nl/tools/tla-tools/) and TalkBank (https://talkbank.org/). The latest workshop, which was funded by CLARIN (Common Language Resources and Technology Infrastructure) was held as an online event in January 2021 on topics including Data Protection Impact Assessments, reviewing changes in ethics perspectives in academia on sharing CSD, and voice conversion as a mean to pseudonomise speech. This paper reports the latest progress of DELAD and discusses the directions for further advance of the initiative, with information on how researchers can contribute to the repository.

Background

Corpora of speech of individuals with communication disorders (CSD) are invaluable resources for education and research. However, they are costly and hard to build and can be difficult to share given various issues, such as the preservation of privacy and confidentiality of the participants, and the possible extra work and cost required for formatting the datasets for comparable sharing and hosting in a repository (see e.g., Barbui et al., Citation2016; Fraser et al., Citation2019). Overcoming these challenges is important, as sharing data enables better science in the future. Re-analysis of raw data fosters improvement in the reproducibility and robustness of research (Barbui et al., Citation2016; “Data sharing and the future of science [Editorial]”, Citation2018; Fraser et al., Citation2019). The availability of datasets allows other research teams to answer a different research question which maximizes the value of the data collected (Barbui et al., Citation2016; Fraser et al., Citation2019), and in turn increases the impact of research of the original investigators (Byrd et al., Citation2020). The availability of data also facilitates systematic review and meta-analyses (Barbui et al., Citation2016). Datasets that are comparable can be pooled together to form a bigger set of data permitting more sophisticated analyses (“Data sharing”, Citation2018). The pooling of similar datasets also allows cross-linguistic research between countries, or investigation of rare conditions as it is often difficult to collect data from a sufficient number of participants by a single research center. These benefits have been demonstrated in disciplines such as genomics, neuroscience, astronomy and astrophysics where the culture of data sharing is more established (“Data sharing”, Citation2018). Hence, it would benefit the discipline if more researchers in the area of clinical linguistics and phonetics, and speech and language therapy (or speech-language pathology) considered sharing speech data.

So far, the largest repository of CSD available is probably the TalkBank (website: https://talkbank.org/) which was established and is being organized by Professor Brian MacWhinney at Carnegie Mellon University in the USA. The TalkBank encompasses corpora in the following areas: Conversation Banks that include datasets of conversations or interactions between adults (e.g., CABank) and those between children and adults (e.g., CHILDES); Multilingual Banks that include datasets for studying second language acquisition (e.g., Second Language Tutors); and Clinical Banks that include datasets of individuals with different types of communication disorders (e.g., AphasiaBank, ASDBank). However, there is no speech corpus for speech disorders (defined below) which motivated the initiation of the project reported in this paper.

DisorderedSPeechBank is a venture commenced by Professors Nicole Müller and Martin Ball in 2015 (Ball et al., Citation2016; van den Heuvel et al., Citation2018). The project was later renamed DELAD, which stands for Database Enterprise for Language And speech Disorders, and the acronym is also a word in Swedish which means ‘shared’. The project aims to build a digital archive of sound files and video files representing samples of pathological speech in a variety of languages, where researchers can share their datasets of speech disorders through this platform with interested audiences (Ball et al., Citation2016). Here, the notion of speech disorders is interpreted widely, including, but not limited to, protracted phonological development, articulatory and phonological disorders in children, childhood apraxia of speech (or developmental verbal dyspraxia), motor speech disorders and speech disorders from other neurogenic conditions, and speech disorders resultant from physiological impairments, or various genetic conditions. The sound and video files may be accompanied by high-fidelity transcripts, acoustic analysis files, or imaging files (e.g., ultrasound imaging), as appropriate. As wide a variety as possible of languages will be included, and currently the DELAD project involves researchers working with Catalan, Croatian, Dutch, English, Finnish, French, German, Irish, Norwegian, Polish, Spanish, Swedish, and Welsh (see also Ball et al., Citation2016). The work of this project has been carried out mainly through workshops and regular meetings of the steering group. The next two sections report the progress of the project since the start.

Project history

To date, four workshops have been held for this project. The first workshop was designed to set up the project and was held in mid-October 2015 and then a second one in early June 2016, both in Linköping University, Sweden. These workshops were supported by the Riksbankens Jubileumsfond (Ball et al., Citation2016; van den Heuvel et al., Citation2018). The third workshop was held in University College Cork, Ireland, in mid-November 2017 and the fourth in Utrecht, The Netherlands, in late January 2019. The fifth workshop was planned for June 2020 in University of Helsinki, Finland, but it was postponed to late January 2021 and held as an online event due to the Covid-19 pandemic. The three most recent workshops were funded by CLARIN (Common Language Resources and Technology Infrastructure).

For each of the workshops, specific themes relevant to the latest progress of the project were set, and researchers or specialists with pertinent expertise, such as speech disorder researchers, infrastructure specialists, and experts in the area of intellectual property rights (IPR), ethics and the General Data Protection Regulation (GDPR), were invited to present and discuss the relevant topics. The participants of the workshops were identified by the workshop organizers, as well as based on suggestions made by the participants. The outcomes of the first three workshops, which were presented at several conferences including the biennial conference of the International Clinical Phonetics and Linguistics Association (ICPLA) held in 2016 and 2018 (Ball et al., Citation2016; van den Heuvel et al., Citation2018), are detailed below.

Workshop 1 – Linköping, October 2015

A total of 20 participants affiliated with universities or institutes in 10 countries (Canada, Croatia, Finland, Germany, Ireland, Norway, Spain, Sweden, The Netherlands, and the UK) attended the first workshop held on 14–16 October 2015 in Linköping University, Sweden. The research expertise and background of the participants include clinical phonetics and linguistics, speech and language therapy, curation of speech and language databases, and automatic speech recognition and language learning.

All participants at the workshop agreed that establishing an archive of pathological speech is worthwhile and the final product should be made available to researchers, as well as speech and language therapy educators and students (Ball et al., Citation2016). It was felt that, for researchers, one attraction of such an archive is the possibility of refining analysis methods, or formulating and testing hypotheses about speech disorders using existing, suitable datasets. This saves time and resources in collecting new data. As for educators and students, an archive of high-quality speech data allows them to learn and practice analysis of speech disorders and the application of diagnostic tools on a variety of cases (Ball et al., Citation2016). Such resources in the form of online tools are available for specific topics; for example, PUMA (Practical education Using Multimedia Application; website: https://pumalogopedi.se/; reported in Lohmander et al., Citation2020) which was originally developed for learning to perform auditory-perceptual and instrumental analysis of speech problems associated with cleft palate in Swedish, now includes auditory-perceptual evaluation of other areas such as motor speech disorders and voice disorders. CLISPI (CLeft palate International SPeech Issues; website: https://clispi.com/) is a website for auditory-perceptual judgements of cleft palate speech in a number of European languages, and WebFon (website: http://elearning.marjon.ac.uk/ptsp/) and Ulster set (both reported in the paper by Titterington & Bates, Citation2018) for learning phonetic transcription of speech disorders in English speakers. It is anticipated that the ultimate goal of the present project is to contribute to the improvement of evidence-based therapy for developmental and acquired speech disorders, as well as research opportunities.

At the workshop, the participants presented and discussed their own research projects with regard to speech data collection and analysis of speech disorders. They were then split into two groups to carry out smaller group discussion on three issues: ethics guidelines, collection of speech data, and a pilot archive (see subsections below). The participants reconvened afterward to share the results of the small group discussions and brainstormed funding opportunities for further development of this initiative.

Ethics guidelines

Various aspects related to ethics were discussed. It was agreed that the consent form should include statements that confirm that the participants are informed of specific types of information, including the purpose of the research project; data anonymization and handling; voluntary participation in the research project; the timing of withdrawal of consent, or excluding their data from analysis, if that is possible. Many research ethics boards currently require this kind of information in consent forms. In addition, it was suggested that the consent form could include different levels of consent; for example, consent to the use of one’s data for education purposes only, presentation at conferences or meetings, dissemination of findings via publications, and archiving and sharing with other researchers through a database. Furthermore, it was agreed that if researchers planned to archive the data of their projects with the DisorderedSPeechBank, their research participants should be informed of this at the stage of obtaining initial consent. Hence, a suggestion was made to create a standard document that explains the aim of DisorderedSPeechBank, and how and where the data collected will be saved. This document could be appended to an ethics approval application (and a research grant application if needed), and then handed out with the information sheet and consent form at the stage of participant recruitment. As the requirements and policies regarding ethics applications differ between countries, it was recommended that researchers should follow their national regulations regarding research ethics, particularly for issues, such as whether data of deceased research participants could be shared; how long the data could be kept after the lifetime of the research project; and the procedures for data anonymization.

Collection of speech data

A wide variety of possible forms of speech data was listed. They include digital audio and video recordings; phonetic transcriptions or orthographic transcriptions of speech samples but without accompanying audio or video recordings, as well as data collected using specific instrumental techniques, such as electromagnetic articulography, electropalatography, nasometry, ultrasound, and functional magnetic resonance imaging. In addition, the data may include case data, results of standardized speech and language assessments, and results of data analyses.

Pilot archive

For data archiving, issues discussed included data contribution and data access. It was suggested that researchers who contribute the dataset(s) would have to declare that they have followed their national regulations for research ethics, obtained approval from the relevant ethics committee and consent from the research participants for archiving and sharing the data. Also, the types of permission of data usage (e.g., for education purposes only, and so on; see the sub-section on ethics guidelines above) would have to be specified for each dataset archived. For data access, the groups discussed the need for login information in order to access the data, and the types of user information that may be required (e.g., name, affiliation, and purpose of accessing the data). The groups also agreed on the need for a gatekeeper to the data and the maintenance of the database; for example, checking the quality or format of data submitted; and uploading and removing datasets. The types of metadata required were discussed, including, for example, speaker characteristics, location of recording, and the sources of research funding. The group also discussed the possibility of linking the DisorderedSpeechBank with to an existing database (e.g., the TalkBank).

Workshop 2 – Linköping, June 2016

The second workshop was held on 1–3 June 2016, again in Linköping University, Sweden. It was attended by 22 participants with five who were new to this initiative, joining the event on behalf of their colleagues who took part in Workshop 1. The research expertise and background of the participants were similar to the last group. Workshop 2 also started with the new participants presenting their research projects and sharing the experience of data collection and storage. The group continued the discussion of ethics guidelines and a pilot archive, carried forward from the previous workshop. One idea to start off DisorderedSpeechBank was to archive some existing datasets that can be shared publicly. However, due to the limits related to ethical issues, very little already existing data can be grandfathered in (Ball et al., Citation2016).

As the possibility of linking to an existing database was raised in the previous workshop, the group invited Professors Barbara May Bernhardt and Joseph Stemberger to the workshop to share their experience of establishing PhonBank, which is housed within TalkBank. In addition, the participants had an online meeting through Skype with Professor Brian MacWhinney to discuss the feasibility of joining TalkBank; and with Dr. Sebastian Drude in the CLARIN ERIC office about linking with CLARIN. The discussions were positive, and the participants agreed that further consideration and exploration were needed before a decision of whether or not to link with a current database could be made. Furthermore, between this workshop and the third workshop in Cork, the present initiative was renamed DELAD and a preliminary website was set up as well.

Workshop 3 – Cork, November 2017

The third workshop was held on 15–17 November 2017 in University College Cork, Ireland. Some new elements were introduced to this workshop. As the participants of the previous two workshops in Linköping indicated that their existing datasets could not be shared because of ethical or IPR regulations, researchers who were about to start data collection for their recently funded projects were invited to this workshop. It was hoped that they could benefit from the information of DELAD regarding data sharing. Hence, a few key collaborators (or beneficiaries) of a project called TAPAS (Training Network on Automatic Processing of PAthological Speech; website: https://www.tapas-etn-eu.org/), funded by the Horizon 2020 Marie Skłodowska-Curie Actions – The Innovative Training Networks of the European Commission, were invited to take part in the workshop. In addition, participants from CLARIN ERIC (website: https://www.clarin.eu/) and FIN-CLARIN (website: https://www.kielipankki.fi/organisaatio/fin-clarin/), as well as legal experts from ELRA (European Language Resources Association, website: http://www.elra.info/en/) were invited to the workshop. Hence, the majority of the participants of this workshop (18 out of 22) were first-time participants of the DELAD workshop, either due to the reason stated above or participating on behalf colleagues who participated in previous workshops. All participants were based in Europe (Cyprus, Finland, Ireland, Malta, Poland, The Netherlands, UK). The main focus of Workshop 3 was setting up a data archive along with legal and ethical issues anticipated for the implementation of GDPR 2018.

Data archive

As a result of this workshop, the participants agreed that the datasets shared via DELAD would be archived with CLARIN in the form of a CLARIN CSD portal. A number of specific recommendations were made. First, the datasets to be included in DELAD should be composed of speech data collected from individuals with speech disorders. Depending on the research design of the original projects, speech data might have also been collected from typical speakers (e.g., those who were age- and gender-matched with the speakers with speech disorders) to form a control group. In such cases, it would be very useful to the users of DELAD if the data of both experimental and control groups can be shared. The group planned to look for the best practices for defining the sub-categories of speech pathologies. Because of data archiving with CLARIN, their guidelines for data format and versioning will be followed. Regarding data anonymization, direct identifiers (e.g., name, and any other information that can identify the person directly) should be anonymized and broad categories should be used for reporting indirect identifiers (typically: socio-economic status, gender, age intervals, disorder category). For the types of metadata to be included, the group planned to look at best practices, such as AphasiaBank and DementiaBank within TalkBank; and the Component MetaData Infrastructure of CLARIN.

For data access, there is a clear need for a layered approach with different access levels. The group also discussed possible locations for storing the archive and the pros and cons of local storage as opposed to central repository. The workshop noted that if DELAD was linked to an existing database, then we could take advantage of the present technology and infrastructures of the bigger data archive organization.

Legal and ethical issues

For IPR and ethics, the group planned to search for good examples of information sheet and consent form used by existing databases (e.g., the TalkBank) and other projects. Moreover, safeguarding datasets from deletion, as demanded by some ethics boards, should be a priority. At the end of the workshop, small working groups were formed to follow up on the actions agreed.

Progress since our last report at ICPLA 2018

Since our last update at the ICPLA conference in 2018, the fourth workshop was held in Utrecht, The Netherlands, on 28–30 January 2019. Twenty-four participants, of which 10 were new, based in 12 countries (Belgium, Canada, Cyprus, Estonia, Finland, Greece, Ireland, Norway, Poland. The Netherlands, the UK, and the USA) attended the workshop, with a few who joined online. Similar to the flow of previous workshops, this workshop started with an overview of action points and an update, followed by presentations of research projects and data collection by the current participants and new participants. The goal of this workshop was to review the status of the actions set out in the previous workshop, exchange deeper insights on ethical and legal aspects (including IPR) of CSD collection in the context of GDPR, and to come up with a plan for primary special needs for the CLARIN infrastructure to host CSD. Hence, speech disorder researchers, infrastructure specialists, and IPR, ethics and GDPR specialists were invited to participate this workshop.

A number of outcomes emerged from this Utrecht workshop. First, reaffirmation that CLARIN is the Data Trust that can provide the necessary data fence around CSD. Hence, DELAD should apply to become a Task Force within CLARIN focusing on practical issues on sharing CSD. When this process is completed, the DELAD website will include relevant CLARIN guidelines for collecting, sharing and storing CSD. The TalkBank was also seen as a good CLARIN site to host CSD, especially if a European storage cloud and stricter access policy can be realized.

After this workshop the DELAD website was redesigned (website: http://delad.net/) and now includes information and online application forms for researchers to join and share their datasets. DELAD has also linked with the CLARIN K-Centre for Atypical Communication Expertise (ACE; website: https://ace.ruhosting.nl/), where CSD can be hosted and accessed through the CLARIN B-Centres: The Language Archive (TLA; website: https://tla.mpi.nl/tools/tla-tools/) and TalkBank (website: https://talkbank.org/).

As a result of the DELAD workshops, it was decided to create a use case to demonstrate how CSD can be found through one organization and made accessible through another, using the infrastructure related to DELAD. For that purpose, one of the data collections reported during the Utrecht DELAD workshop (Polish Cued Speech Corpus of 20 Hearing Impaired Children, cf. Trochymiuk, Citation2008) was curated and made accessible through Talkbank in the USA and stored at the TLA in The Netherlands. The public record for the dataset includes the original audio recordings (isolated words), prompt texts for the recorded utterances, and the basic speaker metadata (gender and age). The access to more sensitive information, medical and family facts remains restricted due to data protection requirements. The data curation process was supported by the CLARIN K-Centre ACE team. The direct link to the dataset profile at Talkbank is https://phonbank.talkbank.org/access/Clinical/PCSC.html; see also CLARIN ACE show cases record (https://ace.ruhosting.nl/show-cases/).

Current development

Recently, in October 2020 DELAD organized a webinar using the framework of the supervision of the SSHOC (Social Sciences & Humanities Open Cloud) project (website: https://sshopencloud.eu/). The title of the webinar was Sharing Datasets of Pathological Speech. In this one-hour webinar, a group of representatives from DELAD and SSHOC gave five presentations addressing the following topics: (1) the objectives of DELAD, (2) fundamentals of GDPR and ethics for collecting for corpora of speech disorders, (3) sharing CSD via layered access and (4) remote access, and (5) curating and sharing pathological speech corpora using infrastructures provided by different organizations. The webinar can be viewed at the website of SSHOC (https://sshopencloud.eu/sshoc-webinar-sharing-datasets-pathological-speech) and the slide desk is available from Zenodo (https://zenodo.org/record/4081602#.X4tDGNAzba8).

The latest DELAD workshop was just held online on 27–28 January 2021 on topics including Data Protection Impact Assessments, reviewing changes in ethics perspectives in academia on sharing CSD, and voice conversion as a means to pseudonomise speech. Furthermore, the relation between consent and public and legitimate interest in the sense of the GDPR for informed consent forms were discussed. A blog about the workshop published on the website of CLARIN and video recordings of the presentations and slides housed by Zenodo can be accessed via the DELAD website (weblink: https://delad.ruhosting.nl/wordpress/delad-workshop-27-28-january-2021/).

Conclusions

Positive and steady progress in developing DELAD has been made since 2015. The main aspects identified in the workshops have been the great need for guidelines for sharing CSD. These guidelines involve information on how to ask permission for sharing sensitive data, what are the best practices in data collection (e.g., recording equipment and relevant setup, the types of speech tasks and materials to be included), and what are the safest platforms for sharing data. Another need expressed by the participants of the DELAD workshops is the enhancement of networking among the present and potential new members of the group. We hope that these needs could be addressed through the development and regular updates on the DELAD website.

Declaration of interest statement

The authors report no conflicts of interest.

Acknowledgments

The authors would like to thank Riksbankens Jubileumsfond, Sweden, and CLARIN ERIC (Common Language Resources and Technology Infrastructure) for their funding support for organizing the DELAD Workshops.

Additional information

Funding

This work was supported by the Riksbankens Jubileumsfond [Bank of Sweden Tercentenary Foundation]; CLARIN ERIC.

References