412
Views
0
CrossRef citations to date
0
Altmetric
Articles

FAIR data principles and their application to speech and oral archives

ORCID Icon &
Pages 339-354 | Received 19 Dec 2017, Accepted 01 May 2018, Published online: 28 May 2018
 

Abstract

This paper discusses current open data science and archive principles and issues and their applicability to speech and oral archives. Firstly, a definition of speech and oral archives is provided: they represent a rather tricky object of study not only in language related studies but also in the social sciences and humanities. Secondly, we introduce the FAIR principles for data science, which are now progressively being adopted in the humanities and in the domain of Language Resources, in connection with research infrastructures such as CLARIN and DARIAH. Finally we discuss some issues that may slow down the applicability adoption of FAIR principles for speech and oral archives, as well as current initiatives that may help in solving them.

Disclosure statement

No potential conflict of interest was reported by the authors.

Notes

1 According to a commonly agreed upon definition ‘Research infrastructures (RIs) are facilities, resources and services used by the science community to conduct research and foster innovation’ (<https://ec.europa.eu/research/infrastructures/index_en.cfm?pg=about>). Among other services RIs provide easy access to ‘resources such as collections, archives or scientific data’. The European Union fosters the creation of RIs in various disciplines including the humanities; they provide international access to data and services by federating existing national data centers.

2 FAIR principles have been first proposed by the FORCE11 (Future Of Research Communications and E-scholarship, Citation2014) group in 2015 (<https://www.force11.org/group/fairgroup/fairprinciples>), and later adopted by others.

4 The hendiadys in the title of the present paper is therefore deliberate.

5 Forced alignment is an NLP technology used to automatically align an audio recording with its textual transcription. This is particularly useful when transcriptions for legacy data are available that were not produced using transcription software. For more information see <http://oralhistory.eu/workshops/transcription-chain>.

6 For a list of currently existing speech processing tools see <http://liceu.uab.es/~joaquim/phonetics/fon_anal_acus/herram_anal_acus.html> (last accessed on 9 December 2017).

7 ‘Open Science represents a new approach to the scientific process based on cooperative work and new ways of diffusing knowledge by using digital technologies and new collaborative tools’. (European Commission, Citation2016).

8 Interestingly enough, it seems that in the humanities the term resource is used in an exactly opposite sense to what happens in geology (<http://energyeducation.ca/encyclopedia/Reserve_vs_resource>). ‘Oil resources’ include all fossil existing, both discovered and undiscovered, exploitable and unexploitable. In our field, this would be expressed more likely by an expression like ‘language data’.

9 See Calamai and Biliotti (Citation2017a) for an example of reuse of an archive collected for historical documentation through the lens of a morphological analysis.

10 See among others a proposal in Mariani and Francopoulo (Citation2015), for what concerns language resources.

12 Such as, among others, the Dutch DANS <dans.knaw.nl>.

19 Beyond CLARIN, other important players for what concerns the identification, distribution and validation of Language resources are associations such as ELRA (the European Language Resources Association <http://www.elra.info/>) and its American equivalent LDC (Linguistic Data consortium <https://www.ldc.upenn.edu/>).

20 The Collaborative European Digital Archive Infrastructure <http://www.cendari.eu/>.

21 The European Holocaust Research Infrastructure <https://www.ehri-project.eu/>.

22 PARTHENOS (‘Pooling Activities, Resources and Tools for Heritage E-research Networking, Optimization and Synergies’) is a EU funded infrastructural project (H2020-EU INFRADEV-4–2014/2015) <http://www.parthenos-project.eu/>.

26 CLARIN endorses various types of persistent identifiers. See this document for an overview <https://www.clarin.eu/content/comparison-pid-systems>.

31 Technical details are retrievable at <https://www.clarin.eu/node/3788>.

32 For an overview of CLARIN licenses see <https://www.clarin.eu/content/license-categories>.

33 Listen to a recording from the Mandarine corpus here <http://hdl.handle.net/11312/a-00000938-3>.

36 Concepts also have a pid; this one is identified by <http://hdl.handle.net/11459/CCR_C-2955_3b1694a0-f9e4-3802-f2bd-1bd371ac5945>.

37 We shall not go into the details since the discussion on formats and standards of data encoding in LRs has been extensively discussed. A recent reference publication is the Handbook of Linguistic Annotation (Ide & Pustejovsky, Citation2017). See also the following page for some further references on Speech and Spoken Language Resources: <http://liceu.uab.es/~joaquim/language_resources/spoken_res/biblio_corpus_orals.html> (accessed 9 December 2017).

40 In a series of CLARIN endorsed workshops, Oral History have been put in contact with Speech technology (Oxford 2016; Arezzo 2017): oral historians and social scientists were able to familiarise with the potentialities currently offered by NLP, and thus realise the importance of adhering to agreed upon best practices. An ongoing objective of the CLARIN Oral History group is now to put in place a Transcription Chain, that is, ‘a set of concatenated ‘steps’ one has to perform in order to go from a ‘recorded interview’ to a findable, accessible and viewable digital AV-document with relevant metadata on the Internet’ (for more information see <http://oralhistory.eu/workshops>).

44 See the TalkBank website for a list of subprojects  <http://talkbank.org/>.

45 Infrastructures strongly promote exchanges among scientists. In CLARIN they are gathered under the umbrella of the so called ‘The Knowledge Sharing Infrastrutcture’; in DARIAH a similar role is played by ‘Virtual Competency Centres’.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.