578
Views
9
CrossRef citations to date
0
Altmetric
Articles

Assessing Annotated Corpora as Research OutputFootnote

, , &
Pages 1-21 | Accepted 24 Jun 2015, Published online: 07 Dec 2015
 

Abstract

The increasing importance of language documentation as a paradigm in linguistic research means that many linguists now spend substantial amounts of time preparing digital corpora of language data for long-term access. Benefits of this development include: (i) making analyses accountable to the primary material on which they are based; (ii) providing future researchers with a body of linguistic material to analyse in ways not foreseen by the original collector of the data; and, equally importantly, (iii) acknowledging the responsibility of the linguist to create records that can be accessed by the speakers of the language and by their descendants. Preparing such data collections requires substantial scholarly effort, and in order to make this approach sustainable, those who undertake it need to receive appropriate academic recognition of their effort in relevant institutional contexts. Such recognition is especially important for early-career scholars so that they can devote efforts to the compilation of annotated corpora and to making them accessible without damaging their careers in the long-term by impacting negatively on their publication record. Preliminary discussions between the Australian Linguistic Society (ALS) and the Australian Research Council (ARC) made it clear that the ARC accepts that curated corpora can legitimately be seen as research output, but that it is the responsibility of the ALS (and the scholarly community more generally) to establish conventions to accord scholarly credibility to such research products. This paper reports on the activities of the authors in exploring this issue on behalf of the ALS and it discusses issues in two areas: (a) what sort of process is appropriate in according acknowledgment and validation to curated corpora as research output; and (b) what are the appropriate criteria against which such validation should be judged? While the discussion focuses on the Australian linguistic context, it is also more broadly applicable as we will present in this article.

Notes

* We gratefully acknowledge the input received from colleagues at the Australian Linguistic Society conferences at which aspects of this proposal were discussed. Two anonymous reviewers provided very useful comments.

1 In this paper we use the more generalized term ‘repository’ to include archives, such as the specialist linguistics archives discussed below.

2 See for example PLOSone: http://www.plosone.org/static/policies.

5 For example, in the Routledge Handbook of Corpus Linguistics (O'Keeffe & McCarthy Citation2010) the word ‘indigenous' is not an index item, nor are ‘endangered', ‘documentation' or ‘DoBeS'. The only reference that deals with ‘less-studied languages’ that we have found in a search of the corpus literature is Ostler (Citation2008), and, due to the paucity of traditionally conceived corpora in these languages, he also includes archival collections of primary records in his discussion.

7 D-Lib Magazine 21 (1/2) DOI: 10.1045/january2015-contents.

11 There are a number of such archives, represented by the umbrella organization the Digital Endangered Languages and Musics Archives Network (DELAMAN.org).

15 http://www.nsf.gov/news/news_summ.jsp?cntn_id=110719&org=NSF&from=news, initially from 2004, and as a permanent programme from 2007.

16 E.g. at the workshop Potentials of Language Documentation: Methods, Analyses, Utilization held in Leipzig in 2011 (Seifart et al. Citation2012), by Margetts et al. (Citation2012), as well as on blogs such as: http://www.paradisec.org.au/blog/2012/11/counting-collections.

19 Lawrence et al. (Citation2011: 18ff) call this ‘publication by proxy’.

20 We are grateful to an anonymous reviewer for pointing out that articles of this type are normal practice in the field of corpus linguistics.

21 Their proposal is specific to linguistics; but Lawrence et al. (Citation2011: 21–23) discuss such possibilities under the label ‘overlay data publication’.

22 Of course published articles face similar challenges. For example, whilst scholars presumably submit articles on a particular grammatical feature when they believe that the analysis is complete, further research may require the findings of earlier articles to be revisited. A key difference here is that in the case of corpora we would not expect researchers to refrain from publishing them until they felt all aspects of the analysis of a given language were complete. In this sense corpora can be expected to be more open to change than some more traditional academic publications.

25 It is worth noting that in Callaghan's empirical test, these conditions turned out not to be trivial. She selected seven data sources, all of which passed test 1 because the tool used for selection was based on Digital Object Identifiers (DOIs), which are unique identifiers and remain fixed for the lifetime of the object that they refer to. However, two of these seven failed at least one of the other tests, and were rejected at this editorial stage. Another source of data only would have progressed to review on a generous interpretation of test 2 as it had a README file in raw .xml without stylesheet instructions, and yet another had confusing information about access. In other words, these editorial tests applied rigorously would have ruled out more than half of Callaghan's (admittedly small) sample.

26 See for example the descriptions of CHILDES corpora at http://childes.psy.cmu.edu/manuals/.

27 We envisage that in the future online grammars and well-structured lexical databases themselves will be counted as research outputs in the same way as we discuss here for text collections. While they are beyond the focus of the current discussion their accessibility along with the text data would be counted as raising the quality assessment of a text collection.

28 Speakers may be anonymized and in this case information about naming procedures should be provided.

29 Much legacy data fall into this category and are doubtless extremely significant from a heritage perspective, and as the basis for future scholarly work. Thus legacy data merit archiving and curation, but are not considered as a collection eligible for review.

30 For example see the Long Term Ecological Research Network Data Access Policy http://www.lternet.edu/policies/data-access, or the NSF report, Today's Data, Tomorrow's Discoveries, http://www.nsf.gov/pubs/2015/nsf15052/nsf15052.pdf.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.