Abstract
The frequency of linguistic units and patterns (linguistic metrics) is specific to each language and varies depending on text complexity. In this study we investigate a number of measures that can contribute to determine text complexity/difficulty and be obtained automatically. The variables considered come primarily from counts of phonological units and patterns (e.g. word size, number of prosodic words and clitics, rare syllable types, exceptional stress patterns and rare segment classes), some of which being also informative of texts’ morphological and syntactic complexity and semantic density. We compared the linguistic metrics in a large sample of Portuguese medicine package leaflets (PL), a sample of PL from the medicines that belong to the three most consumed therapeutic groups, and two samples of more common texts (composed of journalistic and oral texts). For each indicator, oral texts, in particular texts produced by speakers with lower levels of education, showed the lowest values in a difficulty continuum, and frequently PL showed the highest values in that continuum, indicating that PL may be less readable than the other types of texts analysed.
Acknowledgements
We are very grateful to Sónia Frota and Marisa Cruz, from the Phonetics Lab of the University of Lisbon, for giving us access to the group of texts used in this study as comparators, which were previously collected within the project FreP – Frequency Patterns of Phonological Objects in Portuguese (http://labfon.letras.ulisboa.pt/FreP).
Notes
1. To the best of our knowledge, the differences in terms of linguistic complexity between oral and written texts have not been systematically investigated in European Portuguese (but see Duarte Citation2000: chap. 8 for some remarks bearing on the issue). Our investigation will hopefully also contribute to this end.
2. The National Prescribing Guide describes the majority of medicines in the Portuguese market.
3. Texts made available by the FreP project team, coordinated by Sónia Frota (Phonetics Lab of the Lisbon University). http://labfon.letras.ulisboa.pt/FreP/
4. Texts from the journal Público 1991–1994, originally made available by the Natura project, coordinated by José João A. G. Dias de Almeida (University of Minho).
5. Português Falado Project (1995–1997), coordinated by João Malaca Casteleiro (CLUL – Centro de Linguística da Universidade de Lisboa). http://www.clul.ul.pt/sectores/linguistica_de_corpus/projecto_portuguesfalado.php
6. CORP-ORAL project (2005–2008), coordinated by Maria Helena Mateus (ILTEC – Instituto de Linguística Teórica e Computacional). Data collected and originally made available by Tiago Freitas. http://www.iltec.pt/projectos/concluidos/corp-oral.html
7. CORDIAL-SIN project, coordinated by Ana Maria Martins (CLUL – Centro de Linguística da Universidade de Lisboa – and CLUNL – Centro de Linguística da Universidade Nova de Lisboa). http://www.clul.ul.pt/sectores/variacao/cordialsin/projecto_cordialsin_corpus.php
8. Marktest in O perfil dos leitores de imprensa. [The profile of readers of the press]. http://www.marktest.com/wap/a/n/id~1bbe.aspx.
9. Marktest- Bareme imprensa: Informação de mercado de imprensa. [News from the market Information]. 2.ª VAGA 2014.
10. In the recent orthographic reform, which has recently become mandatory, these consonants are no longer spelled. This means that the semi-manual procedure mentioned above will no longer be required for newly written texts.
11. Medra – Medical Dictionary for Regulatory Activities. http://www.meddra.org/how-to-use/basics/hierarchy