350
Views
8
CrossRef citations to date
0
Altmetric
Articles

An evaluation of metrics used by the Performance-based Research Fund process in New Zealand

&
Pages 270-287 | Received 26 Oct 2017, Accepted 21 May 2018, Published online: 31 May 2018
 

ABSTRACT

The New Zealand Performance-based Research Fund applies a set of metrics to assess researchers and rank disciplines and universities. The process involves giving a total weighted score (based on three components) to individuals and then assigning them to one of four Quality Categories (QCs), used to derive Average Quality Scores (AQS). This paper evaluates the properties of these metrics and argues that QC thresholds influence the final distribution of scores. The paper also demonstrates that the derivation of AQSs depends on the weights assigned to each QC and the distribution of portfolios. The method used to determine raw scores also has an independent effect on the distribution of scores. The paper compares how research rankings of New Zealand universities would vary if alternative summary measures, based on the total weighted scores rather than QCs, were used to evaluate performance.

Acknowledgments

We are very grateful to Amber Flynn and Sharon Beattie for providing the TEC data used here and helpful discussions regarding the PBRF process, and John MacCormick for mentioning useful references. We have benefited from helpful comments on an earlier draft by Norman Gemmell, Gary Hawke, Associate Editor Martin Berka and three referees.

Disclosure statement

No potential conflict of interest was reported by the authors.

Notes

1. Three measures are used to allocate Government funding to support research at universities and other Tertiary Education Organisations (TEOs). The first component is Quality Evaluation, which comprises 60 per cent of the Fund allocated on the basis of an assessment of the research quality of eligible staff. The second component is Research Degree Completions, which comprises 25 per cent. The third component is External Research Income, which comprises 15 per cent and is based on the amount of external research revenue generated. This paper is concerned only with the Quality Evaluation component.

2. For background and detailed discussion of the PBRF, see New Zealand Tertiary Education Commission (Citation2002, Citation2013).

3. Wilsdon et al. (Citation2015) argue that ‘Peer review is not perfect, but it is the least worst form of academic governance we have’. They also suggest that bibliometric data can usefully complement other forms of evaluation of research quality. Examples of the use of bibliometric methods to assess the impact on research performance of PBRF include Gibson, Anderson, and Tressler (Citation2008), Smart (Citation2009), Hodder and Hodder (Citation2010), Anderson, Smart, and Tressler (Citation2013) and Anderson and Tressler (Citation2014).

4. A broader discussion of the process and an analysis of the evolution of research quality in NZ universities between 2003 and 2012 is in Buckle and Creedy (Citation2018a, Citation2018b).

5. The importance of research rankings for the reputation of universities is stressed in OECD (Citation2010).

6. In 2012, the TEC developed four alternative AQSs which varied according to the denominator. For example, AQS(E) uses equivalent full-time students; AQS(P) is a subset using equivalent full-time postgraduate degree students; AQS(S) uses all academic teaching and research staff. The measures used in this paper are AQS(N), which uses the full-time equivalent staff for which EPs were submitted and a measure using all non-administrative staff.

7. The Tertiary Education Commission undertakes a moderation process at the commencement of each PBRF assessment round to encourage consistent assessment over time and across discipline panels. This process includes training expert panel members on assessment methods and referring to a sample of evaluations from previous PBRF rounds.

8. These categories will differ in the 2018 round.

9. The recognition that new researchers may take time to establish their research, publications and academic reputations led to the introduction in 2006 of the new categories, C(NE) and R(NE), which are assigned the same weights as C and R, respectively. These apply to new and emerging researchers. For present purposes, these can be ignored and all Cs and Rs are grouped together. The implications of this change are discussed by Buckle and Creedy (Citation2018a).

10. The TEC does not actually have the responsibility for designing the metrics used, but has only the task of administering the PBRF process.

11. It is possible that the metrics used to obtain the initial raw scores are considered to be useful but partial measures, which are to be supplemented by a range of (unspecified) qualitative considerations. However, marginal adjustments would then be expected, rather than the less transparent large ones.

12. In practice the scores for the different components may be expected to be positively correlated.

13. The question arises of whether a wider range for each component, si, say from 0 to 10, would make a difference. Such a choice results in 1573 permutations, for integer values of x ranging from 0 to 1000. The resulting distribution of QCs turns out to be similar to that obtained using the current metric.

14. As discussed above, and again in Section 5.1, the process is actually iterative.

15. The data were made available to the authors, following a confidentiality agreement. The dataset is not publicly available.

16. The final column in the table corresponds to the final column of Table 2 of Buckle and Creedy (Citation2018a).

17. Ministry of Education (Citation2012, p. 11, n. 2) describe this as follows.

Researchers did not have to participate in the 2006 round. The Working Group had recommended that a [Quality Evaluation] QE three years after the first would help to ensure a managed transition, to develop good practices in performance evaluation, and to acknowledge the need to learn from experience after the first QE. In reviewing the 2003 QE, it was determined that a full round was not necessary. Instead, the partial round assessed staff who had not been assessed in 2003, staff who wished to be reported under a different subject area, and staff who wished to be reassessed in the hopes of achieving a higher quality rating.

18. Indeed, as shown by Buckle and Creedy (Citation2018a), the major improvement in AQSs from 2003 to 2012 was associated with the exit of a large number of R-type staff. Furthermore, among remaining Rs submitted by universities in 2012, there is a complete absence of zero scores.

19. Shockley (Citation1957) first suggested that measured performance depends multiplicatively on a range of qualities and, as shown by Aitchison and Brown (Citation1957), this gives rise to the lognormal distribution. For a flavour of the literature on productivity and research metric distributions, see, for example, Allison and Stewart (Citation1974), Cortés, Perote, and Andrés (Citation2016), Heckman and Sattinger (Citation1991), Moreira, Zeng, and Amaral (Citation2015), Perc (Citation2010), Roy (Citation1950), Ruocco, Daraio, Folli, and Leonetti (Citation2017) and Swan, Powers, and Bos (Citation1999). However, some authors report evidence of a fatter upper tail than the lognormal distribution.

20. The sum of the differences between frequencies, when adding over the whole range of indicative scores, do not add to zero. The sum is positive, because there are some missing values in the distribution of indicative scores. These arise where agreement could not be reached by the appropriate subject panel members at the initial stage.

21. For 2003, there is an increase by one portfolio in the range below 600, but this is associated with a portfolio without a final raw score. This arose because the panel could not agree at that stage, so no score was recorded.

22. In the 2012 round, there was no incentive for universities to submit portfolios for those expected to be judged as Rs. For further details of the nature of the process, see Buckle and Creedy (Citation2018a).

23. The correlation coefficient between the differences in frequencies between QC and final score in 2003 and 2012 is high, at 0.93. However, the correlation coefficient between the differences in frequency between final and indicative scores in these years is much lower, at 0.37.

24. The precise values are taken from Buckle and Creedy (Citation2018a), where each individual is given an employment weight of unity. However, the rankings are exactly the same as reported in New Zealand Tertiary Education Commission (Citation2013).

25. These values are taken from Buckle and Creedy (Citation2018a).

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.