1,748
Views
11
CrossRef citations to date
0
Altmetric
Original Articles

Collaboration to Data Curation: Harnessing Institutional Expertise

&
Pages 4-16 | Published online: 20 Oct 2010

Abstract

It can be argued that institutional repositories have not had the impact (CitationLynch 2003; CitationSalo 2008), initially expected, on academic scholarly communications (the exception being in a few well-developed and successful instances). So why should data repositories expect to fare any better? First, data repositories can learn from publication repositories’ experiences and their efforts to engage researchers to accept and use these new institutional services. Second, they provide a technical infrastructure for storing and sharing data with the potential for providing access to complimentary research support facilities. Finally, due to the interdisciplinary expertise required to develop and maintain such systems, stronger ties will be forged between libraries, information and computing services, and researchers. This will assist innovation and help to make them sustainable and embedded within academic institutional policy.

This paper, while aware of the diverse nature of institutional and departmental practices, aims to highlight a number of initiatives in the Universities of Edinburgh and Oxford, showing how research data repository infrastructures can be effectively realized through collaboration and sharing of expertise. We argue that by employing agile community, strategic and policy judgment, a robust data repository infrastructure will be part of an integrated solution to effectively manage institutional research data assets.

Introduction

This paper, while aware of the diverse nature of institutional and departmental research cultures and practices, aims to highlight a number of initiatives that will show how research data management and repository infrastructures can be effectively realized through collaborative efforts and the sharing of expertise. As such, they demand a prominent place on the academic research landscape by providing systematic and trusted curatorial and archival services, engaging interfaces that encourage re-use of content, in addition to addressing funder and institutional requirements regarding research data management mandates.

We present activities at the University of Oxford and the University of Edinburgh that build on lessons learned from publication repositories and make use of a different set of strategies to deal with research data. These strategies require multidisciplinary skills in areas such as information management, computing, economics, institutional governance, and social dynamics supplied by such actors as departmental heads, librarians and computing staff, principal investigators, records managers, archivists, and research office staff. The alignment of specialists from the aforementioned backgrounds is an important step on the route to a cohesive infrastructure to support researchers in the creation and use of data while ensuring the appropriate harvesting of the products of the research base.

In the following section, the context is set briefly describing what the generic data repository can garner from the publication repository paradigm. The next sections highlight collaborative data management activities in the Universities of Oxford and Edinburgh. Finally, the personal reflections of the authors are articulated in the discussion.

Research Data as a Product of the Research Base

The recent Council for Science and Technology report “A Vision for UK Research” (2010) placed emphasis on two-linked processes, namely: “focusing on excellence across the research base and second, harvesting the products of the research base” in order to remain competitive with emerging science-based economies (e.g., India, China) and to maintain the UK's leading position in the global research marketplace. In order to sustain this position the report emphasizes the need for the development of new collaborative models. These could manifest themselves as collaboration surrounding facilities where cost may be a factor, collaboration where sheer scale of effort needed can deliver both breadth and economies of scale not possible for each singular participant, and collaboration at the local level which pools both resources and expertise.

Integral to the whole research base are research outputs such as publications and digital data as both evidence and the means to verify intellectual endeavor. University strategies to harvest these products have developed around the concept of digital repositories developed by the academic libraries. The first realization of such information systems were publication repositories built to manage and disseminate research articles and aimed to provide open access to a significant proportion of newly published academic papers. The development of research data repositories has been seen as the next coherent step in the growth of repositories (CitationHeery & Powell 2006). Nonetheless, it can be argued that institutional repositories have not had the impact, initially expected (CitationLynch 2003; CitationSalo 2008), on academic scholarly communications (the exception being in a few well-developed and successful instances). So why should data repositories expect to fare any better?

First, the data repository activity can learn from publication repositories experiences (CitationMacdonald & Martinez-Uribe 2009) and their efforts to engage researchers to accept and use these new institutional services. Second, data repositories provide not only a technical infrastructure for storing, sharing, and managing data but also access to complimentary research support facilities. These include data management training and auditing tools, innovative utilities such as dataset citation, as well as linking tools and accessories to visualize and analyze heterogeneous content. Finally, due to the interdisciplinary expertise required to develop and maintain such organizational and technical systems, stronger ties will be forged between libraries, information and computing services, and researchers, helping to make them sustainable and embedded within academic institutional policy.

Strategies to Harvest the Products of the Research Base

Data Repository Activity at Oxford

A research data management scoping study (CitationMartinez-Uribe 2008) directed by the Oxford Digital Repositories Steering Group throughout 2008 revealed that University staff from a range of disciplines and central departments face a variety of challenges relating to the creation and management of data. This comes at a time when research councils are increasingly developing policies that require certain levels of data sharing and curation (CitationJones 2009). Such policies are an important and welcome step towards a new scholarly communication landscape but, in some cases, they can be dislocated from the research labs or other environments where research takes place. This is also echoed by the RIN Disciplinary Case Studies in the Life Science Project (2009) which found that data and information sharing activities are mainly driven by needs and benefits perceived as most important by practitioners rather than “top-down” policies and strategies.

Central services at Oxford, including computing services, library, and research services, together with academic departments are looking at ways of streamlining these issues for their researchers. There is an urgent need for establishing coherent institutional frameworks that support the creation, curation, and reuse of data while addressing research council policies and understanding the value and cost of data management and curation activities.

Since the initial study, a complete research data program has commenced at Oxford. This program includes a range of data repository activities that are working across subject domains as well as institutional service providers and themes. The final aim is to develop a robust collaborative data repository infrastructure to support researchers with their data.

One such activity is the Embedding Institutional Data Curation Services in Research (EIDCSR) project.Footnote 1 Through EIDCSR central departments are working with three collaborating research groups in Life Sciences and Medicine to scope and address their data management requirements. These groups collaborate as part of a nationally-funded project and conduct research into ventricular tissue architecture. In their research, they combine traditional histological and novel imaging techniques like Magnetic Resonance Imaging (MRI) and Diffusion Tensor MRI, as well as with image processing and computational models for bio-mathematical simulation. By bringing together this sophisticated range of techniques and an extraordinary array of areas of expertise the collaborative groups are generating hundreds of Gigabytes of data. The funding agency required them to store and made these data available for ten years after the completion of the project.

The EIDCSR project is also investigating institutional policy and guidance for the management of research data and records. The Research Services Office is leading this work in collaboration with the University of Melbourne following their Data Management PolicyFootnote 2 as exemplar. The approach took attempts to transform funders’ policies into something that clarifies the responsibilities of both department and researcher while pointing them to existing services and other useful information and resources. Through the experience in Melbourne and the work in Oxford it has been realized that, in order to develop successful policy and guidance for research records, it is crucial to understand the role such policies can play at the both university and academic department level. Moreover, policy and guidance ought to be useful, in practical terms, to researchers and, therefore, they must be involved in their development for it is vital that the implementation of institutional policy penetrates at local or departmental level.

Another area being explored through EIDCSR is the economics of research data management. Key questions in data repository development include how much it cost to manage data and who will pay for it (Blue Ribbon Task Force 2010). By participating in the KRDS2 ProjectFootnote 3 with a range of data centers, departments, and institutions, detailed information was gathered about the costs of creating, managing, and curating the data created by the research groups participating in EIDCSR. The results shown in , first published in CitationBeagrie et al. (2010), highlight the high costs of creating the specific datasets by the research groups in question. The second biggest cost, start-up curation, covers the curatorial activities undertaken as part of the EIDCSR project; this includes metadata management and technical developments. These curatorial costs are expected to be lower if provided by an established institutional service. As an example of this, the back up and long-term filestore service provided by Computing Services ensures the copies of the data are kept safe for five years. This established service has a minimal cost in comparison to creation and start-up curation.

FIGURE 1 Data management and curation costs from the Oxford survey.

FIGURE 1 Data management and curation costs from the Oxford survey.

, also first published in CitationBeagrie et al. (2010), presents the previous cost represented in time with a data lifecycle of eight years. The costs are concentrated in the first years when the data are created and reduce significantly as they progress through their lifecycle.

FIGURE 2 Data management activities placed in time.

FIGURE 2 Data management activities placed in time.

Other data management activities that have recently started at Oxford include the Sudamih and the Admiral projects. Sudamih is an EIDCSR sister project and it shares the institutional and procedural frameworks developed through EIDCSR to scope and address the requirements of scholars in the Humanities Division. The project will pilot the provision of a database as a service to enable the creation and publication of datasets including images, text, or geo-referenced data. It has also a strong emphasis in training and will gather requirements about data management training needs to then develop and pilot training modules based around existing training resources such as DCC 101.Footnote 4 The Admiral project led by the Image Bioinformatics Research Group (IBRG) in Zoology is working with life science researchers to assist them with tools to locally store and annotate their data to then package them for archiving and submission into a central repository for preservation provided by the Library. Previous data curation activities from IBRG has provided extremely helpful concepts such as Sheer CurationFootnote 5 or proved the usefulness of data enriched articles (CitationShotton et al. 2009).

Finally, Oxford is taking part in the UK Research Data Service (UKRDS) that is working towards a national infrastructure for research data management. The initial feasibility study concluded that the best approach should embed capacity, skills, and training (Beagrie et al. 2009). Currently, UKRDS is planning the Pathfinder implementation phase in which Oxford will play a key role with Leeds, Bristol, and Leicester.

Research Data Management Stakeholders and Initiatives at Edinburgh

The DISC-UK DataShare projectFootnote 6 (Mar 2007–Mar 2009) sought to develop models for multi-disciplinary institutional data repositories in the UK higher education sector. It was led by EDINA and Edinburgh University Data Library in partnership with the University of Oxford and the University of Southampton. It concluded that (CitationRice 2009):

  • Data management motivation is a better bottom-up driver for researchers than data sharing but is not sufficient to create culture change.

  • Data librarians, data managers and data scientists can help bridge communication between repository managers and researchers.

  • Institutional repositories can improve impact of sharing data over the internet.

From a local perspective the project was instrumental in developing the Edinburgh DataShare research data repository, hosted by the Data Library and contributor to JISC RepositoryNet, an interoperable network of repositories which provides UK tertiary education with access to trusted and expert information about repositories. Edinburgh DataShare shares the same DSpace software platform with the Library's Research Publication Service (comprised of the Publications Repository, a closed repository for use only in the University of Edinburgh, and the Edinburgh Research Archive which is a public open access repository) which was launched in January 2010 by Edinburgh University Library to support the implementation of the University's Open Access Publications Policy.Footnote 7 As such, it retains the potential to interoperate and link supplementary research data produced by local researchers to corresponding publications currently being scoped as part of an internally-funded project Linking Articles Into Research Data (LAIRD).

As a direct result of funder council requirements regarding the management and sharing of research data after the research project has been completed, Edinburgh DataShare has been approached by a number of local researchers who wish to have permanent location for their completed and documented dataset(s) with an open access metadata record. Such an engagement opportunity facilitates the development of the data repository by scoping functional requirements regarding value-added visualization and analytic tools such as multi-media viewers, licenses, domain specific metadata schemes and file formats, federated access, links to remote storage, and semantification of content.

The Data Library also led one of the pilot demonstrator projects of the HATII/DDC led-Data Audit Framework (DAF),Footnote 8 which conducted audits of departmental data collections thus engaging with the local research community in data management practices. The primary recommendation by the Edinburgh DAF Steering Group was for the adoption of an institutional-wide research data management policy. Other key findings (CitationEkmekcioglu & Rice 2009) indicated that staff require practical and systematic guidance on research data management, whether from research unit or school procedures, college or university-wide infrastructure and policy, or identifiable forms of support in the form of expert support staff, web pages, and discipline-specific guidelines, as well as short, focused training opportunities. This resonates with the Oxford Scoping Study and the RIN Disciplinary Case Studies in the Life Science Project conducted by researchers at the University of Edinburgh, which demonstrated a lack of coherency when it came to research data guidance and training, which, in turn, indicated local and ad hoc mechanisms were in place reflecting both scientific laboratory culture and working practices.

Using DAF as the engagement vehicle, the Data Library is currently scoping generic data management training based on research data management guidance materialsFootnote 9 developed by the Data Library and the Research Computing Service for early stage researchers in conjunction with the Postgraduate Transferable Skills Unit and the Researcher Development Programme. In November 2009, a training course was piloted in for PhD students in the School of Geosciences as part of the Postgraduate Research Students Training Programme. Initial feedback suggested that such courses should seek to strike a balance between discipline-specific and generic content.

In spring 2010, a review commenced at the University of Edinburgh to address the issue of managing the rapidly expanding volume and complexity of data produced by Edinburgh researchers. Concern is both for the shorter term—ensuring competitive advantage through secure and easy-to-use access—and for the longer term—ensuring long-term access and usability for the research community into the future.

The Review is overseen by the IT Committee and the Library Committee and has twin tracks to look at Research Data Storage and Data Management, Curation, and Preservation. The Review is looking at current practice in the University of Edinburgh, assessing what is known about current practice in peer universities and internationally, and developing options which will include costs, feasibility, and a risk analysis of actions or inactions in this field.

Crucial to its success is the development of partnerships with researchers, heads of department, and principal investigators who have to be convinced that any research data management solution must allay fears about issues such as privacy; loss of ownership; fear of misuse; personal investment; IPR uncertainties while offering tangible benefits such as providing reliable access to researchers’ own data; a suitable environment to adhere to funders’ mandate; metadata that can increase the exposure of individual's research within the research community; and the devolution of preservation from the individual to the institution. Of equal importance are those with supporting roles, such as librarians, computing services, school administrators, records managers, archivists, and research office staff who have a stake in using, preserving, and re-using digital data output as part of the research process.

There are a number of other services and initiatives that have a role to play on the Edinburgh repository stage, including:

  • The Digital Curation Centre, which enters its third phase, and aims to build strong foundations for good data curation practice across the HE sector by providing support to data custodians with a specially devised DCC training program aimed at encouraging the transfer of knowledge and best practice, first among data custodians, then between data producers and users.

  • The Edinburgh Compute Data Facility (ECDF) Storage Area Network (SAN) which provides large scale storage; thus offering a potential solution for hosting very large datasets through interoperation with a data repository service such as Edinburgh DataShare which could store the corresponding metadata record(s).

  • The Enhancing Repository Infrastructure in Scotland (ERIS) Project, led by the University of Edinburgh, whose aim is to work in collaboration with Scottish researchers and their institutions’ repository managers to motivate researchers to deposit their work in repositories and facilitate the integration of repositories in research and institutional processes. ERIS also intends to engage with research poolsFootnote 10 to create “virtual repositories” that represent aggregations of research outputs as collected from their participating members institutional repositories, ensuring that the practical requirements of these repositories as stated by the research pools are met. The University of Edinburgh plays a key role in a number of the Scottish Research Pools.

The Data Library and ERIS are currently appraising the potential use of Edinburgh DataShare as the type of mechanism for storing and providing shared access to research data generated by the devolved researchers who comprise the cross-institutional research pools. They also, along with colleagues from EDINA, the DCC, and the Library form the organizing committee for the Repository Fringe Conference (held in Edinburgh since 2008), which acts as a forum for repository developers, managers, researchers, administrators, and open access practitioners to discuss and share developments in the repository world.

Discussion

Initially data repositories can be thought of as the technical infrastructure to deal with the creation, storage, management, and curation of research data. Nonetheless, current research and practice shows that in order for a data repository to be successful, it is required to develop not only the technical infrastructure but a whole range of other institutional services. The development and implementation of these data repository services needs to be an orchestrated exercise involving a wide group of institutional actors including academics, computing services, libraries, and research services.

Although this paper focuses its attention on the non-technical aspects of data repositories, it is important to highlight that there are key developments in this area that promise to revolutionize the way the data is collected and analyzed. With the almost exponential growth of research data output and the absence of off-the-shelf data management solutions, there are those in the e-science community who are proposing “taking the computing to the data,” that is, collaboration between the domain specialist and the computing specialist at the design phase of a data system to come up with a common language to describe those terms used in the scientific and computing domains. Jim Gray (CitationSzalay & Blakeley 2009) called this “Data Intensive Scalable Computing” in his informal rules that codify how to approach data engineering challenges related to large-scale scientific datasets. An approach utilizing “Gray's Laws” is more simplistic in terms of scalability and connecting to the scientists and may well offer potential lessons that can be applied to research data management in the academic sphere.

In sum, the activities discussed in this paper show that insightful strategic and policy judgment can help to develop robust institutional data repository frameworks enabling a move towards seamlessness in terms of the professionals working or interacting with research data throughout the lifecycle. This consolidation of expertise within research institutions via intra- and inter-institutional and cross-facility services could be seen as part of a research data management solution in this time of the economic constraints that are impacting academic institutions.

Notes

1. Embedding Institutional Data Curation Services in Research (EIDCSR) http://www.eidcsr.oucs.ox.ac.uk

2. University of Melbourne Data Management Policy http://www.unimelb.edu.au/records/research.html

3. Keeping Research Data Safe 2 http://www.beagrie.com/jisc.php

7. University's Open Access Publications Policy http://tiny.cc/5845v

8. Data Audit Framework (DAF)—http://www.data-audit.eu/ provides organisations with the means to identify, locate, describe and assess how they are managing their research data assets via online tools and methodologies.

10. Research Pooling is defined as the formation of strategic collaborations between universities in disciplinary or multi-disciplinary areas involving the international quality departments or individual researchers across Scotland http://sligachan.lib.ed.ac.uk/wordpress-mu/themes/research-pools/

References