Data Labours: How the Sequence Databases GenBank and EMBL-Bank Make Data: Science as Culture: Vol 25 , No 4

Abstract

What actually happens inside genetic databases, how do they work upon data and who does this work? While they have become central tools for doing science, not much is known about the work that goes on inside these vital infrastructures. Ethnographic explorations of two of the world’s largest nucleotide sequence databases, GenBank and the European Molecular Biology Laboratory’s EMBL-Bank, reveal manifold goings-on. Like most infrastructural work, it is modest and invisible routines that build and maintain the vast interconnected suite of bioinformational resources. Data curators construct organisms out of sulphuric sludge, dataflow engineers as self-styled “genetic information plumbers” keep the data deluge flowing, and a data submissions support assistant manages to make room for care amidst this deluge. Taken together, these data labours render tangible the modest and processual aspects of data infrastructure while also revealing the databases to be situated and lively spaces of convergence. Inventively analysing data labours paves surprising ways for encountering and making sense of databases, data and the work they do. Here, practices of natural history, like specimen-making and curation, are continued by other means while the assembly of sludge sheds light on the absences and deletions which mystify infrastructural maintenance work.

Keywords:

Acknowledgements

I wish to thank all the respondents who kindly agreed to be interviewed as part of this research, as well as Kean Birch, Les Levidow and two anonymous reviewers for their insightful comments. I also wish to thank Tanja Bogusz, Althea Greenan, Mike Michael and the participants of the Egenis/Symbiology Seminar at Exeter University for their generous feedback on earlier versions of this paper.

Disclosure Statement

No potential conflict of interest was reported by the author.

Notes

1 During the recent government shutdown, the National Institutes of Health (NIH) had to put 73% of its employees on forced leave, a move which affected most of the online resources, including GenBank. The pages of NCBI web resources carried the following message:

Due to the lapse in government funding, the information on this web site may not be up to date, transactions submitted via the web site may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted.

2 While separate entities institutionally, their data are exchanged on a daily basis and pooled into one global search domain by means of the International Nucleotide Sequence Database Collaboration. This was established between the three databases in the 1980s to develop collaborative tools (such as data standards, models and conventions) and a unified accessioning system (for a detailed description, see Nakamura et al., Citation2013).

3 EMBL-Bank is maintained by the European Bioinformatics Institute (EBI), which is part of the EMBL, and located on the Wellcome Genome Campus in Hinxton, UK. GenBank is maintained by the National Center for Biotechnology Information (NCBI), a division of the National Library of Medicine, located within the vast campus of the US NIH in Bethesda, Maryland, USA.

4 In information and library science, accession refers to the addition of an item to a collection, in this case the addition of a record to the database.

5 Sequencing machines and procedures have become much more efficient and, importantly, cheaper. So-called “next generation sequencing” technologies allow for a human genome to be sequenced within about 24 hours. Since 2003 sequencing capacities have increased at a rate between three to fivefold per year. How all of the thereby generated sequence data will be mined is often of no immediate concern.

6 A recent blogpost satirically summarises the inflation of “next-generation sequencing” as follows: “In a recent poll, 98% of researchers answered ‘next-generation sequencing’ to every single question—even their name, age and job title. The new science of ‘sequence first, think later’ has been coined ‘nextgenomics’” (jovialscientist, Citation2014).

7 The term “biocurator” emerged in the early 2000s. The first International Biocurator Meeting was held 2005 in Asilomar, California. The International Society for Biocuration was established in 2009.

8 Nevertheless, it has, over the last 10 years come into its own as a discipline with a vibrant online community dedicated to biocurational concerns (Bateman, Citation2010).

9 This A4 printout features a series of concentric circles containing the letters A, C, U and G, which represent the four nucleotides (in RNA the DNA’s T is replaced by U), and helps to compose codons, the sequence of three nucleotides that form genetic code in a DNA/RNA molecule.

10 Acid mine drainage, the flow of sulfuric acid into ground and surface water from mines, has proven a fertile environment for metagenomic analysis (see, for example, Tringe and Rubin, Citation2005).

11 The locust genome has a size of 6.5 GB and contains over 17,000 annotated and predicted gene models, over 2,500 repeat gene families, a proliferation of transposable elements and, the data boon, hundreds of potential target sites for pesticides.

12 Take, for example, the European Biodiversity Observation Network (2012 to 2017) project, which seeks to “integrate” and distribute” environmental datasets: “Doing so will require (1) the establishment and adoption of new data standards and integration techniques, (2) harmonised data collection, and (3) the development of new approaches and strategies for future biodiversity monitoring and assessment.” None of the work packages mentions plans for data curation.

Log in via your institution

Access through your institution

Log in to Taylor & Francis Online

Shibboleth

Log in to Taylor & Francis Online

Restore content access

Restore content access for purchases made as guest

Purchase options * Save for later

PDF download + Online access

48 hours access to article PDF & online version
Article PDF can be downloaded
Article PDF can be printed

USD 53.00 Add to cart

Issue Purchase

30 days online access to complete issue
Article PDFs can be downloaded
Article PDFs can be printed

USD 286.00 Add to cart

* Local tax will be added as applicable

Data Labours: How the Sequence Databases GenBank and EMBL-Bank Make Data

Log in via your institution

Log in to Taylor & Francis Online

Restore content access

Related Research

Information for

Open access

Opportunities

Help and information

Data Labours: How the Sequence Databases GenBank and EMBL-Bank Make Data

Abstract

Acknowledgements

Disclosure Statement

Notes

Log in via your institution

Log in to Taylor & Francis Online

Log in to Taylor & Francis Online

Restore content access

Related Research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date

Your download is now in progress and you may close this window

Login or register to access this feature