Publication Cover
Mitochondrial DNA Part A
DNA Mapping, Sequencing, and Analysis
Volume 27, 2016 - Issue 6
1,861
Views
9
CrossRef citations to date
0
Altmetric
Mitocommunication

The chloroplast genome hidden in plain sight, open access publishing and anti-fragile distributed data sources

Pages 4518-4519 | Received 13 Sep 2015, Accepted 26 Sep 2015, Published online: 21 Oct 2015

Abstract

We sequenced several cannabis genomes in 2011 of June and the first and the longest contigs to emerge were the chloroplast and mitochondrial genomes. Having been a contributor to the Human Genome Project and an eye-witness to the real benefits of immediate data release, I have first hand experience with the potential mal-investment of millions of dollars of tax payer money narrowly averted due to the adopted global rapid data release policy. The policy was vital in reducing duplication of effort and economic waste. As a result, we felt obligated to publish the Cannabis genome data in a similar spirit and placed them immediately on a cloud based Amazon server in August of 2011. While these rapid data release practices were heralded by many in the media, we still find some authors fail to find or reference said work and hope to compel the readership that this omission has more pervasive repercussions than bruised egos and is a regression for our community.

In June of 2011, we sequenced several cannabis genomes and the first and longest contigs to emerge were the chloroplast and mitochondrial genomes (Stafford, Citation2011). Having been a contributor to the Human Genome Project and an eye-witness to the real benefits of immediate data release, I have first hand experience with the potential mal-investment of millions of dollars of tax payer money narrowly averted due to the adopted global rapid data release policy. The policy was vital in reducing duplication of effort and economic waste. As a result, we felt obligated to publish the Cannabis genome data in a similar spirit and placed them immediately on a cloud based Amazon server in August of 2011 (McKernan, Citation2015). While these rapid data release practices were heralded by many in the media (Medicinal Genomics Sequences the Cannabis Genome to Assemble the Largest Known Gene Collection of this Therapeutic Plant, Citation2015) we still find some authors fail to find or reference said work and hope to compel the readership that this omission has more pervasive repercussions than bruised egos and is a regression for our community.

The human genome project was well known for the Bermuda Principles data sharing and publication guildelines (Policies on Release of Human Genomic Sequence Data, Citation2015). This convention reached a consensus on 24 h data release for any sequence data larger than a kilobase of DNA sequence. It encouraged the etiquette of referencing pre-published data to prevent data hoarding during the long peer review cycles required to assemble complex genomes. This open and beneficial philosophy of rapid data sharing is easily undermined by a few individuals who choose to utilize the knowledge of this data to scoop without reference to the more open and generous in the publication process. This has in fact occurred in the September issue of Mitochondria DNA with the Vergara et al. publication of Cannabis Chloroplast genomes despite not only an Amazon cloud instance of the data but also a 3.5-year-old Google indexed website containing the data as a top ranking search of the text “Cannabis Chloroplast” (Vergara et al., Citation2015). Missing these links is perhaps understandable for new entrants to a field but even a direct email of 99% identical sequence to the authors, 3 months prior to their submission did not induce a comparison or even an acknowledgment that said data previously existed. While this can be the result of a simple misunderstanding, I believe it is important to underscore the value of searching beyond a single centralized database like NCBI for information informing a given manuscript.

Our 2011 data publication was not the first cannabis DNA sequence in existence; however, it was the first voluminous next-generation sequencing data. Its terabyte size required shipment of hard drives to facilitate NCBI submission due to software and local network constraints. In light of these constraints, we reconsidered the proper repository for this data based on several global trends related to Moore’s law, Nielsen’s law, Kryder’s law, and the Austrian School of Economics (Austrian Business Cycle Theory, Citation2015; Kryder, Citation2015; Moore's Law, Citation2015; Nielsen's Law, Citation2015).

Austrian school of economics foreshadows fragility in centralization

In 2011, the NCBI SRA database was broadcasting a potential funding cut that would severely cripple its capacity to be a centralized data store for DNA Sequence information (NCBI Data Submissions During Federal Government Shutdown, Citation2013; Wiecek, Citation2015). Two years later in 2013, this cycle repeated itself with an actual NCBI shutdown (NCBI Data Submissions During Federal Government Shutdown, Citation2013). As a result of the 2011 announcement, we investigated more anti-fragile databases that relied on cloud and torrent architectures to defray the risk of data centralization and economic uncertainties inherent in a Hayekian described fiat funded database model (Hayek et al., Citation1935). The realization of a sneaker net bottleneck (postage-stamped hard drives) at NCBI was frightening. The looming scale of the thousands of next-generation sequencers around the world exceeding Moore’s law in data production capacity, exceeding Nielsen’s law in network speed evolution, and exceeding Kryders law in respect to drive space evolution was an obvious unscalable proposition for a centralized data archive going forward. The centralized DNA database model presented the largest technical hurdle in the next-generation sequencing space and was in retrospect “fragile by design”. In our opinion, distributed, and decentralized database architectures that mimicked the decentralized nature of data production were the only lasting architecture worth considering.

To this day, debt ceilings continue to escalate and the financial backing of many life science databases are both centralized and dependent on a fiat monetary system that Hayek would predict are untrustworthy and exposed to fragile economic conditions lacking a price signal. Hayek’s Nobel prize on the business cycle theory would warn that a central repository for DNA sequence for every experiment on earth might be the least evolutionarily conserved lesson in evolution humankind will ever face. We should embrace early and open access to data and we should embrace decentralized torrent or blockchain like means to storing these data (Lee & Dinu, Citation2014). These are the architectures embraced by the high bandwidth video communities and the Bitcoin communities, neither of which rely on cumbersome submission processes and all of which enable adolescent submission and access to the data. Only then will data be read-write accessible to all and have anti-fragile characteristics that survive the test of time.

NCBI is in fact aware of these risks and is employing decentralization on a few of its datasets with the 1000 Genomes Project data and BLAST servers being placed on the AWS (Genomes Project and AWS, Citation2015). This distribution will require that we search in more than one place for data but that is a problem well addressed with modern tools and pays many dividends in reduced duplication of laboratory efforts. “LMGTFY” or “Let Me Google That For You” is a modern day snide acronym reminding us of how easily accessed indexed distributed data in fact is. Let us leverage this so DNA sequence uploading and access is as easy as Youtube and the excuse of “not in NCBI” is a retired vernacular of yesteryear so immediate open access of data can accelerate the benefit of all humanity.

References