3,904
Views
22
CrossRef citations to date
0
Altmetric
Articles

Completeness of citizen science biodiversity data from a volunteered geographic information perspective

&
Pages 3-13 | Received 05 Jan 2017, Accepted 26 Jul 2017, Published online: 24 Feb 2017

Abstract

Observations of living organisms by citizen scientists that are reported to online portals are a valuable source of information. They are also a special kind of volunteered geographic information (VGI). VGI data have issues of completeness, which arise from biases caused by the opportunistic nature of the data collection process. We examined the completeness of bird species represented in citizen science observation data from eBird and iNaturalist in US National Parks (NPs). We used approaches for completeness estimation which were developed for data from OpenStreetMap, a crowdsourced map of the world. First, we used an extrinsic approach, comparing species lists from citizen science data with National Park Service lists. Second, we examined two intrinsic approaches using total observation numbers in NPs and the development of the number of new species being added to the data-set over time. Results from the extrinsic approach provided appropriate completeness estimations to evaluate the intrinsic approaches. We found that total observation numbers are a good estimator of species completeness of citizen science data from US NPs. There is also a close relationship between species completeness and the ratio of new species added to observation data vs. observation numbers in a given year.

1. Introduction

The advent of web-based citizen science portals collecting observations of living organisms from the general public triggered the production of large amounts of a special kind of volunteered geographic information (VGI) (Goodchild Citation2007). Reports about sightings of plants, animals, or other organisms are generated in large numbers and over a broad geographic range. They are a potentially valuable source of information for present and future biological and ecological research (Dickinson and Bonney Citation2012). These data possess a geographic dimension and are therefore part of the sector of citizen science that overlaps with the field of VGI, termed “Geographic Citizen Science” by Haklay (Citation2013).

Data about the occurrence of species that are collected by volunteers have issues of completeness. This is especially the case if the collection process is opportunistic. We consider a data collection process to be opportunistic, if the decision about when, where, and what to observe, as well as the amount of effort invested, is largely or completely left to the volunteer. Completeness issues in opportunistic citizen science biodiversity data arise from a number of biases related to the process of data collection. van Strien, van Swaay, and Termaat (Citation2013) summarized these under the terms geographical bias (uneven distribution of surveyed sites), observation bias (variances in search effort), reporting bias (preference of certain species by observers), and detection bias (varying ability of observers to detect certain species). Other studies discussed more factors, like uneven observation effort in time due to season or overall decline in volunteer effort (Bird et al. Citation2014), and observer quality in terms of skill, experience, and training received (Dickinson, Zuckerberg, and Bonter Citation2010). These biases lead to an incomplete representation of species distribution in space and time. Also, such data often record only the presence, but not the absence of species. Snäll et al. (Citation2011) compared opportunistic presence-only observation data with data from a standardized monitoring scheme in Sweden and pointed out problems caused by reporter behavior and reporting bias. Boakes et al. (Citation2010) demonstrated in their study that biases are also present in data from other sources, like museum collections, literature, ringing data etc. Bird et al. (Citation2014) suggested a close collaboration of statisticians and conservation scientists to develop new statistical solutions accounting for bias when using opportunistic citizen science data.

Completeness has also been a prominent research topic for VGI data like OpenStreetMap (OSM). There are important commonalities between these data and opportunistic citizen science biodiversity data, especially in the process of data acquisition. OSM volunteers are free to choose the time, the location, the object to be mapped, and the amount of effort that they would like to invest. This common characteristic gives the contributors and their behavior and choices a strong influence on the process of data generation, and, consequently, on the characteristics of the data produced.

There are also fundamental differences between OSM and opportunistic citizen science biodiversity data in the nature and characteristics of the objects they treat. OSM, on one hand, collects cartographic representations of “physical features on the ground” (OSM Wiki, http://wiki.openstreetmap.org/wiki/Map_Features, accessed on 12 December 2016). This term is used in a rather broad sense, but always refers to objects (roads, buildings, signs, borderlines, etc.) that are at least to some degree permanent, temporally as well as spatially. A contribution to the OSM therefore contributes directly to the completeness of the OSM data-set. Biodiversity citizen science, on the other hand, deals with sightings of certain species at a certain place and time. For the most part, these species are mobile and/or ephemeral by nature (with exceptions, like trees, corals, or other perennial and sessile organisms). This has important consequences for the way the completeness of data representing these different kinds of information can be conceptualized. We find concepts of completeness of citizen science observation data comparable to those used for OSM data, if we look at information about species or places that we can derive from observation data-sets, and that is more permanent in nature than the individual observation. An individual contribution to a biodiversity citizen science project (e.g. a sighting of a bird at a certain place and time) is in itself of limited significance. It contributes to the completeness of the data-set as a whole by adding to the suitability of this data-set to derive information about a species or about the place the observation comes from. For our study, we selected the species inventory of areas as a use case. This subject is closely related to the problem of assessing the feature completeness of OSM in certain areas. Ballatore and Zipf (Citation2015) used the term class completeness, that is, completeness of feature types, to conceptualize the completeness of VGI. Species completeness can be described as a class completeness problem, if species are seen as classes.

Our goal is to transfer concepts and approaches for assessing the completeness of OSM data to citizen science data from the biodiversity domain, which has, to our knowledge, not been attempted before. The question we ask is: how can we use approaches developed for feature completeness assessment of OSM data to estimate the species completeness of opportunistic citizen science data? Several different approaches to assess completeness of OSM have been suggested. Most studies rely on external reference data-sets to assess completeness extrinsically. However, there are also approaches for intrinsic assessment of completeness that do not use external reference data.

By far the greatest part of studies on OSM completeness used an extrinsic concept of spatial completeness. It can be assessed by comparing OSM to another commercial or official data-set. Several studies focused on the spatial completeness of the street network. Some studies also took Points of Interest (POIs) into account. A comprehensive review of this work can be found in Neis and Zielstra (Citation2014). The most common method to assess the completeness of the street network was to calculate and compare the length of line features found in grid cells (e.g. Haklay Citation2010; Neis, Zielstra, and Zipf Citation2012; Zielstra and Zipf Citation2010). Ludwig, Voss, and Krause-Traudes (Citation2011), Koukoletsos, Haklay, and Ellul (Citation2012), and Fan et al. (Citation2016) used feature-matching methods, which also allow for assessing relative completeness of attributes. Further studies investigated the completeness of land use or buildings (e.g. Dorn, Törnros, and Zipf Citation2015; Fan et al. Citation2014; Törnros et al. Citation2015). Most studies found pronounced differences in completeness between urban and rural areas, with the latter being less well-covered by OSM data. They also revealed that the data were developing quickly in many areas (e.g. Neis, Zielstra, and Zipf Citation2012). Hochmair and Zielstra (Citation2013) analyzed the relative POI completeness.

Some authors developed intrinsic approaches for completeness estimation of OSM data that do not rely on external data sources. Ballatore and Zipf (Citation2015) used the number of classes (feature types) instantiated in an area to identify areas with low class completeness. Another example for an intrinsic approach to VGI quality is OSMatrix (http://koenigstuhl.geog.uni-heidelberg.de/osmatrix/), an online tool designed to support visual assessment of OSM data quality (Roick, Hagenauer, and Zipf Citation2011). It visualizes several parameters describing OSM data in a grid. This allows users to assess data completeness relatively between the grid cells. OSMatrix uses many different parameters, including numbers of all features per grid cell, numbers of attributes per grid cell, but also aggregated numbers of certain feature types, like area in a grid cell covered by certain feature types. The system also uses contributor-related parameters to characterize the level of user activity in a grid cell, e.g. visualizing numbers of users and the number of objects modified per user. Barron, Neis, and Zipf (Citation2014) introduced further intrinsic approaches towards assessing OSM completeness. They use the development of the OSM data-set over time to estimate completeness. Their approach is based on the observation, taken from other work (Neis, Zielstra, and Zipf Citation2012), that certain types of roads are usually completed sooner than others. A certain road type can therefore be considered being almost completely mapped in an area when it does not gain significantly in length any more, while other types of roads are still heavily worked on. By visual interpretation of diagrams of the accumulation of road length over time, the authors demonstrate the usefulness of their approach with some examples, but also discuss difficulties, for instance cases with low or temporarily varying overall contribution rates. They also stress that their approach allows only for obtaining an approximate impression of the state of completeness concerning the OSM street network in an area.

We followed the two principle approaches of extrinsic and intrinsic completeness assessment to answer our research question. We used data samples from eBird and iNaturalist, with a regional focus in the two projects’ country of origin, the USA.

Ballatore and Zipf (Citation2015) pointed out that ground truth data that can serve as a reference for extrinsic completeness assessment of VGI data are often not available. This is also the case for species inventory data, where authoritative data are seldom available especially for small focus regions that have not been scientifically investigated accordingly. However, for problems concerned with a comparatively small area, it is important to have information that fits the area. If intrinsic approaches can be found, this will make a valuable contribution to the problem of completeness assessment (Ballatore and Zipf Citation2015), also for citizen science biodiversity data and arbitrary small or medium-sized areas.

2. Methods

2.1. Extrinsic completeness assessment

Extrinsic OSM completeness assessment often uses spatial sub-units, in which OSM and reference data can be compared to one another. These spatial sub-units are usually regular grids (e.g. Haklay Citation2010) or other geometric sections of space (e.g. concentric zones around city centers) (Zielstra and Zipf Citation2010). This is possible because the external data source is considered to be spatially complete, and provides the necessary information at any place within its spatial extent in the same way. An equivalent external data-set for species inventories would have to be able to deliver a complete species list for an arbitrary geographical sub-unit. While this might actually be possible for some regions by compiling all available species distribution data, it is difficult in general due to missing distribution data with suitable geographic resolution. We therefore chose a different approach. First, we selected areas for which complete, authoritative species lists are available. Then we derived species lists from the citizen science sources used in this study for the same geographical areas. Third, we compared these two, to arrive at an extrinsic completeness assessment of the citizen science data concerning their ability to represent the biodiversity of these areas. As an index for measuring this parameter, we used the rate of species in the citizen science source matching species in the external data.

2.2. Intrinsic completeness assessment

In our study, we built on the ideas developed for assessing OSM completeness intrinsically, in two different ways. First, we examined whether the total number of observations from an area is a suitable indicator for completeness of citizen science biodiversity data in terms of species occurring there, an approach related to using total OSM feature numbers in grid cells. Completeness estimates obtained from our extrinsic approach allowed us to assess the relationship between total numbers of observations and extrinsically estimated completeness for our study areas, thereby testing the suitability of this completeness indicator (total number of observations from an area).

In a second approach to intrinsic completeness assessment, we adopted the fundamental ideas behind the approach of Barron, Neis, and Zipf (Citation2014) of using the temporal development of the data-set. As the number of species that occur in a certain area is finite, low numbers of new species added recently to the list may be indicative for the overall number of species in the list nearing its maximum. However, the reporting rate for the same group of species must still be relatively high in the area in question. Otherwise, low numbers of new species might be due to a lack of appropriate observations. We therefore used the ratio of new species added vs. the number of observations in a year to assess the suitability of this approach to estimate species completeness. In biology and ecology, a related approach uses so called species accumulation curves (SACs) to assess and compare species completeness between areas, or to extrapolate true species richness in an area (e.g. Colwell, Mao, and Chang Citation2004). A recent study used SACs of individual eBird observers to estimate their skill (Kelling et al. Citation2015). SACs plot the number of observed species in an area over some measure of effort (e.g. time spent observing, or number of observations). The temporal sequence of the samples used is of no importance for that method. In our approach based on work by Barron, Neis, and Zipf (Citation2014), we observed the temporal sequence inherent in the ongoing observation process, like they do. However, we went beyond visual interpretation of the accumulation process by examining the relationship between the ratio of new species added vs. the number of observations in a year, and the completeness estimates gained from our extrinsic approach. As before, these estimates provide the reference values of completeness for our study areas, and allowed us to evaluate the usefulness of this potential intrinsic indicator.

3. Study area and data used

3.1. IRMA species lists

For our extrinsic approach for data completeness assessment, we need an authoritative source of species lists for clearly defined geographic areas. We selected the US National Park Service’s (NPS) IRMA (Integrated Resource Management Applications) portal, which provides complete and up-to-date species lists for all 59 US NPs. These species lists meet two important requirements. On one hand, they refer to well-defined geographic areas, and can therefore be compared to citizen science data from exactly the same areas. On the other hand, they are based on scientific work, mostly by members of the NPS (NPSpecies User Guide), and therefore provide an authoritative source of information. We used only bird species, because one of the citizen science data-sets used in this study, eBird, provides only bird observations. Furthermore, we filtered the original IRMA lists for species that were listed as “Present” or “Probably Present” in the list’s category “Occurrence”. To eliminate possible causes for taxonomic mismatches, sub-species names were reduced to the binomial names and hybrid species names were ignored. This procedure resulted in a mean value of 216 species in a NP. Values range from 41 species in Hawaii’s Haleakala NP to 410 species in Big Bend NP.

3.2. eBird observation data

In this study, we used the eBird Basic Data-set version May 2015 (eBird Citation2015), containing all validated observations world-wide. Sullivan et al. (Citation2014) describe the data validation process. We used binomial species names for observations with sub-species, excluding hybrids. We included observations made within the years 2002– 2014, starting with the year the project went officially online. Eligible observations amount to over 1.3 million within the US NPs. NPs have an average of 22,813 eligible observations per park from the analyzed period. There is a broad range, from 171,381 observations at Everglades NP to just 25 at the National Park of American Samoa, and a median of 9209 showing that over 50% of the parks have less than 10,000 observations. Kobuk Valley NP, a remote NP in Alaska, did not render any eligible eBird observations.

3.3. iNaturalist observation data

iNaturalist data can be freely downloaded by all registered users from the project’s home page (http://www.inaturalist.org). As before, we used only bird species. Sub-species names were reduced to binomial species names, and hybrid species names excluded to avoid taxonomic mismatches. Observations not determined on the species level, or lacking coordinates, were also excluded. Furthermore, only observations were used that reached the quality grade “Research”. The strategies adopted by the project to ensure data quality are described by Freitag, Meyer, and Whiteman (Citation2016). Finally, the data we used were restricted to observations that were made within the years 2008–2014, starting with the year the project started. This resulted in 121,354 eligible observations for North America. 2043 of these observations lie inside US NPs. The iNaturalist data used here have therefore much smaller overall numbers of observations than eBird. Twelve parks do not have eligible observations in the data we used. The remaining parks have a small average number of just 44 bird observations per park. Numbers of eligible observations per park range from 396 in Everglades NP to just one eligible bird observation in five parks in the analyzed 2008–2014 period, with a median of 20.

3.4. Taxonomy

Different sources of species occurrence data use different taxonomic systems. Therefore, the same species may be listed under different scientific names, leading to false mismatches. We also encountered the problem that duplicate listings of the same species under different scientific names occurred within the IRMA lists. To minimize this problem, we used a web service provided by the Catalogue of Life (COL) project (http://www.catalogueoflife.org/) designed to that purpose. The service takes a binomial species name and returns a result that either identifies the submitted name as the accepted name by COL standards (in which case we kept the submitted name), or identifies it as a synonym (in which case we replaced the submitted name by the accepted name delivered in the response). In a few cases, the submitted species name was not found in the databases used by the COL, but most of these still matched among the sources used.

3.5. Park boundaries and spatial filtering of observations

Some factors introduce uncertainty into the process of extracting observations within park boundaries. Among these are reporting of travelling counts or area counts in eBird (with coordinates representing the center point of the distance travelled, or of the area surveyed), or, in iNaturalist, obscuring of coordinates by users or due to an endangered status of a species. However, we assessed these effects as not crucial for the resulting species lists, and to use all locations “as is” without correcting measures. Distances traveled, areas surveyed, or dislocation distances for obscured coordinates are not large with respect to the scale of the analysis. For instance, eBird recommends traveling a maximum of five miles for one traveling count. Maximum offsets for obscured coordinates in iNaturalist data are 10–22 km, and only 2.2% of the iNaturalist observations used here were obscured. Also, the avifauna in areas directly adjoining the parks is not expected to be critically different form the park itself. By the nature of the analysis, reports “misplaced” within a park have no effect on the results.

4. Results and discussion

Results of the completeness analysis of eBird and iNaturalist data reflect the large differences between the two data use cases concerning numbers of eligible observations. However, they are also consistent in important ways.

4.1. Extrinsic approach

Species in eBird observation data from NPs on average matched 74.1% of the species listed in the park-specific IRMA species lists for the US NPs. Values range from a maximum of 93.8% at Grand Teton NP to a minimum of just 40.0% at The National Park of American Samoa. The latter park is a special case in several respects. It produced the lowest number of eBird observations (25) in 2002−2014. Visitor numbers available from the IRMA portal also show that it is the park with the lowest number of visitors (70,910) in the 2002–2014 period. Although the IRMA bird species list used in this analysis for The National Park of American Samoa has just 45 bird species, the very low number of eBird observations resulted in the low completeness we found for this park. The results for the iNaturalist data-set, which contains much lower observation numbers for all NPs, show an average of just 8.1% of the parks’ IRMA list of bird species being present in that project’s observation data up to 2014. Match rates range from less than one percent in several parks to a maximum of 32.8% (at Grand Teton, the most visited NP). These results (see Table ) already indicate the close relationship between observation numbers and completeness that we will analyze in more detail in the next section.

Table 1. Results of extrinsic completeness estimation for data from eBird and iNaturalist.

Of the species listed in the IRMA species lists and not matched by citizen science observations in a park, the majority is made up of species simply not yet observed by a citizen scientist. Only a small portion is due to taxonomic mismatches that could not be resolved by our taxonomic unification approach. Therefore, match rates are in some cases lower than they would be with perfectly matching taxonomies. The actual error in the match rate for a park depends on how many of these species were observed in the park. For example, a detailed analysis for Everglades NP eBird data (the park with the highest number of observations) shows that two of the species not matching their IRMA counterparts due to differing taxonomy were observed there, causing an error of 0.5% points in the match rate. Other parks are not affected at all, like Pinnacles NP.

There is also, in most parks, a number of species observed that do not match a species in the park’s IRMA list. In some cases, the number of species observed by citizen scientists is even higher than the number of species in the IRMA lists used for our analysis. As we omitted species from IRMA lists that were not listed there as “Present” or “Probably Present”, we examined whether these additional species might be species listed as “Unconfirmed” or “Not In Park” in the original IRMA lists. In this case, citizen science data could be regarded as a potential source of information confirming the presence of these species. However, we found that, on average, most of the species observed by citizen scientists in a park, but not listed in the park’s IRMA list, are species not at all listed in the park’s IRMA list. For 2014 eBird data, this applies to 73.1% of these additional species, on average. Only the remaining species are listed as “Unconfirmed” or “Not In Park” (with mean values of 13.7 and 11.5%, respectively). A small number of species do not have appropriate information about their presence. 2014 iNaturalist data have similar average values, with 82.7% of the species in question not at all listed by IRMA, 9.6% classified as “Unconfirmed”, and 3.9% as “Not In Park”. However, in the eBird data there are also parks where the rate of species classified as “Unconfirmed” or “Not In Park”, but observed by citizen scientists, are considerably higher. In any case, such observations may give reason to review the occurrence status of the species and parks in question.

In our introduction, we laid out that completeness of bird species lists derived from opportunistic citizen science observation data suffers from a number of biases. Most of these biases are inherent in data gathering processes of all opportunistic citizen science projects. Differences in project designs sometimes add more factors influencing the completeness of their data. An important difference between the data gathering processes of eBird and iNaturalist is that an iNaturalist observation, to reach quality grade “Research” (only these observations were used in our study), must provide a photograph or sound recording, while eBird does not require such evidence for an observation to be accepted. For some species, it is difficult or even impossible to produce suitable evidence of this kind, especially photographs. This excludes these species from species lists derived from iNaturalist observations, and is a factor artificially reducing observed species completeness of iNaturalist data. It is also clear that potential completeness of iNaturalist data regarding bird species is necessarily lower than that of eBird data. Using also observations of lower quality grades might reduce this problem, but involves higher uncertainty concerning species identification.

4.2. Intrinsic approach

4.2.1. Total observation numbers

We used Spearman’s Rho (corrected for ties, if applicable) for examining the correlation between the rate of species in citizen science data-sets matching species in the corresponding IRMA species list, and the total number of observations in a NP up to 2014 for both eBird and iNaturalist. We found close associations between these parameters for both eBird and iNaturalist, which are illustrated in Figures and .

Figure 1. Association of species match rate with IRMA lists and total number of observations in NPs, eBird observation data in 2002−2014 (Spearman’s rho 0.77, p < 2.2 × 10−16), n = 58.

Figure 1. Association of species match rate with IRMA lists and total number of observations in NPs, eBird observation data in 2002−2014 (Spearman’s rho 0.77, p < 2.2 × 10−16), n = 58.

Figure 2. Association of species match rate with IRMA lists and total number of observations in NPs, iNaturalist observation data in 2008−2014 (Spearman’s rho 0.97, p < 2.2 × 10−16), n = 47.

Figure 2. Association of species match rate with IRMA lists and total number of observations in NPs, iNaturalist observation data in 2008−2014 (Spearman’s rho 0.97, p < 2.2 × 10−16), n = 47.

The match rates rise steeply for the relatively low observation numbers, but level out quickly for higher observation numbers, as they are present in the eBird data-set. In all of the 27 NPs with more than 10,000 eBird observations until 2014, more than 70% of the species listed in the IRMA lists are represented in the observation data, while parks with less than 3000 eBird observations until 2014 (14 parks) all have match rates of less than 70%. In the iNaturalist data, with their much lower observation numbers until 2014, extrinsic completeness values are also much lower and do not even reach the lowest 2014 eBird value. It is interesting that early stage eBird data from 2002, with observation numbers closer to those of iNaturalist in 2014, also show a picture much more similar to iNaturalist concerning completeness distribution (see Figure ). These eBird data have an average of 300.0 observations per park and an average match rate of 25.8%. All parks, in 2002, had less than 3000 observations, and a completeness, estimated extrinsically, of less than 70%, which goes along with the findings for 2014 eBird data.

Figure 3. Association of species match rate with IRMA lists and total number of observations in NPs, eBird observation data in 2002 (Spearman’s rho 0.93, p < 2.2 × 10−16), n = 58.

Figure 3. Association of species match rate with IRMA lists and total number of observations in NPs, eBird observation data in 2002 (Spearman’s rho 0.93, p < 2.2 × 10−16), n = 58.

These results support the assumption that the total number of observations in a park can be considered to be indicative of the completeness of species represented in these observations. If the total number of observations is low, either because observation rates are relatively low (as in iNaturalist), or because the project is in an early stage, completeness of the species represented in the data is also relatively low. High total observation numbers are in turn associated with high completeness. However, these results also indicate that completeness goes up very quickly with rising total numbers of observations. For high total observation numbers, the parameter does indicate relatively high completeness, but does not allow for further differentiation.

An important question is whether the size of an area has an important influence on total observation numbers. For the eBird data used in this study, there is no association of park size and total observation numbers (2014 data: Spearman’s rho 0.06, p = 0.32; 2002 data: Spearman’s rho 0.13, p ≈ 0.16). The situation is different for iNaturalist data: here, we found a correlation between park size and total observation numbers in 2014 data (Spearman’s rho 0.37, p ≈ 0.005). It is well known that the distribution of citizen science observations of organisms varies in space (van Strien, van Swaay, and Termaat Citation2013). We therefore argue that the influence of the size of an area depends on the individual situation of the area in question. Small sections from regions with a high density of observations will yield high total numbers of observations, indicating high completeness of species represented in these observations. Enlarging the area will lead to even higher total observation numbers, but these do not necessarily indicate even higher completeness, as we have seen. In low-density regions, a small section will contain only a low number of observations, indicating low completeness. Enlarging the area will raise the number of observations, with a pronounced effect on completeness, on a low level. Further research is needed to shed more light on the relationship between these parameters. Concerning completeness estimation, other factors are also important. Citizen science observations tend to be associated with roads and other infrastructure (Bird et al. Citation2014) making certain areas more accessible than others. Inaccessible areas, and their habitats and associated species, are necessarily underrepresented in the resulting observation data. This problem can be expected to be more pronounced for species that are less mobile than birds.

4.2.2. New species added with recent observations

Figure shows the development of yearly mean cumulative numbers of species observed in US NPs for eBird data. The growth of this number declines over time. This means that the mean number of new species added to the observation data-set is also declining over time, which can be seen in Figure . The mean number of new species added in a year decreased from over 60 new species per park in the first project year to a little less than six in 2014.

Figure 4. Development of the yearly mean total number of species in eBird observation data from the US NPs over time (red squares), and yearly mean numbers of eBird observations per NP (blue triangles).

Figure 4. Development of the yearly mean total number of species in eBird observation data from the US NPs over time (red squares), and yearly mean numbers of eBird observations per NP (blue triangles).

Figure 5. Development of the yearly mean numbers of new species per NP in eBird observation data.

Figure 5. Development of the yearly mean numbers of new species per NP in eBird observation data.

A low number of new species in a park in recent observations may be indicative of a relatively high completeness of the species already observed. However, recent observation numbers must be high. Otherwise, low recent numbers of new species added may be simply due to the fact that there are only few recent observations. On average, eBird observations from NPs comply with these assumptions: mean numbers of new species added were recently relatively low, and mean report numbers per park were recently relatively high (Figure ). This matches a relatively high average completeness of species already observed by eBirders in NPs of 74.1% until 2014, if we take our extrinsic completeness estimation as a reference.

We used the ratio of new species vs. the number of observations from a year to assess the association with species completeness for all parks. If a park’s observation numbers are on a higher level than in other parks, a given number of new species observed in a year should indicate a higher completeness than in a park with a lower level of observation numbers and the same number of new species observed. For 2014 eBird data, there is indeed a close association between this ratio and our extrinsic completeness estimates (see Figure ): a low ratio is associated with a high completeness. This shows that the number of new species observed in a park in a year does hold some indicative power for the completeness of the species already observed in a park’s eBird data in that year, if we take the level of observation numbers into account. We also analyzed eBird data in earlier stages and found similar associations.

Figure 6. Association of species match rate with IRMA lists and ratio of the number of new species vs. number of eBird observations per park in 2014 (Spearman’s rho −0.61, p ≈ 3.588 × 10−7). n = 56. Three parks did not have observations in 2014, therefore the ratio could not be calculated for these parks.

Figure 6. Association of species match rate with IRMA lists and ratio of the number of new species vs. number of eBird observations per park in 2014 (Spearman’s rho −0.61, p ≈ 3.588 × 10−7). n = 56. Three parks did not have observations in 2014, therefore the ratio could not be calculated for these parks.

Mean values for all parks from iNaturalist data used in this study do not show any of the trends found for eBird data, except a growing number of mean observations per NP (see Figure ). Mean numbers of new species have mostly been going up instead of down since 2010 (Figure ), with a pronounced leap after 2011, the year the project was institutionalized as iNaturalist, LLC (http://www.inaturalist.org/pages/about, accessed 08 November 2015).

Figure 7. Development of the yearly mean total number of species in iNaturalist data from the US NPs over time (red squares), and yearly mean numbers of iNaturalist observations per park (blue triangles).

Figure 7. Development of the yearly mean total number of species in iNaturalist data from the US NPs over time (red squares), and yearly mean numbers of iNaturalist observations per park (blue triangles).

Figure 8. Development of the yearly mean numbers of new species per NP in iNaturalist data.

Figure 8. Development of the yearly mean numbers of new species per NP in iNaturalist data.

However, despite their different characteristics as compared to eBird data, iNaturalist data exhibit a similar association of the ratio of new species vs. the number of observations and extrinsically estimated completeness, see Figure . This confirms our finding that the number of new species observed in a year is indicative of the completeness of species already observed, if the level of observation numbers is taken into account. This holds also for the much lower level of observation numbers in iNaturalist as compared to eBird.

Figure 9. Association of species match rate with IRMA lists and ratio of the number of new species vs. number of iNaturalist observations per park in 2014 (Spearman’s rho −0.64, p≈1.809 × 10−5). n = 35. Twenty-four parks did not have observations in 2014, therefore the ratio could not be calculated for these parks. In nine parks, all 2014 observations contributed a new species each (ratio = 1).

Figure 9. Association of species match rate with IRMA lists and ratio of the number of new species vs. number of iNaturalist observations per park in 2014 (Spearman’s rho −0.64, p≈1.809 × 10−5). n = 35. Twenty-four parks did not have observations in 2014, therefore the ratio could not be calculated for these parks. In nine parks, all 2014 observations contributed a new species each (ratio = 1).

5. Conclusions and future work

Our research question is: how can we use approaches developed for feature completeness assessment of OSM data to estimate the species completeness of opportunistic citizen science data? We obtained several answers to this question.

First, if a suitable extrinsic source of information about species present in a certain area is available, we can estimate the completeness of species represented in a citizen science observation data-set by comparing lists of species derived from these two sources. This approach is closely related to completeness estimation of OSM features by comparing them to an extrinsic data-set. Problems arising from different taxonomies can be reduced using available tools. Second, we found a close association between the total number of observations in an area and species completeness estimated extrinsically. We therefore argue that the total number of observations in an area can serve as an indicator for species completeness of citizen science observation data from that area. This approach is related to the use of absolute numbers of OSM features in different areas for indicating spatially variable feature completeness in OSM, used in the OSMatrix framework (Roick, Hagenauer, and Zipf Citation2011). On one hand, total observation numbers can indicate differences in the completeness of species observed between different areas with higher or lower total observation numbers from the same data-set. On the other hand, they indicate completeness differences between different data-sets with different total observation numbers in the same area. Third, the relation between the number of new species added in a year, and the number of observations producing this effect, is also closely associated with species completeness. Both intrinsic approaches examined in this work are relatively simple. However, we think that they would be suitable for use in frameworks visualizing biodiversity citizen science data quality relatively between different areas, like OSMatrix does for OSM data, or to show differences in quality between data of different projects for the same area. Such frameworks could be used to direct volunteer effort to areas with low completeness.

Future work should examine how far the results of this study can be reproduced with citizen science observation data-sets obtained from other projects with different properties, or with the same data-sets used in this study, but for other regions with different observation rates. Another important question is how our approaches would perform if applied to other species groups with different characteristics concerning mobility, detectability, and others. However, for any extrinsic completeness estimation as conducted in this study, there is always the difficulty of finding a suitable extrinsic source of reference data, especially for small focus regions. Moving away from species inventories of areas, more completeness dimensions come to the fore. These include the spatial completeness of the area of distribution of a species represented in a citizen science observation data-set, or the temporal completeness of the presence of a migratory species at a place.

Notes on contributors

Clemens Jacobs is a member of GIScience research group at Heidelberg University, Department of Geography since 2012. Before that, he worked in environmetal planning at Spang.Fischer.Natzschka GmbH (Walldorf) and in geographic web application development and QA at Leiner & Wolff GmbH (Heidelberg). Currently, he is a PhD candidate (supervisor: Prof. Dr. Alexander Zipf). His research interest is in quality assessment of biodiversity observation data from citizen science projects. He has a background in Geography from Heidelberg University.

Alexander Zipf (PhD) is a professor and the chair of GIScience (Geoinformatics) at Heidelberg University (Department of Geography) since late 2009. He is a member of the Centre for Scientific Computing (IWR), the Heidelberg Center for Cultural Heritage and PI at the Heidelberg Graduate School MathComp. He is also a founding member of the Heidelberg Center for the Environment (HCE) and is currently establishing the “Heidelberg Institute for Geoinformation Technology” (HeiGIT), core funded by the Klaus Tschira Stiftung. From 2012 to 2014, he was the managing director of the Department of Geography, Heidelberg University. In 2011−2012, he acted as the vice dean of the Faculty for Chemistry and Geosciences, Heidelberg University. Since 2012, he is the speaker of the graduate school “Crowd Analyser − Spatio-temporal Analysis of User-generated Content.” He is also a member of the editorial board of several further journals and has organized a set of conferences and workshops. In 2012−2015, he was the regional editor of ISI journal “Transactions in GIS” (Wiley). Before coming to Heidelberg, he led the chair of Cartography at Bonn University and earlier was a professor for Applied Computer Science and Geoinformatics at the University of Applied Sciences in Mainz, Germany. He has a background in Mathematics and Geography from Heidelberg University and finished his PhD at the European Media Laboratory EML in Heidelberg where he was the first PhD student. There he also conducted further research as a postdoc for 3 years.

Acknowledgments

We would like to thank all participants in the projects eBird and iNaturalist whose efforts in contributing and validating observations made our research possible.

References

  • Ballatore, A., and A. Zipf. 2015. “A Conceptual Quality Framework for Volunteered Geographic Information.” In Lecture Notes in Computer Science, Volume 9368: Spatial Information Theory. 2nd International Conference, COSIT 2015, October 12−16, 2015, Proceedings, edited by S. I. Fabrikant, M. Raubal, M. Bertolotto, C. Davies, S. Freundschuh, and S. Bell, 89–107. Santa Fe, NM, Springer.
  • Barron, C., P. Neis, and A. Zipf. 2014. “A Comprehensive Framework for Intrinsic OpenStreetMap Quality Analysis.” Transactions in GIS 18 (6): 877–895. doi:10.1111/tgis.12073.
  • Bird, T. J., A. E. Bates, J. S. Lefcheck, N. A. Hill, R. J. Thomson, G. J. Edgar, R. D. Stuart-Smith, et al. 2014. “Statistical Solutions for Error and Bias in Global Citizen Science Datasets.” Biological Conservation 173: 144–154. doi:10.1016/j.biocon.2013.07.037.
  • Boakes, E. H., P. J. K. McGowan, R. A. Fuller, C. Ding, N. E. Clark, K. O’Connor, and G. M. Mace. 2010. “Distorted Views of Biodiversity: Spatial and Temporal Bias in Species Occurrence Data.” PLoS Biology 8 (6): e1000385. doi:10.1371/journal.pbio.1000385.
  • Colwell, R. K., C. X. Mao, and J. Chang. 2004. “Interpolating, Extrapolating, and Comparing Incidence-based Species Accumulation Curves.” Ecology 85 (10): 2717–2727. doi:10.1890/03-0557.
  • Dickinson, J. L., and R. Bonney, eds. 2012. Citizen Science: Public Participation in Environmental Research. Ithaca, NY: Comstock.
  • Dickinson, J. L., B. Zuckerberg, and D. N. Bonter. 2010. “Citizen Science as an Ecological Research Tool: Challenges and Benefits.” Annual Review of Ecology, Evolution, and Systematics 41 (1): 149–172. doi:10.1146/annurev-ecolsys-102209-144636.
  • Dorn, H., T. Törnros, and A. Zipf. 2015. “Quality Evaluation of VGI Using Authoritative Data − A Comparison with Land Use Data in Southern Germany.” ISPRS International Journal of Geo-Information 4 (3): 1657–1671. doi:10.3390/ijgi4031657.
  • eBird Basic Dataset. 2015. Version: EBD_RelMay-2015. Ithaca, NY: Cornell Lab of Ornithology. May.
  • Fan, H., B. Yang, A. Zipf, and A. Rousell. 2016. “A Polygon-based Approach for Matching OpenStreetMap Road Networks with Regional Transit Authority Data.” International Journal of Geographical Information Science 30 (4): 748–764. doi:10.1080/13658816.2015.1100732.
  • Fan, H., A. Zipf, Q. Fu, and P. Neis. 2014. “Quality Assessment for Building Footprints Data on OpenStreetMap.” International Journal of Geographical Information Science 28 (4): 700–719. doi:10.1080/13658816.2013.867495.
  • Freitag, A., R. Meyer, and L. Whiteman. 2016. “Strategies Employed by Citizen Science Programs to Increase the Credibility of Their Data.” Citizen Science: Theory and Practice 1 (1): 1–11. doi: S10.5334/cstp.6.
  • Goodchild, M. F. 2007. “Citizens as Sensors: The World of Volunteered Geography.” GeoJournal 69 (4): 211–221. doi:10.1007/s10708-007-9111-y.
  • Haklay, M. 2010. “How Good is OpenStreetMap Information? A Comparative Study of OpenStreetMap and Ordnance Survey Datasets for London and the Rest of England.” Environment and Planning B: Planning and Design 37 (4): 682–703. doi:10.1068/b35097.
  • Haklay, M. 2013. “Citizen Science and Volunteered Geographic Information: Overview and Typology of Participation.” In Crowdsourcing Geographic Knowledge, edited by D. Sui, S. Elwood and M. Goodchild, 105–122. New York: Springer.10.1007/978-94-007-4587-2
  • Hochmair, H., and D. Zielstra. 2013. “Development and Completeness of Points of Interest in Free and Proprietary Data Sets: A Florida Case Study.” Paper presented at the GI_Forum 2013: Creating the GISociety, Salzburg, July 3–5.
  • Kelling, S., A. Johnston, W. M. Hochachka, M. Iliff, D. Fink, J. Gerbracht, C. Lagoze, et al. 2015. “Can Observation Skills of Citizen Scientists Be Estimated Using Species Accumulation Curves?” PLoS ONE 10 (10): e0139600. doi:10.1371/journal.pone.0139600.
  • Koukoletsos, T., M. Haklay, and C. Ellul. 2012. “Assessing Data Completeness of VGI through an Automated Matching Procedure for Linear Data.” Transactions in GIS 6 (4): 477–498. doi:10.1111/j.1467-9671.2012.01304.x.
  • Ludwig, I., A. Voss, and M. Krause-Traudes. 2011. “A Comparison of the Street Networks of Navteq and OSM in Germany.” In Advancing Geoinformation Science for a Changing World, edited by S. Geertman, W. Reinhardt and F. Toppen, 65–84. Berlin: Springer.10.1007/978-3-642-19789-5
  • Neis, P., and D. Zielstra. 2014. “Recent Developments and Future Trends in Volunteered Geographic Information Research: The Case of OpenStreetMap.” Future Internet 6 (1): 76–106. doi:10.3390/fi6010076.
  • Neis, P., D. Zielstra, and A. Zipf. 2012. “The Street Network Evolution of Crowdsourced Maps: OpenStreetMap in Germany 2007−2011.” Future Internet 4 (1): 1–21. doi:10.3390/fi4010001.
  • Roick, O., J. Hagenauer, and A. Zipf. 2011. “OSMatrix – Grid-based Analysis and Visualization of OpenStreetMap.” Paper presented at the State of the Map Conference, Denver, CO, September 9−11.
  • Snäll, T., O. Kindvall, J. Nilsson, and T. Pärt. 2011. “Evaluating Citizen-based Presence Data for Bird Monitoring.” Biological Conservation 144 (2): 804–810. doi:10.1016/j.biocon.2010.11.010.
  • van Strien, A. J., C. A. M. van Swaay, and T. Termaat. 2013. “Opportunistic Citizen Science Data of Animal Species Produce Reliable Estimates of Distribution Trends If Analysed with Occupancy Models.” Journal of Applied Ecology 50 (6): 1450–1458. doi:10.1111/1365-2664.12158.
  • Sullivan, B. L., J. L. Aycrigg, L. H. Barry, R. E. Bonney, N. Bruns, C. B. Cooper, T. Damoulas, et al. 2014. “The eBird Enterprise: An Integrated Approach to Development and Application of Citizen Science.” Biological Conservation 169: 31–40. doi:10.1016/j.biocon.2013.11.003.
  • Törnros, T., Dorn, H., S. Hahmann, and A. Zipf. 2015. “Uncertainties of Completeness Measures in OpenStreetMap – A Case Study for Buildings in a Medium-sized German City.” ISPRS Annals of Photogrammetry, Remote Sensing and Spatial Information Sciences II-3/W5: 353–357. doi:10.5194/isprsannals-II-3-W5-353-2015.
  • Zielstra, D., and A. Zipf. 2010. “A Comparative Study of Proprietary Geodata and Volunteered Geographic Information for Germany.” Paper presented at the 13th AGILE Conference, Guimaraes, Portugal, May 11−14.