Search in:

International Journal of Digital Earth Volume 11, 2018 - Issue 5

Submit an article Journal homepage

Free access

675

Views

CrossRef citations to date

Altmetric

Listen

Articles

VGI and crowdsourced data credibility analysis using spam email detection techniques

Saman KoswatteSchool of Civil Engineering and Surveying, University of Southern Queensland, Toowoomba, Australia

http://orcid.org/0000-0002-3484-854X View further author information

Kevin McDougallSchool of Civil Engineering and Surveying, University of Southern Queensland, Toowoomba, AustraliaCorrespondence[email protected]

http://orcid.org/0000-0001-6088-1004 View further author information

Xiaoye LiuSchool of Civil Engineering and Surveying, University of Southern Queensland, Toowoomba, Australia

http://orcid.org/0000-0002-0487-9193 View further author information

Pages 520-532 | Received 12 Jan 2017, Accepted 08 Jun 2017, Published online: 21 Jun 2017

Cite this article
https://doi.org/10.1080/17538947.2017.1341558
CrossMark

In this article

ABSTRACT
1. Introduction
2. CSD credibility
3. Methods
4. Results and discussion
5. Conclusion
Disclosure statement
Additional information
References

Full Article
Figures & data
References
Citations
Metrics
Reprints & Permissions
View PDF PDF

ABSTRACT

Volunteered geographic information (VGI) can be considered a subset of crowdsourced data (CSD) and its popularity has recently increased in a number of application areas. Disaster management is one of its key application areas in which the benefits of VGI and CSD are potentially very high. However, quality issues such as credibility, reliability and relevance are limiting many of the advantages of utilising CSD. Credibility issues arise as CSD come from a variety of heterogeneous sources including both professionals and untrained citizens. VGI and CSD are also highly unstructured and the quality and metadata are often undocumented. In the 2011 Australian floods, the general public and disaster management administrators used the Ushahidi Crowd-mapping platform to extensively communicate flood-related information including hazards, evacuations, emergency services, road closures and property damage. This study assessed the credibility of the Australian Broadcasting Corporation’s Ushahidi CrowdMap dataset using a Naïve Bayesian network approach based on models commonly used in spam email detection systems. The results of the study reveal that the spam email detection approach is potentially useful for CSD credibility detection with an accuracy of over 90% using a forced classification methodology.

KEYWORDS:

VGI
crowdsourced data
credibility
Bayesian networks
spam emails

1. Introduction

Volunteered geographic information (VGI) (Goodchild Citation2007), with its geographic context, is considered a subset of crowdsourced data (CSD) (Howe Citation2006; Goodchild and Glennon Citation2010; Heipke Citation2010; Koswatte, McDougall, and Liu Citation2016). In recent times, there has been an increased interest in the use of CSD for both research and commercial applications. VGI production and use have also become simpler than ever before with technological developments in mobile communication, positioning technologies, smart phone applications and other infrastructure developments which support easy to use mobile applications. However, data quality issues such as credibility, relevance, reliability, data structures, incomplete location information, missing metadata and validity continue to limit its usage and potential benefits (Flanagin and Metzger Citation2008; De Longueville, Ostlander, and Keskitalo Citation2010; Koswatte, McDougall, and Liu Citation2016). Therefore, researchers are now seeking new approaches for improving and managing the quality of VGI and CSD in order to increase the utilisation of this data.

VGI quality can be described in terms of quality measures and quality indicators (Antoniou and Skopeliti Citation2015). The quality measures of spatial data have largely focused on quantitative measures such as completeness, logical consistency, positional accuracy, temporal accuracy and thematic accuracy whilst the quality indicators are often more difficult to measure and refer to areas such as purpose, usage, trustworthiness, content quality, credibility and relevance (Senaratne et al. Citation2016). However, in CSD it may not always be appropriate to trust the information provided by the volunteers as their experience and expertise varies dramatically and assessing the credibility of the provider may be impractical. In particular, the volunteers in a disaster situation are often extremely heterogeneous and their input only occurs during a short period. Hence, it is difficult to profile these contributors, unlike many users of Twitter which may have a long history of activity. Therefore, a key challenge is to assess the credibility of the provided data in order to utilise it for future decision-making.

A popular approach to assess credibility in spam email detection is to numerically estimate the ‘degree on belief’ (Robinson Citation2003) by analysing the email content using natural language processing and machine learning techniques. Natural language processing is a commonly used term to describe the use of computing techniques to analyse and understand natural language and speech. These approaches have been successfully applied to the detection of spam in Twitter messages (Wang Citation2010). The objective of this research is to investigate and test the use of spam email detection processes for credibility detection of crowdsourced disaster data.

The data for this research were collected through the Uhsahidi CrowdMap platform (https://www.ushahidi.com) which has been successfully used in a range of disasters including the 2011 Australian floods, the Christchurch earthquake and the 2011 tsunami in Japan. The Ushahidi platform was initially developed to easily capture crowd input via cell phones or emails (Bahree Citation2008; Longueville et al. Citation2010) and was utilised to report the election violence in Kenya. Over time, its popularity has increased and the platform has been successfully deployed in a number of disasters around the world.

This paper discusses the use of a Naïve Bayesian network (BN)-based model to detect the credibility of CSD using a similar approach to spam email detection. The paper is structured as follows: Section 2 discusses the background of CSD credibility detection and the use of Naïve BNs for spam email detection. Section 3 explores the methods used in the study. Section 4 details the results of the study and discusses their implications. Finally, Section 5 provides some concluding remarks and some future suggestions for research.

2. CSD credibility

Hovland, Janis, and Kelley (Citation1953) defined credibility as ‘the believability of a source or message’ which comprises primarily of two dimensions, trustworthiness and expertise. However, as identified by Flanagin and Metzger (Citation2008), the dimensions of trust and expertise can also be considered as being subjectively perceived, as the study of credibility is highly interdisciplinary and the definition of credibility varies according to the field of study. Whilst the scientific community views credibility as an objective property of information quality, the communication and social psychology researchers treat credibility more as a perceptual variable (Fogg and Tseng Citation1999; Flanagin and Metzger Citation2008). According to Fogg and Tseng (Citation1999), credibility is defined as ‘a perceived quality made up of multiple dimensions such as trustworthiness and expertise’ or simply as believability.

Credibility analysis approaches and the methods will vary depending on the context. Studies conducted by Bishr and Kuhn (Citation2007), Noy, Griffith, and Musen (Citation2008), Janowicz et al. (Citation2010), Sadeghi-Niaraki et al. (Citation2010), and Shvaiko and Euzenat (Citation2013) have identified the importance and usefulness of spatial semantics and ontologies in assessing the quality of CSD. Most approaches tackle CSD quality by qualifying contributors and contributions (Brando and Bucher Citation2010). Various authors have investigated the classification of users based on their purpose (Coleman, Georgiadou, and Labonte Citation2009), their geographic location (Goodchild Citation2009) and trust as a reputational model (Bishr and Kuhn Citation2007). Quality based on contributions has mostly been validated using rating systems (Brando and Bucher Citation2010; Elwood Citation2008) or using a reference dataset (Haklay Citation2010; Goodchild and Li Citation2012). Longueville et al. (Citation2010) proposed an approach which consisted of a workflow that used prior information about the phenomenon. The key to their approach was to extract valid information from CSD using cross validation, cluster processing and ranking. A similar but extended approach for the automated assessment of the quality of CSD was proposed by Ostermann and Spinsanti (Citation2011).

Given the variability of contributors of CSD during a disaster event, and the complexities in qualifying the expertise or experience of contributors, it was decided that a content analysis approach would provide the greatest likelihood of success for this research.

2.1. Statistical approaches for CSD credibility detection in disaster management

Disaster-related CSD is quite different in the sense of its lifetime and contributors. Data are often collected over a very short period of time with many different contributors during the event. Recent research conducted by Hung, Kalantari, and Rajabifard (Citation2016) identified the possibility of using statistical methods to assess the credibility of VGI. They used the 2011 Australian flood VGI dataset as the training data and the 2013 Brisbane floods data as the testing dataset. Their approach was to use binary logistic regression modelling to achieve an overall accuracy of 90.5% for a training model and 80.4% accuracy for the testing dataset. They highlighted the potential of using statistical approaches for efficiently analysing the CSD credibility and for rapid decision-making in the disaster management sector even without real-time or near real-time information.

Kim (Citation2013) developed a framework to assess the credibility of a VGI dataset from the 2010 Haiti earthquake based on a BN model. The outcomes of this earthquake damage assessment study were compared with the results from official sources. The author reported that ‘the experiments have not only demonstrated microscopic effects on the individual data, but also showed the macroscopic variations of the overall damage patterns by the credibility model’. Both of these models were identified as being more suitable for post-disaster management purposes.

In filter-based classification processes, it is important to simplify the message content using transformations including tokenising, stemming and lemmatising () which may improve the classification accuracy and performance (Guzella and Caminhas Citation2009). This research followed a similar approach by incorporating natural language processing techniques and enhancing a ‘bag of words’ model with tokenising (extracting words), stemming (removing derivational affixes), lemmatising (remove inflectional endings and returning the base or dictionary form of the word) and removing stop-words (common words in English).

Figure 1. Main steps involved in filter-based email classification (Guzella and Caminhas Citation2009).

Credibility can be calculated and rated into different levels which may be useful for disaster management staff. However, in critical events such as disaster management, a binary form of credibility representation would be simpler and less confusing for the general public (Ostermann and Spinsanti Citation2011). This research has adopted a similar binary approach by classifying the credibility using a ‘credible/unreliable’ labelling. The term ‘unreliable’ is used to describe those messages or reports that were not classified as ‘credible’.

2.2. Why use spam email detection as an approach for CSD credibility detection?

Spam email is considered as ‘unsolicited bulk email’ in its shortest definition (Blanzieri and Bryl Citation2008). Spam emails cost industries billions of dollars annually through the misuse of computing resources and the additional time required by users to sort emails. Spam emails can often carry computer viruses and also violate users’ privacy (Blanzieri and Bryl Citation2008). Compared to the spam emails, CSD has some similarities and differences. Firstly, CSD also has a mixture of content that varies in credibility and the CSD events often generate large volumes of data. Emails, including spam emails, often have a specified structure (sender, body text and header), however, CSD often lacks structure. Finally, the aim of the filtering data to identify legitimate or credible content is similar in both cases.

Spam email detection (Pantel and Lin Citation1998; Cranor and LaMacchia Citation1998; Metsis, Androutsopoulos, and Paliouras Citation2006; Robinson Citation2003; Lopes et al. Citation2011), junk-email detection (Sahami et al. Citation1998) or anti-spam filtering (Androutsopoulos et al. Citation2000; Schneider Citation2003) research has a long history which grew from the commercialisation of the internet in mid-1990s (Cranor and LaMacchia Citation1998). Researchers have explored various approaches with content-based filters or Bayesian filters being the most popular anti-spam systems (Lopes et al. Citation2011). Wang (Citation2010) tested a Bayesian classifier for spam detection in Twitter and confirmed that Bayesian classifiers performed highly in terms of weighted recall and precision, and outperformed the decision tree, neural network, support vector machines and k-nearest neighbour’s classifications.

Castillo, Mendoza, and Poblete (Citation2011) analysed the news worthiness of tweets using a supervised classifier whilst Kang, O’Donovan, and Höllerer (Citation2012) analysed the ‘credible individual tweets or users’ based on three models (social model, content model and hybrid model) using Bayesian and other classifiers. These studies support the use of a modified Bayesian approach for assessing the credibility of crowd sourced data.

2.3. A naïve BN-based model for CSD credibility detection

The BNs were initially identified as powerful tools for knowledge representations and inference. With the advent of Naïve BNs, which are simple BNs that assume all attributes are independent, the classification power of BNs were expanded (Cheng and Greiner Citation1999). The credibility CSD detection engine proposed in this research was developed using a Naïve Bayesian-based spam detection model.

A credibility detection function can be defined as: where m is a message to be classified, θ is a vector of parameters, t_credible and t_unreliable are tags to be assigned based on the threshold T to the messages.

The vector of parameters θ is the result of training the classifier on a pre-collected dataset: where are previously collected messages, are the corresponding labels, and Θ is the training function.

As Guzella and Caminhas (Citation2009) defined; if a given message is represented by which belongs to class , the probability that a message is classified as c and represented by can be written as:where is overall probability that any given message is classified as c, is the a priori probability of a random message represented by , and are the probabilities that a message is classified as spam or legitimate, respectively, and are overall probabilities that any given message is classified as spam or legitimate, respectively.

The naïve classifier assumes that all feature in are conditionally independent to every other feature and the probability can be defined considering N number of messages as:

So, Equation (1) becomeswith , given bywhere function f depends on the representation of the message. The probability is determined based on the occurrence of term t_i in the training dataset .

3. Methods

During the 2011 Australian floods, the Australian Broadcasting Corporation (ABC) (www.abc.net.au) developed a customised version of the Ushahidi CrowdMap to report/map disaster communications (Koswatte, McDougall, and Liu Citation2016). These data comprised primarily of text-based content that was submitted by volunteers during the flood event. The data included input from a heterogeneous range of volunteers who submitted reports during a relatively short period of time (approximately 7 days) via various channels including a mobile app, a website, SMS messages, emails, phone calls and Twitter.

3.1. CSD credibility detecting algorithm based on spam email detection approach

An algorithm for the CSD credibility detection based on the Naïve BN was developed for the analysis. The Java programming language was used for coding the system within the NetBeans Integrated Development Environment (https://netbeans.org/). The pseudo code of the algorithm consisted of two phases including training and testing, and is listed below.

Phase 1: Start training

Select Classifier and Training Dataset

for each Message m_i in Training Dataset D_tr do

for each Word in the Corpus do

Calculate the Credible and Unreliable word Probabilities and store in Hash Table

end for

End training

Phase 2: Start classification

Select Classifier, Testing Dataset and Hash Table

for each Message m_i in the Training Dataset D_tr do

for each Word in the Corpus do

Calculate the Word Probability for being Credible and Unreliable

Update Hash Table

end for

Calculate combined Probability for the Message

if combined Probability > Threshold

Label Message as Credible

else

Label Message as Unreliable

end if

end for

End classification

The probability threshold was determined after the initial testing and was set at the .9 probability level.

illustrates the key steps in CSD credibility detection approach based on the Naïve BN and the classical ‘bag of words’ model popular in email spam detection.

Figure 2. CSD credibility detection workflow.

The ABC’s 2011 Australian Flood Crisis Map dataset (Ushahidi CrowdMap) was used as the input CSD. The dataset was initially pre-processed using the steps explained in . After the data pre-processing, the system was trained using a training sample dataset.

Within the ABC’s Ushahidi CrowdMap, there were approximately 700 reports during the period of 9–15th of January 2011 which often included information about the location where the report had originated. After the initial duplicates were removed, there were 663 unique Ushahidi CrowdMap reports remaining. The duplicates of the dataset were removed using the ‘Remove duplicates’ tool of the MS Excel software.

For training and testing purposes, approximately 20% of the total reports (143 reports) were randomly selected from this Ushahidi CrowdMap dataset. Eighty percent of these reports (110 reports) were then selected as training data and remaining 20% selected as the testing data (33 reports). The remainder of the full dataset (520 reports) was then used for the credibility detection analysis.

The whole dataset was initially pre-processed to prepare for the training, testing and credibility detection. The training dataset was classified through a manual decision process which identified messages that where either credible or unreliable based on the credibility of terms within the message. The classification was undertaken by a reviewer who had local and expert knowledge of the disaster area. The system was then trained and tested using the testing dataset under two different environments namely, unforced and forced conditions, to test the accuracy and performance improvements.

In the unforced training, the data processing of the test data followed the normal pre-processing steps and was then used directly for refining the training of the system with no human intervention undertaken. The results of this unforced training provided a report on the level of possible false positives (FPs) in the classification. A high level of FPs is indicative of a possible bias in the classification process and is often referred to as Bayesian poisoning (Graham-Cumming Citation2006). The purpose of the forced training was then to review the FPs and other classified data to improve the quality of the classification process and hence re-train the system. The forced training required human intervention to improve the training of the system and therefore some terms which had artificially increased the credibility of the messages were identified and removed. This enabled the training of the system to be further refined and to more effectively distinguish the credible or unreliable messages. The forced training process consisted of the following stages:

The location terms were removed/disabled from both the credible and unreliable messages.
Highly credible terms such as evacuation centre, road close, police, hospital, etc. were removed from messages that were identified as unreliable to give more weight to similar terms in the credible messages and to avoid Bayesian poisoning.
Removing remaining messages which could cause a high FP rate and therefore avoid Bayesian poisoning.

When location terms appeared frequently in messages, these terms tended to increase the probability of the message being credible when in reality this was not the case. This impacted both the credible and unreliable messages. This impact was reduced by removing all the location terms in both credible and unreliable training sample messages. The Queensland Place Names Gazetteer was used as the basis for removing location terms as it provided a list of registered geographic locations and places. All incoming message terms were cross checked against the gazetteer list and discarded if found. Due to the large range and complexity of local or vernacular place names, these were not identified and would therefore be ignored by the gazetteer.

The full message structure from the Ushahidi reports included information on message number, incident title, incident date, location, description, category, latitude and longitude. For example:

101, Road closure due to flooding, 9/01/2011 20:00, Esk-kilcoy Rd, Fast running water over the road at the bottom of the decent below lookout, Roads Affected, -27.060215, 152.553593.

Some of the message descriptions were very brief in the Ushahidi CrowdMap data. The content of these messages were further reduced when some of the pre-processing activities were undertaken including the removal of numbers, units, time, dates, hashtags, Twitter user accounts and URLs. If the number of characters of these messages were less than 30 characters, the data columns ‘Incident Title’ and ‘Description’ were combined (see ) to make the descriptions more comprehensive and meaningful.

**Table 1. Example of the combination results of the Incident title and Description of the Ushahidi CrowdMap message fields.**

Download CSV Display Table

In some cases, this combination did not provide a meaningful result and did not satisfy the above condition. Therefore, the ‘Location’ column was also combined in these situations (see ) to improve the message meaning. However, a small number of messages had to be discarded as they did not succeed in any of the above operations.

**Table 2. Example of the combination result of the Incident title, Description and Location of the Ushahidi CrowdMap message fields.**

Download CSV Display Table

The following example shows how the original Ushahidi CrowdMap message was processed after tokenisation, stemming, lemmatisation and stop-word removal before being used for training, testing and credibility detection.

Original Ushahidi Crowdmap message:

Access to Stanthorpe town is severely restricted and all residents along Quart Pot Creek have been ordered to evacuate.

Tokenised, stemmed and lemmatised message:

access to Stanthorpe town be severely restrict and all resident along Quart Pot Creek have be order to evacuate.

Stop word removed message:

access stanthorpe town severely restrict resident along quart pot creek order evacuate.

4. Results and discussion

4.1. Results of initial training and testing using different sized training data

The system was initially trained using two different sized training datasets to assess any variations in the outcomes based on the size of the training dataset. The first training dataset consisted of 35 messages of which there were 25 credible messages and 10 messages identified as unreliable. The second training set was a larger training sample and consisted of 77 messages with 53 credible messages and 24 messages identified as unreliable.

A dataset of 33 messages was then tested using both the smaller and larger training datasets to train the system under both forced and unforced conditions. The test dataset was also manually pre-classified to identify credible and unreliable messages in order to confirm the accuracy and performance during the testing. shows example messages of correctly and misclassified results. show the classification results for the four test environments. Test 1 utilised the smaller training dataset (35 messages) with the 33 test messages under unforced training conditions. Test 2 utilised the smaller training dataset (35 messages) with the 33 test messages under forced training conditions. Test 3 utilised the larger training dataset (77 messages) with the 33 test messages under unforced training conditions. Finally, Test 4 utilised the larger training dataset (77 messages) with the 33 test messages under forced training conditions.

Table 3. Examples of correctly and incorrectly classified messages.

Download CSV Display Table

Table 4. Test 1 – unforced training results using the small training sample (35 messages) and 33 test messages.

Download CSV Display Table

Table 5. Test 2 – forced training results using small training sample (35 messages) and 33 test messages.

Download CSV Display Table

Table 6. Test 3 – unforced training results using the larger training sample (77 messages) and 33 test messages.

Download CSV Display Table

Table 7. Test 4 – forced training results using the larger training sample (77 messages) and 33 test messages.

Download CSV Display Table

The terms true positive (TP), true negative (TN), FP and false negative (FN) were used to compare the results of the classification. A TP result correctly predicts a ‘Credible’ outcome when it is ‘Credible’, a TN result correctly predicts an ‘Unreliable’ outcome when it is ‘Unreliable’, a FP result falsely predicts a ‘Credible’ outcome when it should be ‘Unreliable’ and finally, a FN result falsely predicts an ‘Unreliable’ outcome when it should be ‘Credible’.

results indicate that the system correctly classified 24 out of 25 credible messages during unforced training, but only 1 out of the 8 messages identified as unreliable was correctly classified. This outcome resulted in a high number of FPs for the unforced training which indicated that further training was required.

When the system utilised the same training dataset but ran under forced training conditions the results as expected varied (). Of the 25 credible messages 23 messages were correctly classified and only 2 messages incorrectly classified. These results only varied slightly from the unforced training outcomes in regard to detecting credible messages correctly. However, there was a significant improvement in the correct detection of unreliable messages with all messages being correctly classified during this test. Overall, the results were considered acceptable with a high classification accuracy for both the credible and unreliable messages classification and hence validated the forced training conditions.

Next, the size of the training sample was increased from 35 messages to 77 messages and then the unforced and forced training was repeated on the same test dataset. The results of unforced training are shown in and identify that for the credible message classification, 21 out of 25 messages were correctly classified which was a small decrease in accuracy compared to the previous result (). However, the classification accuracy of unreliable messages improved from one correctly classified message to five correctly classified messages out of the eight to be classified.

Finally, shows the results of the classification using the larger training dataset under forced training conditions. The results of the testing are identical to the forced training using the smaller training dataset with 23 out of 25 credible messages correctly classified and all 8 unreliable messages were also correctly classified. This indicated that the forced training conditions were consistent and were not impacted by the changed training sample size.

A number of measures such as accuracy, precision, sensitivity and the F1-score provided an indication of each classification’s effectiveness. The accuracy, which is the ratio of correctly predicted observations, was calculated by the formula (TP + TN)/(TP + TN + FP + FN). The precision or positive predictive value (PPV) is the ratio of correct positive observations. The PPV was calculated by TP/(TP + FP). The F1-score (F1) is used to measure classification performance using the weighted recall and precision, where the recall is the percentage of relevant instances that are retrieved and was calculated by 2 * TP/(2 * TP + FP + FN). The sensitivity or true positive rate was calculated by TP/(TP + FN).

The classification quality for the four tests are summarised in . The accuracy and precision was higher for the forced training outcomes for both training sample sizes and indicates the importance of the forced training. It can also be seen that the classification accuracy and precision increased slightly for the unforced training outcomes when the larger training sample size was utilised. However, the precision and accuracy outcomes for the forced training were similar and indicate that there may be a lesser dependency on the size of the training dataset when force training is utilised. The F1-score did not change with the sample size but the measures indicate that the forced training again performed better than the unforced training scenarios. Finally, the classification sensitivity remained constant for the forced training for both training sample sizes but dropped slightly with the larger training sample size for the unforced training test outcomes.

Table 8. Quality of the CSD classification.

Download CSV Display Table

4.2. Results of the full Ushahidi CrowdMap data CSD analysis

After the training testing of the system was completed to an acceptable classification quality, the full Ushahidi CrowdMap sample of remaining 433 messages was analysed for credibility. As (a) indicates, 54% (234 out of 433) of the messages were identified as credible using an unforced training classification. However, when the system was run under forced conditions, 77% (334 out of 433) of the messages were identified as credible ((b)). This was a more confident value than the previous result as the accuracy and precision of the credibility detection was higher.

Figure 3. Assessed credibility of 2011 Australian flood’s Ushahidi CrowdMap data.

5. Conclusion

The CSD message credibility detection is a challenging task due to the high degree of variability of the data, the lack of a consistent data structure, the variability of the data providers and the limited metadata available. This study identified that Bayesian spam email detection approaches can be applied successfully to the challenge of classifying the credibility of CSD. However, the training approaches and the size of the training dataset can influence the quality and performance of the training outcomes.

Due to the variability of the data, it is recommended that forced training is undertaken to achieve the highest accuracy and performance. In particular, the forced training provided a higher level of confidence in eliminating the number of FP outcomes which were the incorrect classification of messages. The size of the training dataset was found to be less critical when a forced training approach was utilised with the results of the classification outcomes being similar for both the smaller and larger training datasets. However, if the system training was unforced, a larger training dataset is recommended.

Although this study focused on the issue of credibility, it should be recognised that the relevance of that dataset is another critical dimension in the quality assessment of the crowd sourced datasets. It is often not enough to just have a credible source of information as it is also important that the information is relevant to the purpose of the operational activity. For example, in the case of a flood disaster, the relevant information should relate to useful and relevant data regarding the support of the flood operations or emergency services. It is therefore important that future studies analyse both the credibility and the relevance of the crowd sourced datasets.

Disclosure statement

No potential conflict of interest was reported by the authors.

ORCID

Saman Koswatte http://orcid.org/0000-0002-3484-854X

Kevin McDougall http://orcid.org/0000-0001-6088-1004

Xiaoye Liu http://orcid.org/0000-0002-0487-9193

Additional information

Funding

Authors wish to acknowledge the Australian Government for providing support for the research work through the Research Training Program (RTP) and Monique Potts, ABC – Australia for providing the 2011 Australian Flood’s Ushahidi Crowdmap data.

References

Androutsopoulos, Ion, John Koutsias, Konstantinos V. Chandrinos, and Constantine D. Spyropoulos. 2000. “An Experimental Comparison of Naive Bayesian and Keyword-based Anti-spam Filtering with Personal E-mail Messages.” Paper presented at the proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval, Athens, Greece.
Google Scholar
Antoniou, V., and A. Skopeliti. 2015. “Measures and Indicators of VGI Quality: AN Overview.” ISPRS Annals of Photogrammetry, Remote Sensing and Spatial Information Sciences II-3/W5: 345–351. doi: 10.5194/isprsannals-II-3-W5-345-2015
Google Scholar
Bahree, Megha. 2008. “Citizen Voices.” Forbes Magazine, Vol. 182, No. 12, 83.
Google Scholar
Bishr, Mohamed, and Werner Kuhn. 2007. “Geospatial Information Bottom-up: A Matter of Trust and Semantics.” In The European Information Society: Leading the Way with Geo-information, edited by S. Fabrikant and M. Wachowicz, 365–387. Berlin: Springer.
Google Scholar
Blanzieri, Enrico, and Anton Bryl. 2008. “A Survey of Learning-based Techniques of Email Spam Filtering.” Artificial Intelligence Review 29 (1): 63–92. doi: 10.1007/s10462-009-9109-6
Web of Science ®Google Scholar
Brando, Carmen, and Bénédicte Bucher. 2010. “Quality in User Generated Spatial Content: A Matter of Specifications.” Paper presented at the proceedings of the 13th AGILE international conference on geographic information science, Guimarães.
Google Scholar
Castillo, Carlos, Marcelo Mendoza, and Barbara Poblete. 2011. “Information Credibility on Twitter.” Paper presented at the proceedings of the 20th international conference on world wide web, Hyderabad, India.
Google Scholar
Cheng, Jie, and Russell Greiner. 1999. “Comparing Bayesian Network Classifiers.” Proceedings of the fifteenth conference on uncertainty in artificial intelligence, 101–108. Stockholm: Morgan Kaufmann.
Google Scholar
Coleman, David J., Yola Georgiadou, and Jeff Labonte. 2009. “Volunteered Geographic Information: The Nature and Motivation of Producers.” International Journal of Spatial Data Infrastructures Research 4 (1): 332–358.
Google Scholar
Cranor, Lorrie Faith, and Brian A. LaMacchia. 1998. “Spam!” Communications of the ACM 41 (8): 74–83. doi: 10.1145/280324.280336
Web of Science ®Google Scholar
De Longueville, Bertrand, Nicole Ostlander, and Carina Keskitalo. 2010. “Addressing Vagueness in Volunteered Geographic Information (VGI) – A Case Study.” International Journal of Spatial Data Infrastructures Research 5: 1725-0463.
Google Scholar
Elwood, Sarah. 2008. “Volunteered Geographic Information: Future Research Directions Motivated by Critical, Participatory, and Feminist GIS.” GeoJournal 72 (3–4): 173–183. doi: 10.1007/s10708-008-9186-0
Google Scholar
Flanagin, Andrew J., and Miriam J. Metzger. 2008. “The Credibility of Volunteered Geographic Information.” GeoJournal 72 (3–4): 137–148. doi: 10.1007/s10708-008-9188-y
Google Scholar
Fogg, B. J., and Hsiang Tseng. 1999. “The Elements of Computer Credibility.” Paper presented at the proceedings of the SIGCHI conference on human factors in computing systems, Pittsburgh, PA.
Google Scholar
Goodchild, Michael F. 2007. “Citizens as Sensors: The World of Volunteered Geography.” GeoJournal 69 (4): 211–221. doi: 10.1007/s10708-007-9111-y
Google Scholar
Goodchild, Michael F. 2009. “NeoGeography and the Nature of Geographic Expertise.” Journal of Location Based Services 3 (2): 82–96. doi: 10.1080/17489720902950374
Google Scholar
Goodchild, Michael F., and J. Alan Glennon. 2010. “Crowdsourcing Geographic Information for Disaster Response: A Research Frontier.” International Journal of Digital Earth 3 (3): 231–241. doi: 10.1080/17538941003759255
Web of Science ®Google Scholar
Goodchild, Michael F., and Linna Li. 2012. “Assuring the Quality of Volunteered Geographic Information.” Spatial Statistics 1: 110–120. doi: 10.1016/j.spasta.2012.03.002
Web of Science ®Google Scholar
Graham-Cumming, John. 2006. “Does Bayesian Poisoning Exist.” Virus Bulletin, 69.
Google Scholar
Guzella, Thiago S., and Walmir M. Caminhas. 2009. “A Review of Machine Learning Approaches to Spam Filtering.” Expert Systems with Applications 36 (7): 10206–10222. doi: 10.1016/j.eswa.2009.02.037
Web of Science ®Google Scholar
Haklay, Mordechai. 2010. “How Good Is Volunteered Geographical Information? A Comparative Study of OpenStreetMap and Ordnance Survey Datasets.” Environment and Planning B: Planning and Design 37 (4): 682–703. doi: 10.1068/b35097
Web of Science ®Google Scholar
Heipke, Christian. 2010. “Crowdsourcing Geospatial Data.” ISPRS Journal of Photogrammetry and Remote Sensing 65 (6): 550–557. doi: 10.1016/j.isprsjprs.2010.06.005
Web of Science ®Google Scholar
Hovland, Carl I., Irving L. Janis, and Harold H. Kelley. 1953. Communication and Persuasion; Psychological Studies of Opinion Change. New Haven: Yale University Press.
Google Scholar
Howe, Jeff. 2006. “The Rise of Crowdsourcing.” Wired Magazine, 1–4.
Google Scholar
Hung, Kuo-Chih, Mohsen Kalantari, and Abbas Rajabifard. 2016. “Methods for Assessing the Credibility of Volunteered Geographic Information in Flood Response: A Case Study in Brisbane, Australia.” Applied Geography 68: 37–47. doi: 10.1016/j.apgeog.2016.01.005
Web of Science ®Google Scholar
Janowicz, Krzysztof, Sven Schade, Arne Broring, Carsten Kebler, Patrick Maue, and Christoph Stasch. 2010. “Semantic Enablement for Spatial Data Infrastructures.” Transactions in GIS 14 (2): 111–129. doi: 10.1111/j.1467-9671.2010.01186.x
Web of Science ®Google Scholar
Kang, Byungkyu, John O’Donovan, and Tobias Höllerer. 2012. “Modeling Topic Specific Credibility on Twitter.” Paper presented at the proceedings of the 2012 ACM international conference on intelligent user interfaces, Lisbon, Portugal.
Google Scholar
Kim, Heejun. 2013. “Credibility Assessment of Volunteered Geographic Information for Emergency Management: A Bayesian Network Modeling Approach.” University of Illinois.
Google Scholar
Koswatte, Saman, Kevin McDougall, and Xiaoye Liu. 2016. “Semantic Location Extraction from Crowdsourced Data.” ISPRS-international archives of the photogrammetry, remote sensing and spatial information sciences, volume XLI-B2-543, 543–547, Prague, Czech Republic.
Google Scholar
Longueville, Bertrand De, Gianluca Luraschi, Paul Smits, Stephen Peedell, and Tom De Groeve. 2010. “Citizens as Sensors for Natural Hazards: A VGI Integration Workflow.” Geomatica 64 (1): 41–59.
Google Scholar
Lopes, Clotilde, Paulo Cortez, Pedro Sousa, Miguel Rocha, and Miguel Rio. 2011. “Symbiotic Filtering for Spam Email Detection.” Expert Systems with Applications 38 (8): 9365–9372. doi: 10.1016/j.eswa.2011.01.174
Web of Science ®Google Scholar
Metsis, Vangelis, Ion Androutsopoulos, and Georgios Paliouras. 2006. “Spam Filtering with Naive Bayes – Which Naive Bayes?” Paper presented at the CEAS, Mountain View, CA.
Google Scholar
Noy, Natalya F., Nicholas Griffith, and Mark A. Musen. 2008. “Collecting Community-based Mappings in an Ontology Repository.” In The Semantic Web-ISWC 2008, edited by A. Sheth, S. Steffen, M. Dean, M. Paolucci, D. Maynard, T. Finin, and K. Thirunarayan, 371–386. Berlin: Springer.
Google Scholar
Ostermann, Frank O., and Laura Spinsanti. 2011. “A Conceptual Workflow for Automatically Assessing the Quality of Volunteered Geographic Information for Crisis Management.” Paper presented at the proceedings of AGILE, University of Utrecht, Utrecht.
Google Scholar
Pantel, Patrick, and Dekang Lin. 1998. “Spamcop: A Spam Classification & Organization Program.” Paper presented at the proceedings of AAAI-98 workshop on learning for text categorization, Madison, WI.
Google Scholar
Robinson, Gary. 2003. “A Statistical Approach to the Spam Problem.” Linux Journal 2003 (107): 3.
Google Scholar
Sadeghi-Niaraki, Abolghasem, Abbas Rajabifard, Kyehyun Kim, and Jungtaek Seo. 2010. “Ontology Based SDI to Facilitate Spatially Enabled Society.” Paper presented at the proceedings of GSDI 12 world conference, Singapore.
Google Scholar
Sahami, Mehran, Susan Dumais, David Heckerman, and Eric Horvitz. 1998. “A Bayesian Approach to Filtering Junk E-mail.” Paper presented at the learning for text categorization: papers from the 1998 workshop, Madison, WI.
Google Scholar
Schneider, Karl-Michael. 2003. “A Comparison of Event Models for Naive Bayes Anti-spam E-mail Filtering.” Paper presented at the proceedings of the tenth conference on European chapter of the association for computational linguistics, volume 1, 307–314, Budapest, Hungary.
Google Scholar
Senaratne, Hansi, Amin Mobasheri, Ahmed Loai Ali, Cristina Capineri, and Mordechai Haklay. 2017. “A Review of Volunteered Geographic Information Quality Assessment Methods.” International Journal of Geographical Information Science, 139–167. doi: 10.1080/13658816.2016.1189556
Web of Science ®Google Scholar
Shvaiko, Pavel, and Jérôme Euzenat. 2013. “Ontology Matching: State of the Art and Future Challenges.” IEEE Transactions on Knowledge and Data Engineering 25 (1): 158–176. doi: 10.1109/TKDE.2011.253
Web of Science ®Google Scholar
Wang, Alex Hai. 2010. “Don’t Follow Me: Spam Detection in Twitter.” Proceedings of the 2010 international conference on security and cryptography (SECRYPT), July 26–28, Athens, Greece.
Google Scholar

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Download PDF

Share icon
Back to Top

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Your download is now in progress and you may close this window

Did you know that with a free Taylor & Francis Online account you can gain access to the following benefits?

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

Have an account?
Login now Don't have an account?
Register for free

Login or register to access this feature

Have an account?
Login now Don't have an account?
Register for free

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

VGI and crowdsourced data credibility analysis using spam email detection techniques

ABSTRACT

1. Introduction

2. CSD credibility

2.1. Statistical approaches for CSD credibility detection in disaster management

2.2. Why use spam email detection as an approach for CSD credibility detection?

2.3. A naïve BN-based model for CSD credibility detection

3. Methods

3.1. CSD credibility detecting algorithm based on spam email detection approach

**Table 1. Example of the combination results of the Incident title and Description of the Ushahidi CrowdMap message fields.**

**Table 2. Example of the combination result of the Incident title, Description and Location of the Ushahidi CrowdMap message fields.**

4. Results and discussion

4.1. Results of initial training and testing using different sized training data

Table 3. Examples of correctly and incorrectly classified messages.

Table 4. Test 1 – unforced training results using the small training sample (35 messages) and 33 test messages.

Table 5. Test 2 – forced training results using small training sample (35 messages) and 33 test messages.

Table 6. Test 3 – unforced training results using the larger training sample (77 messages) and 33 test messages.

Table 7. Test 4 – forced training results using the larger training sample (77 messages) and 33 test messages.

Table 8. Quality of the CSD classification.

4.2. Results of the full Ushahidi CrowdMap data CSD analysis

5. Conclusion

Disclosure statement

References

Information for

Open access

Opportunities

Help and information

VGI and crowdsourced data credibility analysis using spam email detection techniques

ABSTRACT

1. Introduction

2. CSD credibility

2.1. Statistical approaches for CSD credibility detection in disaster management

2.2. Why use spam email detection as an approach for CSD credibility detection?

2.3. A naïve BN-based model for CSD credibility detection

3. Methods

3.1. CSD credibility detecting algorithm based on spam email detection approach

Table 1. Example of the combination results of the Incident title and Description of the Ushahidi CrowdMap message fields.

Table 2. Example of the combination result of the Incident title, Description and Location of the Ushahidi CrowdMap message fields.

4. Results and discussion

4.1. Results of initial training and testing using different sized training data

Table 3. Examples of correctly and incorrectly classified messages.

Table 4. Test 1 – unforced training results using the small training sample (35 messages) and 33 test messages.

Table 5. Test 2 – forced training results using small training sample (35 messages) and 33 test messages.

Table 6. Test 3 – unforced training results using the larger training sample (77 messages) and 33 test messages.

Table 7. Test 4 – forced training results using the larger training sample (77 messages) and 33 test messages.

Table 8. Quality of the CSD classification.

4.2. Results of the full Ushahidi CrowdMap data CSD analysis

5. Conclusion

Disclosure statement

ORCID

Additional information

Funding

References

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date

**Table 1. Example of the combination results of the Incident title and Description of the Ushahidi CrowdMap message fields.**

**Table 2. Example of the combination result of the Incident title, Description and Location of the Ushahidi CrowdMap message fields.**