2,244
Views
31
CrossRef citations to date
0
Altmetric
Research Paper

A natural language processing framework to analyse the opinions on HPV vaccination reflected in twitter over 10 years (2008 - 2017)

ORCID Icon, ORCID Icon & ORCID Icon
Pages 1496-1504 | Received 23 Mar 2019, Accepted 28 May 2019, Published online: 16 Jul 2019

ABSTRACT

In this research, we developed a natural language processing (NLP) framework to investigate the opinions on HPV vaccination reflected on Twitter over a 10-year period – 2008–2017. The NLP framework includes sentiment analysis, entity analysis, and artificial intelligence (AI)-based phrase association mining. The sentiment analysis demonstrates the sentiment fluctuation over the past 10 years. The results show that there are more negative tweets in 2008 to 2011 and 2015 to 2016. The entity extraction and analysis help to identify the organization, geographical location and events entities associated with the negative and positive tweets. The results show that the organization entities such as FDA, CDC and Merck occur in both negative and positive tweets of almost every year, whereas the geographical location entities mentioned in both negative and positive tweets change from year to year. The reason is because of the specific events that happened in those different locations. The objective of the AI-based phrase association mining is to identify the main topics reflected in both negative and positive tweets and detailed tweet content. Through the phrase association mining, we found that the main negative topics on Twitter include “injuries”, “deaths”, “scandal”, “safety concerns”, and “adverse/side effects”, whereas the main positive topics include “cervical cancers”, “cervical screens”, “prevents”, and “vaccination campaigns”. We believe the results of this research can help public health researchers better understand the nature of social media influence on HPV vaccination attitudes and to develop strategies to counter the proliferation of misinformation.

Introduction

Human papillomavirus (HPV) is the most common sexually transmitted infection (STI) in the United States and in the world. Persistent HPV infection can progress to several types of cancer, including cervical, anal, and oropharyngeal cancers. First licensed in the U.S. in 2006, HPV vaccine currently is approved for males and females ages 9 through 45 years for the prevention of cervical, anal, vaginal, and vulvar cancers, as well as genital warts.Citation1 Ninety-one countries had implemented national HPV vaccination programs as of June 2019, with most of these countries located in the Americas and western Europe. Relatively few countries in Asia have adopted national HPV vaccination programs.Citation2 Despite the recommendation and availability of safe and effective vaccines for over 10 years, HPV vaccination rates in the U.S. (and many other countries) are far lower than the goal set by Healthy People 2020 of 80% series completion for both adolescent males and females.Citation3 Identified reasons for non-vaccination include poor quality recommendations on the part of health care providers and unwarranted parental concerns about HPV vaccine safety and effectiveness. The extent to which parental concerns are influenced by social media stories has received quite a bit of attention.

Twitter, a social media platform created in 2006, has hundreds of millions of users around the world. It is a place for users to express their opinions and is frequently used for microblogging. The application’s nature makes it an ideal medium for the study of online opinions and behaviours. Research has been done to analyse tweets in Twitter to understand the public concerns and influences on HPV vaccine uptake. Shapiro et al.Citation4 conducted a comparison of tweets that express concerns about HPV vaccines from three countries: Australia, Canada, and the UK. The study period was about two and a half years from January 2014 to April 2016. A model was created to distinguish between tweets with and without (neutral) concerns. The tweets expressing concerns about psychological barriers comprised the largest (46%) proportion of tweets expressing concerns. Dunn et al.Citation5 investigated HPV vaccine coverage in the United States by using tweets posted between October 2013 and October 2015. Their results indicated that the topics related to media controversies were most closely related to coverage (both positively and negatively), which explained variance in HPV Vaccine coverage in females in 2015. Le at al.Citation6 used three sampling strategies to analyse Tweets from 2014. The three sampling strategies were based on key words related to cervical cancer prevention. One hundred tweets from each of the three sampling strategies were used to examine the narratives and themes which were different across samples. Dunn et al.Citation7 analysed tweets related to HPV Vaccine for a six-month period (from Oct 2013 to April 2014). Machine learning models were built to classify tweets as anti-vaccine or otherwise, and their research identified that negative tweets contributed to 25.1% of total HPV related tweets. Du et al.Citation8 extracted tweets between July 15, 2015 to August 17, 2015, and manually annotated 6000 tweets for sentiment analysis. Machine learning algorithm Support Vector Machine (SVM) was used to classify the tweets into 7 sentimental categories. The classification worked well on non-HPV related content and positive tweets, but not so well on the negative tweets.

In the present research, our main objective was to extend prior research efforts by developing a natural language processing framework to analyse HPV vaccine-related information extracted from one social media platform (Twitter) over a 10-year period (2008–2017). Natural language processing (NLP) is a subfield of computer science, computational linguistics, and artificial intelligence that helps computers process and analyse large amounts of natural language data. The NLP tasks include content categorization, topic discovery, entity recognition, sentiment analysis and so on. In this research, the NLP framework composites three main tasks organized as a pipeline to analyse the natural language data. Using NLP and artificial intelligence techniques, we sought to identify, and extract terms and phrases related to reasons for non-vaccination or opposition to HPV vaccination. These terms and phrases can help public health researchers identify the reasons for HPV vaccine scares and refine policies to prevent the scares and improve HPV vaccine uptake. Our research makes a unique contribution to the literature by evaluating opinions about HPV vaccination on Twitter over a 10-year timespan and by our development of an NLP framework to identify the associations between the phrases and entities reflected in both negative and positive tweets. Our intent in this paper is not to examine associations of Twitter trends with HPV vaccine policies or uptake, but to evaluate strengths and weaknesses of different Twitter analytic approaches. We hope to identify approaches that may prove valuable in helping researchers and public health professionals to understand events and associated phrases that might trigger the spread of negative opinion on social media. The results of a 10-year timespan provide a good overview and insight into attitudes about HPV vaccine reflected in social media.

Methods

Study data

English-language tweets (287,100 tweets) containing key-words related to HPV vaccines were collected between January 1, 2008 and Dec 31, 2017. These tweets were extracted by searching for any combination of keywords: HPV, vaccination, vaccine, cervical, cancer, and Gardasil, via a Twitter application programming interface (API), with respect to the terms of service for Twitter developers.

NLP analysis framework

In this research, we designed and developed an NLP framework by first classifying the HPV vaccine-related tweets into three groups: positive, negative and neutral through sentiment analysis. Then, to further understand the influential content mentioned in these groups of tweets, phrase association mining and entity analysis were applied to the positive and negative groups to understand the main opinion phrases or events that were associated to the positive and negative tweets. shows an overview of the framework. This framework can identify the negative tweets and the mentioned topics, events and entities within the tweets. Public health researchers may be able to make use of such a framework to identify negative opinions and associated public scares, public event and entities in the social media, so that preventive action can be taken to prevent drops in HPV vaccine uptake.

Figure 1. An NLP framework for HPV tweets analysis.

Figure 1. An NLP framework for HPV tweets analysis.

Sentiment analysis

Sentiment analysis inspects the given text and identifies the prevailing emotional opinion within the text to determine a writer’s attitude as positive, negative, or neutral. The sentiment analysis used in this research was Google Cloud sentiment analysis. Google Cloud sentiment analysis belongs to the set of Cloud Natural Language API developed by Google.Citation9 Users need to obtain Google Cloud credentials and credits in order to use this Google Cloud service. Given a text, Google Cloud sentiment analysis produces sentiment scores and magnitude values for the scores. Google Cloud sentiment scores range from −1 to 1, where −1 is extremely negative, 0 is neutral, and 1 is extremely positive. Associated with each score is the magnitude value. A higher value of magnitude indicates the strength of the sentiment. Given a text that includes multiple sentences, the algorithm calculates the sentiment score and magnitude value for each sentence, and then provide an overall sentiment score and magnitude value for the input text. provides an example of the sentiment analysis. The sentiment score is 0.8, and the magnitude score is 0.8. That means this text has a very positive sentiment and an equally strong magnitude value.

Figure 2. Example of sentiment analysis.

Figure 2. Example of sentiment analysis.

The initial phrase cloud of positive tweets included phrases such as “scandal”, “adverse event”. Hence, those tweets were extracted out to be further analysed manually. We confirmed that these were negative tweets, demonstrating that Google Cloud Platform (GCP) sentiment analysis algorithm had some limitations. Similarly, we found that GCP sentiment analysis gave a sentiment score and a magnitude value of 0.9 and 0.9 respectively to the sentence, “Former Doctor Predicts that Gardasil will Become the Greatest Medical Scandal of All Time.” However, the sentiment score and magnitude values for “scandal” are −0.6 and 1.6 respectively. It is likely that the inclusion of the word “greatest” led to the incorrect scoring of the sentence as a whole. Hence, we manually went through all the tweets that contained “scandal”, “adverse”, “safety” and “injury” to validate the sentiment analysis results and categorize them into the correct sentiment category. In total, we manually categorized 3672 tweets.

Entity extraction and analysis

The entity extraction and analysis approach involve more than simply scanning for nouns, but also inspects the given text for known entities such as public figures, landmarks, and common nouns such as restaurant, stadium, and so on. It also returns information about those entities. In the present study, spaCy entity analysis was used.Citation10 The spaCy’s entity analysis identified the entities in text and classified them into many categories, some of which were: person, location, organization, event, and work of art. demonstrates an example of entity analysis by using spaCy’s entity analysis. We used entity analysis to identify the critical entities mentioned in the positive and negative tweets of each year. We hypothesized that some entities, such as events or organizations might be associated with more negative tweets, and negative tweets might be linked to certain geographic locations.

Figure 3. Example of entity extraction and analysis.

Figure 3. Example of entity extraction and analysis.

Phrase association mining

The word embedding model proposed by Mikolov et al.Citation11 has been widely used for biomedical text analysis.Citation12 One advantage of word embeddings is that it discovers the semantic associations between words based on the content of the text. We believe that opinions and entities related to the opinion are often represented as phrases or terms which contain one or more than one word. Hence, in this research, we made use of phrase embeddingsCitation11 to discover the associations between phrases within the tweets. First, the most common phrases in the text corpora were identified and tagged in the corpora with phrases instead of words. Lastly, the word embedding model was employed with the newly found phrases. In this study, we employed the Skip-Gram model. The neural network architecture of the Skip-Gram is a standard three-layer neural network. The inputs to the neural network are phrases that are represented as vectors (Vc). The length of the vector is the number of phrases (N) in the corpora. The output vectors (UcmUc+m) are those to be evaluated against the defined context phrases of an input phrase. The context phrases are the ones that occur within a specific sliding window (2m) of the input phrase in a sentence. The training objective is to minimize the cost Jwhich is presented as EquationEquation 1:

(1) J=j=0.m2mucm+jTvc+2mlogk=1NexpukTvc(1)

At the end of the training process, each phrase is represented by a vector, the association scores between the phrases x and y can be calculated through a distance measure. In this study, the cosine distance (cos), as presented as EquationEquation 2, was used to calculate the association scores. In this research, we made use of the phrase association mining to understand the phrases that were mostly related to the keywords of HPV vaccines in the positive and negative tweets respectively.

(2) cos(vx,vy)=vxvyvxvy(2)

Results

The number of tweets from each year is listed in . The number of tweets increased sharply from 2008 to 2009, peaked in 2013, but has remained over 30,000 per year since 2013.

Table 1. Number of tweets of each year.

Sentiment analysis

Sentiment analysis was done on the tweets of every year. The time series boxplot in shows the progress of public opinion on HPV vaccine over time, illustrated by the boxes that show proportionally more negative tweets in from 2008 to 2011, then, equivalent amounts of negative and positive tweets from 2012 to 2014. Finally, negative tweets predominated again in years 2015 and 2016. Word cloud is a visualization method for text which is straightforward and appealing. It has been used by many studies in text mining and opinion mining.Citation13 In our study, the data set had a large amount of text data about people’s attitudes toward the HPV vaccine. The phrase cloud can provide a general overview of phrases that were used to express opinions over the past 10 years. ,) provides phrase clouds of positive and negative tweets respectively. The phrase cloud of the positive tweets shows ‘cervical cancer’, while this term does not appear in the phrase cloud of the negative tweets. Similarly, the word “clean” is prominent in positive tweets, but not in negative tweets. Conversely, the word “scandal” is prominent in negative tweets, but not positive tweets. Interestingly, several words appear in both word clouds (e.g., “HPV Vaccine” and “Gardasil Vaccine”), indicating some limitations to this approach. It emphasizes the phrases based on the frequency in the study data.

Figure 4. Boxplot of sentiment scores changes over 10 years.

Figure 4. Boxplot of sentiment scores changes over 10 years.

Figure 5. (a) Phrase cloud of positive tweets. (b) Phrase cloud of negative tweets.

Figure 5. (a) Phrase cloud of positive tweets. (b) Phrase cloud of negative tweets.

Entity extraction and analysis

Based on the results of the sentiment analysis, entity extraction and analysis were applied to the positive and negative tweets of each year. We found that “CDC” is the organization entity that occurred in both positive and negative tweets for almost all the 10 years, except 2010. Other than 2008, 2009 and 2016, “CDC” occurred more in the negative tweets than positive tweets. After further investigation, we found that most of the negative tweets mentioned “CDC” because either they tagged “CDC” or wanted “CDC” to stop supporting the vaccine. Similar patterns were apparent with “FDA” and “Merck”. shows the frequency of “CDC”, “FDA” and “Merck” in the positive and negative tweets over the 10 years. shows examples of tweets associated with organizations like CDC, FDA, and Merck for some of the years. We found that the high frequencies of some tweets were due to repeated postings. For example, the tweet, “Gardasil (MERCK) Vaccine against HPV human papilloma virus Sale and personal delivery throughout the country”, was repeated over one thousand times in 2013, which caused the spike of “MERCK” in the positive tweets of that year.

Table 2. Sample positive and negative tweets contain “CDC”, “FDA”, or “MERCK”.

Figure 6. Frequency of “CDC”, “FDA” and “Merck” in the positive and negative tweets.

Figure 6. Frequency of “CDC”, “FDA” and “Merck” in the positive and negative tweets.

Starting from 2010, many positive and negative tweets had location entities in them. In 2013, ‘Southern US’ was one of the popular location entities in the negative tweets. We found that this was because of the tweet – “HPV vaccination rates alarmingly low among young women in Southern US”, which actually was not reflective of a negative sentiment about the HPV vaccine. In 2015, ‘Central and South America’ was a frequent location entity. We found that it was linked to the tweet, “HPV Vaccine Injuries and Deaths Now Being Reported from Central and South America”. and detail the most frequent location entities in the positive and negative tweets along with detailed tweets.

Table 3. Top frequent location entities in positive tweets.

Table 4. Top frequent location entities in negative tweets.

The event entity extraction and analysis on the negative and positive tweets showed that there was no event that was frequently associated with HPV vaccine-related tweets for a specific year. One exception was in 2016, when “World Cancer Day 2016” was mentioned in 7 tweets. These 7 tweets were about “HPV news: On World Cancer Day 2016: Study explains why HPV vaccination rates are low in Hawaii”, which were categorized as neutral to positive tweets.

Phrase association mining

Phrase association mining was used to identify the most relevant phrases to the keywords of the HPV vaccine and the key reasons extracted from the NIS-teen surveyCitation14 for non-vaccination. The phrase association mining algorithm was employed on the negative and positive tweets of each year to investigate the changes in the phrase associations. In this paper, we examined the identified negative phrases/words, their association scores with the keywords of the HPV vaccine and the content that reflected the associations. Based on the phrase cloud of the negative tweets, the negative phrases were those containing the following words: ‘death’, ‘concern’, ‘kill’, ‘injured’, ‘safety’, ‘adverse’, ‘scandal’ and ‘fraud’. shows selected phrase associations for each year and the changes over the past 10 years. It shows that in the early years before 2010, the HPV vaccine was highly associated with ‘safety concerns’ and ‘death’. Starting from 2011, more negative tweets were about ‘injuries’, ‘fraud’ and ‘scandal’. The ‘adverse reaction/effects’ and ‘side effects’ were the concerns from 2011 until 2017. shows selected phrase association identified from the positive tweets. It consistently shows that the positive tweets were often associated with `cervical cancer’ and/or `prevention’.

Table 5. Phrase association mining for negative tweets.

Table 6. Phrase association mining for positive tweets.

Discussion

In this descriptive study of HPV vaccination on Twitter, the overall results showed that many concerns about HPV vaccine could be discovered from using such a multi-faceted approach to analysis of Twitter posts over the past 10 years. The developed NLP framework has the capability to first analyze the sentiment of the tweets, then extract key phrases and analyze the association of the phrases to further understand the main topics of the negative tweets. The identified topic may help public health researchers identify the reasons for HPV non-vaccination. The developed NLP framework also has the capability to find location, organization and event entities within the tweets. This can enable public health researchers of different countries or states to compare and investigate how to improve HPV vaccination at different locations and through different organizations and events.

The number of tweets peaked in 2013. We hypothesized that the main reason was that awareness of HPV vaccine increased. Although we also found a voluntary recall of one lot of Gardasil HPV Vaccine issued by CDC in 2013,Citation15 we thought it was not the main cause for the peak, since the word “recall” was not one of the frequent phrases associated with negative tweets. Through phrase association mining, we identified some negative topics that were different from those most cited in the literatureCitation16-Citation19 as drivers of non-vaccination, such as “marketing fraud”, “scandal”, and “injury”. These topics could cause HPV vaccine scares. Public health researchers could take appropriate action on providing clarifications on these topics and prevent scares, so that HPV uptake does not diminish. Through the entity analysis, we discovered that “CDC” was also associated with negative tweets. Some of the positive tweets were linked to CDC recommendation on HPV vaccine, whereas the negative tweets were initiated by Twitter users who expressed that CDC should take some action to stop the HPV recommendation. And we identified that Europe is the location entity that was mentioned almost every year. Although some of those tweets are positive, some are negative, reflecting variations in HPV vaccine hesitancy in European countries and the fact that HPV vaccination programs around the world have been introduced at different times.Citation20

There are several limitations of the work reported here, which also serve as a cautionary tale for analysis of social media posts. One serious issue was that the GCP sentiment analysis sometimes misidentified the nature of the sentiment expressed. This may have occurred because GCP sentiment analysis is not customized to evaluate sentiments that show consumers’ resistance to, or opinions about, a medical product. Often, an opinion about a medical product is not simply expressed through use of terms such as “good”, “bad”, and “great”. Although we manually corrected the sentiment analysis results of some tweets, because of the size of the corpus, it was not feasible to manually validate all of them. In the future, we plan to develop a semi-supervised learning algorithm for sentiment analysis on this study corpus. The semi-supervised learning algorithm will make use of the Long Short Term Memory (LSTM) Neural NetworksCitation21 to classify the tweets to the sentiment class: positive and negative and make use of the concept embedding techniquesCitation22 to consider the semantics of the words and phrases in the tweets. We also plan to annotate some categories, such as “fraud”, “adverse”, “side effect”, “death” and so on for negative tweets based on the identified key phrases from the tweets, and then train the learning model to automatically identify the tweets in those categories. In this work, we simply classified the tweets into positive and negative based on threshold 0. More sophisticated thresholds can be used to identify extreme negative (<-0.75) and extreme positive (>0.75) tweets and the content within them. More tweet features, such as number of retweets, favorites, and replies might be used to identify the spreading of the negative and positive tweets on the social media.

In this largely descriptive study, we did not generally attempt to link specific historical events related to HPV vaccination to patterns of opinions reflected on Twitter over time. As we work to develop more accurate algorithms for sentiment analysis, it will be important to evaluate ways to use this information to quickly assess the impact of historical events, such as false reports of serious adverse events, on social media trends. In this way, it may be possible to intervene to limit the damage that can result from the rapid dissemination of false information on social media.

Conclusions

In this descriptive research, we explored an NLP framework which included sentiment analysis, entity analysis and phrase association mining to analyse tweets from social media about HPV vaccination. This framework can help analyse and identify the main topics and entities related to the positive and negative text buried in a large text corpus. Our results show the different negative and positive opinions on HPV vaccination reflected in Twitter over the past 10 years. Tweets about HPV vaccine significantly increased after 2010 and peaked in 2013. In earlier years, soon after the FDA licensed HPV vaccine, the main concern was ‘death’ and ‘safety’, later it was more about ‘injuries’, ‘adverse effects’ and ‘fraud’. The results provide an overall insight into the concerns expressed through social media, which can help health authorities form and evaluate was of countering the misinformation and fear mongering that has been relatively common on Twitter and other social media platforms. At the same time, it is encouraging that a substantial proportion of tweets had neutral or positive sentiments associated with them. Future work in this domain includes adding emoji analysis into the sentiment analysis, given that users express their feelings and opinions not only through words but also through emojis; customizing the sentiment analysis algorithm to work better with tweets and to tailor it towards understanding the context of opinions in the medical domain. In addition, it will be important to examine the extent to which negative or positive tweets influence actual uptake of HPV vaccine.

Disclosure of potential conflicts of interest

Within the last year, author Gregory Zimet received an honorarium from Sanofi Pasteur for work on the Adolescent Immunization Initiative and received travel support from Merck to attend a conference on HPV vaccination.

References

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.