Full article: Ontology-based semantic data interestingness using BERT models

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

The COVID-19 pandemic has generated massive data in the healthcare sector in recent years, encouraging researchers and scientists to uncover the underlying facts. Mining interesting patterns in the large COVID-19 corpora is very important and useful for the decision makers. This paper presents a novel approach for uncovering interesting insights in large datasets using ontologies and BERT models. The research proposes a framework for extracting semantically rich facts from data by incorporating domain knowledge into the data mining process through the use of ontologies. An improved Apriori algorithm is employed for mining semantic association rules, while the interestingness of the rules is evaluated using BERT models for semantic richness. The results of the proposed framework are compared with state-of-the-art methods and evaluated using a combination of domain expert evaluation and statistical significance testing. The study offers a promising solution for finding meaningful relationships and facts in large datasets, particularly in the healthcare sector.

Keywords:

1. Introduction

The advancement in technology and increase in data generation has made it imperative for organisations to mine and analyse vast amounts of data to gain valuable insights and hidden relationships. This has become especially crucial in the current scenario where the COVID-19 pandemic has generated massive amounts of data in the healthcare sector, in the work by Abhilash and Mahesh (Citation2021). The sheer volume of data has led researchers and scientists to seek effective methods to uncover the underlying facts and relationships. The importance of uncovering valuable insights and hidden relationships in large datasets cannot be overstated, especially in the healthcare sector where crucial information is required to fight the pandemic. This has prompted the need for innovative approaches and frameworks to analyse and mine large datasets, thereby providing valuable insights and hidden relationships that can assist in making informed decisions (Bringmann et al., Citation2011).

The Knowledge Bases (KBs) created with the domain ontology plays a crucial role in discovering the interesting patterns, which can be mined from logical rules using Association Rule Mining (ARM) with anthologies and Inductive Logic Programming (ILP). Suppose – “If two people test positive for the virus and their cases are classified as community transmission, Then – it is likely that they reside in the same geographical area”. Large KBs have been created due to recent improvements in information extraction. “These knowledge bases typically contain information such as “Beijing is the capital of China”, “Bill Gates was born in Seattle”, and “All doctors are individuals”. YAGO (Suchanek et al., Citation2007), DBpedia (Bizer et al., Citation2009).

The ARM techniques generate rules in the following way: Given a set of items I = $I_{1}, I_{2}, \dots, I_{n}$ , a group of items S = $S_{1}, S_{2}, \dots ., S_{n}$ is created, where S is a subset of I. The association rule A $\overset{}{\to}$ B is defined over the group G. The interestingness of the rule is measured using metrics such as support, confidence, and Lift. By setting different thresholds for these metrics, the most meaningful rules can be identified from the dataset (Agrawal & Srikant, Citation1994).

To address the problem of uncovering hidden relationships and valuable insights in large datasets, a novel framework is proposed that leverages Resource Description Framework (RDF) data and a semantic approach to extract meaningful facts from COVID-19 data Abhilash and Mahesh (Citation2022b). The framework integrates ontologies to incorporate domain knowledge into data mining, utilising the improved Apriori algorithm for mining semantic association rules. The interestingness of the rules is further evaluated using BERT models for semantic richness. The proposed framework in this research aims to find valuable insights and hidden relationships from the large datasets, particularly in the healthcare sector during the COVID-19 pandemic, which provides a new avenue for data analysis. The approach uses RDF data, ontologies, an improved Apriori algorithm, and BERT models to extract meaningful facts from the data (Gahar et al., Citation2018).

To evaluate the effectiveness of the proposed framework, a combination of domain expert evaluation and statistical significance testing using a Chi-Square test has been employed. The results obtained from the evaluation will provide an insight into the accuracy and reliability of the framework in uncovering interesting facts from large datasets.

The publication “Attention is all you need” by Vaswani et al. (Citation2017) presented the Transformers architecture (2017). The architecture of transformers is encoder-decoder. The Google AI team developed Bidirectional Encoder Representations from Transformers (BERT), a transformer-based pre-trained model (Devlin et al., Citation2018). In this work, the semantic scores are employed as measures of importance, and their distributions, considering the distance measure, are calculated using the BERT models.

More precisely, our contributions are as follows:

An effective data preprocessing technique that introduces semantics at the level of data curation.
An effective Semantic Interestingness Framework using BERT (SIF-B) that incorporates ontology-based methods with ARM techniques to extract meaningful and semantically rich rules from large datasets, particularly in the healthcare sector during the COVID-19 pandemic.
Adoption of healthcare BERT models to introduce semantic interestingness measure to strengthen the framework and makes it novel.

The comparison of the proposed framework with the state-of-the-art results, along with the evaluation through a combination of domain expert assessment and statistical significance testing, demonstrate its effectiveness in uncovering hidden relationships and valuable insights in data.

The structure of this paper is as follows: Section 2 provides a summary of existing methods and techniques. Section 3 outlines the data, preliminary steps, and data pre-processing techniques. Section 4 presents the Semantic Interestingness Framework using BERT (SIF-B). Section 5 presents the results, and compares the semantic-rich rules with the state-of-the-art findings and implementation of BERT models for semantic interestingness. Section 6 includes a significance test and evaluation by domain experts. Finally, Section 7 concludes the paper and suggests future research directions.

2. Related work

Incorporating ontology knowledge into the discovery of interesting patterns in RDF (Resource Description Framework) data enhances the results compared to traditional instance-level data mining approaches by providing a deeper understanding of the relationships and concepts in the data (Abhilash & Mahesh, Citation2022a). The ontology represents the schema-level information and provides a semantic context for the data, which helps in uncovering hidden relationships between entities and making the discovered patterns more meaningful and relevant (Ciotti & Tomasi, Citation2016). The use of ontology information allows for the discovery of interesting patterns that take into account not just the instance-level data, but also the relationships between the concepts and classes in the data. This results in more accurate and relevant insights compared to instance-level data mining approaches, where only the individual instances are analysed without considering the underlying schema information (Su et al., Citation2023; Wikipedia contributors, Citation2021).

In recent years, the field of data mining has seen significant advances in the discovery of interesting patterns in large and complex datasets. One of the most promising developments has been the incorporation of ontology knowledge into these techniques, which allows for a deeper understanding of the semantic relationships between data elements. This has the potential to greatly enhance the results of data mining compared to traditional instance-level approaches, which are limited to only considering the relationships between individual data instances.

One of the key challenges in incorporating ontology knowledge into data mining is the accurate measurement of the semantic significance of the rules generated. This requires not only understanding the relationships between data elements, but also assessing the significance of these relationships in the context of the data.

To address this challenge, researchers have turned to the use of BERT models. These models, based on the transformer architecture, are capable of capturing fine-grained semantic relationships between words and phrases in natural language (Abas et al., Citation2020). By incorporating these models into the rule generation process, researchers can accurately measure the semantic significance of the rules generated from both ontology and instance-level information.

The use of BERT models in this context has been shown to produce promising results. In evaluations, the rules generated using BERT models have been found to be highly correlated with the assessments of domain experts, suggesting that these models are effective at measuring the semantic significance of the rules generated.

Prior research on COVID-19 has concentrated on forecasting case numbers (Arora et al., Citation2020; Qin et al., Citation2020; Tomar & Gupta, Citation2020) and categorising COVID-19 patients from real-world x-ray data sets using sophisticated deep neural network techniques (Apostolopoulos & Mpesiana, Citation2020; Ozturk et al., Citation2020). These techniques, however, focus on examining COVID-19 symptom patterns.

In Bellandi et al. (Citation2007), Bellandi et al. demonstrated how ontologies could improve the rules obtained by ARM systems. The post-processing of ARM results using an ontology for consistency testing is presented by Marinica and Guillet (Citation2010). Filtering the identified rules is proposed by Mangla and Akhare (Citation2015).

Rule-based expert system for self-pre-diagnosis of COVID-19. It analyses symptom data to predict the risk of COVID-19 and provides suggestions for precautionary and supportive actions. The system was tested and found to be effective and efficient in diagnosing and monitoring positive cases during the COVID-19 pandemic (Çelik Ertuğrul & Ulusoy, Citation2022).

Shan et al. (Citation2021) used the transformed-based models for fake news detection. It's observed that the model specific to the healthcare domain yields better results compared to generalised ones.

In Alzubi et al. (Citation2021) COBERT-question answering system design was created to address COVID-19 difficulties and quickly assist researchers and clinical professionals in obtaining legitimate scientific information. For COVID-19-related question answering, cosine similarity measures are applied on the word embedding for categorising the top-K documents (Choi et al., Citation2018; Guo et al., Citation2020; Shen et al., Citation2020).

2.1. Outcome of literature review

The literature review revealed that an ontology-based approach is widely used for generating explicit and implicit facts. However, to the best of our knowledge, the use of ontology-based approaches for rule generation has received relatively little attention. Although previous studies have proposed mining semantic association rules using ARM, it is uncertain whether the remaining rules are semantically meaningful after threshold-based pruning.

2.1.1. Research gaps

With the detailed literature review, the following gaps are identified:

Limited use of ontology-based methods in combination with ARM techniques to extract meaningful and semantically rich rules from large datasets in the healthcare sector.
Limited use of healthcare BERT models to evaluate the interestingness of rules in data mining.
Limited use of a framework that combines ontology-based methods, ARM techniques, and healthcare BERT models to extract meaningful and semantically rich rules from large datasets.

3. Preliminaries, data, and pre-processing technique

This section discusses the dataset with preliminaries and an effective data processing technique designed for this study. Besides the general data analysis and knowledge engineering methods, the ontology-based approach stands aside and has unique importance.

3.1. Preliminaries

Ontology and ARM methods closely work towards the data interestingness (Abhilash & Mahesh, Citation2021). In data mining literature, association rule mining is widely used for rule generation based on frequent patterns.

Definition 3.1

Association Rule: Technique used to mine the frequent patterns in Data. The discovered patterns define the relationship between them.

we call X $\overset{}{\to}$ Y as association rule. To have the strong association rule, we need to compute the support and confidence as indicated in Equations (Equation1(1) $\begin{aligned} S u p p o r t (X \to Y) & = \frac{X & Y}{T o t a l n u m b e r o f a t t r i b u t e s s e t} \end{aligned}$ (1) ) and (Equation2(2) $\begin{aligned} C o n f i d e n c e (X \to Y) & = \frac{B o t h X & Y}{A l l v a l u e s e t c o n t a i n i n g X} \end{aligned}$ (2) ). Rules are defined considering our domain information. AVS refers to the attribute value set. (1) $\begin{aligned} S u p p o r t (X \to Y) & = \frac{X & Y}{T o t a l n u m b e r o f a t t r i b u t e s s e t} \end{aligned}$ (1) (2) $\begin{aligned} C o n f i d e n c e (X \to Y) & = \frac{B o t h X & Y}{A l l v a l u e s e t c o n t a i n i n g X} \end{aligned}$ (2)

Definition 3.2

Ontology: An Ontology O is defined as {O = (Tbox + Abox, G)}. Tbox: define the schema or an ontology. Abox refers to RDF triples at the instance level. G is a labelled graph structure produced by connecting the relations with concepts.

Figure represents the ontology and a COVID-19 knowledge graph snippet.

Figure 1. Representation of ontology and knowledge graph.(a) Example of ABox and TBox of an Ontology. (b) Knowledge graph view from COVID-19 Data.

The figure shows a graphical representation of the RDF triples that describe a doctor and patient in a medical system. The doctor and patient are represented as rectangular nodes, with their respective properties and values indicated as labels. Arrows represent relationships or links between the nodes. The doctor node is linked to the patient node via a "treats" property, indicating that the doctor is treating the patient.

Definition 3.3

Knowledge Graph: A collection of descriptions of concepts, things, relationships, and events that are all linked together.

Definition 3.4

Data Interestingness (DI) Our notion of Data Interestingness is derived by integrating domain ontology (O) with data in RDF (D) and with user interest rules (U) as shown in Equation (Equation3(3) $\begin{aligned} DI = {O,D,U} \end{aligned}$ (3) ).

(3) $\begin{aligned} DI = {O,D,U} \end{aligned}$ (3) Here user interest(U) refers to the unexpectedness and actionability measures. Interestingness framework in Section 4 illustrate Equation (Equation3(3) $\begin{aligned} DI = {O,D,U} \end{aligned}$ (3) ).

Definition 3.5

Patient Outcomes (PO): This is a measure of the impact that a pattern or relationship has on the health and well-being of patients. You can use patient outcome data such as survival rates, hospital readmission rates, and quality of life scores to evaluate the interestingness of a pattern or relationship

Definition 3.6

Clinical Relevance(CR): This is the extent to which a pattern or relationship is relevant to the treatment and care of patients. You can use clinical knowledge and expert opinion to evaluate the clinical relevance of a pattern or relationship, and use this information to guide your analysis.

Definitions 3.5 and 3.6 are used in domain expert assessment in Section 6.

3.2. Data – COVID-19 corpora and ontology

The proposed framework is suitable for any type of data along with its relevant ontology, therefore, the selection of data sets was based on their availability and relevance to the research objectives. The first data set was selected from a publicly available repository, while the second was from a funded project. The ability to use other data sets depends on their suitability and compatibility with the proposed framework.Footnote¹ The data statistics are illustrated in Tables and . We have used two COVID-19 ontology from Bioportal Footnote² which match the our COVID-19 corpora.

Ontology-based semantic data interestingness using BERT models

Abstract

1. Introduction

2. Related work

2.1. Outcome of literature review

2.1.1. Research gaps

3. Preliminaries, data, and pre-processing technique

3.1. Preliminaries

3.2. Data – COVID-19 corpora and ontology

Table 1. BERT model for healthcare domain.

Table 2. COVID-19 corpora description.

Table 3. Dataset and ontology descriptions.

3.3. Data pre-processing technique

3.3.1. Data curation

3.3.2. Converting to RDF

3.3.3. Linking data sources

3.3.4. Publishing as knowledge graph

4. Interestingness framework using BERT

5. Result and discussions

5.1. Rules from ontology

Table 4. Items from the constraint file.

5.2. Semantic interestingness in COVID-19 corpora

Table 5. Few semantic association rules from COKPME and KATrace dataset.

Table 6. Interesting entities of the rules with different confidence values.

5.3. Comparison to the state-of-the-art

Table 7. Comparative analysis of rules I.

Table 8. Comparative analysis of rules II.

5.4. Semantic interestingness using transformers

5.4.1. Tokenization and embeddings

5.4.2. Interesting centroids

Table 9. Rules from cluster centroid-CovidBERT.

Table 10. Rules from cluster centroid – BioClinicalBERT.

5.4.3. Rules with semantic score

Table 11. Rules from CovidBERT with semantic scores.

Table 12. Rules from BioClinicalBERT with semantic scores.

5.4.4. Rules using distance measure

Table 13. Semantic rules with absolute distance measure using CovidBERT for Cluster 0, 1 and 2.

Table 14. Semantic rules with absolute distance measure using CovidBERT for cluster 3 and 4.

Table 15. Semantic rules with absolute distance measure using BioClinicalBERT for cluster 0, 1 and 2.

Table 16. Semantic rules with absolute distance measure using BioClinicalBERT for cluster 3 and 4.

5.4.5. Industrial significance

6. Rule evaluation

6.1. Statistical significance

Table 17. Chi-square test for rule significance on sample rules.

6.1.1. Domain expert evaluation

Table 18. Evaluation scale range.

Table 19. Domain expert evaluation summary.

6.2. Implication and limitations of our framework

7. Conclusion

Acknowledgments

Disclosure statement

Notes

References

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date