233
Views
38
CrossRef citations to date
0
Altmetric
Original Research

Development of a prediction model for pancreatic cancer in patients with type 2 diabetes using logistic regression and artificial neural network models

, , , , &
Pages 6317-6324 | Published online: 26 Nov 2018

Abstract

Objectives

Patients with type 2 diabetes (T2DM) are suggested to have a higher risk of developing pancreatic cancer. We used two models to predict pancreatic cancer risk among patients with T2DM.

Methods

The original data used for this investigation were retrieved from the National Health Insurance Research Database of Taiwan. The prediction models included the available possible risk factors for pancreatic cancer. The data were split into training and test sets: 97.5% of the data were used as the training set and 2.5% of the data were used as the test set. Logistic regression (LR) and artificial neural network (ANN) models were implemented using Python (Version 3.7.0). The F1, precision, and recall were compared between the LR and the ANN models. The areas under the receiver operating characteristic (ROC) curves of the prediction models were also compared.

Results

The metrics used in this study indicated that the LR model more accurately predicted pancreatic cancer than the ANN model. For the LR model, the area under the ROC curve in the prediction of pancreatic cancer was 0.727, indicating a good fit.

Conclusion

Using this LR model, our results suggested that we could appropriately predict pancreatic cancer risk in patients with T2DM in Taiwan.

Study highlights

What is current knowledge?

Type 2 diabetes has a higher risk of pancreatic cancer.

What is new here?

We used logistic and ANN models to predict pancreatic cancer.

Introduction

Pancreatic cancer is one of the most lethal malignancies because its early diagnosis is difficult, and most patients have already progressed to unresectable and incurable statuses at diagnosis.Citation1,Citation2 According to the GLOBOCAN 2012 estimates, pancreatic cancer ranked as the 11th most common cancer and the seventh leading cause of cancer death in both genders globally in 2012.Citation3 In Taiwan, although it is not one of the top 10 cancers, the age-adjusted incidence rate steadily increased from 4.63/100,000 persons in 2005 to 6.23/100,000 persons in 2015.Citation4 Moreover, it was the sixth and eighth leading cause of mortality from cancer among women and men in 2016, respectively.Citation5 Early detection and treatment are vital, considering the relatively poor survival rate compared with its incidence. Identification of the risk factors of pancreatic cancer and regular surveillance of high-risk groups may increase the opportunity of early diagnosis, which could lead to improvements in treatment outcome.

Risk factors of pancreatic cancer, namely, smoking, obesity, chronic pancreatitis, unhealthy diet, and heavy alcohol consumption, have been well documented,Citation6Citation8 but some studies have suggested that patients with type 2 diabetes (T2DM) are also more likely to develop pancreatic cancer.Citation6Citation12 The exact mechanisms that link this possible association have still not been fully determined. LiCitation11 stated that insulin resistance and associated hyperglycemia, hyperinsulinemia, and inflammation may play a role in the underlying mechanisms, thereby contributing to the development of diabetes-associated pancreatic cancer.

In this study, we used data from the National Health Insurance Research Database (NHIRD) of Taiwan and attempted to create a suitable model to help physicians evaluate and predict the risk of development of pancreatic cancer in patients with T2DM. Logistic regression (LR) and artificial neural network (ANN) models have been used to predict medical outcomes.Citation13,Citation14 This study aims to compare the effectiveness of LR and ANN models in predicting the development of pancreatic cancer.

Methods

Data source

The study cohort was selected from the Longitudinal Cohort of Diabetes Patients (LHDB) of the National Health Insurance (NHI) program. The database is anonymized. The LHDB comprises data of 1,700,000 randomly selected newly diagnosed T2DM (ICD-9 code 250.x0 and 250.x2) patients with longitudinally linked data available from 1997 to 2013. Patients who had at least two diagnoses of T2DM within a year were eligible for inclusion in the LHDB. Diseases in the claims data were coded using the ICD, ninth revision, clinical modification (ICD-9-CM). The study was approved by the Research Ethics Committee of China Medical University and Hospital in Taiwan (CMUH104-REC2-115-CR3).

Participants

Patients with newly diagnosed T2DM were identified from the period of 2000 to 2012 from the data set of the LHDB. The first diagnosis date was defined as the index date of T2DM. T2DM patients with a history of pancreatic cancer (ICD-9 code 157) before the index date, aged <20 years, or with incomplete demographic information were excluded.

Comorbidities and medications

The baseline comorbidities considered in this study were acute pancreatitis, chronic pancreatitis, alcohol-related illness, gallstone, cholecystectomy, cirrhosis, COPD, Helicobacter pylori infection, hepatitis B, hepatitis C, hypertension, hyperlipidemia, nephropathy, and obesity. The Charlson comorbidity index (CCI) was also determined for each participant from claims data of outpatient visits or hospitalizations before the index date. The CCI is a scoring system that weighs factors on crucial concomitant diseases; it has been validated for use with the ICD-9-CM-coded administrative database.Citation15,Citation16 We categorized CCI into the following four levels: 0, 1, 2, and 3 or more. To measure the severity of T2DM, we used the adapted Diabetes Complication Severity Index (aDCSI).Citation17 The aDCSI had seven categories, namely, retinopathy, nephropathy, neuropathy, cerebrovascular, cardiovascular, peripheral vascular disease, and metabolic. The progression of diabetes was defined as a yearly increase in aDCSI score from the date of T2DM diagnosis to the end of follow-up. Three progression groups were defined as having a yearly increase in scores less than 0–0.1, 0.1–0.3, and >0.3 per year. Medications that may be associated with pancreatic cancer were also evaluated, including statin and antidiabetic drugs. Antidiabetic drugs included insulin, sulfonylureas, metformin, and thiazolidinediones and other antidiabetic drugs.

Constructing training and data sets

The data comprised 1,358,634 data points, each of which represented one patient. The data were cleaned and one-hot encoded using RStudio. After data cleaning, 22 input features and two output features were obtained. The features included patient’s age, underlying diseases, aDCSI score, and medications. The positive output class represented diagnosis of pancreatic cancer, whereas the negative output class represented no diagnosis. The data were split into training and test sets: 97.5% of the data were used as the training set and 2.5% of the data were used as the test set. presents the allocation between the two data sets.

Table 1 Distribution of train and test sets

Algorithm and training

The average k-fold cross-validation accuracy, with a k-value of 10, was used as the metric to determine the optimal hyperparameters for the prediction models. The LR model used an L2 regularization penalty with primal formulation. The LIBLINEAR algorithm was used for the optimization problem.Citation18 The one-versus-rest scheme was used as the loss function. The LR model was trained for 100 iterations before convergence. The ANN model was a multilayer perceptron deep neural network. The model consisted of an input layer of 22 dimensions, two hidden layers of 22 dimensions, and an output layer of two dimensions. The model was trained using the stochastic gradient descent, with a mini batch size of 1. The model was optimized using Adam with the default parameters outlined by Kingma et al, with a learning rate of 0.01, the β1 value of 0.9, the β2 value of 0.999, and no decay rate.Citation19 The input and hidden layers used a scaled exponential linear unit activation function,Citation20 and the output layer used the Softmax activation function. Dropout of 20% was applied at the input layer and 50% at the output layer.Citation21 The categorical cross entropy function was used as the loss function. The neuron weights were initialized using normalized He initialization.Citation22 The ANN model was trained for 3,600 epochs.

Nondiagnosis of pancreatic cancer was prevalent in the output data. The ratio between patients with and without pancreatic cancer was 1:438.40. For the LR and ANN models, each data point in the positive class was weighted as 438.40 times greater than each data point in the negative class to ensure that the output of the prediction was not unbalanced.

The software was implemented using Python (version 3.7.0). The LR model was created and trained with the scikit-learn library (version 0.19.1)Citation23 and trained on an Intel Core i5 CPU. The ANN model was created and trained with the Tensorflow framework (version 1.8.0)Citation24 on an NVIDIA® Tesla K80 graphics processing unit through Google Cloud.

Statistical analyses

The baseline characteristics, comorbidities, and medications of the pancreatic cancer group and nonpancreatic cancer group were compared. The Chi-squared test and Student’s t-test were used to test the differences of categorical and continuous variables, respectively. All risk factors outlined in were included in the model.

Table 2 Baseline characteristics of T2DM patients with and without pancreatic cancer

We used the weighted average recall (sensitivity), precision (positive predictive value), and F1 (harmonic mean of recall and precision) values to evaluate the predictor performance instead of accuracy due to an unbalanced data distribution.Citation25 The F1, precision, and recall values were calculated for the test set and for all data using the scikit-learn library. In addition, the receiver operating characteristic (ROC) curve was used as a metric to measure predictor performance. The ROC was calculated between the outcome and the predicted probability of outcome by the prediction model.

The F1, precision, and recall, and area under the ROC curve were compared between the LR and ANN models. The area under the ROC curve of both prediction models was also compared with the ideal value of 1.Citation26

Results

Demographic features of patients

Overall, 1,358,634 participants were selected for this retrospective cohort study, including 3,092 pancreatic cancer patients and 1,355,542 nonpancreatic cancer patients (). The age distribution was different in both groups, with the mean age higher in the pancreatic cancer group than in the nonpancreatic cancer group (63.8 [SD=11.4] vs 57.3 [SD=14.2] years). Compared with the nonpancreatic cancer group, the patients with pancreatic cancer had more prevalent comorbidities, including acute pancreatitis, chronic pancreatitis, gallstone, cholecystectomy, and cirrhosis. The proportion of those with a CCI score of 3 or above was 11.2% in the pancreatic cancer group compared with 9.98% in the nonpancreatic cancer group. The major T2DM-related complications (namely retinopathy, nephropathy, neuropathy, cerebrovascular, and peripheral vascular disease) were more prevalent in the nonpancreatic cancer group than in the pancreatic cancer group. The mean aDCSI score was 2.23 (SD =1.99) in the pancreatic cancer group and 2.62 (SD=2.18) in the nonpancreatic cancer group. The mean follow-up periods were 3.84 (SD=3.44) years in the pancreatic cancer group and 6.87 (SD=3.87) years in the nonpancreatic cancer group.

Evaluation of predictor performance

The F1, precision, and recall values of the LR and ANN models across all data are outlined in . The F1 and recall values of the LR model were greater than all of those of the ANN model, whereas the precision values of the ANN model were greater than those of the LR model across all the data. The weighted k-fold cross-validation accuracies (k=10) of the LR and ANN models were 0.996 and 0.907, respectively.

Table 3 Accuracy analysis of LR and ANN models across all data set

and present the ROC curves of the LR and ANN models, respectively. The area under the ROC curve across all data for the LR and ANN models were 0.727 (95% CI: 0.718–0.735, standard error [SE]: 0.004) and 0.605 (95% CI: 0.595–0.615, SE: 0.05), respectively. The areas under ROC curves of both prediction models were significantly better than the null hypothesis area of 0.5. The area under the ROC curve of the LR model was significantly greater than the area under the ROC curve of the ANN model.

Figure 1 The ROC curve of the LR model.

Note: The AUC across all data for the LR model is 0.727.

Abbreviations: AUC, area under the ROC curve; LR, logistic regression; ROC, receiver operating characteristic.

Figure 1 The ROC curve of the LR model.Note: The AUC across all data for the LR model is 0.727.Abbreviations: AUC, area under the ROC curve; LR, logistic regression; ROC, receiver operating characteristic.

Figure 2 The ROC curve of the ANN model.

Note: The AUC curve across all data for the ANN model is 0.605.

Abbreviations: ANN, artificial neural network; AUC, area under the ROC curve; ROC, receiver operating characteristic.

Figure 2 The ROC curve of the ANN model.Note: The AUC curve across all data for the ANN model is 0.605.Abbreviations: ANN, artificial neural network; AUC, area under the ROC curve; ROC, receiver operating characteristic.

Discussion

In this study, we created two models to predict the risk of developing pancreatic cancer among patients with T2DM in Taiwan. The metrics used in this study indicated that the LR model achieved superior results to the ANN model in the prediction of pancreatic cancer.

Studies have suggested that patients with T2DM possess an elevated risk of developing pancreatic cancer.Citation6Citation12 In Taiwan, researchers used a traditional Cox proportional hazard model and the NHIRD to evaluate the pancreatic cancer risk among patients with T2DM and antidiabetic therapies and revealed a positive association;Citation27Citation30 however, TsengCitation31 indicated that this relationship was likely due to detection bias and confounders. Based on changes in glucose level, changes in weight, and age at the onset of diabetes, Sharma et alCitation32 developed a model to determine the risk of pancreatic cancer among patients with new-onset diabetes. The current study attempted to use predictive models to evaluate their possible linkage. ANN is a mathematical model imitating the structure and function of a biological neural network and is used to evaluate functions or approximate operations. It is the most commonly used “model” of artificial intelligence and can be used for prediction, forecasting, diagnosis, and decision making.Citation33,Citation34 By using the NHIRD, researchers have revealed that ANN is a suitable model to predict some diseases.Citation34,Citation35 However, our results indicated that the area under the ROC curve across all data for the ANN model was only 0.605; by contrast, the LR model achieved a superior performance in predicting pancreatic cancer in patients with T2DM. Furthermore, the F1 and recall values also indicated that the LR model was superior. LR may be used to predict the risk of developing a given disease as well. The outcome can be binomial, ordinal, or multinomial. Steyerberg et alCitation13 suggested that LR analysis can be used to develop a statistical model for a binary outcome. As most of our variables were categorical, LR was a suitable choice for modeling. However, TuCitation14 suggested that although neural networks generally have an accurate predictor performance, the performance of prediction models depends upon the characteristics of the data set. One characteristic of the current data set was that its outcome distribution was asymmetric. As a result, the prediction model may have overfitted the data, which could be solved by adding regularization to the model. In the present study, although the LR model outperformed the ANN model in the area under the ROC curve, F1 value, and recall value, only the precision value of the ANN model was higher than that of the LR model. This suggested that the ANN model may have overfitted the data despite dropout regularization and class weighting. Ayer et alCitation36 noted that ANNs are particularly useful when implicit interactions and complex relationships exist in the data, whereas LR models are the superior choice when statistical inferences must be drawn from the outputted data.

To our knowledge, this is the first national population-based study that used LR and ANN models to predict the risk of pancreatic cancer in patients with T2DM. This study was based on a nationwide representative sample, which increased its generalizability. In addition, the diagnoses of T2DM and pancreatic cancer were highly reliable because all the NHI claims were scrutinized by medical reimbursement specialists and peer reviewed to prevent errors and overutilization of medical resources. However, several limitations must still be addressed before the results are interpreted. First, we used outpatient and inpatient records of the ICD-9 code diagnosis of pancreatic cancer and defined patients with at least five consensus diagnoses to ensure the validity of diagnosis. We do not have the information in regard to the registry for catastrophic illness patients file and the registry for drug prescriptions file to maximize the accuracy of diagnosis of the pancreatic cancer. However, the reimbursement policy of NHIRD is universal and operated by a single buyer, the government in Taiwan. All the NHI claims were scrutinized by medical reimbursement specialists and peer reviewed to prevent errors and overutilization of medical resources, and medical providers face administrative sanction and high financial penalties if diagnostic claims do not agree with the standard diagnostic criteria used for medical reimbursement. Therefore, the diagnosed validity of “pancreatic cancer” based on ICD-9 codes in this study is highly reliable. In addition, some related studies with the same diagnostic method and criteria by ICD-9 coding were already been published.Citation27,Citation29,Citation30 Second, unlike the traditional Cox proportional hazard model, our predictive models could not provide valued levels (95% CIs and P-values) to assess the overall statistical significance of the predictions made by the prediction models. Instead, we used recall, precision, F1, and area under the ROC as metrics to evaluate the performance of the prediction models. Third, certain health-related behaviors such as smoking and drinking alcohol have been suggested to increase the risk of pancreatic cancerCitation6Citation8; however, the NHIRD did not contain any information regarding this. Therefore, we cannot confirm if adding these factors might have improved the values of the metrics or the ROC curves for the prediction models. Instead, we used alcohol-related illness to decrease the effect of alcohol in possible associations. Fourth, some undetermined factors, such as family history of pancreatic cancer, diet, and physical exercise, which may be related to pancreatic cancer,Citation7,Citation8 were also unavailable in the NHIRD for data extraction. We could not control these factors in the analyses either. Finally, the prediction models in this study did not take into account time series information, which means that we did not track the progression of particular subjects over time. We only considered individual subject data from 2000 to 2012.

Conclusion

This study compared models for the prediction of pancreatic cancer risk in patients with T2DM. Our analysis indicated that the LR model rather than ANN model provided a more appropriate method for predicting pancreatic cancer in patients with T2DM in Taiwan. Our findings may increase the prognosis of pancreatic cancer through surveillance, early diagnosis, and treatment in people with certain risk factors. Further investigations from other countries are required to determine if our findings are applicable elsewhere.

Abbreviations

ANN=

artificial neural network

ICD-9-CM=

ICD, ninth revision, clinical modification

LR=

logistic regression

NHIRD=

National Health Insurance Research Database

T2DM=

type 2 diabetes

Acknowledgments

This work was supported by grants from the Ministry of Health and Welfare, Taiwan (MOHW107-TDU-B-212-123004), China Medical University Hospital (DMR-107-192), Academia Sinica Stroke Biosignature Project (BM10701010021), MOST Clinical Trial Consortium for Stroke (MOST 107-2321-B-039 -004-), Tseng-Lien Lin Foundation, Taichung, Taiwan, and Katsuzo and Kiyo Aoshima Memorial Funds, Japan. The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. No additional external funding was received for this study.

Author contributions

All authors have contributed significantly, and all authors are in agreement with the content of the manuscript. MHH, LMS, and CHK contributed to the conception/design. CHK contributed to the provision of study materials. All authors contributed to data analysis, drafting or revising the article, gave final approval of the version to be published, and agree to be accountable for all aspects of the work.

Disclosure

The authors report no conflicts of interest in this work.

References

  • StathopoulosGPAndroulakisNSouglakosJStathopoulosJGeorgouliasVPresent treatment and future expectations in advanced pancreatic cancerAnticancer Res2008282B1303130818505070
  • YousafUChristensenMLEngholmGStormHHSuicides among Danish cancer patients 1971–1999Br J Cancer2005926995100015756279
  • IlicMIlicIEpidemiology of pancreatic cancerWorld J Gastroenterol201622449694970527956793
  • Cancer Statistics [homepage on the Internet]Cancer Incidence TrendsTaiwan Cancer Registry Available from: http://tcr.cph.ntu.edu.tw/main.php?Page=A5B2AccessedJuly 8, 2018
  • Cancer Statistics Annual Report [homepage on the Internet]Taiwan Cancer Registry Available from: http://tcr.cph.ntu.edu.tw/main.php?Page=N2AccessedJuly 8, 2018
  • PandolSGukovskayaAEdderkaouiMEpidemiology, risk factors, and the promotion of pancreatic cancer: role of the stellate cellJ Gastroenterol Hepatol201227Suppl 212713422320930
  • BeckerAEHernandezYGFruchtHLucasALPancreatic ductal adenocarcinoma: risk factors, screening, and early detectionWorld J Gastroenterol20142032111821119825170203
  • HassanMMBondyMLWolffRARisk factors for pancreatic cancer: case-control studyAm J Gastroenterol2007102122696270717764494
  • Carreras-TorresRJohanssonMGaborieauVThe role of obesity, type 2 diabetes, and metabolic factors in pancreatic cancer: a Mendelian randomization studyJ Natl Cancer Inst20171099
  • MakhoulIYacoubASiegelEType 2 diabetes mellitus is associated with increased risk of pancreatic cancer: a veteran administration registry studySAGE Open Med20164205031211668225
  • LiDDiabetes and pancreatic cancerMol Carcinog2012511647422162232
  • SongSWangBZhangXLong-term diabetes mellitus is associated with an increased risk of pancreatic cancer: a meta-analysisPLoS One2015107e013432126222906
  • SteyerbergEWEijkemansMJHarrellFEHabbemaJDPrognostic modelling with logistic regression analysis: a comparison of selection and estimation methods in small data setsStat Med20001981059107910790680
  • TuJVAdvantages and disadvantages of using artificial neural networks versus logistic regression for predicting medical outcomesJ Clin Epidemiol19964911122512318892489
  • CharlsonMEPompeiPAlesKLMackenzieCRA new method of classifying prognostic comorbidity in longitudinal studies: development and validationJ Chronic Dis19874053733833558716
  • DeyoRACherkinDCCiolMAAdapting a clinical comorbidity index for use with ICD-9-CM administrative databasesJ Clin Epidemiol19924566136191607900
  • YoungBALinEvon KorffMDiabetes complications severity index and risk of mortality, hospitalization, and healthcare utilizationAm J Manag Care2008141152318197741
  • FanREChangKWHsiehCJLIBLINEAR: a library for large linear classificationJ Mach Learn Res2008918711874
  • KingmaDPBaLJAdam: A method for stochastic optimization3rd International Conference for Learning RepresentationsMay 7–9, 2015San Diego, CA
  • KlambauerGUnterthinerTMaytASelf-normalizing neural networksAdvances in Neural Information Processing Systems2017971980
  • SrivastavaNHintonGKrizhevskyADropout: a simple way to prevent neural networks from overfittingJ Mach Learn Res20141519291958
  • HeKZhangXRenSDelving deep into rectifiers: surpassing human-level performance on imagenet classificationProceedings of the IEEE International Conference on Computer Vision201510261034
  • PedregosaFVaroquauxGGramfortAScikit-learn: machine learning in pythonJ Mach Learn Res20111228252830
  • AbadiMBarhamPChenJTensorFlow: a system for large-scale machine learning12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16)November 2–4, 2016Savannah, GA, USA265283
  • HeHGarciaEALearning from imbalanced dataIEEE Trans Knowl Data Eng20092112631284
  • HanleyJAMcneilBJThe meaning and use of the area under a receiver operating characteristic (ROC) curveRadiology1982143129367063747
  • LiaoKFLaiSWLiCIChenWCCiLDiabetes mellitus correlates with increased risk of pancreatic cancer: a population-based cohort study in TaiwanJ Gastroenterol Hepatol201227470971321929650
  • ErKCHsuCYLeeYKHuangMYSuYCEffect of glycemic control on the risk of pancreatic cancer: a nationwide cohort studyMedicine20169524e392127311001
  • ChenMJTsanYTLiouJMStatins and the risk of pancreatic cancer in Type 2 diabetic patients: a population-based cohort studyInt J Cancer2016138359460326296262
  • KaoCHSunLMChenPCA population-based cohort study in Taiwan: use of insulin sensitizers can decrease cancer risk in diabetic patients?Ann Oncol201324252353023110810
  • TsengCHNew-onset diabetes with a history of dyslipidemia predicts pancreatic cancerPancreas2013421424822750971
  • SharmaAKandlakuntaHNagpalSJSModel to determine risk of pancreatic cancer in patients with new-onset diabetesGastroenterology2018155373073929775599
  • AhmedFEArtificial neural networks for diagnosis and survival prediction in colon cancerMol Cancer200542916083507
  • ChengCAChiuHWAn artificial neural network model for the evaluation of carotid artery stenting prognosis using a national-wide databaseConf Proc IEEE Eng Med Biol Soc201720172566256929060423
  • RauHHHsuCYLinYADevelopment of a web-based liver cancer prediction model for type II diabetes patients by using an artificial neural networkComput Methods Programs Biomed2016125586526701199
  • AyerTChhatwalJAlagozOInformatics in radiology: comparison of logistic regression and artificial neural network models in breast cancer risk estimationRadiographics2010301132219901087