53
Views
0
CrossRef citations to date
0
Altmetric
ORIGINAL RESEARCH

Machine Learning for Prediction of Non-Small Cell Lung Cancer Based on Inflammatory and Nutritional Indicators in Adults: A Cross-Sectional Study

ORCID Icon, ORCID Icon, ORCID Icon &
Pages 527-535 | Received 11 Jan 2024, Accepted 23 May 2024, Published online: 30 May 2024

Abstract

Purpose

The aim of this study was to evaluate the potential benefit of blood inflammation in the diagnosis of non-small cell lung cancer (NSCLC) and propose a machine-learning-based method to predict NSCLC in asymptomatic adults.

Patients and Methods

A cross-sectional study was evaluated using medical records of 139 patients with non-small cell lung cancer and physical examination data from May 2022 to May 2023 of 198 healthy controls. The NSCLC cohort comprised 128 cases of adenocarcinoma, 3 cases of squamous cell carcinoma, and 8 cases of other NSCLC subtypes. The correlation between inflammatory and nutritional markers, such as monocytes, neutrophils, LMR, NLR, PLR, PHR and non-small cell lung cancer was examined. Features were selected using Python’s feature selection library and analyzed by five algorithms. The predictive ability of the model for non-small cell lung cancer diagnosis was assessed by precision, accuracy, recall, F1 score, and area under the curve (AUC).

Results

The results showed that the top 14 important factors were PDW, age, TP, RBC, HGB, LYM, LYM%, RDW, PLR, LMR, PHR, MONO, MONO%, gender. Additionally, the naive Bayes (NB) algorithm demonstrated the highest overall performance in predicting adult NSCLC among the five machine learning algorithms, achieving an accuracy of 0.87, a macro average F1 score of 0.85, a weighted average F1 score of 0.87, and an AUC of 0.84.

Conclusion

In feature ranking, platelet distribution width was the most important feature, and the NB algorithm performed best in predicting adult NSCLC diagnosis.

Introduction

According to the International Agency for Research on Cancer (IARC) estimates of new cancer cases and deaths worldwide in 2020, lung cancer continues to be a significant burden of global morbidity and mortality, has become the most common malignancy in men and the second most common malignancy in women, and remains the leading cause of cancer deaths (18.0% of all cancer types)Citation1. Among lung cancers, non-small cell lung cancer (NSCLC) is the most common histologic type, accounting for nearly 85% of all diagnosed lung cancers. The clinical prognosis of lung cancer is directly related to the stage of the cancer at the time of diagnosis. The 5-year survival rate for patients with stage I lung cancer is 68.4%, whereas the 5-year survival rate for patients with stage IV lung cancer is only 5.8%.Citation2,Citation3 Unfortunately, most patients with NSCLC are not diagnosed until advanced stage III, which is associated with a poorer prognosis.Citation4 Unfortunately, most patients with NSCLC are not diagnosed until advanced stage III, which is associated with a poorer prognosis. Recently, there has been increasing evidence that inflammation may contribute to the onset, tumor staging, and progression of malignant tumors.Citation5,Citation6 In addition, studies have analyzed the impact of hematological indicators such as monocytes, lymphocytes, neutrophils, albumin, and hemoglobin on cancer detection and prognosis. These indicators can be detected in peripheral blood and have the potential to assist in the diagnosis of lung cancer. It was also found that combinations of ratios derived from these measurements help to better diagnose cancer and predict prognosis compared to single indicators.Citation7 For example, monocyte-to-albumin ratio (MAR), neutrophil percentage-to-hemoglobin ratio (NPHR), neutrophil-to-lymphocyte ratio (NLR), and lymphocyte-to-monocyte ratio (LMR) were reported by several published studies to play an extremely important role in cancer prediction and diagnosis.Citation8 However, the clinical significance of PHR and MHR in NSCLC has not been fully investigated.

With the development of artificial intelligence (AI), machine learning (ML) has made technological breakthroughs in analyzing clinical data in the biomedical field and has been widely used to predict the diagnosis and prognosis of diseases.Citation9 Therefore, we collected inflammation-related data from preoperative peripheral blood indices, including leukocytes, monocytes, neutrophils, monocytes, and platelet count, and calculated their ratios, including lymphocyte-to-monocyte ratio (LMR), neutrophil-to-lymphocyte ratio (NLR); neutrophil-to-monocyte ratio (NMR), and platelet-to- Lymphocyte Ratio (PLR), Monocyte-to-Albumin Ratio (MAR), Monocyte-to-Hemoglobin Ratio (MHR), and Platelet-to-Hemoglobin Ratio (PHR) to determine a machine learning method for predicting the risk of adult non-small cell lung cancer. On this basis, five machine learning algorithm models, including logistic regression (LR), decision tree (DT), adaboost (AB) random forest (RF) and naive Bayes (NB) algorithm were used to build a prediction model to explore the clinical significance of inflammatory markers for NSCLC diagnosis. The diagnostic ability of the model was assessed using the area under the ROC curve, sensitivity, and specificity. The aim was to explore the added value of single inflammatory markers and ratio combinations of these measurements in the diagnosis of NSCLC. This will provide a new theoretical basis for future screening and diagnostic studies of NSCLC.

Material and Methods

Study Design and Participants

This retrospective analysis included medical record data from 139 patients diagnosed with NSCLC by surgery and 198 healthy subjects diagnosed with NSCLC by physical examination at Deyang People’s Hospital (Sichuan Province, China) between May 2022 to May 2023. Patients with NSCLC were diagnosed by cytology or histopathology. Healthy subjects were those who had no suspicious lesions on chest CT examination. The study was approved by the Ethics Committee of Deyang People’s Hospital. As the paper was a retrospective study, the requirement of informed consent was waived. The data was anonymized.

Inclusion and Exclusion Criteria

Inclusion criteria (1) The case group consisted of patients with surgical and pathological diagnosis of NSCLC and the control group consisted of healthy subjects who showed no evidence of suspicious lesions detected by chest CT; (2) Aged 18 years or above; (3) No previous history of cancer; (4) No antitumor therapy prior to examination.

Exclusion criteria (1) patients with acute and chronic inflammatory diseases; (2) cardiovascular and cerebrovascular diseases; (3) chronic obstructive pulmonary disease; (4) diabetes mellitus; (5) hematological diseases; (6) history of other malignant tumors; and (7) incomplete records of clinical and/or auxiliary examinations.

Collecting Clinical Data

Clinical data included participants’ identification numbers, age, sex, smoking history, blood cell counts, histopathology, and diagnostic information. Then, the values of other hematological indices were calculated. NLR is the ratio of absolute neutrophil count to absolute lymphocyte count per liter of whole blood; PLR is the ratio of platelet count to lymphocyte count; MAR is the ratio of monocytes to albumin; TAR is the ratio of total bilirubin to albumin; MHR is the ratio of monocytes to hemoglobin; and PHR is the ratio of platelets to hemoglobin. LMR is the lymphocyte-to-monocyte ratio; NMR is the neutrophil-to-monocyte ratio; and PLR is the platelet-to-lymphocyte ratio.

Statistical Analysis

Statistical analyses were performed using SPSS version 23.0 (SPSS Inc., Chicago, USA), and diagnostic predictive models were constructed using Python 3.7.6. Variables that conformed to normal distribution were expressed as mean ± standard deviation and subjected to Student-t test. Variables that did not fit the normal distribution were expressed as median and interquartile range and analyzed with the independent samples Kruskal–Wallis U-test. Categorical variables were expressed as frequencies and proportions and analyzed with the chi-square test. Risk factors for NSCLC were determined by binary logistic regression. p <0.05 was considered statistically significant.

Machine Learning-Based Diagnostic Model

In this study, five ML algorithms were used for model selection, namely logistic regression (LR), decision tree (DT), adaboost (AB) random forest (RF) and naive Bayes (NB). All machine learning processes were done in Python environment (version 3.9.0) using sklearn, numpy, pandas, matplotlib, seaborn and scipy packages. The first step is to import the data set into a Pandas data frame and then split it into the training set (80%) and test set (20%) using the train_test_split function for building the model and ensuring the accuracy of the results. The independent variables are first standardized based on their feature range using the StandardScaler package from the sklearn library. To prevent overfitting, we employed 5-fold cross-validation during training. The aim of feature selection is to rank and prioritize the most significant predictors within the dataset. The impact of these features was evaluated by calculating permutation importance. Additionally, we manually fine-tuned the parameters in each model. After the completion of model training, an evaluation is conducted to compare and assess the performance of each individual model. Accuracy, precision, recall and F1 score of the prediction model were the evaluation metricsCitation10. True Positive (TP) refers to a positive sample predicted as positive, whereas False Positive (FP) refers to negative samples that are falsely predicted as positive. TN (True Negative) refers to a negative sample predicted as negative, while False Negative (FN) indicates positive samples that are falsely classified as negative by the model. For performance comparison, four parameters, accuracy, precision, recall, and F1-score are used.Citation10 Codes related to this research can be downloaded from GitHub website (https://github.com/ganbingliangyi/ganbingliangyi).

Results

Participant Characteristics

The baseline data of the two groups are shown in . A total of 337 volunteers were recruited in this study, including 139 patients with NSCLC and 198 subjects with no detectable disease (). The patients with NSCLC consisted of 128 patients with adenocarcinoma, 3 patients with squamous cell carcinoma, and 8 patients with other NSCLC. The analysis of single factor revealed that the NSCLC group exhibited a higher proportion of females compared to males, with a statistically significant gender disparity between the two groups. Compared with healthy volunteers, monocytes and monocyte percentage were significantly higher in NSCLC patients (P < 0.05), while hemoglobin, albumin, total bilirubin, total protein, RBC, and PDW were decreased (P < 0.05). NLR, PLR, MAR, MHR, and PHR levels were significantly higher in NSCLC patients compared to controls (P < 0.05). However, LMR, TAR and NMR were much higher in healthy volunteers than in NSCLC patients (shown in ).

Table 1 Participant Characteristics

Analysis of Multifactor Logistic Regression

The variables that showed statistical significance in were included in the multivariate logistic regression analysis. The following factors were identified as independent associations for non-small cell lung cancer (NSCLC): gender, age, red blood cell count (RBC), platelet distribution width (PDW), total protein (TP), and neutrophil-to-lymphocyte ratio (NLR) (shown in ).

Table 2 Multivariate Regression Analysis of Independent Factors for NSCLC

Ranking of Influencing NSCLC Feature

Non-small cell lung cancer was set as the target variable. Importance scores for all risk factors were calculated based on patient characteristics using the “feature selector” method. The importance ranking of all variables was done using the Gini scale. Factors with higher risk scores indicated a greater impact on NSCLC occurrence. We selected first 14 indicators as the characterization factors to build the risk prediction model. The order of importance of the 14 variables is shown in . The cumulative importance score of these 14 features was subsequently computed and presented in . The cumulative importance of PDW, age, TP, RBC, HGB,LYM, LYM%, RDW, PLR, LMR, and PHR approached 0.9 ().

Table 3 The Cumulative Importance Score of Features

Figure 1 The rank of the importance of indicators in NSCLC.

Figure 1 The rank of the importance of indicators in NSCLC.

Figure 2 The cumulative importance of factors.

Figure 2 The cumulative importance of factors.

Prediction Model of ML

To further validate the potential of these indicators as diagnostic indicators for NSCLC, we selected top 14 indicators (PDW, age, TP, RBC, HGB, LYM, LYM%, RDW, PLR, LMR, PHR, MONO, MONO%, gender) as the characterization factors to build the risk prediction model, including RF, LR, DT, AB and NB algorithms. As shown in , the ROC curves were plotted according to each of the five algorithms. The ROC-AUC plots for each algorithm (RF (AUC = 0.77), LR (AUC = 0.74), DT (AUC = 0.67), AB (AUC=0.87), and NB (AUC=0.84)) are shown in , confirming the reliability of the models in predicting risk factors. The AUC values of all five models were greater than 0.6, indicating that these models have moderate potential in the diagnosis of non-small cell lung cancer patients. In addition, shows the accuracy, precision, recall, and F1 score of the ML models, with the five algorithmic models predicting non-small cell lung cancer with an accuracy of 79%, 76%, 69%, 78% and 87%, respectively. This indicates that the naive Bayes (NB) algorithm demonstrated the highest overall performance in predicting adult NSCLC among the five machine learning algorithms, achieving an accuracy of 0.87, a macro average F1 score of 0.85, a weighted average F1 score of 0.87, and an AUC of 0.84.

Table 4 Compared in Terms of Precision, Recall, F1 Value and AUC

Figure 3 The ROC-AUC graph.

Figure 3 The ROC-AUC graph.

Discussion

In this study, records of 139 NSCLC patients and 198 healthy individuals were retrospectively analyzed. It showed that elevated levels of monocytes, monocyte percentage, NLR, PLR, MAR, MHR, PHR and decrease levels of hemoglobin, albumin, total bilirubin, total protein, RBC, PDW, LMR, TAR, and NMR were risk factors for NSCLC. And the results of further multivariate analysis revealed that the following factors were identified as independent associations for non-small cell lung cancer (NSCLC): gender, age, RBC, PDW, TP, and NLR, which is consistent with previous studies.Citation5 Studies have suggested that inflammation may play an important role in the development and progression of cancer.Citation11–14 For example, monocytes play an important role in tumor development and metastasis by creating a link between innate and adaptive immune responses by promoting immunosuppression, remodeling of the extracellular matrix, angiogenesis, and tumor cell conductance.Citation15 A study observed that elevated NLR, and PLR were associated with increased lung cancer risk, while LMR decreased.Citation5 In addition, in this study, hematological parameters such as MHR and PHR were used for the first time to predict NSCLC. Albumin and hemoglobin can reflect nutritional status. Hypoalbuminemia is usually the result of insufficient nutrient absorption and excessive tumor consumption, which affects metabolism and immune function. Tumor cells in a state of hypoxia promote tumor growth while hemoglobin concentration decreases.Citation16 Our data showed that the ratio of inflammatory indicators to nutritional indicators (including PHR, MHR, and MAR) was higher in NSCLC patients than in healthy subjects, which may be a new independent factor for predicting NSCLC. Further analysis revealed that MAR, MHR, and PHR may have potential value in identifying NSCLC patients from the healthy population. Previous studies have shown that MAR is an independent risk factor for NSCLC, which is consistent with our results.Citation8 In addition, a combination of inflammatory and nutritional indicators may be better than a single inflammatory indicator in diagnosing NSCLC patients. In addition, our study showed the ranking of the characteristics affecting NSCLC, and PDW, age, total protein, PHR, and RBC were the top five ranked characteristics.

Recently, machine learning has been more widely applied in predicting diseases in the medical field,Citation17 particularly in medical imaging and decision support systems.Citation18,Citation19 Machine learning can effectively learn the features of huge amounts of data, which provides new research ideas and methods for precise prediction.Citation20 A lot of ML algorithms have been used for clinical applications, including support vector machines (SVMs), random forests (RFs), neural networks (NNs), Decision Trees (DTs), and other algorithms.Citation21 ML techniques allowed the development of forecasting models for predicting cancer diagnosis and clinical outcomes.Citation22,Citation23 Research indicates that the adoption of ML for other cancers substantially improves prediction accuracy.Citation24 In addition, the previous study developed ML models applied to Lung Cancer Classification and Prediction based on a decision tree, RF, logistic regression, SVM, naive Bayes, and neural network, and the results showed that models of the neural network, random forest, and naive Bayes performed well for the data of baseline. The classification accuracies of the models were 0.767, 0.718, and 0.688, respectively, and the AUCs were 0.793, 0.779, and 0.771.Citation25

In this study, five machine learning algorithm models consisting of LR, DT, AB, RF and NB were designed to predict the diagnosis value in patients with NSCLC. Owing to the imbalanced data set, we utilize comprehensive scoring indicators, including accuracy, precision, recall rate, F1-score, and AUC value to improve model performance. In our research, the naive Bayes algorithm was determined to have the best prediction ability with 0.87 accuracy, 0.86 macro-avg precision, 0.87 weighted-avg precision,0.84 macro-avg recall rate, 0.87 weighted-avg recall rate, 0.85 macro-avg F1 score, 0.87 weighted-avg F1 score, and 0.84 AUC. In addition, the clinical importance of machine learning lies in the detection of risk factors that are closely associated with NSCLC. According to the permutation importance of feature variables, PDW, age, TP, RDW, MONO percentage, RBC, LYM percentage, PLR, MONO, LYM, HGB, gender, PHR and LMR were critical to NSCLC. According to this study, this model can help to improve the rate of diagnosis of NSCLC. With successful machine learning models, the outcome of cancer can be improved, and hospitalization and healthcare expenses can be reduced due to early diagnosis. Therefore, a machine learning model with the ability to predict the lung cancer is required.

However, there are some limitations in our research. Firstly, it is a cross-sectional study, our data is based on medical records of single center clinical data from a hospital, which has differences in treatment strategies, ethnicity, and other factors. For example, the majority of the non-small cell lung cancers included in the sample were adenocarcinoma, accounting for 92.09%, which may not adequately represent the comprehensive characteristics of all non-small cell lung cancers.Secondly, potential factors that are not included may also have a certain impact on the results, and the algorithm model is skewed because some input features could affect the accuracy of the model. Thirdly, although the risk of bias was considered and calibrated in this study by two internal validation methods (split-sample validation and n-fold cross-validation), a larger sample size and prospective research are needed to further verify the practice of the model in future studies.

Conclusion

In recent years, artificial intelligence has been widely used in the medical field. Timely and accurate diagnosis is crucial in order to select the most appropriate treatment and reduce misdiagnosis of NSCLC patients. Machine learning models can be used to predict diagnosis and survival, and it is now used in many other fields to predict outcomes. In this study, the evaluation of NSCLC prediction was done using five machine-learning models. The NB model has better predictive ability with the higher level of precision, recall, F1, and AUC and shows the best performance in predicting NSCL in adults. Our results lay the foundation for an early warning system that can provide clinicians with relevant information for clinical decision-making.

Institutional Review Board Statement

The study was conducted according to the guidelines of the Declaration of Helsinki and approved by the Ethics Committee of Deyang People’s Hospital (protocol code:2022-04-097-K01 and date of approval: December 28, 2022).

Informed Consent Statement

As the paper was a retrospective study, the requirement of informed consent was waived.

Author Contributions

All authors made a significant contribution to the work reported, whether that is in the conception, study design, execution, acquisition of data, analysis and interpretation, or in all these areas; took part in drafting, revising or critically reviewing the article; gave final approval of the version to be published; have agreed on the journal to which the article has been submitted; and agree to be accountable for all aspects of the work.

Disclosure

The authors declare no conflicts of interest in this work.

Additional information

Funding

This research was funded by technology research and development project of Deyang Science and Technology Bureau, grant number [2022SCZ137].

References

  • Sung H, Ferlay J, Siegel RL, et al. Global cancer statistics 2020: globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries. Ca a Canc J Clinic. 2021;71(3):209–249. doi:10.3322/caac.21660
  • Goldstraw P, Chansky K, Crowley J, et al. The iaslc lung cancer staging project: proposals for revision of the TNM stage groupings in the forthcoming (eighth) edition of the TNM classification for lung cancer. J Thorac Oncol. 2016;11(1):39–51. doi:10.1016/j.jtho.2015.09.009
  • Mithoowani H, Febbraro M. Non-small-cell lung cancer in 2022: a review for general practitioners in oncology. Curr Oncol. 2022;29(3):1828–1839. doi:10.3390/curroncol29030150
  • Liu G, Pei F, Yang F, et al. Role of autophagy and apoptosis in non-small-cell lung cancer. Int J Mol Sci. 2017;18(2).
  • Nøst TH, Alcala K, Urbarova I, et al. Systemic inflammation markers and cancer incidence in the UK biobank. Eur J Epidemiol. 2021;36(8):841–848.
  • Qian S, Golubnitschaja O, Zhan X. Chronic inflammation: key player and biomarker-set to predict and prevent cancer development and progression based on individualized patient profiles. Epma Journal. 2019;10(4):365–381. doi:10.1007/s13167-019-00194-x
  • Xu F, Xu P, Cui W, et al. Neutrophil-to-lymphocyte and platelet-to-lymphocyte ratios may aid in identifying patients with non-small cell lung cancer and predicting tumor-node-metastasis stages. Oncol Lett. 2018;16(1):483–490. doi:10.3892/ol.2018.8644
  • Zhao ST, Chen XX, Yang XM, He SC, Qian FH. Application of monocyte-to-albumin ratio and neutrophil percentage-to-hemoglobin ratio on distinguishing non-small cell lung cancer patients from healthy subjects. Int J Gen Med. 2023;16:2175–2185. doi:10.2147/IJGM.S409869
  • Obermeyer Z, Emanuel EJ. Predicting the future - big data, machine learning, and clinical medicine. N Engl J Med. 2016;375(13):1216–1219. doi:10.1056/NEJMp1606181
  • Ngabo D, Dong W, Ibeke E, Iwendi C, Masabo E. Tackling pandemics in smart cities using machine learning architecture. Math Biosci Eng. 2021;18(6):8444–8461. doi:10.3934/mbe.2021418
  • Mantovani A, Allavena P, Sica A, Balkwill F. Cancer-related inflammation. Nature. 2008;454(7203):436–444. doi:10.1038/nature07205
  • Diakos CI, Charles KA, Mcmillan DC, Clarke SJ. Cancer-related inflammation and treatment effectiveness. Lancet Oncol. 2014;15(11):e493–e503. doi:10.1016/S1470-2045(14)70263-3
  • Ding HP, Ling YQ, Chen W, et al. Effects of nutritional indices and inflammatory parameters on patients received immunotherapy for non-small cell lung cancer. Curr Probl Cancer. 2023;48:101035. doi:10.1016/j.currproblcancer.2023.101035
  • Berkman SJ, Roscoe EM, Bourret JC. Comparing self-directed methods for training staff to create graphs using GraphPad prism. J Appl Behav Anal. 2019;52(1):188–204. doi:10.1002/jaba.522
  • Olingy CE, Dinh HQ, Hedrick CC. Monocyte heterogeneity and functions in cancer. J Leukoc Biol. 2019;106(2):309–322. doi:10.1002/JLB.4RI0818-311R
  • Huang Y, Wei S, Jiang N, et al. The prognostic impact of decreased pretreatment haemoglobin level on the survival of patients with lung cancer: a systematic review and meta-analysis. BMC Cancer. 2018;18(1):1235. doi:10.1186/s12885-018-5136-5
  • Rajpurkar P, Chen E, Banerjee O, Topol EJ. Ai in health and medicine. Nat Med. 2022;28(1):31–38. doi:10.1038/s41591-021-01614-0
  • Currie G, Hawk KE, Rohren E, Vial A, Klein R. Machine learning and deep learning in medical imaging: intelligent imaging. J Med Imaging Radiat Sci. 2019;50(4):477–487. doi:10.1016/j.jmir.2019.09.005
  • Noorbakhsh-Sabet N, Zand R, Zhang Y, Abedi V. Artificial intelligence transforms the future of health care. Am J Med. 2019;132(7):795–801. doi:10.1016/j.amjmed.2019.01.017
  • Luo W, Phung D, Tran T, et al. Guidelines for developing and reporting machine learning predictive models in biomedical research: a multidisciplinary view. J Med Internet Res. 2016;18(12):e323. doi:10.2196/jmir.5870
  • Nwanosike EM, Conway BR, Merchant HA, Hasan SS. Potential applications and performance of machine learning techniques and algorithms in clinical practice: a systematic review. Int J Med Inform. 2022;159:104679. doi:10.1016/j.ijmedinf.2021.104679
  • Nageswaran S, Arunkumar G, Bisht AK, et al. Lung cancer classification and prediction using machine learning and image processing. Biomed Res Int. 2022;2022:1755460. doi:10.1155/2022/1755460
  • Altuhaifa FA, Win KT, Su G. Predicting lung cancer survival based on clinical data using machine learning: a review. Comput Biol Med. 2023;165:107338. doi:10.1016/j.compbiomed.2023.107338
  • Xie Y, Meng WY, Li RZ, et al. Early lung cancer diagnostic biomarker discovery by machine learning methods. Transl Oncol. 2021;14(1):100907. doi:10.1016/j.tranon.2020.100907
  • Shi Y, Wang H, Yao X, et al. Machine learning prediction models for different stages of non-small cell lung cancer based on tongue and tumor marker: a pilot study. BMC Med Inform Decis Mak. 2023;23(1):197. doi:10.1186/s12911-023-02266-5