Abstract
In this study, we developed new computational DNA adduct prediction models by using significantly more diverse training data-set of 217 DNA adducts and 1024 non-DNA adducts, and applying five machine learning methods which include support vector machine (SVM), k-nearest neighbour, artificial neural networks, logistic regression and continuous kernel discrimination. The molecular descriptors used for DNA adduct prediction were selected from a pool of 548 descriptors by using a multi-step hybrid feature selection method combining Fischer-score and Monte Carlo simulated annealing method. Some of the selected descriptors are consistent with the structural and physicochemical properties reported to be important for DNA adduct formation. The y-scrambling method was used to test whether there is a chance correlation in the developed SVM model. In the meantime, fivefold cross-validation of these machine learning methods results in the prediction accuracies of 64.1–82.5% for DNA adducts and 95.1–97.6% for non-DNA adducts, and the prediction accuracies for external test set are 78.2–100% for DNA adducts and 92.6–98.4% for non-DNA adducts. Our study suggested that the tested machine learning methods are potentially useful for DNA adducts identification.
Acknowledgements
This study was supported by a grant from the Two-Way Support Programs of Sichuan Agricultural University (Project No. 00770117) and the National Natural Science Foundation of China (Project No. 20973118).