119
Views
13
CrossRef citations to date
0
Altmetric
Original Research

Prediction of selective estrogen receptor beta agonist using open data and machine learning approach

, , , &
Pages 2323-2331 | Published online: 18 Jul 2016

Figures & data

Figure 1 The data analysis and machine learning schema.

Notes: Step 1: collect ER-β agonist data from public database. Step 2: chemical diversity analysis. Step 3: construct machine learning models. Step 4: validate the constructed models.
Abbreviations: ER, estrogen receptor; SVM, support vector machine; ROC, receiver operating characteristic; PCA, principal component analysis.
Figure 1 The data analysis and machine learning schema.

Figure 2 Principal component analysis (PCA) of the dataset.

Notes: The PCA was based on four types of fingerprints. Each dot represents a unique compound of the dataset. Black dots represent active compounds, whereas gray dots represent inactive compounds.
Abbreviations: Ext, extended; AP2D, 2D atom pairs; FP, fingerprints.
Figure 2 Principal component analysis (PCA) of the dataset.

Figure 3 The heat map of distance matrix for the compounds in the collected dataset.

Note: Green represents a large distance and structural dissimilarity.
Figure 3 The heat map of distance matrix for the compounds in the collected dataset.

Table 1 Model performances of 5-fold cross validation

Figure 4 The ROC curves of the 5-fold cross validation models based on four types of fingerprints (FP) and four machine learning approaches.

Note: The error bar in the curve is based on five runs of the 5-fold cross validation process.
Abbreviations: ROC, receiver operating characteristic; NB, Naïve Bayesian; KNN, k-nearest neighbor; RF, random forest; SVM, support vector machine; Ext, extended; AP2D, 2D atom pairs; TP, true positives, FPos, false positives.
Figure 4 The ROC curves of the 5-fold cross validation models based on four types of fingerprints (FP) and four machine learning approaches.

Figure 5 Performance ranking of machine learning methods with various fingerprints (FP).

Note: Take KNN for example, KNN ranked first with MACCSFP, ranked second with AP2D, and ranked third with ExtFP or PubChemFP.
Abbreviations: NB, Naïve Bayesian; KNN, k-nearest neighbor; RF, random forest; SVM, support vector machine; Ext, extended; AP2D, 2D atom pairs.
Figure 5 Performance ranking of machine learning methods with various fingerprints (FP).

Figure 6 Performance ranking of fingerprints (FP) in various machine learning methods.

Note: Take MACCSFP for example, MACCSFP ranked third in NB, and ranked second in KNN, RF, and SVM.
Abbreviations: NB, Naïve Bayesian; KNN, k-nearest neighbor; RF, random forest; SVM, support vector machine; Ext, extended; AP2D, 2D atom pairs.
Figure 6 Performance ranking of fingerprints (FP) in various machine learning methods.

Table 2 Model performances of test set

Table 3 Model performances of external test set

Table S1 Ten-fold cross validation model performance

Table S2 Five-fold cross validation model performance using experimental inactive agonists