Abstract
Previous modelling of the median lethal dose (oral rat LD50) has indicated that local class-based models yield better correlations than global models. We evaluated the hypothesis that dividing the dataset by pesticidal mechanisms would improve prediction accuracy. A linear discriminant analysis (LDA) based-approach was utilized to assign indicators such as the pesticide target species, mode of action, or target species - mode of action combination. LDA models were able to predict these indicators with about 87% accuracy. Toxicity is predicted utilizing the QSAR model fit to chemicals with that indicator. Toxicity was also predicted using a global hierarchical clustering (HC) approach which divides data set into clusters based on molecular similarity. At a comparable prediction coverage (~94%), the global HC method yielded slightly higher prediction accuracy (r2 = 0.50) than the LDA method (r2 ~ 0.47). A single model fit to the entire training set yielded the poorest results (r2 = 0.38), indicating that there is an advantage to clustering the dataset to predict acute toxicity. Finally, this study shows that whilst dividing the training set into subsets (i.e. clusters) improves prediction accuracy, it may not matter which method (expert based or purely machine learning) is used to divide the dataset into subsets.
Acknowledgements
We thank Chris Russom for sharing the pesticide classification database. The views expressed in this article are those of the authors and do not necessarily represent the views or policies of the US Environmental Protection Agency.