Abstract
Aim: The explosion of data based technology has accelerated pattern mining. However, it is clear that quality and bias of data impacts all machine learning and modeling. Results & methodology: A technique is presented for using the distribution of first significant digits of medicinal chemistry features: logP, logS, and pKa. experimental and predicted, to assess their following of Benford's law as seen in many natural phenomena. Conclusion: Quality of data depends on the dataset sizes, diversity, and magnitudes. Profiling based on drugs may be too small or narrow; using larger sets of experimentally determined or predicted values recovers the distribution seen in other natural phenomena. This technique may be used to improve profiling, machine learning, large dataset assessment and other data based methods for better (automated) data generation and designing compounds.
Plain Language Summary
Lay abstract Machine learning and other technology depends critically on quality of data
Benford's law can indicate data follows natural phenomena easy, fast, statistical
Drug design impacted by FSD of experiment and predicted logP, pKa, solubility distributions
Method suited for large datasets
Supplementary data
To view the supplementary data that accompany this paper please visit the journal website at:www.tandfonline.com/doi/full/10.2217/epi-2016-0184
Financial & competing interests disclosure
A.T.G.-S. thanks Haridus- ja Teadusministeerium for grant no. IUT34-14. The authors have no other relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript apart from those disclosed.
No writing assistance was utilized in the production of this manuscript.