929
Views
17
CrossRef citations to date
0
Altmetric
Articles

Representing molecular and materials data for unsupervised machine learning

, , &
Pages 905-920 | Received 16 Oct 2017, Accepted 04 Mar 2018, Published online: 02 Apr 2018

References

  • Wu Z , Ramsundar B , Feinberg EN , et al . Moleculenet: a benchmark for molecular machine learning. Chem Sci. 2018;9(2):513–530.
  • Ramprasad R , Batra R , Pilania G , et al . Machine learning in materials informatics: recent applications and prospects. Comput Mater. 2017;3(1):54–66.
  • Ramakrishnan R , von Lilienfeld A . Machine learning, quantum chemistry, and chemical space. Rev Comput Chem. 2017;225–256.
  • Ward L , Wolverton C . Atomistic calculations and materials informatics: a review. Current Opinion Solid State Mater Sci. 2017;21(3):167–176.
  • Agrawal A , Choudhary A . Perspective: materials informatics and big data: realization of the ‘fourth paradigm’ of science in materials science. Apl Mater. 2016;4(5):053208.
  • Jain A , Hautier G , Ong SP , et al . New opportunities for materials informatics: resources and data mining techniques for uncovering hidden relationships. J Mater Res. 2016;31(8):977–994.
  • Wagner N , Rondinelli JM . Theory-guided machine learning in materials science. Front Mater. 2016;3:28–36.
  • von Lilienfeld A . Quantum machine learning in chemical compound space. Angew Chem Int Ed. 2017;57:2–8.
  • Janet JP , Kulik HJ . Resolving transition metal chemical space: feature selection for machine learning and structure-property relationships. J Phys Chem A. 2017 Nov;121:8939–8954.
  • Janet JP , Kulik HJ . Predicting electronic structure properties of transition metal complexes with neural networks. Chem Sci. 2017 Feb;8:5137–5152.
  • Faber FA , Hutchison L , Huang B , et al . Prediction errors of molecular machine learning models lower than hybrid dft error. J Chem Theory Comput. 2017;13(11):5255–5264.
  • Ma J , Sheridan RP , Liaw A , et al . Deep neural nets as a method for quantitative structure-activity relationships. J Chem Inf Model. 2015 Feb;55:263–274.
  • Fernandez M , Trefiak NR , Woo TK . Atomic property weighted radial distribution functions descriptors of metal-organic frameworks for the prediction of gas uptake capacity. J Phys Chem C. 2013;117:14095–14105.
  • Potyrailo R , Rajan K , Stoewe K , et al . Combinatorial and high-throughput screening of materials libraries: review of state of the art. ACS Comb Sci. 2011;13(6):579–633.
  • Murphy RF . An active role for machine learning in drug development. Nat Chem Biol. 2011;7(6):327–330.
  • Fernandez M , Barnard AS . Identification of nanoparticle prototypes and archetypes. ACS Nano. 2015;9(12):11980–11992.
  • Fernandez M , Breedon M , Cole IS , et al . Modeling corrosion inhibition efficacy of small organic molecules as non-toxic chromate alternatives using comparative molecular surface analysis (CoMSA). Chemosphere. 2016;160:80–88.
  • Varnek A , Baskin I . Machine learning methods for property prediction in chemoinformatics: quo vadis? J Chem Inf Model. 2012;52(6):1413–1437.
  • Swann ET , Fernandez M , Coote ML , et al . Bias-free chemically diverse test sets from machine learning. ACS Comb Sci. 2017;19(8):544–554.
  • De S , Bartok AP , Csanyi G , et al . Comparing molecules and solids across structural and alchemical space. Phys Chem Chem Phys. 2016;18:13754–13769.
  • Sadeghi A , Ghasemi SA , Schaefer B , et al . Metrics for measuring distances in configuration spaces. J Chem Phys. 2013;139(18):184118.
  • Cherkasov A , Muratov EN , Fourches D , et al . QSAR modeling: where have you been? Where are you going to? J Med Chem. 2014;57(12):4977–5010.
  • Ghiringhelli LM , Vybiral J , Levchenko SV , et al . Big data of materials science: critical role of the descriptor. Phys Rev Lett. 2015;114:105503.
  • Sun B , Fernandez M , Barnard AS . Machine learning for silver nanoparticle electron transfer property prediction. J Chem Inf Model. 2017;57(10):2413–2423.
  • Fernandez M , Bilić A , Barnard AS . Machine learning and genetic algorithm prediction of energy differences between electronic calculations of graphene nanoflakes. Nanotechnology. 2017;28(38):38LT03–38LT06.
  • Ma X , Li Z , Achenie LEK , et al . Machine-learning-augmented chemisorption model for co 2 electroreduction catalyst screening. J Phys Chem Lett. 2015 Sep;6:3528–3533.
  • Behler J . First principles neural network potentials for reactive simulations of large molecular and condensed systems. Angew Chem Int Ed. 2017;56(42):12828–12840.
  • Behler J . Constructing high-dimensional neural network potentials: a tutorial review. Int J Quantum Chem. 2015;115(16):1032–1050.
  • Pietrucci F , Andreoni W . Graph theory meets Ab Initio molecular dynamics: Atomic structures and transformations at the nanoscale. Phys Rev Lett. 2011;107:085504.
  • Eshet H , Khaliullin RZ , Kühne TD , et al . Ab initio quality neural-network potential for sodium. Phys Rev B. 2010 May;81:184107.
  • Bartók AP , Kondor R , Csányi G . On representing chemical environments. Phys Rev B. 2013;87:184115.
  • Hansen K , Biegler F , Ramakrishnan R , et al . Machine learning predictions of molecular properties: Accurate many-body potentials and nonlocality in chemical space. J Phys Chem Lett. 2015;6(12):2326–2331.
  • Von Lilienfeld OA . First principles view on chemical compound space: gaining rigorous atomistic control of molecular properties. Int J Quantum Chem. 2013;113(12):1676–1689.
  • Huang B , von Lilienfeld OA . Communication: understanding molecular representations in machine learning: The role of uniqueness and target similarity. J Chem Phys. 2016 Oct;145:161102–161107.
  • Seko A , Togo A , Tanaka I . Descriptors for machine learning of materials data. Singapore: Springer; 2018. p. 3–23.
  • Todeschini R , Consonni V . Molecular descriptors for chemoinformatics: volume I: alphabetical listing/volume II: appendices, references. vol. 41. Weinheim: Wiley; 2009.
  • Todeschini R , Consonni V . Handbook of molecular descriptors. vol. 11. Weinheim: Wiley; 2008.
  • Katritzky AR , Lobanov VS , Karelson M . CODESSA 2.0 (Comprehensive descriptors for structural and statistical analysis). Gainesville, USA: University of Florida; 1996.
  • Gardiner EJ , Gillet VJ , Haranczyk M , et al . Turbo similarity searching: effect of fingerprint and dataset on virtual-screening performance. Stat Anal Data Mining. 2009;2(2):103–114.
  • Nguyen KT , Blum LC , Deursen RV , et al . Classification of organic molecules by molecular quantum numbers. ChemMedChem. 2009;4:1803–1805.
  • Pearlman RS , Smith KM . Novel software tools for chemical diversity. Perspect Drug Discovery Des. 1998;9:339–353.
  • Sliwoski G , Kothiwale S , Meiler J , et al . Computational methods in drug discovery. Pharmacol Rev. 2014;66(1):334–395.
  • Rognan D . The impact of in silico screening in the discovery of novel and safer drug candidates. Pharmacol Ther. 2017;175:47–66.
  • Willett P . Similarity-based virtual screening using 2d fingerprints. Drug Discovery Today. 2006;11(23–24):1046–1053.
  • Bonchev D , Trinajstić N . Information theory, distance matrix, and molecular branching. J Chem Phys. 1977;67(10):4517.
  • Bertz SH . The first general index of molecular complexity. J Am Chem Soc. 1981;103(12):3599–3601.
  • Balaban AT . Highly discriminating distance-based topological index. Chem Phys Lett. 1982;89(5):399–404.
  • Hall LH , Kier LB . The molecular connectivity chi indexes and kappa shape indexes in structure-property modeling. Rev Comput Chem. 2007;2:367–422.
  • Bonchev D . On the concept for overall topological representation of molecular structure. Adv Math Chem App. 2015(1):42–75.
  • Fernandez M , Abreu JI , Shi H , et al . Machine learning prediction of the energy gap of graphene nanoflakes using topological autocorrelation vectors. ACS Comb Sci. 2016;18(11):661–664.
  • Hastie T , Tibshirani R , Friedman J . The elements of statistical learning. New York (NY): Springer; 2009.
  • Rupp M , Tkatchenko A , Müller K-R , et al . Fast and accurate modeling of molecular atomization energies with machine learning. Phys Rev Lett. 2012;108(5):58301.
  • Bartók AP , Gillan MJ , Manby FR , et al . Machine-learning approach for one- and two-body corrections to density functional theory: Applications to molecular and condensed water. Phys Rev B. 2013;88:054104.
  • Montavon G , Rupp M , Gobre V , et al . Machine learning of molecular electronic properties in chemical compound space. New J Phys. 2013;15(9):095003.
  • Hansen K , Montavon G , Biegler F , et al . Assessment and validation of machine learning methods for predicting molecular atomization energies. J Chem Theory Comput. 2013;9(8):3404–3419.
  • Rupp M , Tkatchenko A , Müller K-R , et al . Rupp et al. reply. Phys Rev Lett. 2012;109:059802.
  • Faber FA , Hutchison L , Huang B , et al . Prediction errors of molecular machine learning models lower than Hybrid DFT error. J Chem Theory Comput. 2017 Oct;13:5255–5264.
  • Hemmer MC , Steinhauer V , Gasteiger J . Deriving the 3D structure of organic molecules from their infrared spectra. Vib Spectrosc. 1999;19(1):151–164.
  • Hemmer MC , Gasteiger J . Prediction of three-dimensional molecular structures using information from infrared spectra. Anal Chim Acta. 2000;420(2):145–154.
  • Fernandez M , Trefiak NR , Woo TK . Atomic property weighted radial distribution functions descriptors of metal-organic frameworks for the prediction of gas uptake capacity. J Phys Chem C. 2013;117(27):14095–14105.
  • Fernandez M , Shi H , Barnard AS . Quantitative structure-property relationship modeling of electronic properties of graphene using atomic radial distribution function scores. J Chem Inf Model. 2015;55(12):2500–2506.
  • Hemmer MC . Radial distribution functions in computational chemistry -- theory and applications [PhD thesis]. Friedrich-Alexander University Erlangen-Nurnberg; 2007.
  • Jackson JE . A user’s guide to principal components. New York (NY): Wiley; 1991.
  • Tenenbaum JB , DeSliva V , Langford JC . A global framework for nonlinear dimensionality reduction. Science (80-). 2000;290(December):2319–2323.
  • Roweis S , Saul L . Nonlinear dimensionality reduction by locally linear embedding. Science. 2000;290(5500):2323–2326.
  • Belkin M , Niyogi P . Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput. 2003;15(6):1373–1396.
  • Kruskal JB . Nonmetric multidimensional scaling: a numerical method. Psychometrika. 1964;29(2):115–129.
  • Kruskal JB . Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika. 1964;29(1):1–27.
  • van der Maaten L . Accelerating t-SNE using tree-based algorithms. J Mahc Learn Res. 2014;15:3221–3245.
  • Huo X , Ni X , Smith AK . A survey of manifold-based learning methods. Recent Adv Data Mining Enterprise Data. 2007;691–745.
  • Kohonen T . The self-organizing map. Neurocomputing. 1998;21(1–3):1–6.
  • Gasteiger J , Li X , Rudolph C , et al . Representation of molecular electrostatic potentials by topological feature maps. J Am Chem Soc. 1994;116(11):4608–4620.
  • Vatanen T , Osmala M , Raiko T , et al . Self-organization and missing values in SOM and GTM. Neurocomputing. 2015;147(1):60–70.
  • Wittek P , Gao SC , Lim IS , et al . Somoclu: an efficient parallel library for self-organizing maps. J Stat Softw. 2017;78:1–21. DOI:10.18637/jss.v078.i09
  • O’Grady KE . Measures of explained variance: cautions and limitations. Psychol Bull. 1982;92(3):766–777.
  • Cutler A , Breiman L . Archetypal analysis. Technometrics. 1994;36(3):338–347.
  • Huggins P , Pachter L , Sturmfels B . Toward the human genotope. Bull Math Biol. 2007;69(8):2723–2735.
  • Thøgersen JC , Mørup M , Damkiær S , et al . Archetypal analysis of diverse Pseudomonas aeruginosa transcriptomes reveals adaptation in cystic fibrosis airways. BMC Bioinf. 2013;14:279–293.
  • Shoval O , Sheftel H , Shinar G , et al . Evolutionary trade-offs, pareto optimality, and the geometry of phenotype space. Science. 2012;336:1157–1160.
  • Marinetti S , Finesso L , Marsilio E . Archetypes and principal components of an IR image sequence. Infrared Phys Technol. 2007;49(3):272–276.
  • Mørup M , Hansen LK . Archetypal analysis for machine learning and data mining. Neurocomputing. 2012;80:54–63.
  • Kosti MV , Feldt R , Angelis L . Archetypal personalities of software engineers and their work preferences: a new perspective for empirical studies. Empir Softw Eng. 2016;21:1509–1532.
  • Porzio GC , Ragozini G , Vistocco D . On the use of archetypes as benchmarks. Appl Stoch Model Bus Ind. 2008;24(5):419–437.
  • Eugster MJA , Leisch F . From spider-man to hero -- archetypal analysis in R. J Stat Softw. 2009;30(8):1–23.
  • Estivill-Castro V . Why so many clustering algorithms. ACM SIGKDD Explor Newsl. 2002;4(1):65–75.
  • Jain AK , Topchy A , Law MH , et al . Landscape of clustering algorithms. In: Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004. vol. 1. Cambridge, UK: IEEE; 2004. p. 260–263.
  • Sun B , Barnard AS . The impact of size and shape distributions on the electron charge transfer properties of silver nanoparticles. Nanoscale. 2017;9(34):12698–12708.
  • Sun B , Barnard AS . Impact of speciation on the electron charge transfer properties of nanodiamond drug carriers. Nanoscale. 2016;8(29):14264–14270.
  • Sun B , Fernandez M , Barnard AS . Statistics, damned statistics and nanoscience -- using data science to meet the challenge of nanomaterial complexity. Nanoscale Horiz. 2016;1(2):89–95.
  • Barnard AS . Impact of distributions on the photocatalytic performance of anatase nanoparticle ensembles. J Mater Chem A. 2015;3(1):60–64.
  • Shi H , Rees RJ , Per MC , et al . Impact of distributions and mixtures on the charge transfer properties of graphene nanoflakes. Nanoscale. 2015;7(5):1864–1871.
  • Barnard AS , Per MC . Size and shape dependent deprotonation potential and proton affinity of nanodiamond. Nanotechnology. 2014;25(44):445702.
  • Fernandez M , Barnard AS . Geometrical Properties Can Predict CO2 and N2 Adsorption Performance of Metal-Organic Frameworks (MOFs) at Low Pressure. ACS Comb Sci. 2016;18(5):243–252.
  • Barnard AS . Computational strategies for predicting the poten-tial risks associated with nanotechnology. Nanoscale. 2009;1(1):89–95.
  • Von Luxburg U . A tutorial on spectral clustering. Stat Comput. 2007;17(4):395–416.
  • Xiang T , Gong S . Spectral clustering with eigenvector selection. Pattern Recognit. 2008;41(3):1012–1029.
  • Ng AY , Jordan MI , Weiss Y . On spectral clustering: analysis and an algorithm. Adv Neural Inf Process Syst. 2002;849–856.
  • Everitt BS , Landau S , Leese M , et al . Hierarchical clustering. New York (NY): Wiley; 2011.
  • Murtagh F , Contreras P . Algorithms for hierarchical clustering: an overview. Wiley Interdiscip Rev Data Min Knowl Discov. 2012;2(1):86–97.
  • Murtagh F , Contreras P . Methods of hierarchical clustering. Comput (Long Beach Calif). 2011;38(2):1–21.
  • Johnson RD III . Nist computational chemistry comparison and benchmark database nist standard reference database number 10. 2016. [cited 2016 Oct 18. Available from:] http://cccbdb.nist.gov/
  • Durant JL , Leland BA , Henry DR , et al . Reoptimization of MDL Keys for Use in Drug Discovery. J Chem Inf Comput Sci. 2002;42:1273–1280.
  • Landrum G . Rdkit: open source cheminformatics. 2016. Available from: http://www.rdkit.org
  • Curtiss LA , Raghavachari K , Redfern PC , et al . Assessment of Gaussian-3 and density functional theories for a larger experimental test set. J Chem Phys. 2000;112(17):7374–7383.
  • Grimme S . Accurate calculation of the heats of formation for large main group compounds with spin-component scaled MP2 methods. J Phys Chem A. 2005;109:3067–3077.
  • Winter NW . Theoretical description of the diimide molecule. J Chem Phys. 1975;62(4):1269.
  • Parsons CA , Dykstra CE . Electron correlation and basis set effects in unimolecular reactions. A study of the model rearrangement system N2H2 . J Chem Phys. 1979;71(7):3025.
  • Jensen HJA , Jøergensen P , Helgaker T . Ground-state potential energy surface of diazene. J Am Chem Soc. 1987;109:2895–2901.
  • Andzelm J , Sosa C , Eades RA . Theoretical study of chemical reactions using density functional methods with nonlocal corrections. J Phys Chem. 1993;97(18):4664–4669.
  • McKee ML . Catalyzed cis/trans isomerization of diazene. A computational study in the gas and aqueous phases. J Phys Chem. 1993;97:13608–13614.
  • Smith BJ . Isomers and transition structures of diazene. J Phys Chem. 1993;97:10513–10514.
  • Angeli C , Cimiraglia R , Hofmann H-J . On the competition between the inversion and rotation mechanisms in the cis-trans thermal isomerization of diazene. Chem Phys Lett. 1996;259(September):276–282.
  • Jursic B . Ab initio and density functional theory study of the diazene isomerization. Chem Phys Lett. 1996;261:13–17.
  • Mach P , Masik J , Urban J , et al . Single-root multireference Brillouin-Wigner coupled-cluster theory. Rotational barrier of the N2H2 molecule. Mol Phys. 1998;94(1):173–179.
  • Martin JML , Taylor PR . Benchmark ab initio thermochemistry of the isomers of diimide, N2H2, using accurate computed structures and anharmonic force fields. Mol Phys. 1999;96(4):681–692.
  • Stepanic V , Baranovic G . Ground and excited states of isodiazene - an ab initio study. Chem Phys. 2000;254:151–168.
  • Chattarj PK , Perez P , Zevallos J , et al . Theoretical study of the trans N2H2→ cisN2H2 and F2S2→ FSSF reactions in gas and solution phases. J Mol Struct. 2001;580:171–182.
  • Hwang D , Mebel A . Reaction mechanism of N2/H2 conversion to NH3: a theoretical study. J Phys Chem A. 2003;107:2865–2874.
  • Pu X , Wong N-B , Zhou G , et al . Substituent effects on the trans/cis isomerization and stability of diazenes. Chem Phys Lett. 2005;408(1–3):101–106.
  • Biczysko M , Poveda L , Varandas A . Accurate MRCI study of ground-state N2H2 potential energy surface. Chem Phys Lett. 2006;424(1–3):46–53.
  • Chaudhuri RK , Freed KF , Chattopadhyay S , et al . Potential energy curve for isomerization of N2H2 and C2H4 using the improved virtual orbital multireference Møller-Plesset perturbation theory. J Chem Phys. 2008;128(14):144304.
  • Mahapatra US , Chattopadhyay S . Evaluation of the performance of single root multireference coupled cluster method for ground and excited states, and its application to geometry optimization. J Chem Phys. 2011;134(4):044113.
  • Jana J . Relative stabilities of two difluorodiazene isomers: density functional and molecular orbital studies. Reports Theor Chem. 2012;1:1–10.
  • Musiał M , Lupa Ł , Szopa K , et al . Potential energy curves via double ionization potential calculations: example of 1,2-diazene molecule. Struct Chem. 2012;23(5):1377–1382.
  • Sand AM , Schwerdtfeger CA , Mazziotti DA . Strongly correlated barriers to rotation from parametric two-electron reduced-density-matrix methods in application to the isomerization of diazene. J Chem Phys. 2012;136(3):034112.
  • Swann E , Fernandez M , Barnard A , et al . Cmolsc-1 quantum chemical test set. v1. CSIRO Data Collection. 2017. DOI:10.4225/08/58bcf1565950a
  • Swann E , Fernandez M , Barnard A , et al . Cmolst-1 quantum chemical test set. v1. CSIRO Data Collection. 2017. DOI:10.4225/08/58bcf21ca85b6
  • Fernandez M , Wilson HF , Barnard AS . Impact of distributions on the prediction of nanoparticle prototypes and archetypes. Nanoscale. 2017;9:832–843.
  • Lai L , Barnard AS . Tuning the electron transfer properties of entire nanodiamond ensembles. J Phys Chem C. 2014;118:30209–30215.
  • Barnard AS . Impact of distributions on the photocatalytic performance of anatase nanoparticle ensembles. J Mater Chem A. 2015;3:60–64.
  • Barnard AS , Wilson HF . Optical emission of statistical distributions of silicon quantum dots. J Phys Chem C. 2015;119:7969–7977.
  • Barron H , Barnard AS . Using structural diversity to tune the catalytic performance of Pt nanoparticle ensembles. Catal Sci Technol. 2015;5:2848–2855.
  • Silver D , Huang A , Maddison CJ , et al . Mastering the game of Go with deep neural networks and tree search. Nature. 2016;529(7587):484–489.
  • Rusk N . Deep learning. Nat Methods. 2015;13(1):35–35.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.