Search in:

Science and Technology of Advanced Materials Volume 21, 2020 - Issue 1

Submit an article Journal homepage

Open access

5,703

Views

CrossRef citations to date

Altmetric

Listen

New topics/Others

NMR-TS: de novo molecule identification from NMR spectra

Jinzhe Zhanga Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Kashiwa, Japan;b RIKEN Center for Advanced Intelligence Project, Tokyo, JapanView further author information

Kei Terayamab RIKEN Center for Advanced Intelligence Project, Tokyo, Japan;c Graduate School of Medicine, Kyoto University, Kyoto, Japan;d RIKEN Medical Sciences Innovation Hub Program (MIH), Yokohama, Japan;e Graduate School of Medical Life Science, Yokohama City University, Yokohama, JapanView further author information

Masato Sumitab RIKEN Center for Advanced Intelligence Project, Tokyo, Japan;f International Center for Materials Nanoarchitectonics (WPI-MANA), National Institute for Materials Science, Tsukuba, JapanView further author information

Kazuki Yoshizoeb RIKEN Center for Advanced Intelligence Project, Tokyo, JapanView further author information

Kengo Itoe Graduate School of Medical Life Science, Yokohama City University, Yokohama, Japan;g RIKEN Center for Sustainable Resource Science, Yokohama, JapanView further author information

Jun Kikuchie Graduate School of Medical Life Science, Yokohama City University, Yokohama, Japan;g RIKEN Center for Sustainable Resource Science, Yokohama, Japan;h Graduate School of Bioagricultural Sciences, Nagoya University, Nagoya, JapanView further author information

Koji Tsudaa Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Kashiwa, Japan;b RIKEN Center for Advanced Intelligence Project, Tokyo, Japan;i Research and Services Division of Materials Data and Integrated System, National Institute for Materials Science, Tsukuba, JapanCorrespondence[email protected]
View further author information

show all

Pages 552-561 | Received 06 May 2020, Accepted 05 Jul 2020, Published online: 30 Jul 2020

Cite this article
https://doi.org/10.1080/14686996.2020.1793382
CrossMark

In this article

ABSTRACT
1. Introduction
2. Methods
3. Results and discussion
4. Conclusion and outlook
Supplemental material
Acknowledgements
Disclosure statement
Additional information
References

Full Article
Figures & data
References
Supplemental
Citations
Metrics
Licensing
Reprints & Permissions
View PDF PDF View EPUB EPUB

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

Nuclear magnetic resonance (NMR) spectroscopy is an effective tool for identifying molecules in a sample. Although many previously observed NMR spectra are accumulated in public databases, they cover only a tiny fraction of the chemical space, and molecule identification is typically accomplished manually based on expert knowledge. Herein, we propose NMR-TS, a machine-learning-based python library, to automatically identify a molecule from its NMR spectrum. NMR-TS discovers candidate molecules whose NMR spectra match the target spectrum by using deep learning and density functional theory (DFT)-computed spectra. As a proof-of-concept, we identify prototypical metabolites from their computed spectra. After an average 5451 DFT runs for each spectrum, six of the nine molecules are identified correctly, and proximal molecules are obtained in the other cases. This encouraging result implies that de novo molecule generation can contribute to the fully automated identification of chemical structures. NMR-TS is available at https://github.com/tsudalab/NMR-TS.

KEYWORDS:

NMR
deep learning
molecule generation
density functional theory

CLASSIFICATION:

404 Materials informatics / Genomics

1. Introduction

Nuclear magnetic resonance (NMR) spectroscopy is indispensable for chemists for various chemical structure identification tasks, such as confirming the synthesis of a molecule [Citation1] and revealing the existence of impurities [Citation2–Citation4]. An NMR spectrum consists of peaks that correspond to molecular fragments, and the peak positions (chemical shifts) depend on the environment in the molecule. One important task for chemists is peak assignment, in which they use their knowledge to map peaks to functional groups depending on their chemical shifts. The approximate chemical structure of the target molecule can then be confirmed from the NMR spectrum.

Recently, automated robotic laboratory systems have received considerable attention for high-throughput material design. Automated robotic laboratories perform chemical reactivity tests under different reaction conditions guided by machine learning algorithms [Citation5–Citation7]. A key step in this automated experiment is the qualification and quantification of the reaction products from each cycle to evaluate the reactivity. NMR spectroscopy can be used as a method for either qualitative or quantitative measurements. In automated robotic laboratory system studies, NMR spectroscopy has already been applied for quantitative measurements [Citation7]. Automating the identification of any molecule would pave the way for using NMR spectroscopy for the qualitative measurement of reaction products in robotic laboratory systems.

To date, predicting the molecules in a sample from its NMR spectrum has mainly been performed based on databases [Citation8–Citation10]. For example, to identify the metabolites in a sample, methods for accurately calculating chemical shifts [Citation11,Citation12] and for predicting molecules using specific databases [Citation13–Citation16] have been proposed. However, these methods are not effective for unknown molecules whose spectra are not registered in the database. In addition, even for prediction methods based on computational chemistry [Citation17] and machine learning [Citation18–Citation20], NMR spectra cannot be calculated without the structures of the compounds; thus, these methods are also not useful for unknown molecules [Citation21,Citation22]. In this work, we try to identify unknown molecules from its NMR spectrum with a de novo molecule generator.

Recent progress in machine learning has enabled the development of de novo molecule generators [Citation23–Citation27, Citation28–Citation30], which are expected to design molecules with desired properties [Citation24]. For instance, we developed a molecule generator, ChemTS [Citation27], which combines Monte Carlo tree search (MCTS) with a recurrent neural network (RNN), and successfully showed that ChemTS coupled with quantum chemical calculations can produce realistic molecules that have desired properties [Citation31]. So far, most de novo molecule generators have only been tested or applied on quantifiable chemical properties such as gaps between the highest occupied molecular orbital (HOMO) and the lowest unoccupied molecular orbital (LUMO). As ¹ H NMR spectra are highly characteristic of individual compounds, we consider ¹ H NMR spectra as one of its molecular properties.

We developed a python library named NMR-TS to identify an unknown molecule from its spectrum by designing molecules that have ¹ H NMR spectra that are as similar as possible to the target spectrum. In this work, as a proof-of-concept, we evaluated the ¹ H NMR spectra of nine known molecules that were not included in the NMR-TS training set. NMR-TS succeeded in correctly identifying six of the nine molecules from their ¹ H NMR spectra, whereas proximal molecules were obtained in the other three cases.

2. Methods

NMR-TS is a tool that automatically identifies the molecular structure from a given NMR spectrum based on ChemTS. The NMR-TS method is schematized in . NMR-TS requires (1) a target ¹ H NMR spectrum, (2) the numbers of hydrogen and carbon atoms, which indicate the size of the target molecule, and (3) a training data set (a database of SMILES30 strings) as input. NMR-TS outputs a list of candidate molecular structures that fit the input spectrum.

Figure 1. Concept of this study and molecular generator scheme. NMR-TS tries to identify an unknown molecule from its NMR spectrum (target NMR spectrum) by designing molecules with NMR spectra as similar as possible to the target NMR spectrum. The NMR spectrum of a generated molecule is simulated by quantum chemical calculation. The Wasserstein distance is used to quantify the proximity between the NMR spectra of the target and generated molecules.

2.1. ChemTS

Before describing NMR-TS, we introduce ChemTS, which is the base algorithm of NMR-TS. The input for ChemTS is a database of SMILES [Citation32] strings and an evaluation function, which quantifies the goodness of a generated molecule. Starting from a root node that represents the beginning of a SMILES string, the ChemTS algorithm builds a search tree, in which each node corresponds to one SMILES symbol. The ChemTS search process consists of four procedures: selection, expansion, simulation, and backpropagation, the details of which are given in the original paper [Citation27]. In the selection step, the tree is traversed from the root to a leaf by recursively choosing the child node that has the maximum upper confidence bound (UCB) based score at each branch. This score is described in detail in the next paragraph. A path from the root to the leaf node becomes a SMILES prefix. In the expansion step, several child nodes are added to the leaf node. Upon tree expansion, a selected prefix serves as the input for the RNN pretrained on the database. With the SMILES prefix as an input, the RNN can predict the next symbol after the prefix and elongate the length of the prefix by one. By repeating this elongation step until a terminal symbol appears, a complete SMILES string is generated [Citation33]. The generated molecule is evaluated using the evaluation function and then the tree is updated accordingly during the backpropagation procedure. The input database for pretraining the RNN can be either a general database with no specific molecular characteristics or a specific database containing field-specific SMILES strings.

In this study, to perform the massive DFT computations, we parallelized the tree search part of ChemTS using Open MPI based on the virtual loss approach [Citation34]. We used the following scoring in the selection step to avoid concentrating the DFT computations on one node.

(1)

u c b_{i} = \frac{S_{i}}{v_{i} + w_{i}} + C P_{i} \frac{\sqrt{v_{p} + w_{p}}}{1 + v_{i} + w_{i}}

(1)

Here, $s_{i}$ is the total score obtained by node $i$ , $v_{i}$ is the total visit number of $i$ , $w_{i}$ is the total virtual visit number of $i$ (virtual loss), $v_{p}$ is the total visit number of parent node $p$ of $i$ , $w_{p}$ is the total virtual visit number of $p$ , $p_{i}$ is the probability of $i$ among the children of $p$ , and $C$ is a constant that controls the exploration–exploitation trade-off.

2.2. NMR-TS

Similar to ChemTS, NMR-TS pretrains an RNN model using the input SMILES database to obtain an RNN model that can generate various valid SMILES strings depending on the input prefixes. NMR-TS takes a target NMR spectrum as input (Step 0 in ). In the generation step of NMR-TS (Step 1 in ), a SMILES prefix determined by MCTS is given to the RNN model to obtain a complete SMILES string. Then, the simulated NMR spectrum of the SMILES string is computed using a quantum-molecular-calculation-based method (Step 2 in ), as described in the following section. Once the simulated NMR spectrum of the generated molecule is obtained, its similarity with the target NMR spectrum is evaluated using the Wasserstein distance (Step 3 in ). In addition, the numbers of hydrogen and carbon atoms in the target molecules are used to constrain the sizes of the molecules generated by NMR-TS. If the numbers of hydrogen or carbon atoms differ between the target and generated molecules, a penalty is added according to the difference. A score is calculated by integrating both the Wasserstein distance and the atom number penalty. The score of each prefix branch is updated using the calculated score to progress the MCTS (Step 4 in ). Once the search tree is updated, a new SMILES prefix is selected, and the above steps are repeated. By repeating these steps, we expect that the tree will eventually explore the chemical space and provide molecules that fit the target spectrum.

2.3. NMR spectrum prediction

To compute the ¹ H NMR spectra, we started by converting the input SMILES string into the canonical SMILES format, which was converted into a 3D molecular structure [Citation35] through the function implemented in the RDkit library [Citation36] with the random seed fixed to 1. Canonizing the SMILES string and fixing the random seed ensured that identical chemical structures produce the same ¹ H NMR spectrum in the prediction step. Once RDkit produced a molecular structure, with the atom positions described by Cartesian coordinates, the ¹ H NMR spectrum was computed using density functional theory (DFT) [Citation37] at the B3LYP/3-21 G* level on the optimized structure at the universal force field (UFF) level. Magnetic shielding tensors at the proton positions were calculated using the gauge-invariant atomic orbital (GIAO) method. The isotropic chemical shift in the ¹ H NMR spectrum was estimated by subtracting that of tetramethylsilane (TMS) calculated at the same level. For temporal convenience, we ignored the degeneracy between protons in this work. Hence, the ¹ H NMR spectrum of a molecule was computed as a line spectrum of all the protons in the molecule. All DFT calculations were performed with the Gaussian 16 package [Citation38].

2.4. Wasserstein distance and evaluation function

The Wasserstein distance [Citation39], also known as the Kantorovich–Rubinstein metric or the earth mover’s distance [Citation40], is a function that describes the distance between two distributions. If we consider two distributions as two piles of dirt, the Wasserstein distance is the minimum amount of work needed to reshape one into the other. Some typical examples of the distances between the calculated NMR spectra of molecules are shown in . In this study, the Wasserstein distance was used to evaluate the difference between the NMR spectra of a newly generated SMILES string and the target NMR spectra. We also used information about the numbers of hydrogen and carbon atoms in our evaluation function to guide the MCTS. We defined the evaluation function $S c o r e (M_{g}, M_{t})$ between a generated molecule $M_{g}$ and a target (unknown) molecule $M_{t}$ as follows:

(2)

\begin{aligned} S c o r e (M_{g}, M_{t}) = 1 - t a n h (W D (M_{g}, M_{t}) \\ + α P e n a l t y (M_{g}, M_{t})) \end{aligned}

(2)

(3)

P e n a l t y (M_{g}, M_{t}) = |C (M_{g}) - C (M_{t})| + |H (M_{g}) - H (M_{t})|

(3)

Figure 2. Examples of using the Wasserstein score (WS) to quantify the difference between the target NMR spectrum and the NMR spectra of SMILES generated molecules. A WS closer to 1 indicates high similarity between the spectra. In this example, the spectrum of Cc1cc(C)on1 is most similar to the target spectrum.

where $W D (M_{g}, M_{t})$ is the Wasserstein distance between the calculated ¹ H NMR spectra of generated molecule $M_{g}$ and target molecule $M_{t}$ , $C (M)$ and $H (M)$ represent the numbers of carbons and hydrogen in molecule $M$ , and α is a parameter indicating the strength of the penalty. Note that the range of the calculated scores is between 0.0 and 1.0. To calculate the Wasserstein distance between two spectra, the SciPy library was implemented [Citation41]. In the current paper, the term Wasserstein score (WS) is used to refer $S c o r e (M_{g}, M_{t})$ .

2.5. Trie enhancement of ChemTS

As mentioned above, ChemTS is a tool that combines a RNN with MCTS. In the context of ChemTS, the MCTS is essentially executed on a prefix search tree. One advantage of ¹ H NMR spectrum identification is that an enormous number of molecular spectra have been recorded and stored in databases. An intuitive way of utilizing such information is to preload the MCTS prefix search tree with the SMILES strings of the molecules in the database and update the scoring of each traversed node with the WS between the database spectrum and the target spectrum. In computer science, such a prefix tree is usually called trie [Citation42]. See Fig. S2 for an example of trie. We implemented this idea by constructing a trie tree as follows. At every iteration, we inserted one database SMILES string into the trie, on which the nodes were defined in the same way as in ChemTS. After each insertion, the WS of the added SMILES string was used to update the weight of each visited node. The number of preloaded molecules is called the trie size. In our experiments, we tested trie sizes of 0, 1, 100, 1000, and 9800. When the trie size was 1, 100, 1000, and 9800, we ranked the WS for each molecule in the database and selected the top 1, 100, 1000 and 9800 candidates, respectively. The algorithm for trie enhancement is shown in Figure S1.

2.6. Database

To show the validity of our concept, we prepared a SMILES database consisting of molecules with relatively small molecular weights. PubChemQC [Citation43], a free and open to the public online database, contains over 3.5 million molecules. Because PubChemQC also provides molecular properties computed by ab initio calculation, the molecular weights of the molecules included in PubChemQC are limited to 500, which was suitable for our purpose.

We downloaded molecules in the form of SMILES strings with PCCDB-IDs from 1 to 138,895. We ran a selection on these 138,895 molecules to pick out the pure organic molecules that consisted of only C, H, N, and O. After selection, 10,548 molecules remained. Eight molecules were removed owing to the failure of the ¹ H NMR spectrum computation. Charged molecules were also excluded. To verify that molecules not included in the database could be identified using NMR-TS, we removed the test molecules, which are described in the next section, from the database. Finally, 9866 molecules were used as the SMILES database. The database contained the following SMILES characters: O, c, 1, (,), C, =, N, #, n, 2, o, 3, and 4. Since we used a middle-sized database to train our model, the diversity of molecules may not be sufficient to generate certain moieties. For further applications, users may need to retrain the RNN to enhance the performance.

2.7. Test set

We manually selected nine small organic molecules with molecular weights of less than 500 as the test set (). We executed parallelized NMR-TS, using 20 cores per execution. The computation time was limited to 100 h for each test molecule. We tested NMR-TS with five different parameter combinations (see ) on the nine test molecules.

Table 1. Correct answer rate and average Wasserstein score (WS) for each trie size.

Download CSV Display Table

Figure 3. Nine test molecules with their chemical structural formulas and SMILES representations.

3. Results and discussion

We performed molecule estimation trials by NMR-TS for the test molecules using 20 CPU cores (Intel® Xeon® Gold 6148) for 100 h. For each run, 5451 molecules were generated on average. We defined the molecule in the database with the highest WS as the baseline molecule and its WS as the baseline score. In , we summarize the correct answer rate and the average WS for each trie size. The baseline molecules and the high-score molecules generated by NMR-TS with various parameters are shown in . Since the test molecules were not contained in the database, the baseline molecules had WS values of less than 1.0. As shown in , NMR-TS succeeded in identifying six molecules (I, III, IV, V, VI, and VIII) out of the nine test molecules from their ¹ H NMR spectra. NMR-TS also suggested other candidates that were close to the target molecules based on the provided spectra.

Figure 4. Test molecules, baseline molecules, and best candidate molecules generated by NMR-TS. The corresponding Wasserstein score (WS) is shown for each baseline and candidate molecule. For test molecules I, III–VI, and VIII, NMR-TS gave the correct structures. For test molecules II, VII, and IX, NMR-TS failed to find the correct structures.

To demonstrate the dependence of the answer speed on the trie size and the computational time, the evolution of the best WS for each test over time is shown in . For all the trie sizes, the WS drastically increased during the initial 10 h. However, when the trie size was zero, NMR-TS could not generate any molecules with a WS of 1.0 within 100 h. Even for the failures (test molecules II, VII, and IX), NMR-TS succeeded in generating some molecules that had higher WS values than the baseline score when the trie size was increased. See Table S1 for total CPU hours to identify the molecules.

Figure 5. NMR-TS search results for target spectra of test molecules I–IX showing the best Wasserstein score (WS) as the function of time with different trie sizes. See for the details of the different parameter sets.

In ), we show the time evolution of the average score of the best candidate for the nine test molecules with different trie sizes. For all the trie sizes, the growth of the WS is mostly saturated within 40 h. Furthermore, NMR-TS typically generated higher-scored candidate molecules when the trie size was increased, with some exceptions.

Figure 6. (a) Evolution of the average Wasserstein score (WS) of the best candidates for the nine test molecules over time with different trie sizes. When the trie size is 0, ChemTS starts with a root node without any expansion. When the trie size is 1, 100, 1000, or 9800, a WS is obtained for each spectrum in the database against the target spectrum and based on this ranking, the top 1, 100, 1000, and 9800 molecules, respectively, are fed into the trie. (b) Evolution of the total number of candidates with scores better than the database baseline for all test molecules over time. (c) Comparison of the best candidate scores from the database search and NMR-TS. C = 1, trie size = 9800.

) shows the total number of candidates that have higher scores than the baseline molecules as a function of the computational time. For each trie size, the total number of candidates monotonically increased over time. As the trie size increased, NMR-TS generated more candidates with better scores than the baseline. A comparison of the results for trie sizes of 0 and 1 in ,) reveals that a trie size of 0 was superior to a trie size of 1 from the viewpoint of generating more candidates with better scores than the baseline. In contrast, a trie size of 1 was superior to a trie size of 0 from the viewpoint of generating higher-scored candidates. A reasonable explanation for this phenomenon is that while the trie highlights the most promising branch in the search tree, the presence of the trie also restricts the exploration of other branches and thus reduces the overall diversity.

In ), we show a comparison between the scores of the best NMR-TS candidates and the baseline scores. The points above the diagonal dotted line correspond to cases where NMR-TS found better candidates than the baseline candidate. A score on the vertical axis of 1.0 (I, III, IV, V, and VI) indicates that NMR-TS succeeded in identifying the exact molecular structure. Although NMR-TS did not reach the baseline score for VII and VIII, these cases mainly fall on the extreme right side of the horizontal axis, which indicates that a good candidate already existed in the database. On the contrary, in cases where the baseline candidates poorly matched the target molecules (middle to left side of the horizontal axis), NMR-TS surpassed the baseline score.

In the current study, we used ¹ H NMR peaks from all types of functional groups to verify the concept of identifying chemical structures from only the NMR spectra despite using advance information about the number of carbon and hydrogen atoms. Considering the specificity of ¹ H NMR spectroscopy, it currently is difficult for NMR-TS to discriminate between hydrogen signals from – NH and – OH groups because their chemical shifts appear in the same range (1–5 ppm). Indeed, for test molecule II, NMR-TS completely misidentified the – OH environment as a – NH environment.

Similarly, the proximity of the chemical shifts for alkane protons might prevent NMR-TS from predicting the position of a phenyl group relative to methyl(propyl)amine, as in the case of VII. Furthermore, NMR-TS does not currently consider the hyperfine coupling resulting from hydrogen spin–spin coupling. Hence, we expect that NMR-TS will not be able to identify benzene derivatives that are characterized by the positions of substituents (i.e., ortho, meta, and para isomers). For test molecules with possible isomers, NMR-TS successfully identified V from its ¹ H NMR spectrum but failed for IX although the position of substituents of the baseline molecule is the same position of the test molecule. We speculate that this result reflects that NMR-TS cannot recognize the position of the substituents on benzene derivatives. Therefore, to improve the accuracy, it might be effective to consider the effect of hyperfine coupling when computing the ¹ H NMR spectra. However, the computation of the hyperfine coupling constant is time-consuming in electronic structure theory. As an alternative, we are planning to combine the ¹ H NMR spectrum with other spectra, such as the ¹³ C NMR or ultraviolet visible (UV-vis) spectrum. As a commonly used NMR technique, ¹³ C NMR spectroscopy provides important structural information about organic molecules. Thus, by coupling ¹ H NMR and ¹³ C NMR spectra, the accuracy of NMR-TS is expected to improve considerably.

4. Conclusion and outlook

In this study, we demonstrated NMR-TS, a technique for molecule identification from NMR spectra that combines a de novo molecular generation method with quantum chemical computations. NMR-TS was shown to identify a better molecular structure from a ¹ H NMR spectrum than the baseline despite receiving less or an equal amount of information. Despite the database not containing any of the test molecule structures, NMR-TS succeeded in utilizing the database information to reach the correct molecules without assistance from a chemist.

When MCTS was originally used in game play, each MCTS simulation consisted of a random playout from the current stage, for which the time cost was small. In the context of NMR-TS, each simulation (playout) required a computationally heavy DFT calculation. To tackle this problem, we designed a trie structure, and the identification performance improved as the size of the prebuilt trie tree increased. This result suggests that (1) NMR-TS generally performs better when incorporating information from an NMR database through a trie structure, and (2) a trie structure could be applied to enhance ChemTS when time-consuming simulations are required and an information database is available.

As a non-knowledge-based method, NMR-TS explores the metric space for a particular target spectrum. We designed the Wasserstein score as the metric in this study. In addition to ¹ H NMR spectra, data from many other spectroscopic measurements, such as ¹³ C NMR, IR, and UV-vis spectra, could be used as inputs. With a different distance metric, NMR-TS variants can potentially be applied to estimate chemical structures from other chemical properties. It is also possible to take into account other types of measurement techniques such as mass spectrometry, if a distance metric is properly designed.

NMR-TS is still in development and has a handful of limitations and possibilities of improvement. First, SMILES cannot represent many features of organic molecules such as axial chirality. It may be resolved by using graph-based representations [Citation24]. In our study, NMR-TS is tested only with computationally generated spectra and still needs to be tested with experimental spectra where peaks are unclear. Impurities are possible obstacles for accurate identification. NMR-TS cannot identify multiple compounds in a mixture, but could be extended by incorporating peak separation techniques presented in [Citation14]. To save computational time, we employed only one conformer per molecule. If k conformers are considered, the accuracy of NMR-TS should improve at expense of almost k-fold increase in computational cost. Also, our DFT-based spectrum computation can be replaced, e.g., by ENSO [Citation44] in pursuit of better accuracy. See Fig. S3 for comparison of our spectrum with that of ENSO. ENSO took 250 minutes to compute a spectrum, while our DFT calculation took 11 minutes. Compared to our DFT calculation, ENSO showed better accuracy in predicting the experimental spectrum, presumably because ENSO uses multiple conformers for spectrum calculation, while our calculation relies on only one conformer. For molecule generation from experimental spectra, we would need a robust method like ENSO. At this point, the application of NMR-TS is limited to relatively small molecules due to high computational cost. To deal with larger molecules, the incorporation of fragment assembly [Citation14] into NMR-TS might be beneficial. Finally, it is difficult for users to understand why NMR-TS succeeds for some molecules and fails for others. In general, interpreting the results of a neural-network-based system is known to be very difficult [Citation45]. Nevertheless, some methods for explainable AI might improve the interpretability of NMR-TS [Citation45].

We believe that the NMR-TS concept has various possibilities. For instance, NMR-TS could be utilized for product identification in an automated robotic synthesis system. As the reactants are often known, this information could be incorporated into the molecular generator to constrain the search space and improve the performance. In this study, we focused on the identification of one target molecule. Future work should focus on identifying individual molecules from a spectrum of multiple molecules, as it is often difficult to decide which peaks attribute to each molecule in a sample.

Supplemental material

Supplemental Material

Download PDF (799.1 KB)

Acknowledgments

The computations in this work were carried out on the RAIDEN supercomputer (RIKEN Center for AIP). K.Tsuda is supported by NEDO P15009, SIP (Technologies for Smart Bio-industry and Agriculture), JST CREST JPMJCR1502 and JST ERATO JPMJER1903.

Disclosure statement

No potential conflict of interest was reported by the authors.

Supplementary material

Supplemental data for this article can be accessed here.

Additional information

Funding

This work was supported by the Cabinet Office, Government of Japan [SIP]; Core Research for Evolutional Science and Technology [JPMJCR1502]; Exploratory Research for Advanced Technology [JPMJER1903]; New Energy and Industrial Technology Development Organization [P15009].

References

Roberts JD, Caserio MC. Basic principles of organic chemistry. second ed. Menlo Park, CA: W. A. Benjamin, Inc.; 1977.
Google Scholar
Lohr LL, Jensen AJ, Sharp TR. NMR characterization of impurities. In: Ahuja S, Alsante KM, editors. Separation Science and Technology. Vol. 5. Cambridge (MA): Academic Press; 2004. p. 301–339.
Google Scholar
Qiu F, Norwood DL. Identification of pharmaceutical impurities. J Liq Chromatogr Relat Technol. 2007;30:877–935.
Web of Science ®Google Scholar
Hotha KK, Patel T, Roychowdhury S, et al. Identification, synthesis, and characterization of unknown impurity in the famotidine powder for oral suspension due to excipient interaction by UPLC-MS/MS and NMR. J Liq Chromatogr Relat Technol. 2015;38:977–985.
Web of Science ®Google Scholar
Steiner S, Wolf J, Glatzel S, et al. Organic synthesis in a modular robotic system driven by a chemical programming language. Science. 2019;363:eaav2211. DOI:https://doi.org/10.1126/science.aav2211.
PubMed Web of Science ®Google Scholar
Roch LM, Häse F, Kreisbeck C, et al. ChemOS: orchestrating autonomous experimentation. Sci Robot. 2018;3:eaat5559.
PubMed Web of Science ®Google Scholar
Sans V, Porwol L, Dragone V, et al. A self optimizing synthetic organic reactor system using real-time in-line NMR spectroscopy. Chem Sci. 2015;6:1258–1264.
PubMed Web of Science ®Google Scholar
Robinette SL, Bruschweiler R, Schroeder FC, et al. NMR in metabolomics and natural products research: two sides of the same coin. Acc Chem Res. 2012;45:288–297.
PubMed Web of Science ®Google Scholar
Wishart DS. NMR metabolomics: A look ahead. J Magn Reson. 2019;306:155–161.
PubMed Web of Science ®Google Scholar
Tsugawa H, Nakabayashi R, Mori T, et al. A cheminformatics approach to characterize metabolomes in stable-isotope-labeled organisms. Nat Methods. 2019;16:295–298.
PubMed Web of Science ®Google Scholar
Chikayama E, Shimbo Y, Komatsu K, et al. The effect of molecular conformation on the accuracy of theoretical 1H and 13C chemical shifts calculated by Ab initio methods for metabolic mixture analysis. J Phys Chem B. 2016;120:3479–3487.
PubMed Web of Science ®Google Scholar
Hoffmann F, Li DW, Sebastiani D, et al. Improved quantum chemical NMR chemical shift prediction of metabolites in aqueous solution toward the validation of unknowns. J Phys Chem A. 2017;121:3071–3078.
PubMed Web of Science ®Google Scholar
Markley JL, Bruschweiler R, Edison AS, et al. The future of NMR-based metabolomics. Curr Opin Biotechnol. 2017;43:34–40.
PubMed Web of Science ®Google Scholar
Ito K, Tsutsumi Y, Date Y, et al. Fragment assembly approach based on graph/network theory with quantum chemistry verifications for assigning multidimensional NMR signals in metabolite mixtures. ACS Chem Biol. 2016;11:1030–1038.
PubMed Web of Science ®Google Scholar
Komatsu T, Ohishi R, Shino A, et al. Structure and metabolic-flow analysis of molecular complexity in a 13C-labeled tree by 2D and 3D NMR. Angew Chem Int Ed Engl. 2016;55:6000–6003.
PubMed Web of Science ®Google Scholar
Misawa T, Komatsu T, Date Y, et al. SENSI: signal enhancement by spectral integration for the analysis of metabolic mixtures. Chem Commun. 2016;52:2964–2967.
PubMed Web of Science ®Google Scholar
Bühl M, Kaupp M, Malkina OL, et al. The DFT route to NMR chemical shifts. J Comput Chem. 1999;20:91–105.
Web of Science ®Google Scholar
Aires-de-Sousa J, Hemmer MC, Gasteiger J. Prediction of 1H NMR chemical shifts using neural networks. Anal Chem. 2002;74:80–90.
PubMed Web of Science ®Google Scholar
Paruzzo FM, Hofstetter A, Musil F, et al. Chemical shifts in molecular solids by machine learning. Nat Commun. 2018;9:4501.
PubMed Web of Science ®Google Scholar
Binev Y, Aires-de-Sousa J. Structure-based predictions of 1H NMR chemical shifts using feed-forward neural networks. J Chem Inf Comput Sci. 2004;44:940–945.
PubMedGoogle Scholar
Bingol K, Bruschweiler-Li L, Yu C, et al. Metabolomics beyond spectroscopic databases: a combined MS/NMR strategy for the rapid identification of new metabolites in complex mixtures. Anal Chem. 2015;87:3864–3870.
PubMed Web of Science ®Google Scholar
Leggett A, Wang C, Li DW, et al. Identification of unknown metabolomics mixture compounds by combining NMR, MS, and cheminformatics. Methods Enzym. 2019;615:407–422.
PubMed Web of Science ®Google Scholar
Ito K, Obuchi Y, Chikayama E, et al. Exploratory machine-learned theoretical chemical shifts can closely predict metabolic mixture signals. Chem Sci. 2018;9:8213–8220.
PubMed Web of Science ®Google Scholar
Sanchez-Lengeling B, Aspuru-Guzik A. Inverse molecular design using machine learning: generative models for matter engineering. Science. 2018;361:360–365.
PubMed Web of Science ®Google Scholar
Jin W, Barzilay R, Jaakkola T. Junction tree variational autoencoder for molecular graph generation. Proceedings of the 35th International Conference on Machine Learning; Stockholm (Sweden); 2018. p. 2323–2332.
Google Scholar
Jensen JH. A graph-based genetic algorithm and generative model/Monte Carlo tree search for the exploration of chemical space. Chem Sci. 2019;10:3567–3572.
PubMed Web of Science ®Google Scholar
Yang X, Zhang J, Yoshizoe K, et al. ChemTS: an efficient python library for de novo molecular generation. Sci Technol Adv Mater. 2017;18:972–976.
PubMed Web of Science ®Google Scholar
Prykhodko O, Johansson SV, Kotsias P, et al. A de novo molecular generation method using latent vector based generative adversarial network. J Cheminformatics. 2019;11:74.
PubMed Web of Science ®Google Scholar
Grisoni F, Moret M, Lingwood R, et al. Bidirectional molecule generation with recurrent neural networks. J Chem Inf Model. 2020;60(3):1175‐1183.
Web of Science ®Google Scholar
Qi Y, Santana-Bonilla A, Zwijnenburg MA, et al. Molecular generation targeting desired electronic properties via deep generative models. Nanoscale. 2020;12:6744–6758.
PubMed Web of Science ®Google Scholar
Sumita M, Yang X, Ishihara S, et al. Hunting for organic molecules with artificial intelligence: molecules optimized for desired excitation energies. ACS Cent Sci. 2018;4:1126–1133.
PubMed Web of Science ®Google Scholar
Weininger D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Model. 1988;28:31–36.
Web of Science ®Google Scholar
Segler MHS, Kogej T, Tyrchan C, et al. Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS Cent Sci. 2018;4:120–131.
PubMed Web of Science ®Google Scholar
Browne CB, Powley E, Whitehouse D, et al. A survey of monte carlo tree search methods. IEEE Trans Comput Intell AI Games. 2012;4:1–43.
Web of Science ®Google Scholar
Rappe AK, Casewit CJ, Colwell KS, et al. UFF, a full periodic table force field for molecular mechanics and molecular dynamics simulations. J Am Chem Soc. 1992;114:10024–10035.
Web of Science ®Google Scholar
RDKit: open-source cheminformatics [Internet]; [cited 2020 Apr 1]. Available from: http://www.rdkit.org
Google Scholar
Parr RG. Density functional theory of atoms and molecules. Fukui K, Pullman B, editors. Amsterdam: Springer Netherlands; 1980. p. 5–15.
Google Scholar
Frisch MJ, Trucks GW, Schlegel HB, et al. Gaussian 16 revision C.01. 2016.
Google Scholar
Villani C. Optimal transport: old and new. Berlin: Springer-Verlag; 2009.
Google Scholar
Rubner Y, Tomasi C, Guibas L. The earth mover’s distance as a metric for image retrieval. Int J Comput Vis. 2000;40:99–121.
Web of Science ®Google Scholar
Ramdas A, Trillos NG, Cuturi M. On wasserstein two-sample testing and related families of nonparametric tests. Entropy. 2017;19:47.
Web of Science ®Google Scholar
Brass P. Advanced data structures. Cambridge (UK): Cambirdge University Press; 2008.
Google Scholar
Nakata M, Shimazaki T. PubChemQC project: a large-scale first-principles electronic structure database for data-driven chemistry. J Chem Inf Model. 2017;57:1300–1308.
PubMed Web of Science ®Google Scholar
ENSO. [cited 2020 Jun 15]. Available from: https://github.com/grimme-lab/enso/releases/tag/v2.0.1
Google Scholar
Rudin C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat Mach Intell. 2019;1(5):206–215.
PubMedGoogle Scholar

Download PDF

Share icon
Back to Top

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Your download is now in progress and you may close this window

Did you know that with a free Taylor & Francis Online account you can gain access to the following benefits?

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

Have an account?
Login now Don't have an account?
Register for free

Login or register to access this feature

Have an account?
Login now Don't have an account?
Register for free

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

NMR-TS: de novo molecule identification from NMR spectra

ABSTRACT

1. Introduction