Full article: Crystal structure map for materials classification and modeling

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

For classifying and modeling properties of crystalline materials in terms of structure, a three-step workflow with (1) generation of structure feature vectors, (2) evaluation of distances among the feature vectors as a measure of similarity in structure, and (3) mapping of each structure in a low-dimensional space with principal components using dimension reduction is proposed. The obtained distance and resulting principal components are useful for classifying similar crystal structures for a given set of materials systems and for constructing descriptors in machine-learning analyses of properties. The eigenvector of the principal components indicates which part of the original structure feature vector is contained as important information. Examples are demonstrated for classification and property modeling of Al $_{2}$ O $_{3}$ polymorphs including amorphous structures and of the alloy configurations of Si-doped LaFe $_{13}$ compounds.

GRAPHICAL ABSTRACT

IMPACT STATEMENT

Crystal structure map constructed from the feature of structures by a dimension-reduction technique for classification and modeling for a group of related materials such as polymorphs, alloys, and doped systems.

KEYWORDS:

1. Introduction

Today, a wealth of materials information is being generated, collected, and accumulated, experimentally, theoretically, and computationally. One of the current hot issues in materials research is materials discovery assisted by data-science approach with the information [Citation1]. For materials exploration and design, it is important to know in advance where each existing materials system is positioned in a map and more basically how the coordinate system of the map is defined for a given set of materials group. This raises a novel concept ‘materials map’ [Citation2]. Since the materials space is super high dimensional, the key to the materials map for a concrete target materials system is to extract appropriate features which define the low-dimensional axes of the map like the latitude and longitude of the ordinary geological maps.

Atomic configuration in space simply called structure is the most fundamental attribute of molecular and condensed matter systems, especially crystalline materials, and determines the Hamiltonian for electronic states together with their composition under the Born-Oppenheimer approximation [Citation3]. Our understanding of materials property is often based on its correlation with the structure. Usually, the huge sum of structure types exist in nature and can be classified by using symmetry in point group or space group (SG), that is ingeniously connected to the electronic states and related electronic properties.

For example, the electronic state of a condensed matter is often described by analyzing the symmetry of local structure around the target atom in terms of orbital splitting and hybridization. The electronic band structure of a crystalline system obeys its translational symmetry with the definition of the Brillouin zone on the basis of the Bloch theorem. In general, however, the structure dependence of the obtained electronic state and resulting properties is not so easily understood because of high-dimensionality in the structure space. With the desired structure map, the low-dimensional coordinates should classify huge related types of structure and the corresponding axes should provide structure descriptors for materials modeling following.

A common language is needed to start the analysis of the huge number of structures for a given general materials group. It should contain all the necessary but unambiguous information of the structure we investigate. In crystalline materials, the components of structural information are uniquely described with the crystallographic information framework (CIF) defined by the dictionary CoreCIF [Citation4] established by International Union of Crystallography (IUCr) [Citation5]. Most of the crystallographic databases such as ICSD [Citation6], COD [Citation7], and AtomWork [Citation8] adopt the CIF format requisite for structure data standardization. Some crystallographic visualization tools also accept the CIF format data [Citation9,Citation10].

Recently, accelerated by automated high-throughput computations with first-principles density-functional-theory (DFT) electronic structure calculation methods, computational materials databases and archives containing optimized, predicted, and designed crystal structure information compatible with CIF have been constructed like Materials Project [Citation11], NOMAD [Citation12], OQMD [Citation13], AFLOW [Citation14] and etc., forming big data on crystal structures continuously growing with the experimental structure databases. To understand materials properties associated with structure for further development and deployment, it may be indispensable and worthwhile to extract important features in structure for a given target group of materials. Generally, crystal structure consists of high-dimensional information and a dimensional reduction technique is required for the purpose. In this study, a workflow to draw such important features by using a dimension-reduction technique is proposed and demonstrated for some examples.

Obtained low-dimensional features of crystal structure can be used for classifying the structures (clustering as a data-science term) among the related materials systems and also for machine-learning modeling of structure-associated electronic properties of materials. In addition, a simple technique to create possible configurational structures by substituting atomic sites of a materials system with different elements in all combinational degrees of freedom is proposed for applications to alloy and impurity problems with supercell models.

2. Methods

A three-step procedure starting from a set of the CIF files of the target crystal systems for mapping the mutual position of each crystal structure in a low-dimensional space is described in the following sub-sections. Moreover, a generation method of structures with possible required configurations for a given set of compositions is explained in connection with the three-step workflow. The whole structure of the workflow is shown in .

Figure 1. Three-step workflow described in the present study. Capital italic name expresses each detailed procedure, CIF2ESC: conversion from a CIF file to electronic-structure-calculation data with output of fingerprint as structure feature for post structure analyses; DISTANCE: evaluation of distances between structure feature vectors; DISTANCE_DISTRIBUTION: calculation of distance distribution; DIAGNOSIS: extraction of inequivalent structures; CMDS: classical multidimensional scaling; PCEV: calculation of principal component eigenvectors; PROJECTION: projection of an arbitrary structure on the dimension-reduced map; ATLS: generation of atom-type list; FINDSYM $^{*}$ : space-group finder $^{*}$ : Ref [Citation15,Citation16]. Open and solid (and dotted) arrows denote data transfers between the steps and inside each step, respectively. Dotted arrow is applicable if necessary.

2.1. STEP-1: Feature vector of crystal structure (abbreviated as CIF2ESC)

Crystal structure is the starting point of the present analysis and we assume that the detailed information for a set of materials systems is given with the CIF format. To consider a feature for a general structure, one should define a vector that is simple, tractable, translationally and rotationally invariant to treat, and often possibly highly dimensional to distinguish. Coordination number is the simplest information of local structure around a particular atomic site in a crystal. A cutoff distance to determine the nearest neighbor (NN) is rather subjective and not well defined for general structures. Radial distribution function (RDF) contains complete pair-wise structure information within a given radius and is directly measurable in extended x-ray absorption fine structure (EXAFS) experiments. An extension of RDF to more than pair correlations may be possible like cluster expansion [Citation17].

Oganov and Valle [Citation18] have proposed $F$ -fingerprint as a structure feature defined for an $AB$ element pair of a system as

(1)

F_{AB} (R) = \frac{V}{4 π R^{2} N_{A} N_{B}} \sum_{A_{i} \in cell} \sum_{B_{j}} δ (R - R_{ij}) - 1,

(1)

where $A_{i}$ is the $i$ -th site of element $A$ , $R_{ij}$ is the distance between $A_{i}$ and $B_{j}$ , $N_{A}$ is the number of sites of $A$ in cell, and $V$ is the volume of unit cell. In EquationEquation (1)(1) $F_{AB} (R) = \frac{V}{4 π R^{2} N_{A} N_{B}} \sum_{A_{i} \in cell} \sum_{B_{j}} δ (R - R_{ij}) - 1,$ (1) , the first term is nothing but RDF and the second term $(- 1)$ is introduced so that the $F$ -fingerprint becomes short-ranged with an asymptotic behavior of $F (\infty) = 0$ . The $δ$ function is smoothened with the Gaussian function of an appropriate broadening width $σ$ in a practical numerical process as

(2)

δ (R) \approx \frac{1}{\sqrt{2 π} σ} exp [- \frac{R^{2}}{2 σ^{2}}] .

(2)

The $F$ -fingerprint in EquationEquation (1)(1) $F_{AB} (R) = \frac{V}{4 π R^{2} N_{A} N_{B}} \sum_{A_{i} \in cell} \sum_{B_{j}} δ (R - R_{ij}) - 1,$ (1) is discretized as $F_{AB} (R_{l}) (l = 1, \dots, m)$ with an interval $Δ$ up to an assumed maximum radius $R_{m} \equiv R_{\max}$ and by connecting all independent RDFs for $p$ element pairs in series, a $pm (\equiv n)$ -dimensional ( $n$ -D) vector is finally obtained in a vector form. The parameters included in the feature vector, $R_{\max}$ , $σ$ , and $Δ$ , should be determined by judging the evaluation of structure distance according to user’s target system and purpose. Generally speaking, $σ$ corresponds to the resolution of RDF for distinguishing given structures. Thus, typical values of $σ$ may be about one tenths of Å taken from the NN and next nearest neighbor (NNN) distance in the compounds. Accordingly, the discretizing interval $Δ$ should be smaller to some extent than $σ$ to make the resolution meaningful. The cutoff radius $R_{\max}$ should be selected about the cell size of the system. The RDF information beyond it might be redundant because of the periodicity in a model system.

The $F$ -fingerprint has been successfully used in our previous works [Citation19–21] for accelerating crystal structure prediction with the Bayesian optimization technique. Especially, some evaluations was made for investigating the efficiency of the crystal structure search with different choices of the parameters included in the $F$ -fingerprint [Citation20]. Extensions of the $F$ -fingerprint representing higher-order crystal structure information might be also applicable in the present scheme by including angular degrees of freedom as local many-site correlations [Citation22,Citation23].

2.2. STEP-2: Structure distance as a measure of similarity (DISTANCE and DIAGNOSIS)

Although similarity of crystal structure is not uniquely defined, let us consider feature-vector distance as a quantity to represent the similarity. The degree of the similarity can be considered in such a semi-quantitative way that two structures are equivalent if a structure distance $D$ satisfies $D < D_{th}$ with a given threshold $D_{th}$ , and a pair of two structures 1 and 2 with distance $D (1, 2)$ is closer in structural similarity than another pair of 1 and 3 with distance $D (1, 3)$ if the distances satisfy $D (1, 2) < D (1, 3)$ . In any case, $D_{t h}$ can be determined by inspecting the distance distribution and the error bars in the structure data (typically significant figures of lattice constants and atomic coordinates.

Given two $n$ -D vectors $X$ and $Y$ representing the feature of crystal structures, the simplest distance between them is the Euclidean distance defined as

(3)

d_{E} (X, Y) = {[\sum_{l = 1}^{n} {(X_{l} - Y_{l})}^{2}]}^{1 / 2} .

(3)

For the present purpose, the Pearson correlation coefficient (PCC) can be another choice as a measure of structure similarity as

(4)

r (X, Y) = \frac{\sum_{l = 1}^{n} (X_{l} - \overset{ˉ}{X}) (Y_{l} - \overset{ˉ}{Y})}{{(\sum_{l = 1}^{n} {(X_{l} - \overset{ˉ}{X})}^{2} \sum_{l = 1}^{n} {(Y_{l} - \overset{ˉ}{Y})}^{2})}^{1 / 2}},

(4)

where $\overset{ˉ}{X}$ denotes the average of the vector components. Since PCC is distributed within the range of $(- 1 \leq r \leq + 1)$ , it can be transformed into a kind of distance as

(5)

d_{C} (X, Y) = \frac{1}{2} (1 - r),

(5)

known as the Cosine distance. In the present case with the vectors given in EquationEquation (1)(1) $F_{AB} (R) = \frac{V}{4 π R^{2} N_{A} N_{B}} \sum_{A_{i} \in cell} \sum_{B_{j}} δ (R - R_{ij}) - 1,$ (1) , the Cosine distance between structures 1 and 2 is explicitly computed as

(6)

d_{C} (1, 2) = \frac{1}{2} (1 - \frac{F_{1} \cdot F_{2}}{| F_{1} | | F_{2} |}),

(6)

where the inner product is defined as

(7)

F_{1} \cdot F_{2} = \sum_{AB} \sum_{l} F_{1, AB} (l) F_{2, AB} (l) w_{AB}

(7)

with a weight proportional to the pair probability

(8)

w_{AB} = \frac{N_{A} N_{B}}{\sum_{cell} N_{A} N_{B}} .

(8)

This means that the present scheme is limited to equal composition systems when comparing more than one distance. In the same way as EquationEquation (6)(6) $d_{C} (1, 2) = \frac{1}{2} (1 - \frac{F_{1} \cdot F_{2}}{| F_{1} | | F_{2} |}),$ (6) , the Euclidean distance between structures 1 and 2 is obtained as

(9)

d_{E} (1, 2) = {[\sum_{AB} \sum_{l} {[F_{1, AB} (l) - F_{2, AB} (l)]}^{2} w_{AB}]}^{1 / 2} .

(9)

In high-throughput DFT calculations, a huge set of similar crystal structures are generated for a given composition. Some of them are essentially equivalent but are often judged to be different due to numerical errors or insufficient structural optimizations. The distance obtained above can be used for listing up crystallographically inequivalent structures among the structure set in a reasonable manner with a given threshold distance. This procedure may reduce the total computing time of DFT calculations and can be combined with a successive process for primitive-cell and symmetry determination. Without this, it is tempting to think of a large number of possible atomic configurations in an impurity or alloying problem that may contain many equivalent structures as we shall see in Sect. 3.2.

2.3. STEP-3(1): Structure mapping (CMDS)

For a given Euclidean or non-Euclidean distance matrix $(N \times N) D$ constructed from $N$ structure sets of the $n$ -D feature vectors $(N \times n) X$ , dimension-reduced coordinates of the structures $(N \times k) X_{+}$ approximated in a low $k (≪ n)$ -D space can be obtained by using the multidimensional scaling (MDS) [Citation24–26]. (For clarity, the size of matrix is indicated like $(N \times k)$ when it first appears.) The MDS scheme is often called ‘dimension reduction’ in data mapping and is equivalent to principal component analysis (PCA) when Euclidean distance is applied. In MDS, the pair-wise Euclidean distance matrix $(N \times N) D^{X}$ calculated using $X_{+}$ is the closest approximation to D. The algorithm of MDS is quite simple as follows, starting with squared distance matrix $(N \times N) D^{(2)}$ of which the components are squared values of the original distance matrix components. First, the squared distance matrix is transformed by the so-called double centering $(N \times N) J$ into a matrix $(N \times N) B$ as

(10)

B = - \frac{1}{2} J D^{(2)} J with J = I - \frac{1}{N} 1 1^{T},

(10)

where $I$ and $1$ are identity matrix and identity column vector, respectively, and the superscript $T$ stands for matrix transpose. Next, the matrix $B$ is eigenvalue decomposed as

(11)

B = QΛ Q^{T} .

(11)

Here, the matrix $(N \times N) Λ$ contains the eigenvalues in the diagonal elements with zeros in the off-diagonal ones and $(N \times N) Q$ is the corresponding eigenvector matrix. Finally, by taking the first $k$ positive-eigenvalue part $(k \times k) Λ_{+}$ and $(N \times k) Q_{+}$ in the order from the positive-largest one, the principal components (PCs) $(N \times k) X_{+}$ approximately representing the structure coordinates can be obtained as

(12)

X_{+} = Q_{+} Λ_{+}^{1 / 2},

(12)

where $(k \times k) Λ_{+}^{1 / 2}$ is the matrix of which diagonal elements are the square root of the eigenvalues as $Λ_{+}^{1 / 2} Λ_{+}^{1 / 2} = Λ_{+}$ . Then, the matrix B is approximated in the form of scalar product as

(13)

B \approx B_{+} = X_{+} X_{+}^{T} .

(13)

As indicated in EquationEquation (12)(12) $X_{+} = Q_{+} Λ_{+}^{1 / 2},$ (12) , the deviation of the coordinates of each structure is proportional to the square-root of the eigenvalue, since the eigenvector $Q$ in EquationEquation (11)(11) $B = QΛ Q^{T} .$ (11) is usually taken in a unitary form. Thus, the size of the eigenvalue determines the importance of each PC obtained by the dimension reduction.

As Sibson has pointed out [Citation27], the $k$ value that we may choose can be determined by judging the magnitude of positive eigenvalues since the sum of the eigenvalues in $Λ_{+}$ approximate the sum of all eigenvalues in $Λ$ . This point is practically quite important and crucial when determining the number of PCs and evaluating the performance of feature vectors and distances we adopted [Citation28]. To see how much original information remains after the dimension reduction, it is convenient to check the so-called proportion defined by

(14)

P_{k} \equiv \frac{\sum_{i = 1}^{k} Λ_{ii}}{\sum_{i = 1}^{N} Λ_{ii}} .

(14)

Note that the summation in the numerator of EquationEquation (14)(14) $P_{k} \equiv \frac{\sum_{i = 1}^{k} Λ_{ii}}{\sum_{i = 1}^{N} Λ_{ii}} .$ (14) is taken in the descending order starting from the positive largest eigenvalue.

2.4. STEP-3(2): Projection on map and its inverse operation (PCEV and PROJECTION)

Let us describe a way to get the eigenvector of PCs in the original $n$ -D space and to make the projection of a given new structure on the obtained low-dimensional map above, by considering the relation to PCA [Citation29]. In PCA, the data covariance matrix $(n \times n) S = X^{T} X$ is the target quantity and is maximized with a unitary matrix $(n \times n) u$ and the LaGrange multiplier $λ$ as

(15)

\frac{δ}{δ u^{T}} [u^{T} Su + λ (1 - u^{T} u)] = 0,

(15)

leading to an eigenvalue problem of

(16)

Su = uΛ .

(16)

Note that the coordinate matrix $(N \times n) X$ defined in this paper follows Borg-Groenen’s prescription [Citation25] in a transpose form of Bishop’s definition [Citation29]. By combining EquationEquation (12)(12) $X_{+} = Q_{+} Λ_{+}^{1 / 2},$ (12) , (Equation13(13) $B \approx B_{+} = X_{+} X_{+}^{T} .$ (13) ), and (Equation16(16) $Su = uΛ .$ (16) ), the PC eigenvector $(n \times k) u_{+}$ can be computed in a normalized form as

(17)

u_{+} = (X^{T} - 1 {\overset{ˉ}{X}}^{T}) Q_{+} Λ_{+}^{- 12} = (X^{T} - 1 {\overset{ˉ}{X}}^{T}) X_{+} Λ_{+}^{- 1}

(17)

with the center coordinate of the original feature vectors as

(18)

{\overset{ˉ}{X}}_{1 l} = \frac{1}{N} \sum_{i = 1}^{N} {X}_{il} (l = 1, \dots, n) .

(18)

The original coordinate matrix $X$ which is used to calculate the distance is not necessarily centered and is enforced to be so in EquationEquation (17)(17) $u_{+} = (X^{T} - 1 {\overset{ˉ}{X}}^{T}) Q_{+} Λ_{+}^{- 12} = (X^{T} - 1 {\overset{ˉ}{X}}^{T}) X_{+} Λ_{+}^{- 1}$ (17) . By using the obtained PC eigenvector $u_{+}$ , the projection of any arbitrary structure feature vector $(1 \times n) Y$ in the original dimension space on the reduced $k$ -D map can be made as

(19)

Y_{+} = (Y - \overset{ˉ}{X}) u_{+} .

(19)

With help of the minimum-error formulation of PCA [Citation29], an inverse projection, that is the transformation of an arbitrary reduced coordinate on the map $(1 \times k) Z_{+}$ into a feature vector $(1 \times n) Z$ in the original structure space, can be made as

(20)

Z = Z_{+} u_{+}^{T} + \overset{ˉ}{X} .

(20)

Since a certain amount of information is lost due to the dimension reduction, the inverse projection can only be accomplished approximately. The projection and inverse projection procedures described above are reasonably made if the distance matrix is calculated as the Euclidean distance since MDS and PCA are rigorously equivalent. Even in non-Euclidean distance case, the procedures may possibly be carried out.

2.5. Generation of structures with all possible configurations (ATLS)

In addition to stoichiometric ordered materials, alloyed and doped impurity systems are often seen as targets in materials research. In such systems, counting all the possible configurations is crucially indispensable for optimizing materials properties. In electronic structure calculations for alloy systems, coherent-potential approximation (CPA) [Citation30–32] is usually adopted by taking single-site average of scattering path operators within the Korringa-Kohn-Rostoker (KKR) Green’s function formalism [Citation33,Citation34]. This approximation works quite well for randomly distributed alloy systems. Similar idea with supercell models has been proposed with special quasi-random structures (SQS’s) method, where the site configuration is optimized by making multiple site-correlation functions best close to those of random structure [Citation35,Citation36].

However, partially ordered or site-preference features are often seen and essential to govern the materials properties in real materials. Listing up all the possible configurational combinations is quite useful to investigate such local-site-order associated properties for a given composition with specific elements included. Such a tool is developed as ATLS. For a example, let us consider a $(2 \times 2 \times 2)$ body-centered supercell model containing eight atoms (four Fe and four Co) with BCC-base lattice structure. All the possible combinations are $_{8} C_{4} =$ 70 ways. With given lattice constants, lattice centering types, atomic composition, and corresponding atomic coordinates, SG and related structural properties can be obtained by using FINDSYM software [Citation15,Citation16]. By using the workflow described above, only four structures are found to be crystallographically inequivalent (32 $R \overset{ˉ}{3} m$ +12 $Cmmm$ +24 $I 4_{1} amd$ +2 $Pm \overset{ˉ}{3} m$ ) out of the 70 configurations. In the present work, crystallographically equivalent structures mean that their $F$ -fingerprints are equivalent within the reasonable threshold and have the same SG. Interestingly, the $Cmmm$ and $I 4_{1} amd$ structures extracted above have the exactly same $F$ -fingerprint within minor numerical error bars and cannot be distinguished only by the pair-wise information. This is a drawback of the present choice of the fingerprint and structure features beyond the pair-wise correlation should be included if necessary. Actually, it is not the case shown in the example of La(Fe,Si) $_{13}$ (see Sect. 3.2). In the case of a $(2 \times 2 \times 2)$ primitive supercell model containing 16 atoms (eight Fe and eight Co) with BCC-base lattice structure, there are $_{16} C_{8} = 12870$ combinations and 51 crystallographically inequivalent configurations. Similar to the case above, some configurations have the exactly same fingerprint, resulting in Citation32 independent pair-wise fingerprints. Some interesting situations may happen if the mother system contains more than one crystallographically inequivalent atomic sites, leading to a site-preference problem. Such a case will be given below in Section 3.2.

2.6. DFT calculations

All the DFT electronic structure calculations in the present study are performed with the all-electron full-potential linearized augmented-plane-wave (FLAPW) method [Citation37–39] implemented in the HiLAPW program [Citation40]. The exchange and correlation are incorporated within the Perdew-Burke-Ernzerhof form [Citation41] of the generalized gradient approximation to DFT. The energy cutoffs for wave function and potential expansions are 20 Ry and 160 Ry, respectively. A set of $Γ$ -centered $k$ -point mesh with the interval of approximately 0.1 Bohr $^{- 1}$ is used to sample the Brillouin zone with the improved tetrahedron integration scheme [Citation42] in all the self-consistent-field DFT calculations.

3. Results and discussion

3.1. Classification of Al $_{2}$ O $_{3}$ polymorphs

There often exist several stable and metastable types of crystal structure with the same composition, called polymorph, especially in oxide systems. Silica SiO $_{2}$ and alumina Al $_{2}$ O $_{3}$ are such most typical cases seen in materials science and engineering. For example, there are 15 different crystal structures of Al $_{2}$ O $_{3}$ found in the Materials Project site [Citation11]. One of them, mp -1,245,063 seems to be an amorphous model structure with 100 atoms in a unit cell (SG: $P 1$ ). In addition, another amorphous model structure including 120 atoms in a unit cell was generated by a first-principles molecular-dynamics simulation with a quenching scheme from a liquid state [Citation43]. Structural analyses on Al $_{2}$ O $_{3}$ for several different phases have been carried out in relation to their properties [Citation44–46].

A map for the Al $_{2}$ O $_{3}$ structures is created by using the three-step procedure proposed above, and especially it’s quite interesting to look at the positions of two amorphous model structures on the map of the polymorph structures as shown below. (One structure, mp -985,587 is a model for an artificial 2D film, which will be excluded in the following analysis.)

Calculated DFT total energies and energy gaps are listed in together with the corresponding values given in the Materials Project [Citation11]. Here, the same lattice constants and atomic coordinates as the previous works [Citation11,Citation43] were assumed in the present study. The present DFT total energies show almost equivalent values to the Materials Project data, while the energy gaps are slightly underestimated as a general tendency. Here, the highest energy value of the valence bands and the lowest energy value of the conduction bands at the calculated $k$ -mesh points are used to evaluate the energy gaps.

Table 1. List of Al $_{2}$ O $_{3}$ structures in the order of calculated total energy $E$ (meV/atom) relative to the most stable $R \overset{ˉ}{3} c$ structure, and the energy gap $E_{G}$ (eV). Figures in parentheses are values given in the materials project site [11]. Color scheme is used for distinguishing different structures in .

Display Table

To calculate the $F$ -fingerprint for all the systems except for the 2D film model mp -985,587, parameters $σ = 0.2$ Å and $Δ = 0.1$ Å were set according to the arguments given in Sect. 2.1. Actually, clear dips are commonly seen just beyond the first NN O-Al pair shell in the $F$ -fingerprint of all the systems as shown in . Two amorphous structures are quite similar each other in the $F$ -fingerprint while some recognizable differences can be seen in the O-Al and Al-Al pair regions beyond the NN distance. They are also similar to the crystalline polymorphs within the first NN shell of the O-O, O-Al, and Al-Al pairs and turn to be remarkably broadened beyond. Such features were found and discussed in detail by Momida et al. [Citation43]. Distances are calculated using Euclidean and Cosine distance schemes up to the radius cutoff of $R_{\max} = 10$ Å and $R_{\max} = 5$ Å and the eigenvalues and eigenvectors of the double-centered squared distance matrix are computed to get the structure coordinates as described in Sects. 2.2 and 2.3.

Figure 2. F-Fingerprint of all the Al $_{2}$ O $_{3}$ systems listed in except for the 2D film model. Radius cutoff $R_{max} = 10$ Å is assumed and the fingerprint of dimensions between 0 and 100, 101 and 200, and 201 and 300 corresponds to O-O, O-Al, and Al-Al pairs, respectively. Line colors follow color scheme in Table 1 for distinguishing different structures.

Figure 2. F-Fingerprint of all the Al 2O 3 systems listed in Table 1 except for the 2D film model. Radius cutoff Rmax=10 Å is assumed and the fingerprint of dimensions between 0 and 100, 101 and 200, and 201 and 300 corresponds to O-O, O-Al, and Al-Al pairs, respectively. Line colors follow color scheme in Table 1 for distinguishing different structures.

In , calculated eigenvalues are drawn for two radial cutoffs $R_{\max} = 10$ Å and $R_{\max} = 5$ Å with Euclidean and Cosine distances (EquationEquation (9)(9) $d_{E} (1, 2) = {[\sum_{AB} \sum_{l} {[F_{1, AB} (l) - F_{2, AB} (l)]}^{2} w_{AB}]}^{1 / 2} .$ (9) and (Equation6(6) $d_{C} (1, 2) = \frac{1}{2} (1 - \frac{F_{1} \cdot F_{2}}{| F_{1} | | F_{2} |}),$ (6) ), respectively). Typical lattice constants of the systems we consider are up to about 10 Å and the cutoff $R_{\max} = 10$ Å should naívely be appropriate as the first choice. To see effects of the assumed cutoff on structure analyses, a study with $R_{\max} = 5$ Å is also tested. Looking at the Euclidean-distance eigenvalues in , $R_{\max} = 10$ Å gives relatively larger eigenvalues than $R_{\max} = 5$ Å. However, the proportion with $R_{\max} = 10$ Å ( $P_{5} = 0.91$ and $P_{8} = 0.98$ ) is not so different from that with $R_{\max} = 5$ Å ( $P_{5} = 0.84$ and $P_{8} = 0.95$ ). In the case of Cosine distance, the behavior of eigenvalues closely resembles the Euclidean one but information estimation only by the proportion is a bit ambiguous due to the existence of negative eigenvalues. It should be done by testing the precision in models constructed with the obtained PCs as descriptors, as we shall see below.

Figure 3. Eigenvalues calculated from the fingerprint of Al $_{2}$ O $_{3}$ polymorphs using (a) $R_{\max} = 10$ Å and (b) $R_{\max} = 5$ Å with Euclidean (blue) and cosine (red) distances.

shows coordinate map of all the Al $_{2}$ O $_{3}$ structures on five PCs ( $C_{1}$ through $C_{5}$ as the first five components of $X_{+}$ given in EquationEquation (12)(12) $X_{+} = Q_{+} Λ_{+}^{1 / 2},$ (12) ) with different radial cutoff and type of distance. The distribution of the structures in the maps with different cutoff and type of distance is generally quite similar. Most typically, the three lowest-energy structures (SG: $R \overset{ˉ}{3} c$ (black), $C 2 / m$ (maroon), and $Pna 21$ (blue)) form line-shaped clustering on the $C_{1}$ - $C_{2}$ dimension. Relatively low-energy structures up to about 100 meV/atom surround the lowest-energy cluster forming a basin except for $Ia \overset{ˉ}{3}$ structure (pink) that is the most stable for In $_{2}$ O $_{3}$ , far outside along the PC $C_{1}$ . Two amorphous structures are located near the edge of the basin close to the highest-energy structure ( $P \overset{ˉ}{1}$ ). The $Pbcn$ structure is situated close to the lowest-energy one ( $R \overset{ˉ}{3} c$ ) along $C_{1}$ through $C_{3}$ axes and found to be distinguishable along $C_{4}$ or $C_{5}$ . It is also interesting that the stable structures of Ga $_{2}$ O $_{3}$ ( $C 2 / m$ : maroon) and In $_{2}$ O $_{3}$ ( $Ia \overset{ˉ}{3}$ : pink) reveal relatively low energy while their positions in the map are far especially along $C_{1}$ . Since the PC $C_{1}$ dominantly contains information of the first NN O-Al pairs as shown below, the coordination number and the length variation in the NN O-Al bonds by local symmetry seem to affect the distance (dissimilarity) in the structure map.

Figure 4. Map of Al $_{2}$ O $_{3}$ structures with the five principal components (PCs) with (a) Euclidean-distance $R_{\max} = 10$ Å, (b) cosine-distance $R_{\max} = 10$ Å, (c) Euclidean-distance $R_{\max} = 5$ Å, and (d) cosine-distance $R_{\max} = 5$ Å. Dot colors follow color scheme listed in . Square dots in each figure denote the three lowest energy structures (SG: $R \overset{ˉ}{3} c$ (black), $C 2 / m$ (maroon), and $Pna 21$ (blue)), and small and large red dots express amorphous model structures (SG: $P 1$ ) by the materials project [Citation11] and momida [Citation43], respectively.

To investigate the information and its meaning immanent in each PC and the corresponding coordinates, the eigenvector of the obtained PCs is calculated using EquationEquation (17)(17) $u_{+} = (X^{T} - 1 {\overset{ˉ}{X}}^{T}) Q_{+} Λ_{+}^{- 12} = (X^{T} - 1 {\overset{ˉ}{X}}^{T}) X_{+} Λ_{+}^{- 1}$ (17) and plotted in . By searching the largest amplitude of each PC eigenvector in , one can find which radial part of the $F$ -fingerprint, namely RDF gives the most important contribution to the PC. The fist PC $C_{1}$ has the largest amplitude at the first NN position of the O-Al pair, irrespective to the choice of the radial cutoff and type of distance. The second PC $C_{2}$ shows the largest contribution from the second and third NNs of O-Al pair and almost equally the first NN of O-O pair. From these findings, it seems that the characteristics in the energy landscape seen in may be quantitatively explained by the O-Al and O-O neighboring environmental features. However, the PCs beyond $C_{1}$ and $C_{2}$ are also significantly important to reproduce the energy more precisely, as shown below. In this context, the largest amplitudes for $C_{3}$ , $C_{4}$ , and $C_{5}$ can be seen for the Al-Al RDF.

Figure 5. Eigenvector of the five principal components (PCs) with (a) Euclidean-distance $R_{\max} = 10$ Å, (b) cosine-distance $R_{\max} = 10$ Å, (c) Euclidean-distance $R_{\max} = 5$ Å, and (d) cosine-distance $R_{\max} = 5$ Å, calculated by using EquationEquation (17)(17) $u_{+} = (X^{T} - 1 {\overset{ˉ}{X}}^{T}) Q_{+} Λ_{+}^{- 12} = (X^{T} - 1 {\overset{ˉ}{X}}^{T}) X_{+} Λ_{+}^{- 1}$ (17) in Al $_{2}$ O $_{3}$ polymorphs. Red, lime, blue, maroon, and green lines denote $C_{1}$ , $C_{2}$ , $C_{3}$ , $C_{4}$ , and $C_{5}$ PC, respectively. Horizontal axis denotes the dimension of the original fingerprint space (see ). In (a) and (b), the dimensions between 1 and 100, 101 and 200, and 201 and 300 express the O-O, O-Al, and Al-Al pair radius, respectively, and in (c) and (d), the dimensions between 1 and 50, 51 and 100, and 1001 and 150 do the O-O, O-Al, and Al-Al pair radius, respectively.

A sparse modeling technique is useful to search the important descriptors among a given set of ones. In the present study, we prepared 21 descriptors of the zero-th (constant term) through second orders of the obtained five PCs ( $C_{1}$ — $C_{5}$ ), and detected and removed their collinearity and multi-collinearity (linear dependence), and then used exhaustive search or genetic algorithm in linear regression analysis. The best set of the descriptors for a fixed number of descriptors is selected by the leave-one-out (LOO) scheme, a kind of cross validation techniques [Citation47]. Results of the regression analysis for the total energy using the PCs as the descriptors are shown in . $R^{2}$ is the coefficient of determination widely used for expressing the fitting performance [Citation48], $Q^{2}$ is the coefficient of determination for the prediction residual sum of squares (often called PRESS) showing a measure of generalization capability [Citation49] and RMSE denotes the root mean square error in the regression. The explicit equations of these quantities are given and discussed in our previous study [Citation47]. At glance, good modeling is achieved as the number of descriptors is increased in all the cases and any over-fitting behaviors those are often observed in $Q^{2}$ are not seen, possibly because of the use of orthogonal PCs as the descriptors. Although the products of the PCs used may not necessarily be orthogonal to the others, any collinearity or multi-collinearity tendency has not been detected up to the second order of the PCs. As a more rigorous test for investigating any existing over-fitting, a double cross-validation technique is adopted. Here, one sample data are completely hidden during the modeling and then the best model trained by the LOO method with the remaining data is tested for checking the generalization capability using the hold-out data. Double cross-validation results for the total-energy model constructed from the Euclidean distance with $R_{m a x} = 5$ Å are shown in . As the number of descriptors is increased, the standard deviation for the prediction error of the hold-out data shows first a decline, then minimum ( $\sim 0.1$ eV) at the number of descriptors of eight and an ascent beyond, indicating the over-fitting behavior with large numbers of descriptors. The explicit model form with the number of descriptors of eight is

Figure 6. Regression analysis of the total energy with descriptors up to the second order of the five PCs in Al $_{2}$ O $_{3}$ polymorph structures. (a) Euclidean-distance $R_{\max} = 10$ Å, (b) cosine-distance $R_{\max} = 10$ Å, (c) Euclidean-distance $R_{\max} = 5$ Å, and (d) cosine-distance $R_{\max} = 5$ Å. Blue, red, and green dots and lines denote the coefficient of determination ( $R^{2}$ ), its leave-one-out coefficient ( $Q^{2}$ ), and root mean square error (RMSE), respectively.

Figure 7. Double cross-validation test of the total-energy regression model constructed with the descriptors from the Euclidean distance and $R_{\max} = 5$ Å in Al $_{2}$ O $_{3}$ polymorph structures. Dot colors follow color scheme listed in . Red open circle denotes one of the amorphous structure (mp -1,245,063). The model of the number of descriptors of eight shows the smallest standard deviation ( $\sim$ 0.1 eV).

(21)

E^{(8)} = 0.049 C_{1} + 0.122 C_{3} + 0.025 C_{1} C_{2} + 0.123 C_{1} C_{5} + 0.110 C_{3}^{2} + 0.638 C_{3} C_{5} + 0.228 C_{4}^{2} + 0.358 C_{5}^{2} .

(21)

All of the PCs, $C_{1}$ through $C_{5}$ are selected in the model. Detailed energy landscape is actually very complicated than just looking at the map (). The double cross-validation test tells that the line-shaped dip seen in the $C_{1}$ - $C_{2}$ map is rather shallow in the basin landscape and not well precisely reflected in the regression model, though higher-energy structures than 0.1 eV including the amorphous models are well distinguished in the energy model.

It is found, in the same way as the total-energy regression, that the energy-gap data given in can be also modeled with the same structural descriptors, namely $C_{1}$ through $C_{5}$ and their products up to the second order plus the constant term, at the similar level of precision.

3.2. Site preference in La(Fe,Si)₁₃

As a second application of the proposed workflow, we consider the classification of alloy-type configurations for analyzing the site preference of doped Si atoms in LaFe $_{13 - x}$ Si $_{x}$ . This compound is well known as a magnetocaloric materials in applications for magnetic refrigeration [Citation50]. The crystal structure of LaFe $_{13 - x}$ Si $_{x}$ is of cubic NaZn $_{13}$ type (SG: $Fm \overset{ˉ}{3} c$ ) for $1 \leq x \leq 2.8$ [Citation51–53] and drawn in with VESTA [Citation9]. There are two kinds of crystallographic sites for Fe: Fe1 ( $8 b$ Wyckoff position) and Fe2 ( $96 i$ ), and doped Si atoms share the $96 i$ site with Fe2. Configurations of Fe2 and Si are discussed as ‘coloring problem’ [Citation53]. Let us consider below a primitive cell La $_{2}$ Fe $_{26 - y}$ Si $_{y}$ ( $y = 2 x$ ) of the system with fixed experimental lattice constants and atomic positions taken from the database COD-1528546 [Citation7].

Figure 8. Crystal structure of La $_{2}$ Fe $_{26 - y}$ Si $_{y}$ . Green, orange, and brown balls denote La, Fe1 and Fe2 atomic sites, respectively. Partial occupation of Si at the Fe2 sites is depicted with blue. The drawing is made with VESTA [Citation9].

Looking at the local coordinates around Fe, the Fe1 ( $8 b$ ) site is coordinated by 12 Fe2 at the distance of 2.46 Å, which is the similar NN distance to that of pure BCC Fe. This means that the Fe1 site is closely packed with a high coordination number. Around the Fe2 ( $96 i$ ) site, on the other hand, there are two Fe2 at 2.45 Å, one Fe1 at 2.46 Å, and seven Fe2 in a range of 2.50–2.69 Å, showing less densely packed than Fe1. Although the atomic radius of Si is 1.17 Å slightly smaller than that of Fe of 1.24 Å [Citation54], the site preference of Si at the Fe sites is not so easily understandable at first glance.

In the case of La $_{2}$ Fe $_{25}$ Si $_{1}$ , there are only two possible inequivalent configurations with one Si atom either at the $8 b$ or $96 i$ site (Case 1). It turns to be a bit complicated situation in La $_{2}$ Fe $_{24}$ Si $_{2}$ . There is one configuration with two Si atoms fully occupying at the $8 b$ site (Case 2–1). If two Si atoms are placed one by one at the $8 b$ and $96 i$ sites (Case 2–2), total number of possible configurations is 48 (= $_{2}$ C $_{1}$ $\times$ $_{24}$ C $_{1}$ ). By analyzing the structural similarity as described above, it is reduced to two crystallographically inequivalent structures. If two Si atoms are accommodated at the $96 i$ sites (Case 2–3), there are 276 (= $_{24}$ C $_{2}$ ) possible configurations. And then, one can easily obtain 12 crystallographically inequivalent structures by the distance diagnosis. Similarly, in La $_{2}$ Fe $_{23}$ Si $_{3}$ there are one crystallographically inequivalent structure out of 24 possible configurations for two Si at $8 b$ and one Si at $96 i$ (Case 3–1), 17 crystallographically inequivalent structures out of 552 (= $_{2}$ C $_{1}$ $\times$ $_{24}$ C $_{2}$ ) configurations for one Si at $8 b$ and two Si at $96 i$ (Case 3–2), and 50 inequivalent structures out of 2024 (= $_{24}$ C $_{3}$ ) configurations for three Si at $96 i$ (Case 3–3). In all the cases of this sub-section, the crystallographically inequivalent structures correspond to the distinct pair-wise fingerprints.

Heats of formation per atom are evaluated from total energies by first-principles DFT calculations with the FLAPW method for the configuration ( $i$ ) in La $_{2}$ Fe $_{26 - y}$ Si $_{y}$ ( $y = 0 - 3$ ) and most-stable elemental crystal systems, La ( $Fm \overset{ˉ}{3} m$ ), ferromagnetic Fe ( $Im \overset{ˉ}{3} m$ ), and Si ( $Fd \overset{ˉ}{3} m$ ) as

(22)

\begin{matrix} H (y; i) & = \frac{1}{28} {E [L a_{2} F e_{26 - y} S i_{y}; i] - 2 E [L a] \\ - (26 - y) E [F e] - yE [S i]} \end{matrix}

(22)

and plotted in together with spin magnetic moments per Fe atom. To see the configurational energy for a given $y$ , the heats of formation relative to the lowest energy value at $y$ that correspond to the configurational energy defined as

Figure 9. Heats of formation $H$ in eV/atom (EquationEquation (22)(22) $\begin{matrix} H (y; i) & = \frac{1}{28} {E [L a_{2} F e_{26 - y} S i_{y}; i] - 2 E [L a] \\ - (26 - y) E [F e] - yE [S i]} \end{matrix}$ (22) ) and $ΔH$ in meV/atom (EquationEquation (23)(23) $ΔH (y; i) = H (y; i) - mi n_{j} H (y; j)$ (23) ), and spin magnetic moments in $μ_{B}$ per Fe atom calculated for La $_{2}$ Fe $_{26 - y}$ Si $_{y}$ . Red dots, green dots, and blue circles denote values for the configurations with Si occupying at only $8 b$ , at $8 b$ and $96 i$ , and at only $96 i$ , respectively. Black dots indicate the corresponding values for La $_{2}$ Fe $_{26} (y = 0)$ .

Figure 9. Heats of formation H in eV/atom (EquationEquation (22)(22) H(y;i)=128{E[La2Fe26−ySiy;i]−2E[La] −(26−y)E[Fe]−yE[Si]}(22) ) and ΔH in meV/atom (EquationEquation (23)(23) ΔH(y;i)=H(y;i)−minjH(y;j)(23) ), and spin magnetic moments in μB per Fe atom calculated for La 2Fe 26−ySi y. Red dots, green dots, and blue circles denote values for the configurations with Si occupying at only 8b, at 8b and 96i, and at only 96i, respectively. Black dots indicate the corresponding values for La 2Fe 26(y=0).

(23)

ΔH (y; i) = H (y; i) - mi n_{j} H (y; j)

(23)

are also drawn. The calculated heat of formation for non-doped La $_{2}$ Fe $_{26}$ is $H (y = 0) = + 60$ meV/atom, indicating unstable to the constituent elemental systems consistent with no intermetallic-compound phase in the experimental phase diagram [Citation55]. As the composition of Si $y$ is increased in La $_{2}$ Fe $_{26 - y}$ Si $_{y}$ , the systems show a nearly linear stability to $y$ with negligibly small configurational variation at the energy scale of eV/atom within the same $y$ . Minor in magnitude but interesting behaviors in the configurational energy $ΔH$ at the 10 meV/atom energy scale are discussed below. Calculated spin magnetic moment per Fe atom is monotonically decreased as $y$ is increased, leading to reduction in total magnetization per cell more than that simply by the decrease in the number of Fe atoms. Interestingly, the spin magnetic moment shows certain dependence on the configuration within the same Si concentration possibly due to differences in the local hybridization between Fe and Si atoms.

The configurational energy at $y = 1$ in tells that Si prefers to be accommodated at the Fe2 ( $96 i$ ) site by 11 meV/atom compared with the Fe1 ( $8 b$ ) site, being consistent with the experimental data (COD-1528546) [Citation7]. Since the multiplicity of the Fe2 site (24) is significantly larger than that of the Fe1 site (2), the entropic effect must additionally contribute to the site-preference of Si at the Fe2 site. A rough estimation of the difference in the configurational entropy gives $ΔS = k_{B} (ln 24 - ln 2) = 0.214$ meV/K-cell, implying the same order contribution as the energy difference to the stability at temperatures of several hundreds K. By inspecting the configurational energy $Δ H$ more carefully in , clustering features can be seen for Si occupying at the $96 i$ sites (most remarkably in blue circles of $y = 2$ and $y = 3$ ). Since the clustering features are possibly related to the origin of the configurational stability, structure map for all the inequivalent configurations may be useful for understanding it with the benefit of the prescriptions explained in Sec. 2.

shows the $F$ -fingerprints and eigenvector of three PCs on the Si-Si and Fe-Si dimensions of La $_{2}$ Fe $_{24}$ Si $_{2}$ (Case 2–3). The $F$ -fingerprints on the Fe-Si dimension show very minor differences compared with those on the Si-Si one and the other $F$ -fingerprints are much weakly distinguishable (not included in the figure). Among the 12 inequivalent configurations, four structures have two Si atoms located at the NN $96 i$ sites and are in relatively high energy (+7.9 — +20.7 meV/atom) above the lowest energy in the configurational energies $Δ H$ at $y = 2$ in . The rest of the configurations have no NN Si pair and their energies are below +2.6 meV/atom. The PC $C_{1}$ of the structure map of the Euclidean distance in clearly distinguish the four high-energy configurations and the rest of eight ones. The proportion of $k = 3$ is $P_{3} = 0.77$ . The eigenvector of $C_{1}$ shown in actually has the largest amplitude around the Si-Si NN distance. The PCs $C_{2}$ and $C_{3}$ have the largest amplitude of the eigenvectors around approximately 4 Å and 6 Å, respectively, and are related to the different stability within the $C_{1}$ -clustering structures. The three structures (blue, cyan, and teal) with the NN Si-Si pair are quite similar in the map up to the PC $C_{3}$ and the PCs beyond must determine the minor energy difference. It is concluded that the interaction between the Si atoms are repulsive in La $_{2}$ Fe $_{24}$ Si $_{2}$ and they may tend to be distributed apart.

Figure 10. (a) $F$ -fingerprints and (b) Euclidean eigenvectors of three principal components of Si-Si (0–80) and Fe-Si (80–160) dimensions of La $_{2}$ Fe $_{24}$ Si $_{2}$ (Case 2–3). Line colors used for the $F$ -fingerprints in (a) correspond to those of dots in . Red, lime, and blue lines in (b) denote the eigenvectors of $C_{1}$ , $C_{2}$ , and $C_{3}$ , respectively.

Figure 11. Structure map of La $_{2}$ Fe $_{24}$ Si $_{2}$ (case 2–3) obtained by dimension reduction of the Euclidean distance. Dot colors denote the different configurations corresponding to the $F$ -fingerprint shown in ).

Some selected $F$ -fingerprints showing distinctive behavior and eigenvectors of three PCs on the Si-Si and Fe-Si dimensions of La $_{2}$ Fe $_{23}$ Si $_{3}$ (Case 3–3) are drawn in . The corresponding structure map of the Euclidean distance with $k = 3$ is shown in , resulting in the proportion $P_{3} = 0.80$ . The eigenvectors of the three PCs contain the structure information around the essentially same regions of the feature vector dimension as the previous case of La $_{2}$ Fe $_{24}$ Si $_{2}$ (Case 2–3). In this case, there are four kinds of the $F$ -fingerprint peak height at the NN Si-Si distance. The structures (magenta and lime) possessing the highest RDF at the NN distance has Si triangular trimers, forming a clustering with two more structures (small black) in the structure map. The structures (cyan, yellow, and green) with the second highest RDF have V-shape Si trimers, those (maroon, teal, red, blue, and black) with the third highest RDF do Si dimers, and those (purple and navy) have no NN Si pair. These locally distinct Si structures are clearly classified in the structure map along the PC $C_{1}$ , also being in accordance with the general repulsive behavior between the Si atoms in the configurational stability in La(Fe,Si) $_{13}$ .

Figure 12. (a) Selected $F$ -fingerprints and (b) eigenvectors of three principal components of Si-Si (0–80) and Fe-Si (80–160) dimensions of La $_{2}$ Fe $_{23}$ Si $_{3}$ (case 3–3). Red, lime, and blue lines in (b) denote the eigenvectors of $C_{1}$ , $C_{2}$ , and $C_{3}$ , respectively.

Figure 13. Structure map of La $_{2}$ Fe $_{23}$ Si $_{3}$ (case 3–3) obtained by dimension reduction of the Euclidean distance. Dot colors denote the different configurations corresponding to the $F$ -fingerprint shown in ), where the $F$ -fingerprints of the small black dots here are not included.

4. Summary

The three-step procedure to draw crystal structure map for a given set of materials systems with the same composition is proposed and demonstrated to classify the structure in a low-dimensional space and to model the materials properties. Combined with the scheme for counting all the possible combinations for alloy systems within supercell models, configurational energy preference can be discussed with the multiplicity of each configuration as the entropic effect. Further applications of the present methods with first-principles DFT calculations and machine-learning analysis are highly expected. In addition, regarding future prospects the present procedure may be applicable to any vector-type information such as spectral and image data for a given set of related materials systems. Map obtained in a low-dimensional space classifies the existing materials systems in terms of the reduced features and may possibly assist to explore desired candidate systems by using the projection and inverse-projection techniques as shown in EquationEquation (19)(19) $Y_{+} = (Y - \overset{ˉ}{X}) u_{+} .$ (19) and (Equation20(20) $Z = Z_{+} u_{+}^{T} + \overset{ˉ}{X} .$ (20) ).

Statement of novelty

Acknowledgements

The author would like to thank Hiroyoshi Momida for providing the structure data of the amorphous Al $_{2}$ O $_{3}$ model, Yukihiro Makino, Tetsuya Fukushima, and Kazunori Sato for invaluable discussion in the study of La(Fe,Si) $_{13}$ , and Hitoshi Fujii, Tomoki Yamashita, Hiori Kino, and Takashi Miyake for suggestive comments on data-science techniques.

Disclosure statement

The author declares no known competing financial interests.

Data availability statement

The raw data required to reproduce these findings are available by making an e-mail request to the corresponding author.

Additional information

Funding

This work is supported partly by JST-CREST [Grant No. JPMJCR22O2].

References