Full article: Influences of non-landslide sample selection strategies on landslide susceptibility mapping by machine learning

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

Landslide susceptibility mapping is crucial in mitigating the risk of regional landslide hazards. The main objectives of this study are to design and analyze the impact of three non-landslide data generation methods on landslide susceptibility mapping. The specific non-landslide data selection methods are the random generation of non-landslide points through different distance landslide buffers, by selecting areas with low landslide influence in the landslide condition factor layer, and based on the results of the information value model partitioning by selecting low to medium susceptibility zones. There were 14 landslide condition factors used for the landslide susceptibility mapping, and correlations between landslides and condition factors were analyzed using the Gini index. The 70% dataset was modelled using a logistic regression model as well as an artificial neural network model. Finally, statistical metrics and AUC values were used for the 30% validation data for comparing the predictive performances of different non-landslide data generation strategies in the model. Based on the validation results, models A6 and B6 performed the best, with respective AUC values of 0.997 and 0.995. The findings show that the non-landslide samples generated using the low susceptibility interval of the information value model have the best performance.

Keywords:

1. Introduction

Landslide disasters have claimed many lives and damaged much property around the world. Just in 2020, the landslides in China caused 197 casualties and direct economic damages of more than RMB 5 billion. The coastal areas of southeast China are highly vulnerable to landslide disasters due to typhoons and heavy rainfall, which seriously threaten people’s lives and property. Researchers have suggested various solutions to mitigate the adverse consequences of landslides. Landslide susceptibility maps are effective tools for identifying and predicting potential landslides. Research on landslide susceptibility zoning is significant in mitigating the damage to people and property caused by landslides (Zêzere et al. Citation2017; Huang and Zhao Citation2018; Azarafza et al. Citation2021; Yavuz Ozalp et al. Citation2023).

During the past few decades, several studies have been conducted on landslide susceptibility assessment using various methods, including deterministic coefficients (Chong Citation2010; Pradhan et al. Citation2017), linear regression (Onagh et al. Citation2012), the weight of evidence (Lee and Choi Citation2004; Ilia and Tsangaratos Citation2016), analytic hierarchy process (Kayastha et al. Citation2013), and frequency ratios (Mohammady et al. Citation2012; Akinci and Yavuz Ozalp Citation2021). Susceptibility mapping has also used machine learning models such as logistic regression (Umar et al. Citation2014), support vector machines (Chen et al. Citation2017; Zhou et al. Citation2018), random forests (Youssef et al. Citation2016; Chen et al. Citation2018), and neural networks (Pradhan and Lee Citation2010; Xu et al. Citation2015; Akinci Citation2022). Besides, Hybrid models developed by combining the two methods have become a hot topic for research on susceptibility evaluation models, including bivariate statistical methods (Constantin et al. Citation2011; Schicker and Moon Citation2012), neural network fuzzy logic, EBF fuzzy logic, and fuzzy evidence weighting (Hong et al. Citation2017; Hong et al. Citation2018).

The above literature review shows that though the published research has come up with enormous outcomes in landslide susceptibility mapping models, the importance of susceptibility sample selection in model prediction accuracy has not been given much attention. It is necessary to prepare modeling and validation datasets before evaluating the landslide susceptibility. These dataset samples include landslide points, non-landslide points, and condition factors. The sample selection significantly impacts the accuracy and reasonability of the susceptibility zoning (Dou et al. Citation2020; Gaidzik and Ramírez-Herrera Citation2021). To solve this challenge, many scholars (Lima et al. Citation2017; Li et al. Citation2021; Wang et al. Citation2022) have conducted experimental studies on the selection strategy of non-landslide point selection to address this problem. The most commonly used methods for selecting non-landslide samples are (1) random sampling method, in which non-landslides other than known landslides in the study area are selected randomly (Pham et al. Citation2016), (2) buffer sampling method, in which non-landslide points are selected based on known landslide samples, by buffering outward method (Peng et al. Citation2014; Su et al. Citation2017) and (3) river or slope sampling method, which considers the selection of non-landslide samples in the rivers or areas with slopes less than 2° (Kavzoglu et al. Citation2014). The selection of non-landslide samples may emerge in landslide-prone locations due to these methodologies’ relatively random and subjective nature, which lowers the accuracy of landslide prediction.

In view of this, this study improves the above three methods, proposes a non-landslide sample selection method based on buffer, condition factor and information value models, and establishes a study area susceptibility model using logistic regression and artificial neural network, respectively. The efficacy of the non-landslide sample selection methods was then evaluated and compared by contrasting the accuracy and logic of distinct susceptibility models using various evaluation methods. Eventually, the most reasonable non-landslide sample selection method is presented.

2. Study area and data source

2.1. Study area

The study area is located in Anxi County, Quanzhou City, Fujian Province, China (). Anxi County between 25°50′36″-26°26′30″ north latitude and 117°48′30″-118°40′01″ east longitude, has a total area of 3057.28 km² and is 63 km long from north to south and with an east-west width of 74 km and an altitude ranging from 8 m to 1589 m. The study area is exposed to differences in topographic condition and the strength of marine climatic influences owing to its location in a subtropical monsoon climate zone. The average annual precipitation is above 1000 mm, and the rainy season extends from April to June. Typhoons and tropical storms make landfall about 1 to 3 times/year due to the region’s close proximity to the southeast coast. Typhoon-induced heavy-very heavy rainfall is the main triggering factor of landslide disaster outbreaks.

Figure 1. Location of the study area and landslide inventory map: (a) Administrative divisions of China, (b) location of the study, (c) landslide inventory.

2.2. Data source

The geological and historical landslide data were extracted from the list of landslides gathered from the Anxi County Bureau of Land and Resources. A total of 821 historical landslides were identified in the study area (), among which the most extensive and smallest landslide volumes were 136,000 m³ and 12 m^3, respectively. Most of the monitored landslide locations were close to the front and back of residents’ houses for the safety of residents and public facilities.

Various factors should be considered by the evaluation index system of landslide susceptibility in the study area. In addition to topographic factors (elevation, slope angle and curvature) and geological factors (lithology and faults), rainfall characteristics and human engineering operations, such as road distance, are taken into account. Finally, for this study, 14 initial conditional factors for landslide susceptibility are chosen (). Specific information on each factor is presented in .

Figure 2. Landslide conditioning factors: (a) slope angle, (b) altitude, (c) aspect, (d) NDVI, (e) plan curvature, (f) profile curvature, (g) distance to faults, (h) lithology, (i) distance to rivers, (j) TWI, (k) rainfall, (l) distance to roads, (m) distance to residences, and (n) land use.

Table 1. Data sources of the landslide condition factor.

Download CSV Display Table

Slope gradient is one of the most important factors influencing the occurrence and development of landslides and their morphological characteristics due to its considerably varying degrees of influence on surface water runoff, groundwater recharge and discharge, material transport and accumulation, and stress distribution characteristics of the slope body. The slope elevation was derived from the DEM and classified into six categories with a spacing of 10° for this investigation ().

The elevation is also vital in inducing landslides due to its control on slope progradation. Landslides are less likely to occur in very low elevations with thick sedimentary soils and in higher elevations with very hard rocks. In contrast, the probability of landslide occurrence is higher in the middle altitude areas, where the slope body is mostly a binary structure of upper soil and lower rocks, with frequent human engineering activities ().

Different slope orientations produce different vegetation covers and soil moisture contents due to changes in the time and intensity of solar illumination. The slope orientation is derived from the DEM, which is divided into nine directional classes, namely flat, north, north-east, east, south-east, south, west, south-west and north-west ().

Normalized difference vegetation index (NDVI), which represents the vegetation growth status, reflects the sparseness of surface vegetation and is often used to distinguish vegetation cover areas, bare ground, and water bodies. The NDVI of the study area was derived from Landsat8 satellite maps in the near-infrared band and infrared band ().

Curvature is the second-order derivative of the surface and is divided into plane curvature and profile curvature, representing the curvatures in the direction parallel to and perpendicular to the maximum slope. The erosion and deposition are impacted by the flow acceleration and deceleration of profile curvature. On the other hand, the planar curvature affects flow convergence and dispersion. The DEM of the study area was used to derive the planar and profile curvatures, which are classified into eight categories based on the natural discontinuity method ().

Faults cause the development of rock joints and fractures, which results in the deterioration of the geotechnical properties and favors landslide development. In general, the likelihood of landslides increases with the fault’s proximity. The 1:50,000 geological map is used to calculate fault distances, which are classified into eight categories with 500 m intervals ().

The stratigraphic lithology of an area plays an important role in the formation and development of landslides. It is an intrinsic factor in breeding landslide occurrence, as the variations in lithology impact the degree of landslide development by determining the type and scale characteristics of landslides. The lithology map of Anxi County was created using vectorization in Arcgis based on the 1:50,000 geological map of Anxi County. Six different lithological units in the study area are alluvial double-layered soil bodies (Q4), volcanic rock formations (J), limestone formations(P), sandstone siltstone formations (P+), schistose mica-quartz schist formations (P + T), and granite formations (γ) (). As can be seen in , the highest number of landslides occurred within the volcanic rock formation, followed by the granite formation.

One of the primary exogenous factors inducing landslides is water. The strength of the geotechnical body, particularly the weak surface, is reduced by groundwater infiltration and softening mud of water which enhances the risk of landslides on slopes. In this paper, the influence of surface water systems on the distribution of regional landslides is illustrated by the frequency of landslides occurring within a specific spatial distance from a surface water system. The distribution map of water systems was derived from the topographic map of the study area. The distance from the water system in the study area was divided into eight categories with a spacing of 50 m, considering the actual situation ().

Terrain Wetness Index (TWI), which represents the spatial distribution of soil moisture, was derived from DEM and is classified into eight categories with a spacing of 2 m ().

Rainfall can weaken the rock and soil strength, and this weakening phenomenon in soil slopes is more evident after waterlogging. Similarly, the weakening in rock slopes causes a significant reduction in the shear strength which reduces the slope stability when the rock mass or the soft inclusions are more hydrophilic. This study used the five-year precipitation data of Anxi County from 2016 to 2020 to generate the average annual rainfall in recent years based on ArcGIS processing. The annual rainfall in Anxi County was classified into six categories with a spacing of 100 mm ().

Another significant factor impacting the formation of landslides is human activity. Due to the extensive slope cutting in road construction, which alters the slope structure and destroys the initial stress level of the geotechnical body, these progrades are highly susceptible to landslide hazards under specific condition. The geological hazard survey report of Anxi County was used to calculate the distance from the road, which is classified into eight categories with an interval of 50 m (). As the study area is hilly mountainous, most houses are built on the slopes, increasing slope loads and modifying the slope stress distribution. The frequency of landslide occurrence within a specific spatial distance from houses is used to illustrate the influence of houses on the distribution of regional landslides while examining the impacts of houses on slopes. The map of distance from the houses was derived from the house map in the geological hazard survey report of Anxi County. Eight categories of distance from houses were used in this investigation, each with a spacing of 50 m (). Anxi County has experienced many landslides due to land usage, industrial and agricultural development, and rapid infrastructure establishments. Land use of the study area was downloaded from the National Basic Geographic Information Centre and was divided into six categories, which are arable land, forest land, grassland, water, construction land, and others (). Further, the landslide evaluation factor map was resampled into a raster format with a spatial resolution of 30 × 30 m.

3. Methodologies

As shown in , the study is divided into five processes which are (1) selection of suitable condition factors by building a landslide susceptibility evaluation index system based on the Gini index, (2) selection of the corresponding number of landslide and non-landslide samples using three different types of methods and their classification into training and test sets with a ratio of 70:30, (3) building corresponding landslide susceptibility models using different training sets, (4) validation of different non-landslide selection method models by multiple statistical validation parameters, and (5) generation of landslide susceptibility zoning maps and their comparative analysis.

Figure 3. Flowchart of the study.

3.1. Preparation of non-landslide datasets

The identification of labeled data must be prioritized by supervised types of machine learning methods to produce predictions, as the accuracy of this data will determine the prediction results. Several studies have shown that selecting non-landslide data significantly impacts the final evaluation results during landslide susceptibility assessment. The present study compares the machine learning susceptibility models based on the three selection strategies, which are: (1) random generation of non-landslide points equal to the landslide data outside the buffers by generating landslide buffers of different distances and cropping the buffers in the study area, (2) selection of areas with low landslide impact in the landslide susceptibility assessment factor layer by combining the selected condition factors with the landslide samples and the random generation of non-landslide data in these areas and (3) random generation of non-landslide points using the medium-low susceptibility intervals, selected based on the mapping results of the Anxi County information value model.

The 821 landslide data used in the study area were randomly divided into two subsets with a ratio of 70:30. The exact number of non-landslide points was also generated randomly in the non-landslide area using the three methods mentioned above and were divided based on the 70:30 ratio. Finally, the landslide and nonlandslide points were assigned the values ‘1′ and ‘0′, respectively.

3.2. Information value model

A statistical technique known as the information value model based on probability theory has been widely used in landslide susceptibility mapping. These models use magnitudes of the entropy values during the occurrence of geological hazards to represent their probabilities. Information value defines the weight value of each classification unit. It is calculated as the ratio of the logarithm of the density of a landslides class to the density of landslides. The larger the information value, the more likely landslide hazards occur within the evaluation. (1) $I_{i} = \log_{2} \frac{N_{i} / N}{S_{i} / S}$ (1) where S is the total number of evaluation cells in the study area; N is the number of evaluation cells with landslides occurrence; S_i is the number of evaluation cells with an evaluation index ‘i’; N_i is the number of cells with landslide occurrence and an evaluation index ‘i’.

The following equation can calculate the total information value. (2) $I = \sum_{i = 1}^{n} I_{i} = \sum_{i = 1}^{n} \log_{2} \frac{N_{i} / N}{S_{i} / S}$ (2)

3.3. Logistic regression model

A generalized linear regression technique known as the logistic regression model is often used for dichotomous variable analysis in statistics, mapping of results to within [0,1] by the activation function, and for the construction of the best-fit function between a dependent variable and multiple independent variables, which has been widely used in the spatial prediction of landslide hazards. The magnitude of landslide susceptibility is evaluated based on the probability value of landslide occurrence. The landslide susceptibility is defined as follows. (3) $P = \frac{e^{Y}}{1 + e^{Y}}$ (3) where P and Y represent the probability of landslide occurrence and the linear regression equation, respectively. The latter can be defined using the following equation. (4) $Y = \ln (\frac{P}{1 - P}) = β_{1} x_{1} + β_{2} x_{2} + \dots + β_{n} x_{n} + α$ (4) where α, n, β_i, and x_i represent the intercept, the number of variables, the regression coefficient of variable x_i_, and the variable factors controlling the landslide occurrence.

3.4. Artificial neural network

Artificial neural networks are frequently used in landslide sensitivity evaluation as they can mathematically simulate the biologic function of humans due to their fast convergence, extreme non-linearity, and high fault tolerance ().

Figure 4. Artificial neural network structure.

For a given training dataset (x,y), the output expression of the neuron is (5) $y = f (\sum_{i = 1}^{n} ω_{i} x_{i} + α)$ (5) where f, w_i, x_i, and a indicates activation function, connection weight, input value, and deviation, respectively.

3.5. Selection of landslide condition factor by Gini index

Gini index, also called mean decrease impurity, is widely used by researchers to measure the importance of different condition factors. Gini index can be mathematically expressed as: (6) $Gini (n, v_{i}) = \sum_{i = 1}^{m} \frac{a_{i}}{f_{s}} I (d_{u i})$ (6) where m donates the number of landslides at node n, f_s donates the number of input feature vectors used for training, the Gini impurity I(d_ui) represents the distribution of class labels in the node. For a feature variable v_i∈ V with m values at node n, $v_{i} = [u_{1}, u_{2}, \dots u_{m}],$ the value of I(d_ui) can be computed as: (7) $I (d_{u i}) = 1 - {\sum_{i = 0}^{c} (\frac{p_{c_{i}}}{a_{i}})}^{2}$ (7) where $p_{c_{i}}$ donates the number of samples with a value u_i belong to class c_i, a_i indicates the number of samples with a value u_i at that node n.

3.6. Accuracy evaluation and comparison

In this study, the sensitivity, specificity, accuracy and receiver operating characteristic curve (ROC) of landslide susceptibility model are assessed using each metric in the confusion matrix.

The sensitivity, which is the ratio of the number of samples predicted by the model to be in a positive class to the number of actual positive samples, is calculated using the following relation (8) $Sensitivity = \frac{T P}{T P + F N}$ (8)

Specificity, which is the ratio of the number of samples predicted by the model to be true negative values to the total number of negative samples, is calculated using the following relation (9) $Specificity = \frac{T N}{T N + F P}$ (9)

The accuracy of the model, the ratio of the number of accurate model predictions to the total number of predictions of the sample, is calculated using the following relation. (10) $Accuracy = \frac{T P + T N}{T P + T N + F P + F N}$ (10) where TP and TN accurately predict the results of landslides and non-landslides, respectively, while FP and FN incorrectly predict the results of landslides and non-landslides.

The receiver operating characteristic curve is the most commonly used quantitative evaluation technique for the prediction accuracy of landslide susceptibility model. The calculations and results of this method are simple and clear. Furthermore, the area under the curve (AUC) can be used to quickly test the accuracy of the predicted model. The value range of AUC is 0.5 ∼ 1, the closer the AUC value of a model is to 1, the higher the accuracy of the model.

4. Results and discussion

4.1. Landslide condition factor selection

In this study, the weights (AM) of 20 commonly used landslide impact factors are calculated using the Gini index, excluding the factors with AM < 0 and ranking the 14 factors with AM > 0 from largest to smallest. All the 14 selected factors have positive values (AM > 0) on landslide occurrence, as shown in . The distance to residences has the greatest value (AM = 0.414) due to the occurrence of most historical landslide events procured from Anxi County Natural Resources Bureau either in front of or behind the residential structures, followed by distance to rivers (AM = 0.210), distance to roads (AM = 0.201), land use (AM = 0.089), altitude (AM = 0.058), NDVI (AM = 0.058), slope angle (AM = 0.048), aspect (AM = 0.032), and TWI (AM = 0.025), distance to faults (AM = 0.021), rainfall (AM = 0.007), profile curvature (AM = 0.006), plan curvature (AM = 0.004) and lithology (AM = 0.003).

Figure 5. Average IG of conditioning factors.

4.2. Non-landslide data generation

4.2.1. Landslide buffer generates non-landslide data

Based on previous landslide data, landslide buffers were created as 600, 900, 1200, and 1500 m. The GIS cropping tool was used to crop several landslide buffer distances in the base map, which is the regional map of Anxi County, while non-landslide data were randomly generated from the cropped area as the base map ().

Figure 6. Non-landslide inventory map: (a), (b), (c), and (d) represent 600, 900, 1200, and 1500 m, respectively, outside the landslide buffer zone. (e) Non-landslide inventory map: 250 m outside the residences buffer zone. (f) Non-landslide inventory map: with very low and low susceptibility areas of the IV model.

4.2.2. Conditioning factor generation for non-landslide data

In order to avoid incorrect choosing of locations with significant historical landslide occurrence while generating non-landslide data, the indicator factor layer grading area used to generate non-landslide data must ensure fewer historical landslides in the area. To this end, the following non-landslide area principles are developed: (a) The regions selected as non-landslide areas should have a relatively low proportion of historical statistical landslide data. (b) The extent of non-landslide areas needs to be significant to prevent the clustering of non-landslide data and to generate a sufficient comprehensive representation of other areas. Based on the principles mentioned above and the statistical analysis of the evaluation factors, the areas satisfying the first condition were identified, which include areas with (i) slope above 40°, (ii) elevation greater than 1000 m, (iii) distance from houses above 250 m, (iv) TWI above 14, (v) limestone formation as stratigraphic lithology, (vi) rainfall below 1200 mm, (vii) distance from roads above 300 m, (viii) land use type of water and others. Out of these areas satisfying the first condition, the areas meeting the second condition were identified, which include a category with (i) 64% of the area located more than 250 m away from houses and (ii) 20% of the area located more than 300 m away from roads. The percentage of landslides in the former category is only 3.5%, which is half of the landslide occurrence in the latter category. In contrast, the former area is three times larger than the latter. In conclusion, 821 randomly generated non-landslide points were created in the area over 250 meters from the house ().

4.2.3. Information value model generation for non-landslide data

The non-landslide data generated based on the area selected by a single layer may still appear to be overly dominated by a single factor, even though there are two principles for selecting non-landslide areas. However, such situations can be overcome by considering the results of the partitioning of the Anxi County information value model for selecting lower and low susceptibility intervals as non-landslide areas and the subsequent random generation of the non-landslide points. The selected landslide evaluation factor layers were analyzed by extracting the numbers of landslides and zoning area for each factor grading interval using GIS. Further, the ratio of the number of landslides to the number of rasters for the index factor interval was calculated to derive the information value for each interval based on the information value model formula. The calculation results are shown in .

Table 2. Information values of landslide evaluation factors.

Download CSV Display Table

The superposition analysis was performed based on the information value calculation formula using the weighted superposition tool in GIS to generate the landslide susceptibility map of Anxi County, which was further divided into five susceptibility classes using the natural interruption point method (). Finally, an equal number of non-landslide points were randomly generated using both lower susceptibility zone and the low susceptibility zone of the information value model as non-landslide areas ()

Figure 7. Landslide susceptibility map generated by IV model.

4.3. Different non-landslide data based on logistic regression models

The dataset required to build the model was generated by combining the landslide sample set with the data extraction of the 14 condition factors based on the non-landslide sample set generated by the three different selection methods mentioned above. Further, the dataset was divided into training and test sets in a 70:30 ratio using a stratified sampling method. Finally, multiple logistic regression models were developed, and the condition factor regression coefficients are listed in , where Model A1, A2, A3, and A4, are logistic regression models of various buffer zones of landslides such as 600, 900, 1200, and 1500 m, respectively, while A5 and A6 are logistic regression models for more than 250 m away from the house and that for low information prone area.

Table 3. Condition factor logistic regression coefficients.

Download CSV Display Table

Further, the susceptibility index of Anxi County was determined based on the logistic regression model outside the buffer distance of different landslide hazards by combing the coefficients of the logistic regression model in EquationEquation (4)(4) $Y = \ln (\frac{P}{1 - P}) = β_{1} x_{1} + β_{2} x_{2} + \dots + β_{n} x_{n} + α$ (4) using the raster calculator. Based on the natural interruption method, the susceptibility index was classified into five categories: very low susceptibility zone, low susceptibility zone, medium susceptibility zone, high susceptibility zone, and very high susceptibility zone, as shown in .

Figure 8. Landslide susceptibility maps using: (a) model A1, (b) model A2, (c) model A3, (d) model A4, (e) model A5, and (f) model A6.

4.4. Different non-landslide data based on artificial neural network

The artificial neural network model used in the study had an input layer with 14 evaluation factors, nine hidden layers, and two output layers. The artificial neural network model uses heuristic algorithm to self-optimize parameters in the training process, and the momentums were set as 0.5, 0.9, and 0.99, and the learning rates were set as0.01 and 0.1. The artificial neural network model parameters for training on the other datasets in this study include a learning rate of 0.01, and a model momentum of 0.5, with 300 cycles. The connection weights were calculated between each neuron in the input and hidden layers and between the hidden layer and the output layer. The landslide susceptibility indices were calculated by combining the neural network weights with the activation function. They were classified into five categories using the reclassification tool (). Model B1, B2, model B3, and model B4 are ANN models for landslide buffer zones 600 m, 900 m, 1200, and 1500, respectively, while model B5 and model B6 are ANN models for a distance of 250 m or more from a house, and that for a low information-prone zone, respectively.

Figure 9. Landslide susceptibility maps using: (a) model B1, (b) model B2, (c) model B3, (d) model B4, (e) model B5, and (f) model B6.

4.5. Model validation

4.5.1. Validation of logistic regression model

The number of zonal rasters for each model is different due to the use of the natural interruption point method for the classification of landslide susceptibility indices, which results in different areas within the zone, affecting the historical landslide data from the zonal statistics. Several statistical parameters were introduced to compare the advantages and disadvantages of the non-landslide samples selected by each method and evaluate each model’s accuracy. The estimated evaluation indices are shown in .

Table 4. Accuracy validation parameters of logistic regression models.

Download CSV Display Table

As evident from , the best performance in landslide unit classification is shown by model A6 (sensitivity = 98.0%), followed by model A5 (sensitivity = 96.7%), models A2 and A3 (sensitivity = 90.2%), model A1 (sensitivity = 89.8%), and model A4 (sensitivity = 89.4%). Similarly, the best performance in the classification of non-landslide units was shown by model A5 (specificity = 99.2%), followed by model A6 (specificity = 98.8%), model A3 (specificity = 89.4%), model A2 (specificity = 89.0%), model A4 (specificity = 86.6%), and model A1 (specificity = 85.8%). The overall best performance in the classification of landslide and non-landslide units is shown by model A6 with the highest accuracy of 98.4%, followed by model A5 (accuracy = 98.0%), model A3 (accuracy = 89.8%), model A2 (accuracy = 89.6%), model A4 (accuracy = 88.0%), and model A1 (accuracy = 87.8%).

The ROC curves corresponding to the six models were generated to more intuitively compare the prediction accuracy and stability among the models. The ROC curves are shown in , and the area under the curve (AUC) is given in . Based on the LR model, all the non-landslide sampling strategies show high performance (AUC > 0.9). However, the best performance is shown by model A6 (AUC = 0.997), followed by model A5 (AUC = 0.991), model A3 (AUC = 0.954), model A2 (AUC = 0.950), model A4 (AUC = 0.946) and model A1 (AUC = 0.936).

Figure 10. ROC curves of the six logistic regression models.

Table 5. AUC analysis for the six logistic regression models.

Download CSV Display Table

Overall, all the non-landslide sampling methods are acceptable for mapping landslide susceptibility in the study area. Based on the study’s findings, it can be concluded that the non-slippery samples generated using the low susceptibility area of the information value model in the LR model exhibited the best performance.

4.5.2. Validation of artificial neural network model

Statistical parameters are used to validate the accuracy of the six model sets obtained and to compare the advantages and disadvantages of the three non-landslide sampling methods under the ANN-based model. The validation results are shown in .

Table 6. Accuracy validation parameters of artificial neural network models.

Download CSV Display Table

As evident from , the best performance in landslide unit classification was shown by model B6 (sensitivity = 98.0%), followed by model B5 (sensitivity = 95.9%), model B4 (sensitivity = 91.6%), and models B1, B2 and B3 (sensitivity = 90.7%). Similarly, the best performance in the classification of non-landslide units was shown by model B5 (specificity = 98.8%), followed by model B6 (specificity = 98.4%), model B3 (specificity = 88.6%), model B2 (specificity = 86.2%), model B4 (specificity = 85.8%) and model B1 (specificity = 84.6%). However, the overall best performance in the classification of landslide and non-landslide units was shown by model B6, which had the highest accuracy of 98.2%, followed by model B5 (accuracy = 97.0%), model B3 (accuracy = 89.6%), model B4 (accuracy = 88.7%), model B2 (accuracy = 88.4%), and model B1 (accuracy = 87.6%).

The general performance of the landslide models using ROC curves and validation data is shown in and . Based on ANN models, all the non-slippery slope sampling strategies are observed to exhibit high performance (ANN > 0.9). The highest performance was shown by Model B6 (AUC = 0.995), followed by model B5 (AUC = 0.989), model B3 (AUC = 0.956), model B2 (AUC = 0.952), model B4 (AUC = 0.946) and model B1 (AUC = 0.942).

Figure 11. ROC curves of the six artificial neural network models.

Table 7. AUC analysis for the six artificial neural network models.

Download CSV Display Table

Overall, all the non-landslide sampling methods are suitable for mapping landslide susceptibility in the study area. Based on the above analysis, it can be concluded that the use of the low susceptibility area of the information value model showed the best performance in the non-landslide sampling method in the ANN model.

4.5.3. Compared with models from other studies

For the selection strategy of non-landslide samples, some researchers used different methods to conduct similar research (). For example, Wang et al.(Citation2022) used a special non-landslide sample selection method and a logistic regression algorithm to model landslide susceptibility, and finally found that the accuracy of the model sampled only in the mountainous area (AUC = 0.732) was lower than that of the model sampled in the whole research area (AUC = 0.830). Li et al. (Citation2021) extracted non-landslide samples using random, buffer, and information value methods, respectively, and applied the obtained non-landslides to the SVM model in Landslide Hazard Mapping in the western area of Tumen City. The results show that the prediction accuracy of the three methods is above 80%. Chang et al. (Citation2023) constructed a logistic regression and support vector machine model based on slope units using 16 condition factors in Chongyi County, and discussed the effect of the number of sample extractions on non-landslide uncertainty. The results show that the accuracy and rationality of the evaluation results of landslide susceptibility increase with the increase of the number of non-landslide samples.

Table 8. Comparison of model prediction accuracy in different non-landslide point selection strategies (ROC)

Download CSV Display Table

On the basis of the above research, this paper improves some previous methods and puts forward a non-landslide sample selection method based on the buffer method, condition factor method and information value method. Then, the landslide susceptibility model of the study area are established by logistic regression and artificial neural network respectively. The calculation results show that the accuracy of the proposed method is obviously higher than that of the previous methods. It must be recognized that this could be related to the differences between the study area and its historical landslide samples.

5. Conclusion

This paper presents a novel approach to developing three non-landslide sampling methods, namely, different landslide distance buffers, condition factors, and information value models, to examine the influence of non-landslide point selection strategies on the results of landslide susceptibility evaluation. Further, the performance of three non-landslide sampling methods in landslide susceptibility modeling and mapping was evaluated using logistic regression model and artificial neural network model. The main conclusions are as follows:

Non-landslide sample selection significantly impacts the prediction performance of landslide susceptibility, with enhanced accuracy of landslide susceptibility mapping while adopting a reasonable sample selection strategy for non-landslide points.
The AUC values of all the models developed in this paper are greater than 90%, indicating the high accuracy and reliability of the machine learning susceptibility evaluation models constructed by combining non-landslides sample selection methods.
The approach employing a low susceptibility area of the information value model as the non-landslide generation area has the highest accuracy compared to other three major existing methods for non-landslide selection. It minimizes the probability of the selected non-landslide occurrences in high-risk areas, enhancing landslide prediction accuracy.
Although the current study optimized the non-landslide selection strategy with the highest prediction accuracy through comparison, it lacks the exploration of the influence of parameters on the evaluation results due to conditions such as insufficient spatial and temporal data. In view of this, it is suggested that the dynamic risk assessment of regional landslides can be carried out based on time series data and vulnerability assessment in the future, and the influence of parameters in non-landslide selection strategy on the accuracy of the evaluation model can be further discussed.

Data availability

Data will be made available on request.

Disclosure statement

All authors disclosed no relevant relationships, and authors have no conflict of interest todeclare.

Additional information

Funding

The authors would like to acknowledge the financial support from the National Natural Science Foundation of China (No. U2005205, No. 42007235) and the Science and Natural Science Foundation of Fujian Province (No. 2023J01423).

References

Akinci H, Yavuz Ozalp A. 2021. Landslide susceptibility mapping and hazard assessment in Artvin (Turkey) using frequency ratio and modified information value model. Acta Geophys. 69(3):725–745. doi: 10.1007/s11600-021-00577-7.
Web of Science ®Google Scholar
Akinci H. 2022. Assessment of rainfall-induced landslide susceptibility in Artvin, Turkey using machine learning techniques. J Afr Earth Sci. 191:104535. doi: 10.1016/j.jafrearsci.2022.104535.
Web of Science ®Google Scholar
Azarafza M, Azarafza M, Akgün H, Atkinson PM, Derakhshani R. 2021. Deep learning-based landslide susceptibility mapping. Sci Rep. 11(1):24112. doi: 10.1038/s41598-021-03585-1.
PubMed Web of Science ®Google Scholar
Chen W, Pourghasemi HR, Panahi M, Kornejady A, Wang J, Xie X, Cao S. 2017. Spatial prediction of landslide susceptibility using an adaptive neuro-fuzzy inference system combined with frequency ratio, generalized additive model, and support vector machine techniques. Geomorphology. 297:69–85. doi: 10.1016/j.geomorph.2017.09.007.
Web of Science ®Google Scholar
Chen W, Xie X, Peng J, Shahabi H, Hong H, Bui DT, Duan Z, Li S, Zhu A-X. 2018. GIS-based landslide susceptibility evaluation using a novel hybrid integration approach of bivariate statistical based random forest method. Catena. 164:135–149. doi: 10.1016/j.catena.2018.01.012.
Web of Science ®Google Scholar
Chong XU. 2010. GIS platform and certainty factor analysis method based Wenchuan earthquake-induced landslide susceptibility evaluation. J Eng Geology. 18(1):15.
Google Scholar
Constantin M, Bednarik M, Jurchescu MC, Vlaicu M. 2011. Landslide susceptibility assessment using the bivariate statistical analysis and the index of entropy in the Sibiciu Basin (Romania). Environ Earth Sci. 63(2):397–406. doi: 10.1007/s12665-010-0724-y.
Web of Science ®Google Scholar
Chang Z, Catani F, Huang F, Liu G, Meena SR, Huang J, Zhou C. 2023. Landslide susceptibility prediction using slope unit-based machine learning models considering the heterogeneity of conditioning factors. J Rock Mech Geotech Eng. 15(5):1127–1143. doi: 10.1016/j.jrmge.2022.07.009.
Web of Science ®Google Scholar
Dou J, Yunus AP, Merghadi A, Shirzadi A, Nguyen H, Hussain Y, Avtar R, Chen Y, Pham BT, Yamagishi H, et al. 2020. Different sampling strategies for predicting landslide susceptibilities are deemed less consequential with deep learning. Sci Total Environ. 720:137320. doi: 10.1016/j.scitotenv.2020.137320.
PubMed Web of Science ®Google Scholar
Gaidzik K, Ramírez-Herrera MT. 2021. The importance of input data on landslide susceptibility mapping. Sci Rep. 11(1):19334. doi: 10.1038/s41598-021-98830-y.
PubMed Web of Science ®Google Scholar
Hong H, Ilia I, Tsangaratos P, Chen W, Xu C. 2017. A hybrid fuzzy weight of evidence method in landslide susceptibility analysis on the Wuyuan area, China. Geomorphology. 290:1–16. doi: 10.1016/j.geomorph.2017.04.002.
Web of Science ®Google Scholar
Hong H, Tsangaratos P, Ilia I, Liu J, Zhu A, Chen W. 2018. Application of fuzzy weight of evidence and data mining techniques in construction of flood susceptibility map of Poyang County, China. Sci Total Environ. 625:575–588. doi: 10.1016/j.scitotenv.2017.12.256.
PubMed Web of Science ®Google Scholar
Huang Y, Zhao L. 2018. Review on landslide susceptibility mapping using support vector machines. Catena. 165:520–529. doi: 10.1016/j.catena.2018.03.003.
Web of Science ®Google Scholar
Ilia I, Tsangaratos P. 2016. Applying weight of evidence method and sensitivity analysis to produce a landslide susceptibility map. Landslides. 13(2):379–397. doi: 10.1007/s10346-015-0576-3.
Web of Science ®Google Scholar
Kavzoglu T, Sahin EK, Colkesen I. 2014. Landslide susceptibility mapping using GIS-based multi-criteria decision analysis, support vector machines, and logistic regression. Landslides. 11(3):425–439. doi: 10.1007/s10346-013-0391-7.
Web of Science ®Google Scholar
Kayastha P, Dhital MR, De Smedt F. 2013. Application of the analytical hierarchy process (AHP) for landslide susceptibility mapping: a case study from the Tinau watershed, west Nepal. Comput Geosci. 52:398–408. doi: 10.1016/j.cageo.2012.11.003.
Web of Science ®Google Scholar
Lee S, Choi J. 2004. Landslide susceptibility mapping using GIS and the weight-of-evidence model. Int J Geograph Inform Sci. 18(8):789–814. doi: 10.1080/13658810410001702003.
Web of Science ®Google Scholar
Li X, Cheng J, Yu D, Han Y. 2021. Research on non-landslide selection method for landslide hazard mapping.
Google Scholar
Lima P, Steger S, Glade T. 2017. Comparison of non-landslide sampling strategies to counteract inventory-based biases within national-scale statistical landslide susceptibility models. EGU General Assembly Conference Abstracts 13523.
Google Scholar
Mohammady M, Pourghasemi HR, Pradhan B. 2012. Landslide susceptibility mapping at Golestan Province, Iran: a comparison between frequency ratio, Dempster–Shafer, and weights-of-evidence models. J Asian Earth Sci. 61:221–236. doi: 10.1016/j.jseaes.2012.10.005.
Web of Science ®Google Scholar
Onagh M, Kumra VK, Rai PK. 2012. Landslide susceptibility mapping in a part of Uttarkashi district (India) by multiple linear regression method. Int J Geology, Earth Environ Sci. 2:102–120.
Google Scholar
Peng L, Niu R, Huang B, Wu X, Zhao Y, Ye R. 2014. Landslide susceptibility mapping based on rough set theory and support vector machines: a case of the Three Gorges area, China. Geomorphology. 204:287–301. doi: 10.1016/j.geomorph.2013.08.013.
Web of Science ®Google Scholar
Pham BT, Pradhan B, Bui DT, Prakash I, Dholakia MB. 2016. A comparative study of different machine learning methods for landslide susceptibility assessment: a case study of Uttarakhand area (India). Environ Modell Software. 84:240–250. doi: 10.1016/j.envsoft.2016.07.005.
Web of Science ®Google Scholar
Pradhan, B, Xu, C, Dieu, T, et al. 2017. Rainfall-induced landslide susceptibility assessment at the Chongren area (China) using frequency ratio, certainty factor, and index of entropy. Geocarto Int. 32(2):139–154.
Web of Science ®Google Scholar
Pradhan B, Lee S. 2010. Regional landslide susceptibility analysis using back-propagation neural network model at Cameron Highland, Malaysia. Landslides. 7(1):13–30. doi: 10.1007/s10346-009-0183-2.
Web of Science ®Google Scholar
Schicker R, Moon V. 2012. Comparison of bivariate and multivariate statistical approaches in landslide susceptibility mapping at a regional scale. Geomorphology. 161–162:40–57. doi: 10.1016/j.geomorph.2012.03.036.
Web of Science ®Google Scholar
Su Q, Zhang J, Zhao S, Wang L, Liu J, Guo J. 2017. Comparative assessment of three nonlinear approaches for landslide susceptibility mapping in a coal mine area. IJGI. 6(7):228. doi: 10.3390/ijgi6070228.
Google Scholar
Umar Z, Pradhan B, Ahmad A, Jebur MN, Tehrany MS. 2014. Earthquake induced landslide susceptibility mapping using an integrated ensemble frequency ratio and logistic regression models in West Sumatera Province, Indonesia. Catena. 118:124–135. doi: 10.1016/j.catena.2014.02.005.
Web of Science ®Google Scholar
Wang C, Lin Q, Wang L, Jiang T, Su B, Wang Y, Mondal SK, Huang J, Wang Y. 2022. The influences of the spatial extent selection for non-landslide samples on statistical-based landslide susceptibility modelling: a case study of Anhui Province in China. Nat Hazards. 112(3):1967–1988. doi: 10.1007/s11069-022-05252-8.
Web of Science ®Google Scholar
Xu K, Guo Q, Li Z, Xiao J, Qin Y, Chen D, Kong C. 2015. Landslide susceptibility evaluation based on BPNN and GIS: a case of Guojiaba in the Three Gorges Reservoir Area. Int J Geographical Inform Sci. 29(7):1111–1124. doi: 10.1080/13658816.2014.992436.
Web of Science ®Google Scholar
Yavuz Ozalp A, Akinci H, Zeybek M. 2023. Comparative analysis of tree-based ensemble learning algorithms for landslide susceptibility mapping: a case study in Rize, Turkey. Water. 15(14):2661. doi: 10.3390/w15142661.
Web of Science ®Google Scholar
Youssef AM, Pourghasemi HR, Pourtaghi ZS, Al-Katheeri MM. 2016. Erratum to: landslide susceptibility mapping using random forest, boosted regression tree, classification and regression tree, and general linear models and comparison of their performance at Wadi Tayyah Basin, Asir Region, Saudi Arabia. Landslides. 13(5):1315–1318. doi: 10.1007/s10346-015-0667-1.
Web of Science ®Google Scholar
Zêzere JL, Pereira S, Melo R, Oliveira SC, Garcia RA. 2017. Mapping landslide susceptibility using data-driven methods. Sci Total Environ. 589:250–267. doi: 10.1016/j.scitotenv.2017.02.188.
PubMed Web of Science ®Google Scholar
Zhou C, Yin K, Cao Y, Ahmed B, Li Y, Catani F, Pourghasemi HR. 2018. Landslide susceptibility modeling applying machine learning methods: a case study from Longju in the Three Gorges Reservoir area, China. Comput Geosci. 112:23–37. doi: 10.1016/j.cageo.2017.11.019.
Web of Science ®Google Scholar

Influences of non-landslide sample selection strategies on landslide susceptibility mapping by machine learning

Abstract

1. Introduction

2. Study area and data source

2.1. Study area

2.2. Data source

Table 1. Data sources of the landslide condition factor.

3. Methodologies

3.1. Preparation of non-landslide datasets

3.2. Information value model

3.3. Logistic regression model

3.4. Artificial neural network

3.5. Selection of landslide condition factor by Gini index

3.6. Accuracy evaluation and comparison

4. Results and discussion

4.1. Landslide condition factor selection

4.2. Non-landslide data generation

4.2.1. Landslide buffer generates non-landslide data

4.2.2. Conditioning factor generation for non-landslide data

4.2.3. Information value model generation for non-landslide data

Table 2. Information values of landslide evaluation factors.

4.3. Different non-landslide data based on logistic regression models

Table 3. Condition factor logistic regression coefficients.

4.4. Different non-landslide data based on artificial neural network

4.5. Model validation

4.5.1. Validation of logistic regression model

Table 4. Accuracy validation parameters of logistic regression models.

Table 5. AUC analysis for the six logistic regression models.

4.5.2. Validation of artificial neural network model

Table 6. Accuracy validation parameters of artificial neural network models.

Table 7. AUC analysis for the six artificial neural network models.

4.5.3. Compared with models from other studies

Table 8. Comparison of model prediction accuracy in different non-landslide point selection strategies (ROC)

5. Conclusion

Data availability

Disclosure statement

Additional information

Funding

References

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date