2,390
Views
10
CrossRef citations to date
0
Altmetric
Agronomy & Crop Ecology

Exploring relevant wavelength regions for estimating soil total carbon contents of rice fields in Madagascar from Vis-NIR spectra with sequential application of backward interval PLS

ORCID Icon, , , , , ORCID Icon, & show all
Pages 1-14 | Received 07 Nov 2019, Accepted 08 Jun 2020, Published online: 28 Jun 2020

ABSTRACT

Laboratory visible and near-infrared (Vis-NIR) spectroscopy with partial least squares (PLS) regression can be used to determine the soil carbon (C) content, and the waveband selection procedures can refine the predictive ability. However, individually selected wavebands are not always the same depending on the location, scale, and approach. To simplify the variable selection issue, some methods for selecting wavelength regions instead of individual wavebands have been proposed. In this study, we explore relevant wavelength regions for predicting the total carbon (TC) content of lowland and upland soils in Madagascar from Vis-NIR spectroscopy using a dynamic version of backward interval PLS (biPLS) regression. The predictive ability of dynamic biPLS was compared with that of standard full-spectrum PLS (FS-PLS) using the cross-validated coefficient of determination (R2), root mean squared error (RMSE), and ratio of performance to interquartile distance (RPIQ). The biPLS models using reflectance (R2 = 0.877, RMSE = 0.690) and first derivative reflectance (FDR) (R2 = 0.940, RMSE = 0.494) data sets showed better predictive accuracy than the FS-PLS models using reflectance (R2 = 0.826, RMSE = 0.809) and FDR (R2 = 0.933, RMSE = 0.518) data sets, the spectral efficiency was improved. By using biPLS to predict soil TC, the model was simplified by using only four selected wavelength regions in the reflectance (400–490, 1402–1440, 1846–1980 and 2151–2283 nm) and FDR (652–687, 1322–1443, 1856–1985, and 2290–2400 nm) data sets, which yielded reliable (RPIQ > 2.5) predictions.

Graphical abstract

1. Introduction

The soil carbon (C) content is one of the most important properties in assessments of general soil fertility. Then, timely assessments of soil C can be used for effective and sustainable fertilizer management practices, which is particularly true in Sub-Saharan Africa (SSA), where smallholder farmers still rely on indigenous nutrient supplies from soils for crop production. Rice cultivation in Madagascar is typical of farming systems in SSA in which smallholder farmers are impoverished by stagnant yields resulting from infertile soil conditions with minimal external inputs (Tsujimoto et al., Citation2019). In Madagascar, rice is uniquely important not only as the staple food of the country but also as the major income resource for rural livelihoods. However, rice yields have been stagnant at less than 3 t ha–1 for the last decades despite relatively favorable water conditions, with 70% of rice cropping areas categorized as irrigated in this country (Partnership, Citation2013). In a survey of several rice fields in the central highland of Madagascar, Tsujimoto et al. (Citation2009) showed a significant and linear relationship between rice yield and the soil organic carbon (SOC) content in relation to the N-supplying capacity of soils, strongly indicating that soil fertility management is critical to improving rice yields in the region.

Visible and near-infrared (Vis-NIR) spectroscopy, as a rapid and non-destructive technology, has been widely used to perform quantitative analyses of complex samples in agricultural sciences. The reflectance spectra of Vis-NIR (400–2500 nm) obtained by laboratory spectral measurements include wavebands that have been related to physical and chemical properties of soil. The prediction of soil properties requires the development of a spectral library relating to spectra with reference data. To date, soil spectral libraries have been developed by the Vis-NIR enthusiasm around the world at country (Li et al., Citation2015; Romero et al., Citation2018), continental (Johnson et al., Citation2019; Stevens et al., Citation2013), and global (Viscarra Rossel et al., Citation2016) scales. These libraries can be used to develop calibration models for the prediction of soil properties. In Madagascar, however, only a small number of qualified data set was recorded in the soil libraries; continent scale (n = 82) (Johnson et al., Citation2019) and global scale (n = 18) (Viscarra Rossel et al., Citation2016).

Partial least squares (PLS) regression is the most widely used multivariate calibration method because it can extract information on the target component from a spectral matrix with hundreds or even thousands of wavebands (Conforti et al., Citation2013, Citation2015). However, as a linear multivariate calibration, the accuracy of PLS analysis tends to decrease due to the non-linear nature of the relationship between spectral data and the dependent variable (Araújo et al., Citation2014). As data-mining approaches, machine learning techniques, such as artificial neural network (ANN) (Kuang et al., Citation2015), support-vector machines (SVM) (Morellos et al., Citation2016), and random forest (Cipullo et al., Citation2019; Douglas et al., Citation2018; De Santana et al., Citation2018) outperformed the PLS analysis for predicting soil properties as it is able to account for the non-linearity associated with the soil spectral responses. More recently, deep learning is a rapidly developing frontier in machine learning that has also been tested for calibrating soil spectra (Ng et al., Citation2019; Padarian et al., Citation2019). Although the literatures reported the machine learning outperformed PLS regression, they did not suggest that it would be suitable for everyone, because the machine learning, and especially deep learning, is a very data-hungry approach that requires a lot of data to be able to make a good prediction (Ng et al., Citation2019).

In Madagascar, there is no spectral library (data set) to perform machine learning for estimating soil properties. In the present study, we focused on the development of a robust PLS model using waveband selection approach based on local data set that we collected in the central highland of Madagascar. In general, Vis-NIR spectra contains thousands of wavebands, and such large number of spectra variables often contribute to collinearity, and redundancies rather than relevant information. Waveband selection is an important step not only for developing a robust calibration model and also for better understanding the relationship between spectral and soil properties. Indeed, it is increasingly evident that the inclusion of uninformative or redundant spectra in the Vis-NIR spectrum degrade the PLS model and lead to inaccurate predictions (Andersen & Bro, Citation2010; H. D. Li et al., Citation2012). Thus, additional waveband selection methods based on the PLS regression are necessary for NIR spectral analysis to refine the predictive ability.

To date, a large number of waveband selection methods for PLS analysis have been proposed, including individual waveband selection and wavelength region selection. For individual waveband selection, many methods have been developed, such as iterative stepwise elimination PLS (ISE-PLS) (Boggia et al., Citation1997), uninformative variable elimination PLS (UVE-PLS) (Centner et al., Citation1996), and genetic algorithm PLS (GA-PLS) (Leardi et al., Citation1992). Among the waveband selection methods, GA-PLS has been used as a suitable method in chemometrics (Leardi, Citation2000). Earlier studies reported that, after suitable modifications, GA-PLS shows better predictive performance and yields more interpretable results because the selected wavelengths are less dispersed than those with other methods (Kawamura et al., Citation2010; Leardi & González, Citation1998; Lucasius & Kateman, Citation1994). In our previous research (Kawamura et al., Citation2017), we developed a PLS model to estimate the total carbon (TC) contents of paddy soils from laboratory Vis-NIR measurements of soil samples collected from various rice fields in the central highland of Madagascar. The results indicated improvements in the predictive ability by applying individual waveband selection with ISE-PLS. Additionally, our previous study (Kawamura et al., Citation2019) indicated that GA-PLS obtained better solutions than ISE-PLS when estimating the oxalate-extractable P content of paddy soils in Madagascar. However, the computational cost of GA-PLS is very high when the number of wavebands is large. Another considerable issue with GA-PLS is over-fitting when using a large number of wavebands (>200) (Leardi & Nørgaard, Citation2004).

One solution to simplifying the problem of variable selection is to reduce the number of variables involved in the optimization (Zhang et al., Citation2017). Some methods for selecting wavelength regions instead of individual wavebands have been proposed, such as moving window PLS (MWPLS) (Jiang et al., Citation2002; Kasemsumran et al., Citation2004), interval PLS (iPLS) (Nørgaard et al., Citation2000), and backward interval PLS (biPLS) (Leardi & Nørgaard, Citation2004). MWPLS (Jiang et al., Citation2002; Kasemsumran et al., Citation2004) searches for informative spectral regions using a moving window, which moves over the whole spectral region to identify useful spectral intervals. However, the sub-region selected by the window with a fixed size does not always supply the best predictions. Therefore, Du et al. (Citation2004) developed changeable size moving window PLS (CSMWPLS) to optimize the informative regions and their combinations to further improve the predictive ability of the PLS models. Both iPLS (Nørgaard et al., Citation2000) and biPLS (Leardi & Nørgaard, Citation2004) calculate local PLS models using equally sized subintervals of the full spectrum region and identify the optimal combinations of regions by forward and backward selection, respectively. Similar to MWPLS, however, they encounter problems when the border between two contiguous intervals of equally spaced spectral regions falls inside the same spectral feature, such as when the main part of a reflectance peak is in one interval and its tail in the next interval. The solution can be found by running biPLS several times, with a different number of intervals with different interval sizes each time. To overcome these problems, a dynamic version of biPLS has been developed by Leardi and Nørgaard (Citation2004). Dynamic biPLS runs several times using a different composition of the deletion groups (determined by randomizing the order of the samples) and with a different number of intervals (e.g. from 16 to 25).

Here, we adopted this dynamic biPLS approach to explore relevant wavelength regions for prediction of the TC content of upland and lowland soils in Madagascar. To evaluate the performance of the selected wavelength regions for TC calibration, this study compares the predictive ability of dynamic biPLS with that of standard full-spectrum PLS (FS-PLS).

2. Materials and methods

2.1. Data set

This study used the same data set used in our previous study (Kawamura et al., Citation2019, Citation2017); the data set was generated based on laboratory Vis-NIR spectroscopy using soil samples (n = 162) collected from upland and lowland rice fields in the central highland of Madagascar (). This area is dominated by inherently nutrient-poor soil types that are mainly classified into Ferralsols and Acrisols (IUSS Working Group, WRB, Citation2014) or into Oxisols of semiarid to humid climates (Soil Survey Staff, Citation2014). In 2015 and 2016, soil sampling was conducted in 158 rice fields. Surface soil samples were collected from a 0–10 cm depth as composites of three to four cores in each field. Within three fields, sub-surface samples (10–20 cm depth in a field; 10–20, 20–30, and 30–40 cm depth in two fields) were also collected to evaluate the effect of the depth of soil layers. Thus, 165 soil samples were obtained. The soil samples were sieved to <2 mm and air-dried for 7 days in the laboratory. The TC contents of soils were determined using the dry combustion method with an automatic NC analyser, SUMIGRAPH NC-220 F (Sumika Chemical Analysis Service, Ltd., Osaka, Japan).

Figure 1. Location of study area and soil sampling points. Source for background images in (a), (b) and (d): Esri, DigitalGlobe, GeoEye, Earthstar Geographics, CNES/Airbus DS, USDA, USGS, AeroGRID, IGN, and the GIS User Community

Figure 1. Location of study area and soil sampling points. Source for background images in (a), (b) and (d): Esri, DigitalGlobe, GeoEye, Earthstar Geographics, CNES/Airbus DS, USDA, USGS, AeroGRID, IGN, and the GIS User Community

The Vis-NIR spectra for dry soil samples were recorded using an ASD FieldSpec 4 Hi-Res spectroradiometer (ASD Inc., Longmont, CO, USA) and an ASD contact-probe in a dark room. Preprocessing, including noise reduction by standard normal variate (SNV) and outlier detection, was performed on the reflectance and first derivative reflectance (FDR) spectra over a wavelength range from 400 to 2400 nm (2001 bands). The outliers were detected based on the Mahalanobis distance H > 3 from principal component analysis (Kawamura et al., Citation2017). As a result, three samples were considered outliers, leaving 162 samples for further analyses.

2.2. PLS calibrations

PLS calibrations were performed based on reflectance and FDR spectra data sets using ‘PLS_Toolbox’ in MATLAB software ver. 9.3 (MathWorks Herborn, MA, USA). The dynamic biPLS (Leardi & Nørgaard, Citation2004) was computed using the ‘iToolbox’ revision released in March 2013 (http://www.models.kvl.dk/iToolbox).

The dynamic biPLS was performed in 50 runs with the number of intervals varying from 16 to 25 and with three runs (each with a different composition of the deletion groups) for each number of intervals. After 50 runs of biPLS, the wavelength regions were selected by a backward stepwise selection procedure based on the frequency of selections. The final output is a plot showing how many times each waveband was retained after 50 runs with the threshold value (Leardi & Nørgaard, Citation2004).

2.3. Predictive ability of the PLS models

To evaluate the predictive ability of the FS-PLS and dynamic biPLS, a k-fold cross-validation procedure based on independent training and test data sets was performed (Emmert-Streib & Dehmer, Citation2019). Initially, the data were divided randomly into training (n = 120) and test (n = 42) data sets. Next, the training data were split randomly into k-folds. Here, we used k = 5; therefore, each k-fold has n = 24 samples. The PLS model was built on k – 1 folds of the training data set (n = 96), and then the error of the kth fold was recorded as validation data (n = 24). This process was repeated until each of the k-folds served as the validation data set. The coefficient of determination (R2) and root mean squared error (RMSE) values were used to assess model accuracy. Finally, the model was applied to the test data set, and then the predictive ability was evaluated from the R2, the RMSE and the ratio of performance to interquartile range (RPIQ) (Bellon-Maurel et al., Citation2010) in the test data set.

The R2 is calculated as:

(1) R2= RSSTSS(1)

where RSS is the residual sum of squares and TSS is the total sum of squares. The RMSE is defined as:

(2) RMSE= i=1nyiyp2n(2)

where yi and yp represent the measured and predicted soil TC contents for sample i, respectively, and n is the number of samples in the test data sets (n = 42). R2 is a measure of how well the variation in one variable explains the variation in an other variable and is presented as the percentage of the variation explained by a best-fit regression line. RMSE indicates the total prediction error of the model. In general, high R2 and low RMSE values reflect models that can better predict the soil TC content (Kusumo et al., Citation2008).

The RPIQ is defined as follow:

(3) RPIQ= IQRMSEP(3)

where IQ is the inter quantile distance between Q3 and Q1 of the observed values. In terms of the performance ability of the calibration model, RPIQ values >2.5 are considered to reflect excellent models, RPIQ values between 2.0 and 2.5 indicate a very good model with predictive ability, RPIQ values between 1.7 and 2.0 indicate a good model, RPIQ values between 1.4 and 1.7 indicate a fair model in need of some improvement, and RPIQ values <1.4 indicate that a model has a very poor predictive ability (Nawar & Mouazen, Citation2017).

3. Results and discussions

3.1. Soil TC statistics

The descriptive statistics of the soil TC (%) in the 162 samples (Kawamura et al., Citation2019, Citation2017) are shown in , and the data distribution is illustrated in . The mean (and standard deviation (SD)) value was 3.05% (1.72%), with a range of 0.65–10.15%. The TC content was left-skewed, with a higher mean value (3.05%) than the median value (2.70%). The coefficients of variation (CVs) were relatively high (56.40%), indicating a rather high degree of variation, and the distribution was heterogeneous. The SD and range of the sample affect the accuracy of soil property predictions using Vis-NIR spectroscopy, and the wide range of variability indicated that this site is a reasonably optimal case study site (Kuang & Mouazen, Citation2011). In the present study, the range of soil TC values was considered sufficiently large to develop calibration models using PLS regression analyses. Soil TC was generally higher in surface soils than in sub-surface soils at the respective fields where sub-surface samples were collected. Meanwhile, the variation in TC content between surface and sub-surface layers within each field was much smaller than that among surface soils of all the 158 fields.

Table 1. Descriptive statistics of soil TC data

Figure 2. (a) Box-and-whiskers plot with outliers and (b) a histogram of soil TC

Figure 2. (a) Box-and-whiskers plot with outliers and (b) a histogram of soil TC

3.2. Soil reflectance and FDR spectra

shows the soil reflectance and FDR spectra. Large variations in the reflectance spectra were observed in the heterogeneous soil samples, which were collected from upland and lowland fields under various rice-based cropping systems. In the reflectance spectra, three strong absorption features were found in the 1400-, 1900- and 2200-nm wavelength regions ()). The FDR spectra also showed some peaks in the same regions and in visible regions ()). The Vis-NIR spectra are general characteristics of absorption wavebands associated with color (400–700 nm), the bending (1413 nm) and stretching (1916) of the O-H bonds of free water, and lattice minerals (approximately 2210 nm) (Viscarra Rossel et al., Citation2006; Ben-Dor, Citation2002; Knadel et al., Citation2013; Stenberg et al., Citation2010).

Figure 3. Raw reflectance spectra (a) and FDR spectra on a log10 scale (b) of the soil samples

Figure 3. Raw reflectance spectra (a) and FDR spectra on a log10 scale (b) of the soil samples

In general, soil reflectance decreases with increasing organic matter (Ben-Dor et al., Citation1997) and water content (Whiting et al., Citation2004). Wavelengths centered at approximately 400, 450, 510, 550, 700, 870 and 1000 nm are characteristics of the presence of ferrous and ferric iron oxides and are due to the electronic transitions of the iron cations (Ben-Dor et al., Citation1999). In addition to soil components, physical soil properties, such as particle size distribution and aggregate size and density, also affect both the reflectance intensity and shape of the soil spectra through the phenomena of light scattering and reflection (Bellon-Maurel & McBratney, Citation2011; Conforti et al., Citation2018). Thus, the soil spectral behavior can be considered as combination of chemical and physical properties of soil (Clark, Citation1999). Conforti et al. (Citation2018) reported that soil reflectance showed relatively high value for loamy sand soils due to the high amount of quartz in the sand fraction, while reflectance decreased when clay content dominated from phyllosilicates increased, and consequently, when SOC concentration increased.

3.3. Selected wavelength regions from dynamic biPLS

The selected wavelength regions and the frequency of selections after 50 runs of dynamic biPLS using reflectance and FDR spectra to estimate soil TC are shown in , and summarizes the selected wavelength regions with previously known wavebands related to soil components to assist in considering the importance of the selected wavelengths.

Table 2. Selected wavelength regions from dynamic biPLS for estimating the soil TC content using reflectance and FDR data sets and possible soil components

Figure 4. Selected wavelength regions (red bars) from dynamic biPLS for estimating the TC content of paddy soils using reflectance (a) and FDR (b) spectra with the frequencies (count number (N); blue line) of the selected wavebands in dynamic biPLS. Specific absorption bands for the different bonds in soil are specified in the top x-axis (modified by Katuwal et al. (Citation2018))

Figure 4. Selected wavelength regions (red bars) from dynamic biPLS for estimating the TC content of paddy soils using reflectance (a) and FDR (b) spectra with the frequencies (count number (N); blue line) of the selected wavebands in dynamic biPLS. Specific absorption bands for the different bonds in soil are specified in the top x-axis (modified by Katuwal et al. (Citation2018))

In the reflectance data set, four regions of 400–490, 1402–1440, 1846–1980 and 2151–2283 nm were selected in the model. Judging from the high selection frequency, the 1846–1980 nm region was considered the most important region in the reflectance data sets for soil TC predictions. The regions of 1402–1440 and 1846–1980 nm include several wavebands known to be relevant to soil free water and to vary with the soil organic matter content (Knadel et al., Citation2013). Our current data set also showed a significant correlation between the air-dried soil water content and TC content (p < 0.1%, r = 0.625). Since soil organic matter increases soil water retention (Rawls et al., Citation2003), we assumed that the increase in soil water reflected a co-occurrence relation with the increase in soil organic matter. The wavelength region of 2151–2283 nm mainly consisted of the wavebands associated with organic matter and Al hydroxides. Meanwhile, the wavelength region of 400–490 nm is related to Fe oxides, which are mainly associated with soil color as well as soil organic carbon. Soils in the tropics are rich in Fe and Al (hydr)oxides because of intensive weathering and leaching (Ramaroson et al., Citation2018), and Fe and Al (hydr)oxides are well known to increase the stability of organic matter in soils through the formation of organo-metal complexes (Van De Vreken et al., Citation2016). Organic matter is spectrally active in large regions of the Vis-NIR spectrum due to overtones and combinations of NH, CH, and CO groups (Ben-Dor et al., Citation1997).

In the FDR data set, four different wavelength regions in terms of wavelength and width (652–687, 1322–1443, 1856–1985, and 2290–2400 nm) were selected in the dynamic biPLS model. However, the two wavelength regions (1322–1443 and 1856–1985 nm) overlap with those selected in the reflectance data set (1402–1440 and 1846–1980 nm), suggesting the constant significance of soil free water for soil TC predictions. The 2290–2400 nm region contains many wavebands related to soil components, such as organic matter and Fe hydroxides. The wavelengths associated with organic matter composition were selected in the higher wavelength regions with precise information on the structural and functional groups in this data set rather than the reflectance data set because the FDR process enhances the narrow absorption features of organic matter. The wavelength region of 652–687 nm includes the wavebands related to Fe oxides (goethite and hematite), which are important for the stabilization of organic matter in soils (Saidy et al., Citation2012). These selected wavelengths were also identified as potentially important wavebands for soil TC prediction in our previous study using part of the current sample set (Kawamura et al., Citation2017).

Those selected wavelength regions did not substantially vary among the surface and sub-surface samples, while the spectral response of reflectance and FDR simply followed soil TC gradient of the samples. This suggested that the important wavelengths to estimate soil TC content did not change along the soil depth and that the developed model could be applicable to a wide range of soil samples in the rice fields in the central highlands of Madagascar. Nevertheless, in the current study, only a limited number of soil samples were collected from both surface and sub-surface layers, and the difference in soil TC content among surface and sub-surface soil samples varied 0.76–2.95%, which is much narrower than the variation among the all the 165 samples (). Further study is needed to confirm the effect of soil depth on the associated wavelengths to soil TC content.

3.4. Evaluation of the predictive ability

summarizes the number of selected wavebands (NW), the NW percentages of all available bands (NW% = NW/2001 bands × 100%), the optimal number of latent variables (NLVs) and the cross-validated mean R2 and RMSE in the validation data set (n = 24) and the R2, RMSE and RPIQ values from the model on the independent test data set (n = 42). The optimal NLVs, which were determined as the lowest RMSE values calculated from LOO-CV to avoid over-fitting of the model, were lower with FS-PLS (10 in the reflectance data set and 8 in the FDR data set) than with biPLS (12 and 10, respectively). The NW (NW%) remaining after 50 runs of dynamic biPLS was 398 (19.9%) for the reflectance data set and 399 (19.9%) for the FDR data set, suggesting that more than 80% of the waveband information from the soil reflectance spectrum was redundant and did not contribute to the prediction or disturbed the prediction. These results are consistent with previous findings suggesting that the spectral efficiency of PLS models can be improved through waveband selection and that the most useful information in the Vis-NIR region (400–2400 nm) predicted less than 20% of spectrum data (Kawamura et al., Citation2017, Citation2010; Wang et al., Citation2017).

Table 3. The mean R2 and RMSE from the 5-fold cross-validation using the training data sets based on PLS analyses and the R2, RMSE and RPIQ based on the model applied to the test data sets

shows the relationship between observed and predicted soil TC contents in the test data set (n = 42) from the FS-PLS and dynamic biPLS models using the reflectance and FDR spectra data sets. Clearly, the FDR data sets subjected to FS-PLS (R2 = 0.933, RMSE = 0.518) and biPLS (R2 = 0.940, RMSE = 0.494) yielded better predictive accuracies than the reflectance data sets subjected to FS-PLS (R2 = 0.826, RMSE = 0.809) and biPLS (R2 = 0.877, RMSE = 0.690). Based on the RPIQ values from the FDR data set (RPIQ>2.5), the quality and future applicability of our results can be considered to reflect excellent predictive ability. First derivative processing is a key preprocessing step in analytical chemistry for reducing the background signal (e.g. soil color or particle size) and enhancing the narrow absorption features related to soil properties (Brunet et al., Citation2007; Reeves et al., Citation2002). Thus, many researchers have used FDR spectra to estimate soil C contents (Chang et al., Citation2001; Kawamura et al., Citation2017; Reeves et al., Citation2002; Russell, Citation2003; Shepherd & Walsh, Citation2002).

Figure 5. Observed and predicted soil TC contents from the FS-PLS (blue) and dynamic biPLS (red) models using original reflectance (a) and FDR (b) data

Figure 5. Observed and predicted soil TC contents from the FS-PLS (blue) and dynamic biPLS (red) models using original reflectance (a) and FDR (b) data

The dynamic biPLS models showed better predictive ability than the FS-PLS models with fewer variables (selected wavelength regions), and a simpler and cheaper spectrophotometer can be used. These findings confirm previous results showing that the performance of PLS models can be improved through wavelength selection (Cramer et al., Citation2008; Du et al., Citation2004; Goicoechea & Olivieri, Citation2003; Jiang et al., Citation2002; Kasemsumran et al., Citation2004). Moreover, previous researchers have suggested that reducing large spectral datasets is valuable for more efficient storage, computation, and transmission (Yang et al., Citation2012) and for increasing the ease of spectral analysis (Viscarra Rossel & Lark, Citation2009).

In the present study, our results confirmed that soil TC could be rapidly predicted using selected wavelength regions with better predictive accuracy than FS-PLS, and sequential application of dynamic biPLS may be a feasible strategy for local assessments of soil TC. However, we note that our results were derived from heterogeneous and small numbers of soil samples (n = 162), which were collected from upland and lowland soils under various rice-based cropping systems, including wide ranges of soil types, in the central highland of Madagascar. Several researchers consider the reliability of predictions questionable when studying heterogeneous sample sets (Brunet et al., Citation2007). Particle size and arrangement may also affect the calibration due to the light transmission path (Chang et al., Citation2001). Moreover, if the calibration model is to be widely implemented, then large-scale data sets with regional or global content must be considered because the target function’s nature strongly affects the performance of the different prediction approaches, and, different studies therefore provide different results (Gholizadeh et al., Citation2018). Compare to previous researches using large spectral data set (Hermansen et al., Citation2016; Stevens et al., Citation2013), our data set was relatively small diversity on soil carbon content, which is difficult to assess the robustness and applicability at larger spatial scale. To map and assess the spatial distributions of the carbon stock at a larger spatial scale in Madagascar, evaluating an appropriate spatial scale with a larger data set is required (Ramifehiarivo et al., Citation2017; Saiano et al., Citation2013). Meanwhile, a study by Stevens et al. (Citation2013) using a large-scale EU soil survey data set (n = 20,000) reported that predictive ability of SOC calibrations related to the different levels of SOC and variations in other soil properties (sand and clay content). They also suggest that large spectral data set can be valuable to build local and more accurate models that are specific to given geographical entity or soil type. Therefore, to apply the methodology to soil characterization of the whole island of Madagascar, the calibration models should be evaluated for the effect of the heterogeneous data set and updated using a larger dataset collected from various regions in Madagascar in the future.

4. Conclusions

Wavelength region selection rather than individual waveband selection is one approach to simplifying variable selection complexity. In this study, we explored relevant wavelength regions for prediction of the TC content in paddy soils in Madagascar using dynamic biPLS. Our results confirmed that a large range of soil TC (0.65–10.15%) can be rapidly and non-destructively predicted by Vis-NIR spectroscopy, and that the predictive ability was improved by wavelength region selection with dynamic biPLS. Rapid estimations of soil TC can be used to assess soil fertility and supports farmers in implementing suitable fertilizer management practices to improve crop production. Sequential application of biPLS suggested that the important wavelength regions for estimating soil TC were 400–490, 1402–1440, 1846–1980 and 2151–2283 nm in the reflectance data sets (398 bands, 19.9%) and 652–687, 1322–1443, 1856–1985, and 2290–2400 nm in the FDR data sets (399 bands, 19.9%). The selected wavelength regions were considered to be associated with organic matter, Fe and Al oxides, which are common in tropical soils and effective for sorbing and stabilizing soil organic matter. These findings are consistent with previously known soil TC-related absorption features. Thus, the selected wavelength regions should be considered informative wavelength regions for estimating soil TC. Based on the selected FDR wavelength regions in the biPLS model, the soil TC predictions were considered to be excellent (RPIQ > 2.5), with an RMSE of 0.494% in the independent test data set. These findings indicated that sequential application of biPLS was a feasible approach for optimizing wavelength region selection and the combinations for soil TC prediction with Vis-NIR spectroscopy. To up-scale the soil TC calibrations, future analyses will be expanded to examine the effects of heterogeneous samples and extended to the whole island of Madagascar.

Acknowledgments

We would like to especially thank Dr. Naoki Moritsuka, Graduate School of Agriculture, Kyoto University in Japan, for his valuable comments on this manuscript.

Disclosure statement

No potential conflict of interest was reported by the authors.

Additional information

Funding

This research was supported by the Science and Technology Research Partnership for Sustainable Development (SATREPS), Japan Science and Technology Agency (JST)/Japan International Cooperation Agency (JICA) (Grant No. JPMJSA1608).

References