4,433
Views
20
CrossRef citations to date
0
Altmetric
ENVIRONMENTAL MANAGEMENT & CONSERVATION

Imputation methods for recovering streamflow observation: A methodological review

ORCID Icon, , , & | (Reviewing editor)
Article: 1745133 | Received 16 Oct 2019, Accepted 05 Mar 2020, Published online: 06 Apr 2020

Abstract

Missing value in hydrological studies is an unexceptional riddle that has long been discussed by researchers. There are various patterns and mechanisms of “missingness” that can occur and this may have an impact on how the researcher should treat the missingness before analyzing the data. Supposing the consequence of missing value is disregarded, the outcomes of the statistical analysis will be influenced and the range of variability in the data will not be appropriately projected. The aim of this paper is to brief the patterns and mechanism of missing data, reviews several infilling techniques that are convenient to time series analyses in streamflow and deliberates some advantages and drawback of these approaches practically. Simplest infilling approaches along with more developed techniques, such as model-based deterministic imputation method and machine learning method, were discussed. We conclude that attention should be given to the method chosen to handle the gaps in hydrological aspects since missing data always result in misinterpretation of the resulting statistics.

Public Interest Statement

Missing data becomes one of the most common problems in almost all research areas. There are many excuses for why data may be missing. Generally, if the consequence of missing value is not taken into consideration, the outcomes of the statistical analysis will be influenced and the amount of variability in the data will not be appropriately projected, thus leading to invalid conclusions. This manuscript reviews the problems and types of missing data, along with the techniques for handling missing data. We summarize the patterns and mechanism of missing data, review several infilling techniques that are convenient to time series analyses and deliberates some advantages and disadvantages of these approaches practically. The goals of this review paper are to raising awareness among the use of various imputation techniques in the hydrologic context. The paper concludes with recommendations for the handling of missing data.

Competing Interests

The authors declares no competing interests.

1. Introduction

Streamflow data plays a significant pose in the hydrological functioning of watersheds. The availability of complete quality streamflow data of sufficiently long duration is considered as a major requirement in defining the flow characteristics, estimate the occurrence of the extreme event such as high or low flow and rectify problems like landslides, droughts and even floods (Tencaliec et al., Citation2015). Nevertheless, studies on streamflow generally are faced with the missing data issue and it is not rare to obtain data that are damaged with lapses, featured with the dubious quality and short periods (Norliyana et al., Citation2017). Every so often, the information is just not accessible and in many cases, incomplete observations, missing or outliers complicate dramatically the work of the analysts (Kim et al., Citation2015).

The dataset is often not complete, may occur due to several reasons such as not continuous data recording or lost in storage. According to (Žliobaite et al., Citation2014), statistical models for monitoring in environmental studies strongly depend on automatic data acquisition systems which employ numerous physical sensing devices. Frequently, sensor readings are incomplete for long lengths of time, though model outputs necessity to be uninterruptedly accessible in real-time (Santosa et al., Citation2014). The physical sensing device is laid bare to several risks due to acute ecological circumstances, subjection to physical destruction, and battery drainage (Tencaliec et al., Citation2015). Due to technical or maintenance issues, not optimal weather conditions, instrumental failures or apparatus errors throughout the data collection, human error during data entry, calibration process and/or a damage of data due to malfunctioning storing machinery, extended hydrometric data construction and organization become a hard task and, in time, gaps in the data set arise (Gao, Citation2017; Johnston, Citation1999; Peña-angulo et al., Citation2019; Tencaliec, Citation2017).

Despite Malaysia, such conditions are more common in developing countries and the repercussion is a large proportion of unpredictability in the measured features of water operation systems and eventually its weak execution (Kim et al., Citation2015; N.A. Rahman et al., Citation2017). According to (Fontaine, Citation1982) in a study of the USGS stream-gaging program in Maine, an average of 5.6 percent of the stream-gages are malfunctioned during the ten year period of study, while a 1983 study of Virginia stream-gages records a total breakdown rate of 1.95 percent which leads to missing data (Carpenter, Citation1985). (Shields & Sanders, Citation1986) reported that the lengthiest extent of inaccessible streamflow data was 16 consecutive months stretch throughout the seven years programmed observation period of the log. A study of 58 USGS Alabama stilling well and bubbler-gage stations reveals a total breakdown rate of 4 to 11 percent of the years 1979 through 1983 (Meadows & Jeffcoat, Citation1990), and a nationwide study of 1,009 stream-gages for the period 1948 to 1988 reports that five percent or more of streamflow data may be pointed as missing (Wallis et al., Citation1991).

These missing value(s) in the time series usually lessens the ability and the accuracy of statistical analysis approaches (Roth et al., Citation1999) and initiate biased estimations of the relationship among variables (Pigott, Citation2001). Both issues—lessening in ability and bias of estimates—may cause imprecise assumptions in the exploration of a set of data that consists of incomplete data (Graham, Citation2009). Due to this fact, reconstructing and missing data treatment should become the first priority in the data preparation procedure. According to (Tencaliec, Citation2017), the imputation of missing streamflow data is an issue investigated since decades ago and, up till these days, it persists to be a challenge. In the literature, the computation of missing data also called imputation (Schneider, Citation2001), reconstruction (Kim & Pachepsky, Citation2010), and infilling (Goa, Citation2017; Harvey et al., Citation2012). Imputing means substituting to individually missing data with a reasonable value (single imputation) or a vector of reasonable values (multiple imputations) but the aim is not to substitute all missing value but then to retain the features of their dispersal and associations among different variables (Rubin, Citation1976).

Several researchers have studied imputing missing streamflow data with various statistical methods (Regonda et al., Citation2013). Numerous data estimation methods have been suggested to tackle the problem and are profoundly discussed in the literature; from the simple traditional statistical method such as substituting each missing value for given variables with mean, median or other location stations to the advanced techniques. Among these, (Kim et al., Citation2015), who employed Soil and Water Assessment Tool (SWAT), and two machine learning (ML) approaches, Artificial Neural Network (ANN) and Self Organizing Map (SOM) to substitute wrong values and fix missing streamflow data, conclude that the ML models were mostly advance on obtaining high flows, whereas SWAT was preferable on representing low flows. Further references, like the studies of (Santosa et al., Citation2014) suggested that the method derives from information entropy principles have the potential to be developed as the methods to be used to forecast the missing monthly average discharge. Most recent studies by (Norliyana et al., Citation2017) recommend the use of a normal ratio method for reconstructing the missing data in streamflow compared to the arithmetic average, inverse distance technique and coefficient of correlation technique or methods that imply artificial neural network as presented in (Elshorbagy et al., Citation2002; Gill et al., Citation2007; Mispan et al., Citation2015).

Most recent reviews studies by Gao et al. (Citation2018), summarize and compare several widely known techniques employed for imputing missing hydrological datasets, as well as the arithmetic mean imputation, principal component analysis, regression-based approaches, and multiple imputation approaches. In (Gupta & Srinivasan, Citation2011), Two-Directional Exponential Smoothing (TES) been used in order to forecast the missing data for a raw streamflow dataset, while Exponentially Weighted Moving Average (EWMA)was used for a forecast by utilizing the identified data values of prior two seasons. The initial stage in the TES technique is to create the complete data set employing the Average Nearest Observation (ANO) approach (Huo, Citation2005). The ANO technique will substitute the missing values with the mean of the adjacent pre-existing and the subsequent observation i.e. the values are projected by a weighted mean of the adjoining observations with more weight given to the nearer observation (Gupta & Srinivasan, Citation2011). Their results show that both the approaches estimated the data inside and outside the period space with excellent outcomes.

However, based on (Tabachnick & Fidell, Citation2014) suitable technique for recovering missing data rest on the patterns and mechanisms of the missingness and, in fact, both—patterns and mechanisms—have a greater influence on the results than the proportion of the missing data itself. The method chosen has a significant influence on the accurateness of any hydrological model output. The details of the patterns and mechanisms of the missingness and the imputation method are described in the following study.

The paper aims to confer an overview of various infilling techniques that are suitable for time series analysis in streamflow. According to (Gill et al., Citation2007), it is becoming a common practice for a researcher in pre-processing of data for use in hydrological modeling to disregard observations with any missing variable values at any given time frame, although it is only one of the independent variables that are missing. Mostly, these rows of data are considered poor and would not be used in either model building or subsequent model testing and verification. This is not assuredly an ideal approach for handling the missing data since the important facts might be mislaid once deficient rows of data are dropped. Even very small data-breaches in this study may result in terribly distinct analysis outcomes (Tencaliec et al., Citation2015). Therefore, works on these issues is a crucial exercise in hydrological analyses and the goals of this review paper are to raising awareness among the use of various infilling practices in hydrological context.

2. Percentage of missingness

The percentage of missingness is straight off linked to the value of statistical inferences. Up until now, there is no ascertained cut-off based on the literature concerning on tolerable proportion of missing rate in streamflow data as convincing statistical inferences. Schafer (Citation1997) stated that a missing rate of 5 percent or a lesser amount of it, is insignificant. Bennett (Citation2001) asserted that whenever the percentage of missing data exceeding 10 percent, then the statistical analysis is expected to be biased. Dong & Peng (Citation2013) agreed that data missing by 20 percent is a common thing in research while (Widaman, Citation2006) pigeon-holed missing data according to the percentage of the missingness as outlined in Table .

Table 1. Missing data rule of thumb according to (Widaman, Citation2006)

Moreover, the percentage of missing rate is not the only principle through which a researcher weighs the missing value issue Tabachnick and Fidell (Citation2014) affirmed that the missing data mechanisms and the patterns have a major influence on research outcomes compared with the percentage of missing data.

3. Patterns of missingness

Even though a wide range of reconstruction methods to handle missing data problems have been developed, the method individually suffers from various restraints and may not perform practically well under some circumstances (Gao et al., Citation2018). One reason for this is that most of these methods make a presumption about how the missing values are dispersed within the data set. It is worthwhile to know the underlying missingness pattern and mechanism before deciding which method to use to handle missing data (Tabachnick & Fidell, Citation2014).

Collins et al. (Citation1991) first described and divided the pattern of missingness into two groups: general (random) and special patterns including univariate missing data, unit nonresponse, and monotone missing data. The general or random pattern of missingness is exactly as the name implies, where the missing data occur in any of the variables in any position. If there is only one variable with missingness while the other variables are completely recorded, the pattern is called univariate missing data. Additionally, when the multivariate pattern is detected, means that the missing value arises in more than one variable. According to (Ingsrisawang & Potawee, Citation2012), the unit nonresponse pattern has missing values on a block of variables for the same set of cases, and the remaining of the variables are all complete. The pattern is said to be monotone whenever the observations are ordered and item k is missing, and all k+1,  ,n cases are also missing. Figure illustrate missing data patterns discussed above.

Figure 1. Missing data patterns: (a) General (random); (b) Univariate missing data; (c) Unit non-response; (d) Monotone missing data

Figure 1. Missing data patterns: (a) General (random); (b) Univariate missing data; (c) Unit non-response; (d) Monotone missing data

4. Types of missing data

The mechanism by which data is missing is very important when determining the efficacy and appropriateness of imputation strategies (Sakke et al., Citation2016). Kamaruzaman et al. (Citation2017) suggested that the type of mechanism missing data should be interpreted prior to the imputation process because the effectiveness of an imputation technique depends entirely on their assumptions. In spite of that, it is impossible to carry out a proper test to examine whether the presumptions made are valid. Ignoring this missingness might not only lose efficiency yet also induce biased outcomes as well as deceptive inferences (N.A. Rahman et al., Citation2017).

Little and Rubin (Citation2002) classified three possible ways that data may go missing: Missing Completely at Random (MCAR), Missing at Random (MAR) and Missing Not at Random (MNAR). MCAR describes data where the gaps are distinct from any of the variables in the dataset. In any event, the missing values probably correlated to other observed values, yet not to missing values, in that case, the missingness assumed to be MAR. Missing data which are dependent on the observed value is also known as MNAR. Missing data can be presented in the form of a probabilistic process that describes the association among the measured variables and the probability of missing value (Gao et al., Citation2018). Each is discussed below.

4.1. Missing completely at random (MCAR)

In MCAR there is no methodical form on the way the data is missing. The missingness occurs entirely at random which rests on neither on observed nor on missing values. Simply put, the probability for an observation being missing is independent of both the values of other variables as well as the value of the observation itself. In the univariate time series such as streamflow datasets, there are no other variables existent except time as implicit variables. In this case, we can conclude that in MCAR the probability for a certain observation being missing is independent of the point of time of this observation in the series.

(1) Pm|Yobserved, Ymissing=Pm(1)

4.2. Missing at random (MAR)

Similar to MCAR, in MAR probability for an observation being missing is also independent of the value of the observation itself but not on other variables. For univariate time series where there are no other variables than time (implicitly given), the probability for an observation being missing in MAR is dependent on the point in time of this observation in the series.

(2) Pm|Yobserved, Ymissing=Pm|Yobserved(2)

4.3. Missing not at random (MNAR)

MNAR observations are not missing in a random manner. The data are neither MCAR nor MAR. The probability for an observation becomes missing dependent on other variables such as point of time for univariate time series.

(3) Pm|Yobserved, Ymissing=Pm|Yobserved, Ymissing(3)

Based on the definition of (Little & Rubin, Citation2002), missing value in the streamflow study is determined as MCAR as of the episode of missingness in the streamflow data of an area not influenced by the data in that area or any area. There is also a study on streamflow data imputation using the MAR assumption (Gill et al., Citation2007). However, (Moritz & Bartz-Beielstein, Citation2017) have stated that imputation MCAR and MAR for univariate time series study is nearly the same.

5. Imputation Methods

Imputation is a term used to replace or substitute the missing values by some predictable which said to be plausible values in the dataset (Ahmat Zainuri et al., Citation2015). It utilizes observed supporting information for cases with missing value to maintain high accuracy. In this paper, the imputation method classified into four categories; deletion technique, single imputation, model-based deterministic imputation method, and machine learning technique according to the characteristic of each method discussed.

5.1. Deletion technique

Deletion (list-wise and pair-wise) approaches are the standard settings for missing data problems in most statistical software packages, and these methods are most likely the elementary approaches in recovering missing data (Gao et al., Citation2018; Marsh, Citation1998). According to (Gill et al., Citation2007; Kabir et al., Citation2019), deletion technique are among the popular and the most common method used in the treatment of missing value in hydrological field. However, in the study practiced using time series data, deletion methods may cause the data to become discontinuous.

5.1.1. List-wise deletion

McDonald et al. (Citation2000) defines list-wise deletion as the removal of each and every case (observations) whichever has a missing value in at least one of the selected variable. The primary advantage of list-wise deletion is fast, simple to understand and apply, and hence has set off as a preselected option for analysis in most statistical software since there is no special computational methods are required (Dong & Peng, Citation2013). Certain researchers claim that it possibly will lead to bias in the estimation of the parameters (Dong & Peng, Citation2013; Honaker & King, Citation2010; Marsh, Citation1998; McKnight, Citation2007). In the nature of big data which power is not a problem, and the presumption of MCAR is fulfilled, the list-wise technique might be a practical method, but for a small sample, or the presumption of MCAR is not fulfilled, the list-wise technique is not an ideal approach (Gao et al., Citation2018). If the data are not MCAR then there be lacking comparability over time points, which would introduce extremely biased outcomes.

5.1.2. Pair-wise deletion

Pair-wise deletion also known as an improvement version of list-wise deletion in which it helps conserves significantly more information by minimalizing the number of observations dropped (Marsh, Citation1998). The advantage of using this method is it is convenient and not as much of an influence for the MCAR and MAR dataset, along with the suitable procedures are counted in as covariates (Honaker & King, Citation2010). According to (Croninger & Douglas, Citation2005), one of the most important disadvantages of the pair-wise deletion technique is the poor conformity of various analyses since the number of cases differs among distinct pair-wise comparisons. This technique is suitable only when the data consist of a proportionately insignificant percentage of observations with missing data.

5.2. Single imputation

Single imputation signifies that the missing data is substituted by a value. In this method, the sample size is retrieved. However, the reconstructed values are presumed to be the actual values one might have been observed when the data would have been complete (Plaia & Bondì, Citation2006). This method is very convenient since it generates a complete dataset easily (Hasan et al., Citation2017). The drawbacks of the single imputation approaches are that they reconstruct the same missing value every single time. Consequently, a statistical analysis, which treats incorrect values entirely as similar as observed values, will consistently depreciate the variance, even presumptuous that the exact cause for the missingness is known. It is impossible for the single imputation to signify any extra variation that rises when the reasons for missing data are unidentified (Gómez-Carracedo et al., Citation2014). The idea of using a single imputation method to recover the missing value in hydrological been widely used by a variety of authors as in (Ben Aissia et al., Citation2017; Gao et al., Citation2018; Kabir et al., Citation2019; Norliyana et al., Citation2017; N.A. Rahman et al., Citation2017). However, despite the simplicity, the researchers agreed that reconstruct the missing value using the same “number” do not reflect the variation that would probably occur if the variables were observed. The actual figures possibly vary from the imputed. Thus, the variance of those same variables is underestimated.

5.2.1. Arithmetic average method

According to (Schneider, Citation2001), the simplest way to reconstruct the missing value is to substitute every missing value with the average of the observed values for that variable. The average of the variable is calculated through the non-missing values and is employed to reconstruct the missing value of that variable. This method generally used to impute the missing meteorological and hydrological data (Norliyana et al., Citation2017). The missing data of streamflow are obtained by the average of selected nearby stations around the target station or the date on the same day with different years. The estimated missing value is given by

(4) yˆt= 1ni=1nxi(4)

where yˆt is the projected value of the missing data at the t target station or date, xi is the observed data at the ith nearby stations or the date of the same with different years and n is the count of nearby stations or count of years. This method presumes the data is MCAR but is not suggested as it can result in depreciating the variance.

5.2.2. Median imputation

Median imputation substitutes missing values with the median value of the observed values of the same variable (McKnight, Citation2007). Median imputation does not require normality assumptions but instead skewed (Gómez-Carracedo et al., Citation2014). Despite it is simple to understand and apply, it can affect the correlation between variables and the probability distribution of the imputed variable. This method also depreciates the variance of the estimators (Gao et al., Citation2018).

5.2.3. Hot-deck method

This technique requires substituting missing data with values based on the current dataset or matching covariates (Chen & Shao, Citation2001). The process detects any data which are akin to data observed. This approach is convenient as such quite straightforward and upholds the appropriate quantification levels of the logged covariates (Aljuaid & Sasi, Citation2017). In other words, the imputed values under hot-deck imputation will have a similar dispersal form as the observed data (Rubin, Citation1976). It is normally less biased than an arithmetic average method or deletion technique methods and presumes that the missing data are MAR (Blend & Marwala, Citation2008). Weakness is about multiple paralleling algorithms occasionally required to be used in such a way to suit the data.

5.2.4. Last observation carried forward (LOCF)

The LOCF is one form of hot-deck imputation. This method is usually used for longitudinal data which are detected to be missing at a specific “visit” or at any given time for a certain entry. The researcher would then drag the last obtainable value forward i.e. from the last visit or time point, and use this value to substitute the missing values. Some values might be used several times for infilling if more than one missing value occurs in a row, and others may not be used at all. The most crucial challenge of this technique is that the researcher is inferring that there will be no shift from one visit to the next. It is sensible if the data are MCAR, otherwise, the results can be extremely biased (Gao et al., Citation2018).

5.2.5. Cold-deck method

The cold-deck imputation is very much alike to the hot-deck method in its approach with the exception of the tactic for weighing subject uniformity is derived from external information prior knowledge instead of the information obtainable in the existing dataset (Kumar et al., Citation2017). In hydrometeorological, researchers may decide to reconstruct the missing value out of their knowledge of prior research that observed at related variables to estimate individual data. A prominent weakness of this technique is it hooked on the quality of the available external information (Nishanth & Ravi, Citation2013).

5.2.6. Linear interpolation

Linear interpolation is among the simplest methods to impute missing data which made up of drawing a straight line separating observed values before and after the gap and then estimating missing values by interpolation. The linear interpolation method is quick and easy to use and may be adequate for well-resolved data. (Fleig et al., Citation2011) was used this method in univariate regional hydrological frequency analysis, but (Ben Aissia et al., Citation2017) believe that it was not used to estimate missing value in multivariate hydrological frequency analysis. To interpolate the value of M2, the following formula is used.

(5) M2=N2NM3M1N3N1+M1(5)

5.3. Model-based deterministic imputation method

Model-based methods will produce more precise imputations provided the model presumptions are contented. However, the struggle on model-based methods is that those presumptions are frequently unverifiable in practical terms and consequently it may not be easy to specify a suitable model-based infilling technique to impute the missingness (Nishanth & Ravi, Citation2013). A good model-based method would operate efficiently for various options of underlying data distributions and missing mechanisms. (Norliyana et al., Citation2017) studied daily and monthly streamflow estimating using a model-based deterministic imputation method. Many statistical indicators have been utilized for evaluating performance such as root mean square error (RMSE), Mean Absolute Error (MAE) and Correlation Coefficient (R) tests. The NR method is found to be the best estimation technique for streamflow data in their studies. A comparison between the performance of regression and time-series methods were also conducted by (Beauchamp et al., Citation1989). The results indicated that both methods present fairly good estimates and forecasts of the flow at the Foresta Bridge gage. However, the author discovered that the regression model gives a significant amount of autocorrelation in the residuals and this lead to the conclusion of the standard errors and confidence limits on the estimated flow from the regression model would not be appropriate and could be misleading.

5.3.1. Normal ratio method (NR)

NR approach is weighted on the basis of the ratio average of the accessible data among the target station and the ithneighbouring station (N.A. Rahman et al., Citation2017). The simultaneous streamflow data at the neighboring stations are weighted by the ratios of whole streamflow data in the target as well as neighboring stations. The estimated missing value is given by

(6) Yˆt = 1ni=1nNtNiyi(6)

where Nt is total streamflow in the target station while Ni is total streamflow for each neighboring station. This method is used only if any neighboring stations have the normal streamflow data which exceeded more than 10% of the considered station (De Silva et al., Citation2007).

5.3.2. Inverse distance method (ID)

The ID is based on an idea of distance weighting among the target station and neighboring station. The closest stations are better correlated with the target station compared to further stations. The estimated missing value is given by

(7) Yˆt = i=1nyi/diti=1n1/dit(7)

with dit is the distance between the target station and the ith neighbouring station. The main advantage of ID is it’s easy to use or simple and results in sensible outcomes for various options of data. It is all right with results over and above the range of meaningful values (Chen & Liu, Citation2012). In contrast, there are several disadvantages, in which the ID approach is very sensitive to the weighting function and can be exaggerated by uneven data points distribution (Caruso & Quarta, Citation1998).

According to (Norliyana et al., Citation2017), the ID technique is the most frequently used in the computation of rainfall and streamflow missing data. This technique gives good results for missing data analysis, provided, the researcher has the data of neighboring stations of the same period (Teegavarapu & Chandramouli, Citation2005). However, based on (Chen & Liu, Citation2012) using data of neighboring stations of the same period, is valid for climatic parameters like temperature and rainfall, but for data of river flow, it’s unwise to use data of a neighboring station that controls another watershed due to the surfaces, slopes, the permeability or the overall morphology are not the same. They propose to use the ID method only if the researcher has two adjacent stations on the same river.

5.3.3. The coefficient of correlation method (CC)

The CC method is influenced by the success of the ID method (N.A. Rahman et al., Citation2017). This method is used by replacing the distance with the coefficient of correlation between the target and neighboring station, as the weighting value. The missing value is estimated by

(8) Yˆt = i=1nyiriti=1nrit(8)

where rit is the coefficient of correlation of day-to-day time series data among the target and the ith neighbouring stations. The advantages of the CC method are that it is easy to work out and it’s easy to interpret. However, the correlation coefficient does not imply causality, that is it may show that two stations are strongly correlated, but it doesn’t mean that they are responsible for each other (Norliyana et al. Citation2017).

5.3.4. Regression method

Reconstruction using regression on either one or several variables may generate keener values. In this case, the researcher needs to fit a regression model by fixing the variable of interest as the response variable then another related variable as covariates (Gao et al., Citation2018). The coefficients are projected, followed by the estimation of missing values with the fitted model. Regression model fit with complete data given by

(9) yi=β0+β1x1i+β2x2i+  +βnxni+ϵi(9)

and in a case k is missing y, the imputation is

(10) yˆk=βˆ0+βˆ1x1k+βˆ2x2k+  +βˆnxnk+ϵk(10)

where βˆj’s are estimates based on complete cases.

This method appears more sensible than that estimated with mean. Nevertheless, this approach rises the correlation coefficient among the variables and the variability of imputed data is underestimated. Another possible drawback of such a parametric method about the approach may be delicate to model misspecification of the regression model (Schenker & Taylor, Citation1996). In the case the regression model is not a good fit, the extrapolative power of the model might be poor (Little & Rubin, Citation2002).

5.3.5. Nearest-neighbor technique

The nearest-neighbor approach, also known as distance function matching, is a donor technique in which the donor is chosen by minimizing a fixed “distance” (Lee & Kang Citation2015; Chen & Shao Citation2001; Rajagopalan & Lall Citation1999; Yakowitz & Karlsson Citation1987; Kalton & Kish Citation1984). This process involves identifying an appropriate distance measure, where the distance is a function of the auxiliary variables. The observed unit with the shortest distance to the missing values is acknowledged and its value is used for infilling the missing item in proportion to the variable of concern. The beauty of the nearest neighbor technique is that truly observed values are utilized for reconstruction and it presents geographical effects, but the weakness is the outcome could be contingent on the selected order of the file. (Chen & Shao, Citation2001) demonstrates such the nearest-neighbor method, even though a deterministic method, estimates distribution precisely. The variance under the nearest-neighbor approach may be exaggerated in case some donors are used way more often than others (Bagus & Narinda, Citation2016).

5.3.6. Multiple imputations

Multiple imputations substitute every missing value by N probable values to generate a full dataset. The researcher further employs such “latest” datasets in the analysis as well as merges the outcomes toward a single summary outcome. This summary mirrors the additional discrepancy as a result of the missing data (Bertsimas et al., Citation2018). This technique doings well on longitudinal data and is robust to violations of non-normality of the variables used in the analysis. A review of multiple imputations can be found in (Little & Rubin, Citation2002).

5.3.7. Expectation maximization (EM)

According to (Schneider, Citation2001), the EM techniques are a form of the maximum likelihood approach which offers estimates of the means, variances and covariance matrices which can be used to get consistent estimates of the parameters of interest. It is predicated on both; expectation and a maximization step, those are repeated several times until maximum likelihood estimates are obtained. For more detail, in the expectation step, the parameters such as mean, variance and covariance are estimated. Those estimates are then employed to develop a regression equation to forecast the missing value, and then the equations used in the maximization step to impute the missing values. The expectation step is then repeated with the new parameters, where the new regression equations are determined to impute the missing data. Both—the expectation and maximization steps are repeated until the system fixes when the covariance matrix for the subsequent iteration is essentially similar to that for the preceding iteration.

The advantage of EM reconstruction is that once the full data set with no missing values are reconstructed, a random error term for each reconstructed value is incorporated view to reflecting the unpredictability related to the infilling (Aljuaid & Sasi, Citation2017). On the other hand, the disadvantage of EM imputation is it takes a lot of time to converge, specifically when a large fraction of missing data occurs. This can result in biased parameter estimates and can underestimate the standard error (Gómez-Carracedo et al., Citation2014).

5.3.8. Spline interpolation

Spline interpolation is a piecewise polynomial function that is tailored through the sampled points. The spline represents a two-dimensional curve on a three-dimensional surface (Hutchinson & Gessler, Citation1994). Spline interpolation is often preferred to polynomial interpolation as it tries to avoid the oscillating effect that may be observed with polynomial functions of high degrees (Little & An, Citation2004). This method produces smooth and easily interpretable surfaces with low degree polynomials for the spline. The spline function is constrained at defined points (local technique). A specific number of neighboring values should be considered; therefore, the spline can adjust to local abnormalities without affecting the values of interpolation at other points on the global area. The constraints r are given by the degree m of the polynomial function

  • if r=0 there are no restraints.

  • if r=1 the function has to be continuous.

  • when r=m+1 the mth derivative of the function has to be continuous for all points.

The spline for m=1 is called linear, for m=2 quadratic, and for m=3 cubic.

5.4. Machine learning (ML) techniques

ML is a branch of artificial intelligence (AI) that employs statistical approaches to give computer systems the ability to study the data, without being explicitly programmed. The name machine learning was coined in 1959 by Arthur Samuel (Samuel, Citation1959). Many researchers agreed that the approaches relying on machine learning methods were the most suited for the reconstruction of missing values and usually lead to a significant improvement of prediction accurateness as against imputation methods based on statistical approaches (Krysanova & White, Citation2015; Minns & Hall, Citation1996; Varga et al., Citation2016). ML techniques were widely used by variety of authors to recover the missing value of streamflow as in (Allawi et al., Citation2018, Citation2017; Erdal & Karakurt, Citation2013; Kalteh et al., Citation2007; Karakurt et al., Citation2013; Kim & Pachepsky, Citation2010). Despite the fact that ML has been transformative in the hydrological field, effective ML is hard since detecting patterns is difficult and frequently not sufficient training data are accessible; consequently, many ML approaches often fail to give the expected value and most of all can suffer from different data biases (Worland et al., Citation2018).

5.4.1. Self-organizing map (SOM)

The SOM technique, also named characteristic map or Kohonen map, is the most broadly used of the ANN algorithms intended for unconfirmed patterns recognition applications (Kohonen et al., Citation1996). The ability of the SOM method in the estimation of missing univariate and multivariate hydrological data was proved in a number of studies such as in (Adeloye & Rustum, Citation2012; Mwale et al., Citation2012). It is a non-linear process for dimension lessening and illustration of data that could also be used for the missing data reconstruction (Miró et al., Citation2017). SOM describes a methodical two-dimensional discrete mapping, extending a set of data items onto a topologically ordered network. In other words, the SOM network captures the clusters of the available input data, which may be a starting point for the infilling of the missing value (Kalteh et al., Citation2007). This technique conserves the most significant association of the original data elements. This infers that, throughout the mapping, not much information is lost which makes the SOM approach a very good implement for estimation. However, for estimate values outside the range used for extrapolation, the SOM technique cannot be used. This is mostly on grounds that, as it is the case with most data-driven approaches, SOM is a very poor extrapolator (Adeloye et al., Citation2011). It has a restricted volume to forecast values that have not been observed in the past (Ben Aissia et al., Citation2017).

5.4.2. Decision trees

Decision trees make supervised groupings from categorical or discretized data. The decision tree algorithm computes the criterion for each potential split and opts for the one with the peak gain of purity. The algorithms stop splitting once a subset contains only objects fitting to the same class or a different standard is fulfilled. In the tree, a node signifies a subset, a branch characterizes a split, and the last nodes are called leaves (Han et al., Citation2011).

A popular implementation of decision trees is Quinlan’s C4.5 (Quinlan & Kaufmann, Citation1994) algorithm, that by its nature manages missing values without taking them into account when calculating information gain. Thereby, as an infilling technique, C4.5 could be prepared once missing data are present in the predictor variables, rising the potential training set available in the multivariate missing value setting.

5.4.3. Bayesian networks

Bayesian networks pick up probabilistic associations among variables concisely by implementing conditional independence constraints (Uusitalo, Citation2007). They can be configured through a heuristic search using a Bayesian scoring function such as K2 (Cooper et al., Citation1992). Using Bayesian networks for the reconstruction of missing data has some benefits: One, it has greater efficiency compared to EM-based multiple imputation methods for a dataset with a large number of variables; two, it saves the joint probability distribution of the variables, one thing that methods like k-NN do not promise (Kim & Lee, Citation2009). Adversely, a lot of data is generally required to precisely learn a network, and discretization of all data is necessary except conditional probability densities are explicitly modeled and parameterized, often at the great computational expense (Chen & Pollino, Citation2012; John & Langley, Citation1995).

5.4.4. Soil and water assessment tool (SWAT)

The Soil and Water Assessment Tool (SWAT) is a continuous-time physically-based semi-distributed watershed-scale hydrologic model that can be used to simulate long-term impacts of climate, topography, soils, land use, and management operations on water, sediment, and chemical yields, without massive investments of resources (e.g., time, money and labor) (Arnold et al., Citation2012; Neitsch et al., Citation2011). SWAT practices a two-level disaggregation scheme; a preliminary sub-basin identification is conducted premised on topographic criteria, and then further discretization using land use and soil type considerations (Zeiger & Hubbart, Citation2018).

SWAT is an open-source tool and detailed online documentation, user groups, video tutorials, international conferences, and a unique literature database are available. This all makes the tool user-friendly, which can explain, at least partly, the fact that it is one of the best known and most commonly used tools to develop water quality models at the watershed scale (Gassman et al., Citation2010; Refsgaard et al., Citation2010; Varga et al., Citation2016). The tool is continuously improved, supported by the core development team and as a response to shortcomings demonstrated by the many users (Neitsch et al., Citation2011). This results in the development of new tools, e.g., GIS interface tools, pre- and post-processing tools and statistical evaluation tools (Gassman et al., Citation2010). SWAT is also proven to be the most practical and flexible tool for a variety of applications, watershed scales, and environmental conditions (Gassman et al., Citation2014; Krysanova & White, Citation2015; Tuppad et al., Citation2011). Moreover, the semi-distributed structure makes the model computationally efficient and enables us to generate spatially explicit outputs. The tool is really suitable for large, complex watersheds (Gassman et al., Citation2014).

However, every tool has its shortcomings and these are often linked with its advantages. The constant improvements, for example, have led to difficult code and a high number of parameters, requiring expertise to run the model and complicating the calibration process (Gassman et al., Citation2010; Vigerstol & Aukema, Citation2011). Additionally, the tool is highly data-intensive. Although SWAT is said to run on readily available input data this is not always the case, especially in developing countries. Certainly, the data accuracy and precision might be an issue, as expressed by the rule “garbage in is garbage out” (Oosthuizen et al., Citation2018).

5.4.5. Exponentially weighted moving average (EWMA)

The EWMA is frequently utilized to a time-ordered sequence of random variables (Perry, Citation2011). It is a statistic for surveillance of the progression which averaging the data in a way that gives less and less weight to data as they are further removed in time (Čisar & Čisar, Citation2011).In other words, more recent observations have greater weight on the variance.

This method introduces lambda, which is called the smoothing parameter. The value of λ must be less than one and usually set between 0.2 to 0.3 (Perry, Citation2011), even though this choice is slightly arbitrary. By the choice of weighting factor λ, the EWMA control technique can be made sensitive to a small or gradual drift in the process. The equation for estimated missing value was established by Roberts as described in (Roberts, Citation1959) and given by

(11) EWMAt=λYt+1λEWMAt1;     for t=1,2,3,  ,n(11)

where EWMA0is the mean of historical data, Ytis the observation at time t and n is the number of observations to be observed together with EWMA0 while 0<λ1is a constant that assesses the depth of memory.

A possible weakness of an EWMA chart with a small λ is that a very big abrupt change in a parameter may not be sensed rapidly since data gained right away after the shift will be averaged with in-control data obtained prior to the shift. The consequence is that the in-control data have a propensity to mask the effect of the shift. Another possible drawback with an EWMA chart is that the EMWA statistic may be in a disadvantageous standing when the shift occurs. For instance, once there is an upward shift in a process parameter, the EWMA statistic may occur to be close to the lower control limit immediately before the shift, and in this case, it may take the EWMA statistic a relatively long time to reach the upper control limit. This latter problem is usually referred to as the “inertia problem” of the EWMA chart (Spiring, Citation2007; Woodall & Mahmoud, Citation2005; Yashchin, Citation1987, Citation1993).

5.4.6. Artificial neural network (ANN)

ANN is one of the main tools employed in machine learning which is inspired by human brain systems. ANN are mathematical models that are capable of learning complex relationships (Tsintikidis et al., Citation1997). This method involves input and output layers, along with (in most cases) a hidden layer contains units that alter the input hooked on something that the output layer can use. This process is illustrated in Figure . Although ANNs do not need to be fully connected, they often are.

Figure 2. A fully connected neural network with one hidden layer

Figure 2. A fully connected neural network with one hidden layer

(Jain & Indurthy, Citation2003) claimed that among the black-box models, the ANN model has accessed broader pertinency, because the functional system sandwiched among the input variable and the output is not necessary to be outlined from scratch, and it entangles minimum understanding of the fundamental process to model such hydrologic issues. A number of research on the application of ANN in modeling hydrologic input-output associations were discussed in the literature as in (Hipel, Citation1995; Hsu et al., Citation1995; Jeong & Kim, Citation2009; Kim & Pachepsky, Citation2010; Minns & Hall, Citation1996; N.F.A. Rahman et al., Citation2015; Regonda et al., Citation2013; Smith & Eli, Citation1995; Somwanshi & Chaware, Citation2014; Zeiger & Hubbart, Citation2018).

The ANN model, even with only three layers, results in a nonlinear relationship between input and output with a very large number of connections between the input layer, to the hidden layer, and so on to the output layer (Eash et al., Citation2013). Resulting from the existence of such nonlinearity, the ANN model is very sensitive to the values of input neurons. Clearly, if the input values are subjected to large errors, then the functional form, which the ANN evolves at the training stage, may perform poorly at the validation stage (Elshorbagy et al., Citation2002). This calls for the careful selection of input neurons for such complex studies.

5.4.7. Two-directional exponential smoothing (TES)

The TES method was developed by (Huo et al., Citation2010) to substitute missing data. The TES technique is subject to an appropriate Exponential Smoothing (ES) technique and was established by using Holt’s linear trend algorithm system. The TES process estimates missing values founded on the autocorrelations of the time-recorded data for the fact that the missing values arise at non-random periods (Gupta & Srinivasan, Citation2011). This method is intended to signify mutually forward and reverse autocorrelations in the time series, which can lessen the variance triggered by diverse directions. The initial stage in the TES approach is to reconstruct the complete data set employing the average nearest observation (ANO) method (Huo, Citation2005).

The ANO approach will substitute the missing values with the mean of the nearest previous and the subsequent observation. In other words, the values are projected by a weighted average of the nearest observations with higher weight given to the nearer observation. After the data set is reconstructed using the ANO technique, the missing values are forecast using Holt’s linear trend approach, both in a forward and backward way. The Holt’s Linear Trend algorithm can be denoted as

(12) Ft+k=at+btk(12)
(13) at=δYt+1δat1+bt1(13)
(14) bt=γatat1+1γbt1(14)

where Ft+k is an estimated value at period t+k; Ytis the actual value at time t; atand at1 are intercepts at time t and t1 respectively; btand bt1are the slopes at time t and t1 respectively; δ and γ are smoothing constants that are between 0 and 1 (Gupta & Srinivasan, Citation2011; Moahmed et al., Citation2014). The smoothing constants govern the weight specified to the latest previous observations and thus control the degree of smoothing or averaging. Values close to 1 give weightage to more current data and close to 0 allocate the weights to reflect data from the more detached previous data.

6. Conclusions and discussion

Imputation methods lessen the loss of information which may introduce suboptimum outcomes and later to misleading conclusions concerning, as an example, the risk estimation of an extreme event. Several techniques had long been suggested in the literature for managing missing value together with the option of a suitable approach hang on, in particular, on the missing data pattern, and mechanism. The plainest technique, but in parallel, the minimal powerful in recovering missing value is the deletion technique. It is based on simply removing the variable(s) that have some missing values. The deletion method results in an unreasonable loss of valuable data. It is a well-known fact in statistics that smaller sample sizes reduce the statistical power and precision of standard statistical procedures (Little & Rubin, Citation2002). A simulation by (Raaijmakers, Citation1999) proved that the statistical power is reduced between 35% (with 10% missing data) and 98% (with 30% missing data) by using deletion techniques. Regardless of their weak points, most of the statistical software packages offers deletion methods as a default options to overcome the missing data problems, and these methods classified as the simplest approaches in treating missing data (Marsh, Citation1998). However, this is surely not the best possible way of dealing with missing data problems in hydrological fields.

On the other hand, in the single imputation approach, in preference to totally removing the pattern, researchers have to infill (compute) value for it and use the following complete data set for the forecast model. Single imputation methods generate a single replacement for each missing value with suitable values prior to the actual analysis of the data. Even though in the strong MCAR presumption event, these methods misrepresent the resulting parameter estimates (Gao, Citation2017). Particularly, they attenuate the standard deviation and the variance of estimates obtained from analyses of single imputed variables since the imputed values are identical and at the center of the distribution which reduces the variability of the data (Little & Rubin, Citation2002).

A more complex group of methods in reconstructing the hydrological data set is the model-based deterministic imputation method. This group of the method produces more precise imputations compared to the deletion technique and single imputation method. The main idea behind this technique is utilizing details from all observations with complete values in the variables of interest to reconstruct the missing values which are intuitively appealingly. In a hydrological context, the presumption of the model-based deterministic approaches was well established but seemed to be too restrictive (Machiwal & Jha, Citation2008).

A more powerful group of methods is the machine learning techniques. Many researchers conclude that the imputation approaches relying on machine learning algorithms predominated imputation techniques based on statistical procedures in the prediction of missing streamflow data (see e.g.; Krysanova & White, Citation2015; Minns & Hall, Citation1996; Varga et al., Citation2016). Generally, machine learning techniques are significantly more flexible than the standard statistical models and can capture higher-order interactions between the data, which leads better predictions. However, the predictions are made on the basis of complex relationships between the data, thus, the interpretability of the outcomes is sometimes harder, even though there are tools to pull out the knowledge required by these models. As a result, these alternative models are often criticized.

From these facts, one may conclude that a good imputation method would work well for various options of underlying data distributions and missing mechanisms. Generally, imputation in streamflow datasets often lacks a clear conceptual framework and a sound selection of methods depending on the statistical properties of the respective observable and the respective research question. Existing imputation techniques therefore have room for further improvement. As discussed earlier, the researcher with missing data issues in their studies has numerous choices once determine how to handle this common issue. More attention should be given to the missing data in the design and performance of the studies and in the analysis of the resulting data. Application of the sophisticated statistical analysis techniques should only be performed after the maximal efforts have been employed to reduce missing data in the design and prevention techniques. It may also be valuable to perform a sensitivity analysis using different methods in managing the missing data in order to measure the robustness of the outcomes and take into account other critical streamflow characteristic contributors like rainfall, temperature, topography or other parameters of the study area.

Author Contributions

writing—original draft preparation, Fatimah Bibi Hamzah; writing—review and editing, Firdaus Mohamad Hamzah, Siti Fatin Mohd Razali, Othman Jaafar and Norhayati Abdul Jamil.

Conflicts of Interest

The authors declare no conflict of interest.

Additional information

Funding

This work was supported by the Kolej Universiti Poly-Tech MARA Kuala Lumpur Micro-Grant.

Notes on contributors

Fatimah Bibi Hamzah

Fatimah Bibi Hamzah, Ph.D candidate at Universiti Kebangsaan Malaysia, Faculty of Engineering and Built Environment cum a Senior Lecturer at Kolej Universiti Poly-Tech MARA Kuala Lumpur. Master degree in applied statistics and Bachelor’s degree in pure mathematics from UPM, Malaysia. Area of research interest includes mathematical problems in engineering, hydrology, multi-criteria decision analysis and developing methods to overcome missing data and outliers’ problems.

References