814
Views
9
CrossRef citations to date
0
Altmetric
Discussions and Replies

Discussion of “Evapotranspiration modelling using support vector machines”Footnote*

, , , &
Pages 1442-1450 | Published online: 29 Nov 2010

INTRODUCTION

The data-driven modelling paradigm is special in the extent to which input–output data sets, and the structures that exist within them, are the fundamental drivers of the entire modelling process. Consequently, those engaged in data-driven modelling must recognise that an adequate description and assessment of any data sets used is an essential requirement for the proper application of the paradigm. This is equally as important as the description of the particular mathematical algorithm that is used to model the data. It is also essential if data-driven modelling is to comply with the principles of the “scientific method”, i.e. experiments must be repeatable and results must be reproducible. However, in many published papers, the description of the data sets involved and the particular algorithms used to model the data are unbalanced, with brief descriptions of data sets and pre-processing operations contrasting with long and detailed descriptions of algorithmic and computational approaches. As a result, criticisms of data-driven models are emerging whose foci include a lack of clarity in, and justification of, data set pre-processing steps, input variable selection, repeatability and independent verification of modelling applications, as well as the need to perform data-driven modelling in the first place. Dealing with these criticisms has sometimes taken the form of extensive post-publication discussions, in which authors offer piecemeal criticisms of different aspects of different papers and of the errors or omissions in their reporting of data sets, including pre- and post-processing operations; examples are provided below. This is an onerous process that is unlikely to deliver a sound or comprehensive “code of behaviour”.

As a case in point, two recent papers on modelling reference crop evapotranspiration have been heavily criticised in subsequent discussions for failing to provide sufficient information about the data sets involved. Aksoy et al. (Citation2007) raised the issue of unreported missing records in data sets used by Kisi (Citation2006) to model reference crop evapotranspiration in California, USA. It was argued that if a particular data set had some missing observations, that fact should be reported clearly (particularly in journal papers) even if the number of missing cases was small, so that the reported results are comparable with other studies. Kisi (Citation2007), in response to this challenge, subsequently disclosed that a linear regression model had actually been used to provide estimated records for a 12-day period and this too should have been reported in the original paper, not only to ensure others could repeat the work, but also because, through this data-infilling process, the authors had introduced an untested assumption of linearity in the data. Abrahart et al. (Citation2009) questioned the operational processes involved in the removal of incomplete entries by Aytek et al. (Citation2008); also modelling reference crop evapotranspiration in California. In particular, the description of their “data cleansing” operation, a process that was specified in terms of involving “some missing records”, was criticised as being insufficient. Aytek et al. (Citation2009), in response to this challenge, subsequently stated that removal comprised not just records that had missing observations but also the removal of items flagged in the data set as “ignored” (I), “far out of normal range” (R); and “moderately out of range” (Y). Few would challenge the removal of “unreliable records”, since such actions can be justified, but such removals should always be reported so that other interested parties are able to check or repeat the ensuing experiments.

The questioning and debating of poor data reporting procedures are clear signs of a healthy and emerging field. Maturation of matters is further evidenced in recent hydrological dialogues that have occurred on other pertinent issues (e.g. Koutsoyiannis, Citation2007; Aksoy et al., Citation2008); including the need for a stronger explanation of why different types of activity should or should not be categorised as “hydroinformatics” (See et al., Citation2007). This is not the place to re-open past discussions, but it is nevertheless important to consider the limited power of submitted comments as a driving force for change, since no real progress appears to have occurred during a 2-year period, that started with the first paper of Kisi (Citation2006) and ended with the second paper of Aytek et al. (Citation2008): both papers in this sequence were criticised for exhibiting similar shortcomings. It would also be improper at this point to conduct a full audit on the issue by identifying recorded instances of analogous problems occurring in the papers of critics or criticised. It is of course important that scientists both advocate, and do, the right thing. Yet our analysis has demonstrated the inability of post-publication discussions to deliver the “code of behaviour” that is needed to establish what is right. The alternative to post-publication, piecemeal criticisms is to assemble a checklist of “best practice guidelines”, which could be applied at the pre-publication stage by authors, reviewers or editors. It would need to be interpreted as an aid to the reporting of model development and testing data sets, and should be designed to cover both restricted records and records downloaded from open-access internet sites.

DEVELOPMENT OF BEST PRACTICE GUIDELINES

In this discussion we use the recent evapotranspiration modelling work of Kisi & Cimen (Citation2009) (K&C) as a case study, and adopt this as the basis for developing a set of best practice guidelines. K&C applied support vector machines and artificial neural networks to a non-complex hydrological modelling problem: prediction of pre-calculated daily reference crop evapotranspiration (ET0) estimates. Four meteorological input measurements were used – solar radiation (R s), air temperature (T), relative humidity (RH) and wind speed (U 2). Individual models were developed on downloaded data sets for three automated weather stations in California, USA, belonging to the California Irrigation Management Information System (CIMIS) (http://www.cimis.water.ca.gov/): Windsor (#103), Oakville (#077) and Santa Rosa (#083). K&C failed to report the results of traditional statistical benchmarking operations, making it difficult to ascertain the extent and nature of the underlying challenges involved and whether or not the application of more powerful computational approaches was justified by the complexity of the relations in the data (Abrahart & See, Citation2007; Mount & Abrahart, Citation2010). Consequently, the discussers sought to independently verify the need for support vector machine and neural network modelling; initially by checking for linearity by performing a multiple linear regression analysis of the relationships between input predictors and output predictand.

To this end, the descriptions of the data sources and processing provided by K&C, as detailed on p. 920, were used as the basis for generating a faithful counterpart data set, but this proved difficult due to potential misunderstandings about the download site involved and a lack of clarity with respect to any data pre-processing operations employed. K&C also claim to have generated daily values for reference crop evapotranspiration using the FAO-56 PM method of Allen et al. (Citation1998), hereinafter referred to as F-ET0. It is important that this variable is not confused with daily values for reference crop evapotranspiration calculated using the “CIMIS Penman Equation” (C-ET0; Snyder & Pruitt, Citation1992; Eching & Moellenberndt, Citation1998) – a product that the original authors downloaded at the same time as R s, T, RH and U 2. FAO-56 PM estimation of F-ET0 requires four primary input variables, namely, T, RH, net radiation (R n) Footnote 1 and soil heat flux density (G) – the latter often being negligible for daily calculations. K&C did not report how their four downloaded variables (R s, T, RH and U 2) were used to deliver the required output predictand, only two of which corresponded, or how the following additional variables that are required by the FAO-56 PM method were accounted for: slope of the saturation vapour pressure plotted as a function of temperature (Δ), psychrometric constant (γ), which is not really a constant but varies with temperature and pressure, saturation vapour pressure (ea ) and actual vapour pressure (ed ). There are different ways by which some of these variables can be calculated and this would inevitably introduce some degree of uncertainty with regard to the repeatability of the reported work of K&C. For example, was G assumed to have a zero value? As a result, it was not possible to construct the same F-ET0 data set as outlined and used by K&C. Similarly, for data sets that could be downloaded from cited sources and compared directly, discrepancies in the descriptive statistics of the data and those reported by K&C indicated that some additional input data processing must have occurred, but no details were provided.

Based on the concerns that were raised in trying to construct similar data sets to those used in the K&C paper, we have developed a set of best practice guidelines. These guidelines form a seven-point checklist which details the major steps that are needed to deliver more rigorous scientific reporting of data in data-driven modelling papers. The present implementation of our pre-publication checklist as a post-publication instrument also serves to identify numerous issues in K&C's paper that still need to be clarified by the original authors.

SEVEN-POINT CHECKLIST

  • General Description: Each paper should supply a general description of the data set, including a clear statement of content(s) and original purpose of collection, as well as listing the spatial and temporal resolution of records. If a subset of some larger set of records is used, the reasons for choosing that subset should also be stated.

K&C: Ten years of daily meteorological observations for three automated weather stations located at Windsor, Oakville and Santa Rosa in California (1998–2007) were used. The modelling was restricted to a single decade but the reason for limiting the reported experiments to this particular subset was not stated. “Is there anything interesting or challenging or special about the data of that chosen decade?” remains an open question which requires addressing.

  • Quality Control Procedures: Each paper should supply some relevant information about the agencies responsible and, if possible, on their data quality assurance procedures. Any further attempts by the authors to check the data quality should also be reported. Checking data quality is an issue which is very often ignored in numerous hydrological studies. It is sufficient to say that the old information processing maxim of “garbage in–garbage out” is still alive. There is an unreported perception that developing techniques to deal with data quality issues is considered to be a less scientific and exciting topic of investigation. Our view is that this is an important area of research and one in which data-driven modelling could potentially play a considerable part in advancing the field.

K&C: The reported data sets were obtained from a network of automated weather stations that are run by the California Department of Water Resources and the University of California at Davis. The overall intention is to assist irrigators to more efficiently manage their water resources. The different types of equipment used to gather the data and where they were located are listed in the original paper but there is no indication of data reliability, such as the omission of dubious records in the data set that could be identified from quality control flags, e.g. Aytek et al. (Citation2008, 2009).

  • Data Acquisition Procedures: Each paper should provide a list of web sites and/or other sources for each data set used and list the date of download, or date of acquisition. Exact “date stamping” is important, since updates and corrections are sometimes applied to data sets.

K&C: The reported data sets were downloaded from the “CIMIS web server”: but the cited web address is that of the University of California Statewide Integrated Pest Management Program (IPM) (http://www.ipm.ucdavis.edu/WEATHER/wxretrieve.html). No date of download is provided. In our failed attempt to repeat their experiments, all available data sets were downloaded from the IPM web server on 16 September 2009. Initial shortcomings in that material led to an additional set of pertinent records being downloaded from the CIMIS web server (http://wwwcimis.water.ca.gov/) on 6 October 2009. provides a list of equivalent downloaded material.

Table 1  IPM holdings and CIMIS equivalent downloads for automated weather stations at Windsor, Oakville and Santa Rosa. (INT: integer. 1DP/2DP/3DP: recorded to one/two/three decimal places)

  • Data Tagging Procedures: Each paper should provide measurement units for all input and output variables that are mentioned in the text, and assign a unique identifier or subscript to each individual variable concerned.

K&C: The downloaded variables are listed but no units are reported, except in the results section, where ET0 is reported in mm. For this reason, in our failed attempt to repeat their experiments, data sets were downloaded in both English and metric units. K&C solar radiation was found to be recorded in English units (i.e. Langleys per day); all other observations in the original paper are in standard SI units (degrees Celsius, °C; metres per second, m/s; millimetres, mm) or percentages. Hence, modelling appears to have involved fusing a mixed set of measurement systems – something that should be avoided. The other important point is about ET0. This particular variable is mentioned in the text, but no actual ET0 measurements are included in the analysis: everything that is reported in the paper relates to a calculated output. Each individual method of calculating ET0 is different, and such mechanisms will possess a certain amount of inherent dissimilarities and potential error. Thus, to prevent misunderstandings, each set of outputs should be treated as a different data set and assigned a unique identifier, e.g. C-ET0 and F-ET0, including Hargreaves ET0, Ritchie ET0 and Turc ET0.

  • Data Cleansing Procedures: Each paper should specify the type and nature of error detection procedures and data cleansing operations applied, including the use of different filtering mechanisms for the removal of records that possessed missing or incomplete entries, or confirm that no original records were deleted. If removal procedures were applied, it is also important to record whether or not such procedures were applied in a consistent manner to the full data set, or to specific variables in a processed subset of the material. This should include a clear statement about the measures that were used to deal with negative numbers and zeros (if applicable), or confirm that no such procedures were applied. The use of quality control flags to support additional removals or the application of logical rules relating to physical factors, such as meaningful minimum and maximum end points, would be other instances that call for purposeful modifications to be implemented.

K&C: No missing numbers or incomplete records are reported. However, in our failed attempt to repeat their experiments, pertinent data sets downloaded from their cited website came complete with a report of both missing and “locally infilled” records: infilling can be from nearby stations and/or long-term averages and in our download operations we opted for the default method of infilling. provides a quantification of such operations, from which it is observed that numerous instances of missing RH, U 2 and C-ET0 records occur at IPM. In K&C no data cleansing operations are reported. This suggests that each pair of training and testing records contained a full compliment of data, i.e. one record per day for each individual variable concerned, less any instances of missing or incomplete records. It is difficult to accept that no checking was performed. However, since the minimum amount of incoming daily solar radiation at all stations was reported as zero, such oversights could well be correct.

Table 2  IPM default infilling of missing records (1998–2007)

  • Data Processing Procedures: Each paper should specify the type and nature of data pre- and post-processing operations that were applied to the original records, or confirm that no such procedures were applied, e.g. production of calculated numbers such as daily means, derived from daily minimum and daily maximum records, operational infilling using linear regression procedures, developing moving averages, rounding, etc.

K&C: No pre- or post-processing operations are reported. However, in our failed attempt to repeat their experiments, the available data sets that could be downloaded from their cited website did not include mean daily temperature and mean daily relative humidity values. Such records are not directly available as an original data set: the site instead provided maximum and minimum records, such that a coarse mean daily value could be estimated, necessitating arithmetic calculation, although means calculated from end-point maximum and minimum values do not account for daily patterns or statistical distributions.

  • Data Summation Procedures: Each paper should include a numerical description of every variable that is used in the model, i.e. counts, appropriate standard statistics, a cross-correlation matrix and, if applicable, start/end dates.

K&C: The original paper contained a set of statistical descriptors. For each station in our downloaded data sets the mean, minimum, maximum and number of missing records are compared against the original descriptors in . corrects a simple typographical error that resulted in two of their original numbers being positioned in the wrong rows. Numerous inconsistencies are observed. K&C reported solar radiation values that exceed the upper and lower boundaries for data downloaded from IPM, but such values are identical to the ones downloaded from CIMIS. This suggests that the authors downloaded some of their records from CIMIS and not, as reported in the original paper, from IPM. The statistics for temperature and relative humidity records were nevertheless a much closer match to IPM, as opposed to CIMIS. This suggests that a coarse daily mean was calculated using maximum and minimum records. The paired input records required for such calculations could have been downloaded from either IPM or CIMIS, although use of the latter web site is slightly odd, since superior daily means are available as a direct download from that same server. K&C reported minimum and maximum wind speed as real numbers, but in the downloaded IPM data set these values are integers. CIMIS wind speed records are, however, recorded in real numbers and the reported values are a much closer match to CIMIS, as opposed to IPM, suggesting a download from CIMIS. K&C provided descriptive statistics on what is assumed to be downloaded C-ET0. This variable is not used in the modelling process, other than to assess their results, something that is in itself rather odd, since C-ET0 is an alternative method for computing ET0. It should not be regarded as an erroneous model of F-ET0, the pre-calculated variable that is identified in the text as the modelling output. This particular set of estimations is not subject to statistical description in the paper, something that is an obvious oversight. Instead, C-ET0 is reported and observed to possess a minimum of −0.36 and a maximum of 11.2 mm. However, none of our downloaded records contained negative C-ET0 numbers and scatter plots of observed vs predicted in the original paper appear to span a range of 0–10. It might be that some kind of outlier removal procedure was applied during the development of their scatter plots that is not reported in the paper. The final point that is of major interest is in relation to the reported calculation of their output variable F-ET0 (K&C: equation (9)). It is not possible, as mentioned earlier, to calculate this particular variable on the data sets that can be downloaded from either IPM or CIMIS, since a complete set of inputs to the equation is not present, although a larger number of the required inputs and a direct download of pre-calculated values can nevertheless be obtained from CIMIS – as recorded in and .

Table 3  Data sets for Windsor (1998–2007)

Table 4  Data sets for Oakville (1998–2007)

Table 5  Data sets for Santa Rosa (1998–2007)

CONCLUDING REMARKS

Four main conclusions arise. First, several numbers related to our reported downloaded data set are slightly different from the ones that were listed in K&C, perhaps due to unreported cleansing, removal, rounding or infilling. This information needed to be reported in the original paper. Second, the data sets appear to come from one or more different sources, which should be explicitly stated. Third, it is unclear how the authors computed F-ET0 from material that could be downloaded from either IPM or CIMIS. Fourth, no recognised international standard exists for reporting hydrological modelling data sets and some attempt at developing guidelines is essential if experiments are to be repeated in later studies. From the above analysis, it is clear that a simple procedural list offers one potential way forward. The seven-point list can doubtless be expanded and refined: potential collaborators and other interested parties who might be willing to get involved in helping us to develop our list further should contact the discussers.

Abrahart et al. (Citation2008) identified eight desirable characteristics that papers should possess in order to build stronger scientific foundations. The need to expand upon that initial set of recommendations is evident. There is a pressing need for more scientific rigour in the application and reporting of data-driven hydrological modelling experiments. The focus of most published papers in this field is directed towards the trialling and testing of different tools or methods, e.g. development of neural network, neuro-fuzzy or support vector machine solutions. Limited attention is paid to the specific nature of the data sets that are used – except, perhaps, for the provision of some standard statistical descriptors that are delivered in tabulated format. Yet, understanding the nature of the data is fundamental to the use of data-driven modelling techniques. The use of open-access data sets in several recent papers for testing algorithms and developing models is to be commended, since it encourages others to perform similar comparison exercises. However, without a full reporting of the data sets used, and of the pre-processing operations that were applied, it is not possible to replicate the reported modelling applications such that the overall impact of important pioneering experiments is either reduced or perhaps lost.

Achieving the goal of the scientific method in terms of repeatability and reproducing published results is a laudable endeavour. We recognise that this is a journey which will be fraught with many difficulties. Data access will be a major barrier in this journey. In many countries, hydrological data are regarded as a commodity which can be expensive to purchase. In some countries, obtaining access to the data is politically sensitive, especially in the case of trans-national and transboundary rivers. For testing novel modelling paradigms, having a standard database containing a diversity of challenging data would help in benchmarking the results. This would be good news for the consumers (users) and in the long term would help towards achieving a step change in hydrological modelling practice. The database could contain both real and synthetic data, albeit that the latter type of data can be controversial. In the field of hydraulic engineering, it is a standard practice to test the performance of new finite difference solutions against standard solutions. The point we are making here is that there are many good modelling practices in other fields which can be adapted and used in developing “best hydrological modelling practice” for reporting data-driven hydrological modelling research activities.

The term “data-driven modelling” explicitly recognises the equal importance of both the modelling approach and the data that drive the model, but this balance of recognition is often lacking in published studies. In their desire to ensure that readers can understand and replicate the modelling approaches, most authors rightly provide full and detailed descriptions of the computations underlying their modelling techniques. However, the detail afforded to descriptions of the data sets which drive the models is often lacking, and replication of reported results becomes impossible. Consequently, those who should arguably be the greatest proponents of the data-driven modelling paradigm are failing to properly address large parts of its requirements. At best, this results in a frustration for those wishing to independently verify published research. At worst, it results in a general lack of trust in the paradigm that is fuelled by a lack of independent validation of the work of others and, consequently, a failure in the full application of accepted scientific methodology.

Notes

*Kisi, O. & Cimen, M. (Citation2009) Evapotranspiration modelling using support vector machines. Hydrol. Sci. J. 54(5), 918–928.

1 FAO-56 PM uses Rn not Rs and both variables can be downloaded from CIMIS. For a full listing of their measured and calculated data sets, see: http://www.cimis.water.ca.gov/cimis/dataInfoType.jsp. Rn can also be obtained from Rs using estimated or measured net long-wave radiation values. Estimates of the latter can be produced using standard textbook equations.

REFERENCES

  • Abrahart , R. J. , Ab Ghani , N. and Swan , J. 2009 . Discussion of “An explicit neural network formulation for evapotranspiration” . Hydrol. Sci. J. , 54 ( 2 ) : 382 – 388 .
  • Abrahart , R. J. and See , L. M. 2007 . Neural network modelling of non-linear hydrological relationships . Hydrol. Earth Syst. Sci. , 11 ( 5 ) : 1563 – 1579 .
  • Abrahart , R. J. , See , L. M. and Dawson , C. W. 2008 . “ Neural network hydroinformatics: maintaining scientific rigour. Ch. 3 ” . In Practical Hydroinformatics: Computational Intelligence and Technological Developments in Water Applications , Edited by: Abrahart , R. J. , See , L. M. and Solomatine , D. P. Vol. 68 , 33 – 47 . Berlin : Springer-Verlag, Water Science and Technology Library .
  • Aksoy , H. , Guven , A. , Aytek , A. , Yuce , M. I. and Unal , N. E. 2007 . Discussion of “Generalized regression neural networks for evapotranspiration modelling” . Hydrol. Sci. J. , 52 ( 4 ) : 825 – 828 .
  • Aksoy , H. , Guven , A. , Aytek , A. , Yuce , M. I. and Unal , N. E. 2008 . Comment on “Kisi, O., 2007, Evapotranspiration modelling from climatic data using a neural computing technique . Hydrol. Processes , 22 ( 14 ) : 2715 – 2717 . Hydrol. Processes 21, 1925–1934
  • Allen , R. G. , Pereira , L. S. , Raes , D. and Smith , M. Crop evapotranspiration guidelines for computing crop water requirements. Rome: Food and Agriculture Organization of the United Nations . FAO Irrig. Drainage Paper no. 56 . 1998 .
  • Aytek , A. , Guven , A. , Yuce , M. I. and Aksoy , H. 2008 . An explicit neural network formulation for evapotranspiration . Hydrol. Sci. J. , 53 ( 4 ) : 893 – 904 .
  • Aytek , A. , Guven , A. , Yuce , M. I. and Aksoy , H. 2009 . Reply to Discussion of “An explicit neural network formulation for evapotranspiration” . Hydrol. Sci. J. , 54 ( 2 ) : 389 – 393 .
  • Eching , S. and Moellenberndt , D. 1998 . Technical Elements of CIMIS, the California Irrigation Management Information System , Sacramento, CA : State of California, Resources Agency, Department of Water Resources, Division of Planning and Local Assistance .
  • Kisi , O. 2006 . Generalized regression neural networks for evapotranspiration modelling . Hydrol. Sci. J. , 51 ( 6 ) : 1092 – 1105 .
  • Kisi , O. 2007 . Reply to Discussion of “Generalized regression neural networks for evapotranspiration modelling” . Hydrol. Sci. J. , 52 ( 4 ) : 829 – 831 .
  • Kisi , O. and Cimen , M. 2009 . Evapotranspiration modelling using support vector machines . Hydrol. Sci. J. , 54 ( 5 ) : 918 – 928 .
  • Koutsoyiannis , D. 2007 . Discussion of “Generalized regression neural networks for evapotranspiration modelling” . Hydrol. Sci. J. , 52 ( 4 ) : 832 – 839 .
  • Mount , N. J. and Abrahart , R. J. 2010 . Discussion of “River flow estimation from upstream flow records by artificial intelligence methods” by M. E. Turan & M. A. Yurdusev . J. Hydrol. , 369 : 71 – 77 . (2009) submited to J. Hydrol. (accepted).
  • See , L. M. , Solomatine , D. P. , Abrahart , R. J. and Toth , E. 2007 . Hydroinformatics: computational intelligence and technological developments in water science applications – Editorial . Hydrol. Sci. J. , 52 ( 3 ) : 391 – 396 .
  • Snyder , R. L. and Pruitt , W. O. Evapotranspiration data management in California . Proceedings of Water Forum 1992, Irrigation and Drainage Session . August 1992 2–6 , Baltimore, MD. pp. 128 – 133 . American Society of Civil Engineers .

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.