Search in:

Plant Signaling & Behavior Volume 6, 2011 - Issue 1

Submit an article Journal homepage

Free access

940

Views

CrossRef citations to date

Altmetric

Listen

Article Addendum

To transform or not to transform

That is the dilemma in the statistical analysis of plant volatiles

Yuvaraj Ranganathan Centre for Ecological Sciences, Indian Institute of Science, Bangalore, India

Renee M. Borges Centre for Ecological Sciences, Indian Institute of Science, Bangalore, IndiaCorrespondence[email protected]

Pages 113-116 | Received 13 Nov 2010, Accepted 13 Nov 2010, Published online: 01 Jan 2011

Cite this article
https://doi.org/10.4161/psb.6.1.14191

In this article

Compositional Data in Chemical Ecology
Transformation of Compositional Data in Chemical Ecology
Random Forests and Compositional Data
References

Full Article
Figures & data
References
Citations
Metrics
Reprints & Permissions
View PDF PDF

Abstract

Chemical ecology, be it the study of plant volatiles or insect cuticular hydrocarbons, largely involves the analysis of compositions or “blends” of a mixture of compounds. Compositional data have intrinsic properties such as a “constant-sum constraint” which should be taken into account when statistically analysing these data. The field of compositional data analysis has greatly improved our understanding of the nature of such compositions and has provided us with insights on statistically rigorous ways of analysing such constrained data. Employment of standard multivariate statistical procedures on compositional data necessitates the use of appropriate transformation procedures, which removes the non-independence of data points, thus rendering the data suitable for such analysis. Here we present the current situation of the analysis of compositional data in chemical ecology; the awareness of this constraint of compositional data; and alternative ways of analysing such constrained data using Random Forests, a data-mining algorithm which has many features that facilitate the analysis of such data. Two such features of particular relevance to compositional data are that Random Forests does not incorporate implicit assumptions about the distribution of the data and can deal with auto-correlations between data points.

This article refers to:

Chemical ecology, be it the study of plant volatiles or insect cuticular hydrocarbons, largely involves the analysis of compositions or “blends” of a mixture of compounds. Compositional data have intrinsic properties such as a “constantsum constraint,” which should be taken into account when statistically analyzing these data. The field of compositional data analysis has greatly improved our understanding of the nature of such compositions and has provided us with insights on statistically rigorous ways of analyzing such constrained data. Employment of standard multivariate statistical procedures on compositional data necessitates the use of appropriate transformation procedures, which removes the non-independence of data points, thus rendering the data suitable for such analysis. Here we present the current situation of the analysis of compositional data in chemical ecology; the awareness of this constraint of compositional data; and alternative ways of analyzing such constrained data using Random Forests, a data-mining algorithm that has many features that facilitate the analysis of such data. Two such features of particular relevance to compositional data are that Random Forests does not incorporate implicit assumptions about the distribution of the data and can deal with auto-correlations between data points.

Compositional Data in Chemical Ecology

Plant volatile bouquets or insect cuticular hydrocarbons are usually analyzed as relative proportions or percentages that are always bounded, i.e., all the data points add to a constant of 1 or 100%. Thus any increase in the value of a data point automatically requires the other data points to decrease, demonstrating the “constantsum constraint” of such data.Citation1 This non-independence of data points makes the data unsuitable for analysis using standard conventional statistical procedures such as multiple pairwise correlations, principal component analysis (PCA), multivariate analysis of variance (MANOVA) and multiple regressions. This is because all these procedures implicitly assume a data distribution, independence of data points, as well as absence of interactions between data points. Additional problems encountered in such data in chemical ecology include log-level differences in the percentage values of the data points, presence of a large number of zeroes and auto-correlations between data points.Citation2 These features are natural constraints in chemical ecology since many compounds could share common biosynthetic pathways, have isomeric forms, and also be selectively regulated based on the ecological context, resulting in large absences or large presences based on context.Citation3,Citation4

The statistical analysis of compositional data saw a surge of improvement borrowing heavily from the field of geological chemistry. The study of mineral compositions usually involved categorisation into “major” elements that are present in percent to tens of percent values, “minor” elements that are present at around 1% concentrations and “trace” elements that are present in parts per million or parts per billion levels.Citation5 Such data were analyzed using standard statistical procedures without being aware of the fact that the basic assumptions of normality, among others, were being violated. Although this was pointed out as early as 1897 by Karl Pearson writing on spurious correlations, it was not until the 1960s that such pitfalls were acknowledged and were taken into consideration.Citation1 Several transformations were proposed to render the data suitable for analysis. These include the centered log ratio transformation (clr), additive log ratio transformation (alr) and isometric log ratio transformation (ilr), of which clr is most often used.Citation6,Citation7

Transformation of Compositional Data in Chemical Ecology

To understand the extent of transformations of compositional data in chemical ecology, we performed a literature survey using Google Scholar. We limited our search period to 1986–2010, since it was in 1986 that J. Aitchison published the seminal work titled “The statistical analysis of compositional data,” which advocated the use of data transformation.Citation8 We employed the key words: (plant + volatiles + “GC-MS”) and (cuticular + hydrocarbons + “GC-MS”) to retrieve citations which we used as surrogates for published literature in this area of chemical ecology. We restricted our search with the keyword (GC-MS) as this would capture the specific subset of studies that identify and analyze compounds in chemical ecology. Along with this search, we were able to retrieve literature that contained the keyword (Aitchison) and literature that did not contain the keyword. The results of this survey revealed a disproportionately small number of studies that actually contained the keyword (Aitchison) and thus by proxy have cited Aitchison's paper and transformed their data as recommended by Aitchison (). We repeated this survey using the phrase (“Random Forests”) to retrieve literature that has used this relatively new algorithm. We found just five results with “plant volatiles” and none with “cuticular hydrocarbons” ().

Although dedicated software packages for analyzing compositional data exist, e.g., compositions, robCompositions and MixeR for R software, as well as CoDa developed by Aitchison, many studies use square-root transformations or log transformations with the addition of a constant (ranging from 0.01–0.00001) to accommodate zero data points. The addition of such seemingly arbitrary constant values would greatly affect/alter the projection of such data points in multivariate space.Citation9 Thus, if one sets out to study compositional data within the framework of standard multivariate procedures, it is imperative that the researcher be aware of the limitations and/or assumptions of such procedures and uses appropriate transformation procedures to incorporate statistical rigor into the analysis. If the researcher desires not to use such model-based methods with built-in assumptions, alternate algorithm-based methods such as Random Forests are at the researcher's disposal.

Random Forests and Compositional Data

Random ForestsCitation10 is a data-mining algorithm that has many features which make it suitable for analyzing complex data sets.Citation11 For example, there is increasing use of Random Forests in the analysis of complex microarray data since year-wise microarray studies citing this approach that were retrieved using the keywords (microarray + “random forest”) were the following: 2002:10, 2003:30, 2004:70, 2005:130, 2006:280, 2007:472, 2008:706, 2009:1021, 2010:1300. This indicates an increasing adoption of this method by molecular biologists. Of particular interest to chemical ecologists are two features of Random Forests: no implicit assumptions on the structure of the data points and accommodation of any interactions and/or correlations between data points. As Random Forests is a non-parametric method,Citation12 it can also deal with data points varying in log-scales and with zeroes. Random Forests constructs decision-based trees selecting a subset of samples and variables at random. This combined with bootstrap aggregations gives estimates of classification errors. Such attractive features provide possibilities of using such algorithms for data sets in chemical ecology which have the additional constraint of comprising of compositional data.

We reanalyzed data on volatile organic compounds (VOCs) produced by ripe figs of three species and two sexes within these species (Ficus hispida male and female figs, Ficus exasperata male and female figs, and Ficus tsjahela monoecious figs) that we had analyzed using Random Forests in an earlier paper,Citation2,Citation13 this time by transforming the data by adding 0.0001 to all values. In comparison with an earlier PCA plot of untransformed VOC values, we found that a PCA with transformed VOC values gave better separation between species and sexes () in comparison to untransformed data (Fig. 4a of the earlier publicationCitation13). Furthermore, a multidimensional scaling plot using the MDSplot function in the Random Forests package with untransformed proportions showed the same separation as did the PCA plot with transformed proportions (). This indicates that a PCA with transformed proportions is equivalent to a multidimensional scaling (MDS) plot with untransformed proportions with these data (the MDSplot function does not provide stress values as in other MDS analysis). Furthermore, we used the varSelRF routineCitation11 with Random Forests on transformed data to separate the five classes of figs and found some interesting similarities and differences from our earlier results (). In the case of male F. hispida and F. tsjahela, there were no differences from our earlier predictor VOC compounds. In the case of F. hispida female, we found that Random Forests had substituted 2-heptyl acetate instead of iso-amyl acetate as a predictor compound (). In female F. exasperata, Random Forests substituted undecane instead of p-cymene and β-caryophyllene with a lower model frequency of 83% compared to the earlier model frequency of 98% (). In male F. exasperata, Random Forests added allo-aromadendrene, γ-terpinene and terpinolene to a previous list of predictor VOCs with a now much higher model frequency of 82% compared to the earlier 31% ().

Should a researcher be more comfortable with the results from transformed or untransformed data in this case? We suggest that since Random Forests coupled with varSelRF employs bootstrapping in which various compounds are selected at random may times over, in various combinations, it should not be necessary to transform the data to employ such algorithms in the search for predictor variables. However, this suggestion needs to be examined and verified statistically. We urge statisticians such as John Aitchison and Leo Breiman to turn their attention to such specific problems that will help to shed light on the genuine dilemma facing researchers in this area: to transform or not to transform?

Figures and Tables

Figure 1 Literature survey using Google Scholar from 1986–2010 to retrieve publications in chemical ecology which transformed their proportion data as recommended by Aitchison in the fields of plant volatile (or) insect cuticular hydrocarbon analysis. The number of publications citing both the Aitchison transformation and Breiman's Random Forests is also shown.

Figure 2 Unsupervised classification of fruit-dispersal volatile organic compounds (VOCs) of three sympatric Ficus species using proportional abundance of VOCs. (A) A PCA plot of VOC proportions after transformation employing the clr (centered log ratio) method as recommended by Aitchison. (B) An MDS plot of the untransformed proportions of the same VOCs using Random Forests.

Table 1 Comparison of results from Random Forests on ripe fig fruit volatile organic compounds (VOCs) using untransformed and transformed data

Download CSV Display Table

Addendum to:

Related Research Data

An optimized protocol for large-scale in situ sampling and analysis of volatile organic compounds

Source: (:unav)

CHEMICAL INTROGRESSION IN WALL LIZARDS

Source: Wiley

Source: The Royal Society

Linking provided by

References

Aitchison J, Egozcue JJ. Compositional data analysis: where are we and where should we be heading?. Math Geol 2005; 37:829 - 850; http://dx.doi.org/10.1007/s11004-005-7383-7
Google Scholar
Ranganathan Y, Borges RM. Reducing the babel in plant volatile communication: Using the forest to see the trees. Plant Biol 2010; 12:735 - 742; http://dx.doi.org/10.1111/j.1438-8677.2009.00278.x
PubMed Web of Science ®Google Scholar
Pichersky E, Gang DR. Genetics and biochemistry of secondary metabolites in plants: an evolutionary perspective. Trends Plant Sci 2000; 5:439 - 445
PubMed Web of Science ®Google Scholar
Blomquist GJ, Bagníres AG. Insect Hydrocarbons: Biology, Biochemistry and Chemical Ecology 2010; Cambridge UK Cambridge University Press
Google Scholar
Templ M, Filzmoser P, Reimann C. Cluster analysis applied to regional geochemical data: problems and possibilities. Appl Geochem 2008; 23:2198 - 2213; http://dx.doi.org/10.1016/j.apgeochem.2008.03.004
Web of Science ®Google Scholar
Aitchison J. The single principle of compositional data analysis, continuing fallacies, confusions and misunderstandings and some suggested remedies. CODAWORK'08 2008; Girona, Spain http://hdl.handle.net/10256/706
Google Scholar
Aitchison J. The statistical analysis of compositional data. J R Stat Soc Ser B Stat Methodol 1982; 44:139 - 177
Web of Science ®Google Scholar
Aitchison J. The Statistical Analysis of Compositional Data 1986; London Chapman & Hall Ltd
Google Scholar
Martin SJ, Drijfhout FP. How reliable is the analysis of complex cuticular hydrocarbon profiles by multivariate statistical methods?. J Chem Ecol 2009; 35:375 - 382; http://dx.doi.org/10.1007/s10886-009-9610-z
PubMed Web of Science ®Google Scholar
Breiman L. Random forests. Mach Learn 2001; 45:5 - 32; http://dx.doi.org/10.1023/A:1010933404324
Web of Science ®Google Scholar
Díaz-Uriarte R, Alvarez de Andrés S. Gene selection and classification of microarray data using random forest. BMC Bioinformatics 2006; 7:3; PMID: 16398926; http://dx.doi.org/10.1186/1471-2105-7-3
PubMed Web of Science ®Google Scholar
Lunetta K, Hayward LB, Segal J, Van Eerdewegh P. Screening large-scale association study data: exploiting interactions using random forests. BMC Genetics 2004; 5:32; PMID: 15588316; http://dx.doi.org/10.1186/1471-2156-5-32
PubMed Web of Science ®Google Scholar
Borges RM, Bessière JM, Hossaert-McKey M. The chemical ecology of seed dispersal in monoecious and dioecious figs. Func Ecol 2008; 22:484 - 493; http://dx.doi.org/10.1111/j.1365-2435.2008.01383.x
Web of Science ®Google Scholar

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Download PDF

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Your download is now in progress and you may close this window

Did you know that with a free Taylor & Francis Online account you can gain access to the following benefits?

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

Have an account?
Login now Don't have an account?
Register for free

Login or register to access this feature

Have an account?
Login now Don't have an account?
Register for free

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

To transform or not to transform

That is the dilemma in the statistical analysis of plant volatiles

Abstract

Compositional Data in Chemical Ecology

Transformation of Compositional Data in Chemical Ecology

Random Forests and Compositional Data

Figures and Tables

Table 1 Comparison of results from Random Forests on ripe fig fruit volatile organic compounds (VOCs) using untransformed and transformed data

Related Research Data

References

Information for

Open access

Opportunities

Help and information

To transform or not to transform

That is the dilemma in the statistical analysis of plant volatiles

Abstract

Compositional Data in Chemical Ecology

Transformation of Compositional Data in Chemical Ecology

Random Forests and Compositional Data

Figures and Tables

Table 1 Comparison of results from Random Forests on ripe fig fruit volatile organic compounds (VOCs) using untransformed and transformed data

Related Research Data

References

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date