Abstract
This article addresses the problem of identifying differences between populations of trees. Recently, a sophisticated test was proposed by Balding et al. (Citation2009), the BFFS test, a Kolmogorov type of test that maximizes the differences between the information of the samples, but it does not have a naive computation, since it involves a search over the set of trees that grows exponentially fast. An algorithm for computing the test statistic was devised in Balding et al. (Citation2009), considering a search for a minimum cut over a transport network in a Ford Fulkerson type routine. The test was shown powerful but complex at the time to apply it in practice. On the contrary, we propose a very simple statistical test based on the distance between empirical mean trees, as an analog of the two sample Z statistic for comparing two means. Despite its simplicity, we can report that the test is quite powerful to separate distributions with different means, but it does not distinguish between different populations with the same means. In that case, the BFFS test should be applied. Nevertheless, on a real data set from proteomics, also discussed on Balding et al. (Citation2009), our test obtained the same results, making it a valuable preliminary evaluation tool for random trees population discrimination.
Keywords:
Acknowledgments
We thank Antonio Galves for illuminating discussions about the discriminative power of context trees. We also thank Ricardo Fraiman for his continuos encouragement and support, and the Universidad de San Andres for its hospitality when part of this article was produced. We would like to thank Florencia Leonardi for providing the data used in our example of determination of protein functionality, which was also analyzed in Leonardi (Citation2007). This work was partially supported by PICT 2005-31659 and Secyt grant 69/08 and 05/B352.