Abstract
To recognize functional sites within a protein sequence, the non-numerical attributes of the sequence need encoding prior to using a pattern recognition algorithm. The success of recognition depends on the efficient coding of the biological information contained in the sequence. In this regard, a bio-basis function maps a non-numerical sequence space to a numerical feature space, based on an amino acid mutation matrix. In effect, the biological content in a sequence can be maximally utilized for analysis. One of the important issues for the bio-basis function is how to select a minimum set of bio-bases with maximum information. In this paper, we present two relational soft clustering algorithms, named rough c-medoids and fuzzy-possibilistic c-medoids, to select the most informative bio-bases. While both fuzzy and possibilistic memberships of fuzzy-possibilistic c-medoids avoid the noise sensitivity defect of fuzzy c-medoids and the coincident clusters problem of possibilistic c-medoids, the concept of lower and upper boundaries of rough c-medoids deals with uncertainty, vagueness, and incompleteness in class definition of biological data. The concept of ‘degree of resemblance’, based on non-gapped pairwise homology alignment score, circumvents the initialization and local minima problems of both c-medoids algorithms. In effect, it enables efficient selection of a minimum set of most informative bio-bases. The effectiveness of the algorithms, along with a comparison with other algorithms, has been demonstrated on HIV (human immunodeficiency virus) protein datasets.
Acknowledgements
The authors would like to thank the anonymous referees for providing helpful comments and valuable criticisms which have greatly improved the presentation of the paper.