227
Views
6
CrossRef citations to date
0
Altmetric
Articles

A novel approach to text clustering using genetic algorithm based on the nearest neighbour heuristic

ORCID Icon, ORCID Icon &
Pages 291-303 | Received 24 Sep 2019, Accepted 13 Feb 2020, Published online: 10 Mar 2020
 

Abstract

In this paper, we propose a novel clustering algorithm which uses a weighted combination of several criteria as its fitness function. We demonstrate the suitability of the new method in the case of clustering text documents. The proposed algorithm leverages the concept of nearest neighbour separation (NNS) to enhance the separation of the clusters and also outlines a heuristic to compute the NNS. A new parameterized fitness function has been proposed which can be tuned to provide more weightage to the traditional metrics based on inter- and intra-cluster distances of clusters or on the NNS. Genetic Algorithm has been used to perform the actual clustering and the results obtained has been compared with the traditional K-Means algorithm. The performance of the algorithm has been tested on different standard datasets, and the results have been presented.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Notes on contributors

D. Mustafi

Debjani Mustafi is affiliated to Birla Institute of Technology, Mesra, India. She is currently working as a faculty member in the Department of Computer Science and Engineering. She has been actively associated with academics. She has authored and co-authored multiple peer reviewed journals and conferences. Her research interests include Text mining, Data analysis and visualization, Evolutionary Computing.

A. Mustafi

Abhijit Mustafi is affiliated to Computer Science and Engg., Birla Institute of Technology, Mesra, India. He is currently providing services as Associate Professor. He has authored and co-authored multiple peer-reviewed scientific papers and presented works at many national and International conferences. His academic career is decorated with several reputed awards and funding. Abhijit Mustafi research interests include Information retrieval from web corpus, Dynamic data visualization and blind source separation of images.

G. Sahoo

Gadadhar Sahoo received his M. Sc. degree in Mathematics from Utkal University in 1980 and Ph.D. degree in the area of Computational Mathematics from Indian Institute of Technology, Kharagpur in 1987. He is currently working as a Professor in the Department of Computer Science and Engineering. He has approximately 300 publications in in different national and international journals. His area of interests are Soft and Evolutionary Computing, Grid Computing, ML, Image Processing, Wireless Sensor Network, Bio-Informatics, Cloud Computing, etc.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.