ABSTRACT
With the recent increase in the size of data, compression has become essential tool for handling massive data. Compression is usually more helpful in handling huge data over the networks, large text collections, and storing massive biological data. It is also useful in searching a pattern (of any kind) inside the huge data set. Compressed pattern matching (CPM) is the task of performing string matching in a compressed text without decompressing it. In this type of matching, pattern may or may not be compressed. This paper presents an efficient algorithm (WBTC_WT) for matching a pattern directly inside the compressed text. The proposed algorithm uses the tagged sub optimal code category of algorithm called word-based tagged code (WBTC) and a self-indexed data structure called wavelet tree. WBTC is used for encoding the text and wavelet tree provides fast searching over the compressed text. WBTC_WT removes the problem of false matches, encountered in some of the previous approaches and is able to match arbitrary portion of text without decompressing it. The proposed algorithm is implemented, analyzed, and compared with existing approaches on huge data set. From the simulation results, it is found that our algorithm outperforms the existing algorithms in most of the cases.
ACKNOWLEDGEMENTS
Authors wish to acknowledge Dr. Hari Mohan Singh, Assistant Professor, Department of Computer Science and IT, SHUATS, Naini, Allahabad, India, for his continuous support.
DISCLOSURE STATEMENT
No potential conflict of interest was reported by the authors.
Additional information
Notes on contributors
![](/cms/asset/4b548baf-6e8e-4298-bbb5-0cb49f229179/tijr_a_1347071_uf0001_oc.jpg)
Surya Prakash Mishra
Surya Prakash Mishra is working as an assistant professor at the Department of Computer Science & Information Technology, SHUATS, Naini, Allahabad, India. He received his BSc in PCM from University of Allahabad, Allahabad, India and MCA from, UP Technical University, Lucknow, India. He has more than 10 years of teaching experience at undergraduate and graduate levels. He has published more than seven research papers in International journals and conferences. His research interests include pattern matching, information retrieval, computer algorithms and data compression.
E-mail: [email protected]
![](/cms/asset/8f05cd94-3a4b-495c-983f-f931133b7126/tijr_a_1347071_uf0002_oc.jpg)
Rajesh Prasad
Rajesh Prasad is currently working as an assistant professor at the School of IT and Computing, American University of Nigeria, Yola, Nigeria. He received his MTech in software engineering and PhD in computer science and engineering from MNNIT, Allahabad, India. He has more than 17 years of teaching experience at undergraduate and graduate levels. He has published more than 50 research papers in international journals and conferences. His research interests include pattern matching, information retrieval, data mining, bioinformatics and data compression.
E-mail: [email protected]
![](/cms/asset/5bdab186-198a-4a13-b3c7-6dbc5626ce69/tijr_a_1347071_uf0003_oc.jpg)
Gurmit Singh
Gurmit Singh is working as Professor Emeritus at the Department of Computer Science & Information Technology, SHUATS, Naini, Allahabad, India. He received his BTech in electronics from JNU, New Delhi, India and MTech in electrical engineering from IIT, Kanpur, India. He has more than 20 years of teaching experience at undergraduate and graduate levels. He has published more than 35 research papers in international journals and conferences. His research interests include information retrieval, data communication, image processing and software engineering.
E-mail: [email protected]