131
Views
0
CrossRef citations to date
0
Altmetric
Communications

Efficient Keyword Spotting System Using Deformable Convolutional Network

, , & ORCID Icon
 

Abstract

Keyword spotting (or wake-word detection) establishes a major component of human–machine interactions. Maximizing the detection accuracy at a low false alarm rate, while minimizing the footprint size and computation are the goals of keyword spotting systems. To better satisfy these requirements, we propose an end-to-end neural architecture with deformable convolution combined with the attention mechanism. The deformable convolution layer drives the model to focus more on the human speech region, while the attention mechanism further focuses on the most important part of the speech segment for keyword spotting. Our experimental results on real-world dataset “Hey Snips” show that our system significantly outperforms recent approaches in terms of quality of detection and size and complexity. With only 78 K parameters, the model achieves a false rejection rate (FRR) of 0.005% on clean samples, and 0.531% in noisy conditions, at 0.5 false alarm (FA) per hour.

Additional information

Notes on contributors

Huu Binh Nguyen

Huu Binh Nguyen received the graduate engineering (1996), MS (2006) degrees in electrical engineering from Hanoi University of Science and Technology (HUST), Vietnam. He is a PhD student at the Department of Instrumentation and Industrial Informatics – School of Electrical Engineering – HUST. His research interest includes signal processing and speech recognition. Email: [email protected]

Van Hai Duong

Van Hai Duong received his engineering degree (2019) in control engineering and automation at Hanoi University of Science and Technology (HUST). He is also a master student at the Department of Instrumentation and Industrial Informatics, HUST. His current interests include signal processing, keyword spotting, automatic speech recognition and embedded system. Email: [email protected]

Anh Xuan Tran Thi

Anh Xuan Tran Thi received the graduate engineering (2008), MS (2010) degrees in electrical engineering from Hanoi University of Science and Technology (HUST), Vietnam, and PhD degree in signal, image, speech, telecommunications from INP Grenoble (Institut National Polytechnique de Grenoble), France, in 2016. She is a lecturer/researcher at the Department of Instrumentation and Industrial Informatics – School of Electrical Engineering – HUST. Her research focuses on signal, speech processing and applications, speech recognition, and embedded systems. Email: [email protected]

Quoc Cuong Nguyen

Quoc Cuong Nguyen received the graduate engineering (1996), MS (1998) degrees in electrical engineering from Hanoi University of Science and Technology (HUST), Vietnam, and PhD degree in signal-image-speech-telecoms from INP Grenoble (Institut National Polytechnique de Grenoble), France, in 2002. He is a lecturer/researcher at the Department of Instrumentation and Industrial Informatics – School of Electrical Engineering – HUST. His research interests include signal processing, speech recognition, embedded systems and RF communications. Corresponding author. Email: [email protected]

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.