Abstract
Keyword spotting (or wake-word detection) establishes a major component of human–machine interactions. Maximizing the detection accuracy at a low false alarm rate, while minimizing the footprint size and computation are the goals of keyword spotting systems. To better satisfy these requirements, we propose an end-to-end neural architecture with deformable convolution combined with the attention mechanism. The deformable convolution layer drives the model to focus more on the human speech region, while the attention mechanism further focuses on the most important part of the speech segment for keyword spotting. Our experimental results on real-world dataset “Hey Snips” show that our system significantly outperforms recent approaches in terms of quality of detection and size and complexity. With only 78 K parameters, the model achieves a false rejection rate (FRR) of 0.005% on clean samples, and 0.531% in noisy conditions, at 0.5 false alarm (FA) per hour.
Additional information
Notes on contributors
![](/cms/asset/152fa79e-5fdb-492c-a869-4bb0ec9897af/tijr_a_1946438_ilg0001.gif)
Huu Binh Nguyen
Huu Binh Nguyen received the graduate engineering (1996), MS (2006) degrees in electrical engineering from Hanoi University of Science and Technology (HUST), Vietnam. He is a PhD student at the Department of Instrumentation and Industrial Informatics – School of Electrical Engineering – HUST. His research interest includes signal processing and speech recognition. Email: [email protected]
![](/cms/asset/f2b611d6-6ab7-4afe-8df5-7f45f00f896b/tijr_a_1946438_ilg0002.gif)
Van Hai Duong
Van Hai Duong received his engineering degree (2019) in control engineering and automation at Hanoi University of Science and Technology (HUST). He is also a master student at the Department of Instrumentation and Industrial Informatics, HUST. His current interests include signal processing, keyword spotting, automatic speech recognition and embedded system. Email: [email protected]
![](/cms/asset/c0deae76-27b5-4672-8e68-cd351ff26172/tijr_a_1946438_ilg0003.gif)
Anh Xuan Tran Thi
Anh Xuan Tran Thi received the graduate engineering (2008), MS (2010) degrees in electrical engineering from Hanoi University of Science and Technology (HUST), Vietnam, and PhD degree in signal, image, speech, telecommunications from INP Grenoble (Institut National Polytechnique de Grenoble), France, in 2016. She is a lecturer/researcher at the Department of Instrumentation and Industrial Informatics – School of Electrical Engineering – HUST. Her research focuses on signal, speech processing and applications, speech recognition, and embedded systems. Email: [email protected]
![](/cms/asset/a102e199-32a5-464b-8980-0f6c9b00eacc/tijr_a_1946438_ilg0004.gif)
Quoc Cuong Nguyen
Quoc Cuong Nguyen received the graduate engineering (1996), MS (1998) degrees in electrical engineering from Hanoi University of Science and Technology (HUST), Vietnam, and PhD degree in signal-image-speech-telecoms from INP Grenoble (Institut National Polytechnique de Grenoble), France, in 2002. He is a lecturer/researcher at the Department of Instrumentation and Industrial Informatics – School of Electrical Engineering – HUST. His research interests include signal processing, speech recognition, embedded systems and RF communications. Corresponding author. Email: [email protected]