business triangle technical hexagon

Spotting Voice Keywords and Beyond – Harnessing Audio Data in Deep Learning

Technical talk | English

Theatre 16: Track 4

Thursday - 13.10 to 13.50 - Technical


Voice-based natural-language interfaces are a good example of how AI is transforming modern audio applications. Audio, speech, and acoustics altogether are often regarded as the fastest-growing area after computer vision in the adoption of deep learning technologies.

How is developing intelligent voice interfaces different from making vision systems for self-driving cars? The use of deep learning for speech and audio applications is still trailing behind the pervasiveness it reached in computer vision. Speech and audio engineers not only have a smaller literature to rely on, but also much fewer data available to effectively train new complex models. Typically, audio data also requires much more pre-processing than images and video, due to the combination of data sampling rates, the complexity of published reference models, and the typical computational resources available.

In this talk, we offer a deep dive into these challenges. The session follows the practical development workflow for a keyword-spotting system such as those used to detect the wake-up phrase in mobile devices or voice assistants (e.g. “Hey Siri”, “Ok Google”). We present a suitable deep learning model and we review technical best practices related specifically to working with audio data.

Throughout this presentation, we use MATLAB code examples to reinforce all the key concepts. We aim to make the ideas more easily reusable in practice by audio, acoustics, and speech processing practitioners, who hold the keys of data-related expertise for the type of applications discussed. After quickly selecting a viable deep learning model based on long short-term memory (LSTM) layers, we shift our focus to the actual data used to train the network. We discuss different strategies for creating labeled datasets tailored to the needs of a specific learning task and we show key principles around importing and working with large audio datasets. We then present commonly-used signal processing methods for transforming raw audio recordings into input data that can be consumed by typical deep networks. Finally, we show a few techniques for data augmentation and we review how those help in improving system robustness and in managing network complexity. Across all topics, we keep an eye on computational implications and sustainable implementation strategies, including the use of efficient algorithms and parallel hardware.

Despite the specific technical focus on audio data and applications, we believe the perspective and the topics covered are relevant to the wider range of industries in the process of growing the adoption of deep learning technologies. Engineers can count on high-quality research publications to learn about new neural network architectures and their evaluation for a variety of applications. However, most industry investments in deep learning technologies are currently used to create and manage the large amounts of data needed to train and evaluate those same neural models. Deep networks need spectacularly large amounts of data to train: even the most advanced network cannot be expected to learn a given task, unless it has been trained with a number of data samples sufficient to capture all the different real-world scenarios involved in that task. This idea has interesting repercussions in the balance of engineering roles, technical competencies and computational tools involved in developing deep learning systems for new engineering domains. On one hand, it is unquestionable that the need for expertise in deep learning models is set to increase. On the other, the role of domain-specific tools and expertise is also set to become increasingly important. Only by creating high-value data and enabling high-quality data processing one can enable training and developing highly-complex deep learning systems.