Projects

Speaker Recognition From Degraded Audio Samples

Detecting human speech and identifying its source, i.e., the speaker, from speech audio is an active area of research in Machine learning and Biometrics community. As with other types of digital signals such as images and video, an audio signal can undergo degradations during its generation, propagation, and recording. Identifying the speaker from such degraded speech data is a challenging task and an open research problem. In this research project, we work on developing deep learning-based algorithms for speaker recognition from degraded audio signals. We use speech features like Mel-Frequency Cepstral Coefficients (MFCC) and Linear Predictive Coding (LPC) for representing the audio signals. We design one-dimensional convolutional neural networks (1D-CNN) which learn speaker dependent features from the MFCC and LPC based speech representations for performing speaker recognition. We have also develop 1D-CNN based audio filterbank for extracting robust speaker-dependent speech features directly from raw speech audio.

DeepTalk: Vocal Style Encoding for Speaker Recognition and Speech Synthesis

Speaker recognition is the task of determining a person’s identity from their voice. The human voice, as a biometric modality, is a combination of physiological and behavioral characteristics. The voice production system’s physical traits determine the human voice’s physiological characteristics, while the prosodic (pitch, timbre) and high-level (lexicon) traits impart the human voice’s behavioral characteristics. In this work, we develop deep learning-based models for extracting speaker-dependent behavioral speech characteristics. These behavioral characteristics are then combined with speaker-dependent behavioral speech characteristics to improve speaker recognition performance in degraded audio signals. We further design a deep learning-based speech synthesis framework that uses these behavioral speech characteristics for generating highly realistic synthetic speech samples.

Fusing Face and Voice Modalities for Improved Biometric Recognition

Indoor video surveillance systems often use the face modality to establish the identity of a person of interest. However, the face image may not offer sufficient discriminatory information in many scenarios due to substantial variations in pose, illumination, expression, resolution and distance between the subject and the camera. In such cases, the inclusion of an additional biometric modality can benefit the recognition process. In this regard, we consider the fusion of voice and face modalities for enhancing the recognition accuracy. We use current state-of-art deep learning based methods to perform face and speaker recognition dataset for establishing baseline performance of individual modalities. We explore multiple fusion schemes, at data, feature, score, and decision levels, to combine face and speaker modalities to perform effective biometric recognition in video data.