SPEECH & AUDIO PROCESSING & RECOGNITION - 2024/5
Module code: EEEM030
Expected prior learning: Module EEE3008–Fundamentals of Digital Signal Processing or equivalent learning about signal processing.
Module purpose: The module discusses basic concepts, signal processing methods and human computer interaction applications of speech processing and recognition including auditory perception and psychoacoustics. You will be taught how to extract salient features from speech signals, how to design a model of spoken language, how to perform recognition and training, and given an insight into current research on spontaneous speech recognition, such as speaker adaptation and solutions for robustness to noise. Demonstrations, interesting illustrations and working examples will be given. Successful students can either proceed to do PhDs or get jobs in the R & D departments of industry, i.e. jobs that are at a higher level than mere software package operators. The presented techniques have many other applications beyond speech, including expert systems and financial modelling.
Module EEEM030 contributes to the development of student’s knowledge in audio and speech processing and recognition, which may be useful for their taking of other modules such as:
- EEEM071 Advanced Topics in Computer Vision and Deep Learning
- EEEM004 60 Credit Standard Project
- EEEM005 AI and AI Programming
- EEEM066 Fundamentals of Machine Learning
- EEEM067 AR, VR and Metaverse
- EEEM068 Applied Machine Learning
Module EEEM030 contributes to student knowledge in audio and speech processing and thus useful for students taking 60 credit project (EEEM004) related to audio and speech processing and recognition. EEEM030 is related to EEEM005, EEEM066 and EEEM068 due to the fact that machine learning/AI techniques are used for speech and speaker recognition, therefore, EEEM030 contributes to the development of student knowledge in machine learning/AI, which is beneficial for their taking of machine learning and AI related modules such as EEEM005, EEEM066 and EEEM068. One of the applications of audio and speech processing is to apply it to AR, VR and Metaverse for spatial sound production and reproduction which is a key enabling technology for AR, VR, and Metaverse, for virtual sound reproduction. Therefore, knowledge gained from EEEM030 would be useful for the taking of the module EEEM067.
Module EEE030 also benefits from knowledge gained from other modules such as:
- EEE3008 Digital Signal Processing
- EEE1033 Computer and Digital Logic
- EEE1035 Programming in C
- EEE3042 Audio and Video Processing
- EEE3032 Computer Vision and Pattern Recognition
The modules EEE1033 and EEE1035 provide students with some useful skills in programming, which will be beneficial for them to complete the computer programming based coursework components, by turning the signal processing theories and methods into working program codes. The module EEE3008 provides students with knowledge and skills on fundamental digital signal processing skills which are essential in understanding the application of these skills to audio and speech data. EEE3042 covers both audio and video processing and coding, and the audio related materials are highly relevant and thus useful for the EEEM030 module. The pattern recognition skills gained from EEE3032 would be useful for understanding the use of pattern recognition algorithms to speech data for achieving speech and speaker recognition.
Computer Science and Electronic Eng
WANG Wenwu (CS & EE)
Number of Credits: 15
ECTS Credits: 7.5
Framework: FHEQ Level 7
JACs code: I410
Module cap (Maximum number of students): 90
Overall student workload
Independent Learning Hours: 85
Lecture Hours: 11
Tutorial Hours: 10
Laboratory Hours: 8
Guided Learning: 10
Captured Content: 26
Prerequisites / Co-requisites
Indicative content includes the following.
Lecture Component Speech and Audio Processing
Introduction Speech and language. Digital speech processing. Speech processing applications. Characteristics of speech signals.
Speech Production Vocal tract description. Source-filter model. Origin of periodicity, formants and anti-resonance in terms of physical model. All-pole digital model of vocal tract. Relationship between physical model and phonemes.
Speech Perception The structure of the ear. Frequency and amplitude response of ear. Perception units.
Signal Processing Techniques Autocorrelation of speech signals. Pitch estimation from speech signals. Fourier analysis of speech signal. Spectrogram and power spectrum density. Spectral analysis of voiced and unvoiced speech. Spectral analysis of formants and antiresonances. Harmonic structure of speech.
Linear Prediction Z-transform. Vocal tract transfer function. Stability of transfer function. Concept and model of linear prediction. All-pole source filter. Order selection and its relation to prediction error. LPC coefficients estimation. Speech synthesis from the LPC coefficients.
Inverse Filtering of Speech Signal Separating source from excitation. Vocal tract response – format estimation. Pitch estimation from the residual. Robust linear prediction.
Cepstral Deconvolution Definition of real cepstrum. Transforming convolution to sum by non-linear operation. The complex logarithm. The complex cepstrum. The quefrency unit. Pitch estimation via the cepstrum. Comparison of spectral envelope with that derived from linear prediction.
Audio recording and acoustics Microphone types and directivity patterns, digital audio acquisition, wave propagation and acoustics, effects of reflections and reverberation.
Psychoacoustics Loudness perception, pitch perception, auditory masking, timbre perception, spatial hearing.
Lecture Component Automatic speech recognition
Introduction Human speech communication. The role of ASR in human computer interaction. Fundamentals of phonetic and speech perception.
Feature extraction Vocal tract acoustics and Linear prediction. Mel-frequency cepstrum. Difference features.
Template matching Dynamic time warping. Isolated-word and connected-word recognition. Search pruning.
Hidden Markov models Markov models and state topologies. HMM formulation. Discrete and continuous output pdfs.
Recognition and Viterbi decoding Trellis diagrams. Forward and backward probabilities. Cumulative likelihoods and trace back.
Machine learning by Expectation maximization Baum-Welch training: derivation and implementation.
Large-vocabulary continuous speech recognition Language modeling and discounting. Context-sensitivity and parameter tying.
Adaptation and robustness Speaker adaptation: MLLR and MAP methods. Noise robustness: spectral subtraction and parallel model combination.
|Assessment type||Unit of assessment||Weighting|
|Coursework||COURSEWORK 1 - SPEECH PROCESSING||15|
|Coursework||COURSEWORK 2 - SPEECH RECOGNITION||15|
|Examination||2HR INVIGILATED CLOSE BOOK EXAM||70|
The assessment strategy for this module is designed to provide students with the opportunity to demonstrate the learning outcomes. The computer simulation assignment will assess the students’ technical skills and expertise in designing a simple speech synthesis/recognition system by applying the methods and concepts discussed on the lectures. The written examination will assess students’ knowledge and understanding of the main concepts in speech and audio processing and recognition, and their ability to use such knowledge (such as linear predictive coding) to solve some basic problems in speech modelling and analysis (such as formant frequency estimation).
Thus, the summative assessment for this module consists of the following.
Speech Processing assignment (15%) is a computer-based experiment on speech synthesis. The students are required to submit a report (in electronic copy) with a length of at least 1000 words (excluding figures, plots, and tables) (flexible in length in terms of pages, 5-30 pages of A4 size), together with programming code (such as Matlab code), and synthesized audio samples in electronic copy, by the deadline Tuesday of Week 7.
Speech Recognition assignment (15%) is a computer-based experiment on speech recognition. The students are required to submit a report (in electronic copy) with solutions to the questions laid out in terms of the speech recognition task, by the deadline Tuesday of Week 11.
The examination (70%) provides a limited choice of topics to ensure that good coverage of specialist knowledge is tested in a written examination. The questions are constructed to assess the outcomes at various cognitive levels, often beginning with relating knowledge, then formulating a problem, performing an analysis, and reflecting on the result.
These deadlines are indicative. For confirmation of exact date and time, please check the assessment calendar issued to you.
Formative assessment and feedback
For the module, students will receive formative assessment/feedback in the following ways.
· During lectures, by question and answer sessions
· During lectures, by group discussions
· During worked example/revision classes
· By means of unassessed tutorial problems (with answers/model solutions)
· Via the marking of the assignment, both electronic file submissions and written reports
- Educate students in the particular aspects of speech processing and recognition, with concepts, engineering problems, worked examples and computer simulations.
- The module also aims to provide opportunities for students to learn about the Surrey Pillars listed below.
|001||Demonstrate a systematic understanding of the main concepts in speech and audio processing and recognition.||K||M1|
|002||Apply the concepts and methods learned to some speech processing problems, such as, pitch estimation, speech synthesis.||KC||M2|
|003||Describe and explain the principles of pattern recognition in relation to speech recognition, including feature extraction, dynamic time warping, hidden Markov modelling, Gaussian mixture models, expectation maximization, language models and their application to large-vocabulary continuous speech recognition||KPT||M3, M4|
|004||Formulate and analyse solutions to HMM problems, such as simple likelihood calculation, optimal state-sequence identification and parameter re-estimation||KCT||M1. M3|
|005||Apply HMM theory to practical speech recognition tasks.||KP||M2. M3|
|006||Evaluate a speaker verification system based on objective measures of its operating characteristics and report the outcomes in written format.||KCPT||M5, M6, M16, M17|
C - Cognitive/analytical
K - Subject knowledge
T - Transferable skills
P - Professional/Practical skills
Methods of Teaching / Learning
The learning and teaching strategy is designed to achieve the following aims.
- To provide a broad engineering education in speech processing, machine learning, spoken language processing, pattern recognition and psychoacoustics.
- To develop analytical and computational competence using advanced techniques.
- To promote technical confidence through elaborating specialist techniques associated with speech processing and recognition.
- To provide experience of commonly used software tools relevant to speech and audio signal processing and to certain machine learning techniques.
- To cultivate transferable skills in note taking, knowledge representation, technical writing, time management and professional conduct.
Learning and teaching methods include the following.
- Class discussion integrated within lecture
- Designed in-class problems
- Assignment in the form of computer simulations and reports
- Timetabled revision classes which demonstrate the principles of the theory in quantitative worked examples and prepare students for the written examination.
Indicated Lecture Hours (which may also include seminars, tutorials, workshops and other contact time) are approximate and may include in-class tests where one or more of these are an assessment on the module. In-class tests are scheduled/organised separately to taught content and will be published on to student personal timetables, where they apply to taken modules, as soon as they are finalised by central administration. This will usually be after the initial publication of the teaching timetable for the relevant semester.
Upon accessing the reading list, please search for the module using the module code: EEEM030
This module has a capped number and may not be available to exchange students. Please check with the International Engagement Office email: firstname.lastname@example.org
The Curriculum Framework at Surrey is committed to developing graduates with strengths in five pillars: Digital Capabilities, Employability, Sustainability, Global and Cultural Capabilities, and Resourcefulness and Resilience. This module is designed to allow students to develop knowledge, skills and capabilities in the following areas:
Digital capabilities: Students will develop skills in applications of digital signal processing and machine learning methods to audio and speech, which are key digital technologies in electronic engineering. They will gain practical skills in audio and speech processing experience via coursework components with exercise of implementing signal processing algorithms via computing programming such as Matlab and Python.
Employability: This module provides foundational skills in digital signal processing (such as sampling, quantization, spectral analysis, convolution, correlation, and linear prediction), audio and speech processing (such as time frequency analysis, speech production modelling, power spectral density, spectrogram analysis, cepstrum analysis, liftering, mel-frequency cepstrum coefficients), speech and speaker recognition (such as hidden Markov models, neural networks), audio and speech perception (such as temporal and spectral masking, loudness, pitch, and space perception), which are all important topics for a wide range of industry applications such as speech processing, digital communications, human computer interactions, computational/robotic audition, machine listening, and artificial intelligence/machine learning.
Sustainability: This module discusses the issues of efficiency in audio and speech production modelling, how to reduce the number of parameters in the model, which are important for sustainable use of sustainable use of digital storage space, and sustainable use of computing resources via the use of compressed models.
Resourcefulness and Resilience: This module develops student skills in using the audio and speech processing methods they have learned in lecture material to solve practical problems designed in the tutorial questions, exams, and computer programming based courseworks.
Programmes this module appears in
|Computer and Internet Engineering MEng||1||Optional||A weighted aggregate mark of 50% is required to pass the module|
|Artificial Intelligence MSc||1||Optional||A weighted aggregate mark of 50% is required to pass the module|
|Electronic Engineering with Computer Systems MEng||1||Optional||A weighted aggregate mark of 50% is required to pass the module|
|Electronic Engineering MEng||1||Optional||A weighted aggregate mark of 50% is required to pass the module|
|Computer Vision, Robotics and Machine Learning MSc||1||Optional||A weighted aggregate mark of 50% is required to pass the module|
|Electronic Engineering MSc||1||Optional||A weighted aggregate mark of 50% is required to pass the module|
|Communications Networks and Software MSc||1||Optional||A weighted aggregate mark of 50% is required to pass the module|
|Electronic Engineering with Professional Postgraduate Year MSc||1||Optional||A weighted aggregate mark of 50% is required to pass the module|
|Computer Science MEng||1||Optional||A weighted aggregate mark of 50% is required to pass the module|
|Biomedical Engineering MEng||1||Optional||A weighted aggregate mark of 50% is required to pass the module|
Please note that the information detailed within this record is accurate at the time of publishing and may be subject to change. This record contains information for the most up to date version of the programme / module for the 2024/5 academic year.