SPEECH & AUDIO PROCESSING & RECOGNITION - 2021/2
Module code: EEEM030
In light of the Covid-19 pandemic, and in a departure from previous academic years and previously published information, the University has had to change the delivery (and in some cases the content) of its programmes, together with certain University services and facilities for the academic year 2020/21.
These changes include the implementation of a hybrid teaching approach during 2020/21. Detailed information on all changes is available at: https://www.surrey.ac.uk/coronavirus/course-changes. This webpage sets out information relating to general University changes, and will also direct you to consider additional specific information relating to your chosen programme.
Prior to registering online, you must read this general information and all relevant additional programme specific information. By completing online registration, you acknowledge that you have read such content, and accept all such changes.
Expected prior learning: Module EEE3008–Digital Signal Processing A (6-dpA), or equivalent learning about signal processing.
Module purpose: The module discusses basic concepts, signal processing methods and human computer interaction applications of speech processing and recognition including auditory perception and psychoacoustics. You will be taught how to extract salient features from speech signals, how to design a model of spoken language, how to perform recognition and training, and given an insight into current research on spontaneous speech recognition, such as speaker adaptation and solutions for robustness to noise. Demonstrations, interesting illustrations and working examples will be given. Successful students can either proceed to do PhDs or get jobs in the R & D departments of industry, i.e. jobs that are at a higher level than mere software package operators. The presented techniques have many other applications beyond speech, including expert systems and financial modelling.
Electrical and Electronic Engineering
WANG Wenwu (Elec Elec En)
Number of Credits: 15
ECTS Credits: 7.5
Framework: FHEQ Level 7
JACs code: I410
Module cap (Maximum number of students): N/A
Prerequisites / Co-requisites
Indicative content includes the following.
Lecture Component Speech and Audio Processing
Lecturer: Dr W Wang
Hours 15 Lecture hours with interspersed Problem Classes
1 Introduction - Speech and language. Digital speech processing. Speech processing applications. Characteristics of speech signals.
2 Speech Production - Vocal tract description. Source-filter model. Origin of periodicity, formants and anti-resonance in terms of physical model. All-pole digital model of vocal tract. Relationship between physical model and phonemes.
3 Speech Perception - The structure of the ear. Frequency and amplitude response of ear. Perception units.
4-5 Signal Processing Techniques - Autocorrelation of speech signals. Pitch estimation from speech signals. Fourier analysis of speech signal. Spectrogram and power spectrum density. Spectral analysis of voiced and unvoiced speech. Spectral analysis of formants and antiresonances. Harmonic structure of speech.
6-7 Linear Prediction – Z-transform. Vocal tract transfer function. Stability of transfer function. Concept and model of linear prediction. All-pole source filter. Order selection and its relation to prediction error. LPC coefficients estimation. Speech synthesis from the LPC coefficients.
8 Inverse Filtering of Speech Signal - Separating source from excitation. Vocal tract response – format estimation. Pitch estimation from the residual. Robust linear prediction.
9-10 Cepstral Deconvolution- Definition of real cepstrum. Transforming convolution to sum by non-linear operation. The complex logarithm. The complex cepstrum. The quefrency unit. Pitch estimation via the cepstrum. Comparison of spectral envelope with that derived from linear prediction.
11-12 Audio recording and acoustics – Microphone types and directivity patterns, digital audio acquisition, wave propagation and acoustics, effects of reflections and reverberation.
13-15 Psychoacoustics –Loudness perception, pitch perception, auditory masking, timbre perception, spatial hearing.
Lecture Component Automatic speech recognition
Lecturer: Dr P Jackson
Hours 15 Lecture hours
16-17 Introduction Human speech communication. The role of ASR in human computer interaction. Fundamentals of phonetic and speech perception.
18-19 Feature extraction Vocal tract acoustics and Linear prediction. Mel-frequency cepstrum. Difference features.
20 Template matching Dynamic time warping. Isolated-word and connected-word recognition. Search pruning.
21-22 Hidden Markov models Markov models and state topologies. HMM formulation. Discrete and continuous output pdfs.
23-24 Recognition and Viterbi decoding Trellis diagrams. Forward and backward probabilities. Cumulative likelihoods and trace back.
25-26 Machine learning by Expectation maximization Baum-Welch training: derivation and implementation.
27-28 Large-vocabulary continuous speech recognition Language modeling and discounting. Context-sensitivity and parameter tying.
29-30 Adaptation and robustness Speaker adaptation: MLLR and MAP methods. Noise robustness: spectral subtraction and parallel model combination.
|Assessment type||Unit of assessment||Weighting|
|Examination||2 HOUR EXAM||80|
Not applicable: students failing a unit of assessment resit the assessment in its original format.
The assessment strategy for this module is designed to provide students with the opportunity to demonstrate the learning outcomes. The 2-hour closed-book written examination will assess students’ knowledge and understanding of the main concepts in speech and audio processing and recognition, and their ability to use such knowledge (such as linear predictive coding) to solve some basic problems in speech modelling and analysis (such as formant frequency estimation). The computer simulation assignment will assess the students’ technical skills and expertise in designing a simple speech synthesis/recognition system by applying the methods and concepts discussed on the lectures.
Thus, the summative assessment for this module consists of the following.
· The examination (80%) provides a limited choice of topics to ensure that good coverage of specialist knowledge is tested in a 2-hour closed book written examination. The questions are constructed to assess the outcomes at various cognitive levels, often beginning with relating knowledge, then formulating a problem, performing an analysis, and reflecting on the result.
· Speech Processing experiment (20%) is a computer-based experiment on speech synthesis/recognition. The students are required to submit a report (both in printed hard and electronic copy) with a length of at least 1000 words (excluding figures, plots, and tables) (flexible in length in terms of pages, 5-30 pages of A4 size), together with programming code (such as Matlab code), and synthesized/recognized audio samples in electronic copy, by the deadline Tuesday of Week 9.
These deadlines are indicative. For confirmation of exact date and time, please check the Departmental assessment calendar issued to you.
Formative assessment and feedback
For the module, students will receive formative assessment/feedback in the following ways.
· During lectures, by question and answer sessions
· During lectures, by group discussions
· During worked example/revision classes
· By means of unassessed tutorial problems (with answers/model solutions)
· Via the marking of the assignment, both electronic file submissions and written reports
- Educate students in the particular aspects of speech processing and recognition, with concepts, engineering problems, worked examples and computer simulations.
|1||Demonstrate a systematic understanding of the main concepts in speech and audio processing and recognition.||K|
|2||Apply the concepts and methods learned to some speech processing problems, such as, pitch estimation, speech synthesis.||KCP|
|3||Describe and explain the principles of pattern recognition in relation to speech recognition, including feature extraction, dynamic time warping, hidden Markov modelling, Gaussian mixture models, expectation maximization, language models and their application to large-vocabulary continuous speech recognition||KPT|
|4||Formulate and analyse solutions to HMM problems, such as simple likelihood calculation, optimal state-sequence identification and parameter re-estimation||KCT|
|5||Apply HMM theory to practical speech recognition tasks.||KP|
|6||Evaluate a speaker verification system based on objective measures of its operating characteristics.||KCPT|
C - Cognitive/analytical
K - Subject knowledge
T - Transferable skills
P - Professional/Practical skills
Overall student workload
Independent Study Hours: 117
Lecture Hours: 33
Methods of Teaching / Learning
The learning and teaching strategy is designed to achieve the following aims.
- To provide a broad engineering education in speech processing, machine learning, spoken language processing, pattern recognition and psychoacoustics.
- To develop analytical and computational competence using advanced techniques.
- To promote technical confidence through elaborating specialist techniques associated with speech processing and recognition.
- To provide experience of commonly used software tools relevant to speech and audio signal processing and to certain machine learning techniques.
- To cultivate transferable skills in note taking, knowledge representation, technical writing, time management and professional conduct.
Learning and teaching methods include the following.
- Lectures: 3 hours per week for 10 weeks
- Class discussion integrated within lecture (approximately 15 minutes per week)
- Designed in-class problems (approximately 15 minutes per week)
- Assignment in the form of computer simulations and reports (collectively 22.5 hours spreading over 5 weeks)
- Timetabled revision classes (3hr) which demonstrate the principles of the theory in quantitative worked examples and prepare students for the written examination.
Indicated Lecture Hours (which may also include seminars, tutorials, workshops and other contact time) are approximate and may include in-class tests where one or more of these are an assessment on the module. In-class tests are scheduled/organised separately to taught content and will be published on to student personal timetables, where they apply to taken modules, as soon as they are finalised by central administration. This will usually be after the initial publication of the teaching timetable for the relevant semester.
Reading list for SPEECH & AUDIO PROCESSING & RECOGNITION : http://aspire.surrey.ac.uk/modules/eeem030
Programmes this module appears in
|Electronic Engineering with Computer Systems MEng||1||Optional||A weighted aggregate mark of 50% is required to pass the module|
|Electronic Engineering MEng||1||Optional||A weighted aggregate mark of 50% is required to pass the module|
|Computer and Internet Engineering MEng||1||Optional||A weighted aggregate mark of 50% is required to pass the module|
|Communication Systems MEng||1||Optional||A weighted aggregate mark of 50% is required to pass the module|
|Electronic Engineering with Professional Postgraduate Year MSc||1||Optional||A weighted aggregate mark of 50% is required to pass the module|
|Computer Vision, Robotics and Machine Learning MSc||1||Optional||A weighted aggregate mark of 50% is required to pass the module|
|Electronic Engineering MSc||1||Optional||A weighted aggregate mark of 50% is required to pass the module|
|Communications Networks and Software MSc||1||Optional||A weighted aggregate mark of 50% is required to pass the module|
|Artificial Intelligence MSc||1||Optional||A weighted aggregate mark of 50% is required to pass the module|
|Biomedical Engineering MEng||1||Optional||A weighted aggregate mark of 50% is required to pass the module|
Please note that the information detailed within this record is accurate at the time of publishing and may be subject to change. This record contains information for the most up to date version of the programme / module for the 2021/2 academic year.