SPEECH & AUDIO PROCESSING & RECOGNITION - 2020/1

Module code: EEEM030

Module Overview

Expected prior learning: Module EEE3008–Digital Signal Processing A (6-dpA), or equivalent learning about signal processing.

Module purpose: The module discusses basic concepts, signal processing methods and human computer interaction applications of speech processing and recognition including auditory perception and psychoacoustics. You will be taught how to extract salient features from speech signals, how to design a model of spoken language, how to perform recognition and training, and given an insight into current research on spontaneous speech recognition, such as speaker adaptation and solutions for robustness to noise. Demonstrations, interesting illustrations and working examples will be given. Successful students can either proceed to do PhDs or get jobs in the R & D departments of industry, i.e. jobs that are at a higher level than mere software package operators. The presented techniques have many other applications beyond speech, including expert systems and financial modelling.

Module provider

Electrical and Electronic Engineering

Module Leader

WANG Wenwu (Elec Elec En)

Number of Credits: 15

ECTS Credits: 7.5

Framework: FHEQ Level 7

Module cap (Maximum number of students): 90

Overall student workload

Independent Learning Hours: 117

Lecture Hours: 33

Module Availability

Semester 1

Prerequisites / Co-requisites

None.

Module content

Indicative content includes the following.

Lecture Component Speech and Audio Processing

Lecturer: Dr W Wang

Hours 15 Lecture hours with interspersed Problem Classes

1 Introduction - Speech and language. Digital speech processing. Speech processing applications. Characteristics of speech signals.

2 Speech Production - Vocal tract description. Source-filter model. Origin of periodicity, formants and anti-resonance in terms of physical model. All-pole digital model of vocal tract. Relationship between physical model and phonemes.

3 Speech Perception - The structure of the ear. Frequency and amplitude response of ear. Perception units.

4-5 Signal Processing Techniques - Autocorrelation of speech signals. Pitch estimation from speech signals. Fourier analysis of speech signal. Spectrogram and power spectrum density. Spectral analysis of voiced and unvoiced speech. Spectral analysis of formants and antiresonances. Harmonic structure of speech.

6-7 Linear Prediction – Z-transform. Vocal tract transfer function. Stability of transfer function. Concept and model of linear prediction. All-pole source filter. Order selection and its relation to prediction error. LPC coefficients estimation. Speech synthesis from the LPC coefficients.

8 Inverse Filtering of Speech Signal - Separating source from excitation. Vocal tract response – format estimation. Pitch estimation from the residual. Robust linear prediction.

9-10 Cepstral Deconvolution- Definition of real cepstrum. Transforming convolution to sum by non-linear operation. The complex logarithm. The complex cepstrum. The quefrency unit. Pitch estimation via the cepstrum. Comparison of spectral envelope with that derived from linear prediction.

11-12 Audio recording and acoustics – Microphone types and directivity patterns, digital audio acquisition, wave propagation and acoustics, effects of reflections and reverberation.

13-15 Psychoacoustics –Loudness perception, pitch perception, auditory masking, timbre perception, spatial hearing.

Lecture Component Automatic speech recognition

Lecturer: Dr P Jackson

Hours 15 Lecture hours

16-17 Introduction Human speech communication. The role of ASR in human computer interaction. Fundamentals of phonetic and speech perception.

18-19 Feature extraction Vocal tract acoustics and Linear prediction. Mel-frequency cepstrum. Difference features.

20 Template matching Dynamic time warping. Isolated-word and connected-word recognition. Search pruning.

21-22 Hidden Markov models Markov models and state topologies. HMM formulation. Discrete and continuous output pdfs.

23-24 Recognition and Viterbi decoding Trellis diagrams. Forward and backward probabilities. Cumulative likelihoods and trace back.

25-26 Machine learning by Expectation maximization Baum-Welch training: derivation and implementation.

27-28 Large-vocabulary continuous speech recognition Language modeling and discounting. Context-sensitivity and parameter tying.

29-30 Adaptation and robustness Speaker adaptation: MLLR and MAP methods. Noise robustness: spectral subtraction and parallel model combination.

Assessment pattern

Assessment type	Unit of assessment	Weighting
Examination	2 HOUR EXAM	80
Coursework	COURSEWORK	20

Alternative Assessment

Not applicable: students failing a unit of assessment resit the assessment in its original format.

Assessment Strategy

The assessment strategy for this module is designed to provide students with the opportunity to demonstrate the learning outcomes. The 2-hour closed-book written examination will assess students’ knowledge and understanding of the main concepts in speech and audio processing and recognition, and their ability to use such knowledge (such as linear predictive coding) to solve some basic problems in speech modelling and analysis (such as formant frequency estimation). The computer simulation assignment will assess the students’ technical skills and expertise in designing a simple speech synthesis/recognition system by applying the methods and concepts discussed on the lectures.

Thus, the summative assessment for this module consists of the following.

· The examination (80%) provides a limited choice of topics to ensure that good coverage of specialist knowledge is tested in a 2-hour closed book written examination. The questions are constructed to assess the outcomes at various cognitive levels, often beginning with relating knowledge, then formulating a problem, performing an analysis, and reflecting on the result.

· Speech Processing experiment (20%) is a computer-based experiment on speech synthesis/recognition. The students are required to submit a report (both in printed hard and electronic copy) with a length of at least 1000 words (excluding figures, plots, and tables) (flexible in length in terms of pages, 5-30 pages of A4 size), together with programming code (such as Matlab code), and synthesized/recognized audio samples in electronic copy, by the deadline Tuesday of Week 9.

These deadlines are indicative. For confirmation of exact date and time, please check the Departmental assessment calendar issued to you.

Formative assessment and feedback

For the module, students will receive formative assessment/feedback in the following ways.

· During lectures, by question and answer sessions

· During lectures, by group discussions

· During worked example/revision classes

· By means of unassessed tutorial problems (with answers/model solutions)

· Via the marking of the assignment, both electronic file submissions and written reports

Module aims

Educate students in the particular aspects of speech processing and recognition, with concepts, engineering problems, worked examples and computer simulations.

Learning outcomes

		Attributes Developed
1	Demonstrate a systematic understanding of the main concepts in speech and audio processing and recognition.	K
2	Apply the concepts and methods learned to some speech processing problems, such as, pitch estimation, speech synthesis.	KCP
3	Describe and explain the principles of pattern recognition in relation to speech recognition, including feature extraction, dynamic time warping, hidden Markov modelling, Gaussian mixture models, expectation maximization, language models and their application to large-vocabulary continuous speech recognition	KPT
4	Formulate and analyse solutions to HMM problems, such as simple likelihood calculation, optimal state-sequence identification and parameter re-estimation	KCT
5	Apply HMM theory to practical speech recognition tasks.	KP
6	Evaluate a speaker verification system based on objective measures of its operating characteristics.	KCPT

Attributes Developed

C - Cognitive/analytical

K - Subject knowledge

T - Transferable skills

P - Professional/Practical skills

Methods of Teaching / Learning

The learning and teaching strategy is designed to achieve the following aims.

To provide a broad engineering education in speech processing, machine learning, spoken language processing, pattern recognition and psychoacoustics.

To develop analytical and computational competence using advanced techniques.

To promote technical confidence through elaborating specialist techniques associated with speech processing and recognition.

To provide experience of commonly used software tools relevant to speech and audio signal processing and to certain machine learning techniques.

To cultivate transferable skills in note taking, knowledge representation, technical writing, time management and professional conduct.

Learning and teaching methods include the following.

Lectures: 3 hours per week for 10 weeks

Class discussion integrated within lecture (approximately 15 minutes per week)

Designed in-class problems (approximately 15 minutes per week)

Assignment in the form of computer simulations and reports (collectively 22.5 hours spreading over 5 weeks)

Timetabled revision classes (3hr) which demonstrate the principles of the theory in quantitative worked examples and prepare students for the written examination.

Indicated Lecture Hours (which may also include seminars, tutorials, workshops and other contact time) are approximate and may include in-class tests where one or more of these are an assessment on the module. In-class tests are scheduled/organised separately to taught content and will be published on to student personal timetables, where they apply to taken modules, as soon as they are finalised by central administration. This will usually be after the initial publication of the teaching timetable for the relevant semester.

Reading list

https://readinglists.surrey.ac.uk
Upon accessing the reading list, please search for the module using the module code: EEEM030

Other information

This module has a capped number and may not be available to ERASMUS and other international exchange students. Please check with the International Engagement Office email: ieo.incoming@surrey.ac.uk

Programmes this module appears in

Programme	Semester	Classification	Qualifying conditions
Computer Vision, Robotics and Machine Learning MSc	1	Optional	A weighted aggregate mark of 50% is required to pass the module
Electronic Engineering MSc	1	Optional	A weighted aggregate mark of 50% is required to pass the module
Communications Networks and Software MSc	1	Optional	A weighted aggregate mark of 50% is required to pass the module
Artificial Intelligence MSc	1	Optional	A weighted aggregate mark of 50% is required to pass the module
Computer and Internet Engineering MEng	1	Optional	A weighted aggregate mark of 50% is required to pass the module
Communication Systems MEng	1	Optional	A weighted aggregate mark of 50% is required to pass the module
Electronic Engineering with Communications MEng	1	Optional	A weighted aggregate mark of 50% is required to pass the module
Electronic Engineering with Audio-Visual Systems MEng	1	Optional	A weighted aggregate mark of 50% is required to pass the module
Electronic Engineering with Computer Systems MEng	1	Optional	A weighted aggregate mark of 50% is required to pass the module
Electronic Engineering MEng	1	Optional	A weighted aggregate mark of 50% is required to pass the module
Electronic Engineering with Professional Postgraduate Year MSc	1	Optional	A weighted aggregate mark of 50% is required to pass the module
Biomedical Engineering MEng	1	Optional	A weighted aggregate mark of 50% is required to pass the module

Please note that the information detailed within this record is accurate at the time of publishing and may be subject to change. This record contains information for the most up to date version of the programme / module for the 2020/1 academic year.