Video-Realistic Speech Visualisation

A new hybrid speech visualization technique has been developed at the University of East Anglia based upon appearance models, which can be used to create almost video-realistic sequences of a talking face enunciating arbitrary phrases.

Seeing the face of a talker provides visual information that can significantly influence the perception and understanding of speech signals. This is especially true when the auditory signal is degraded by, for example, hearing impairment. Multimedia interfaces that use speech synthesisers should therefore also consider the significance of this visual information, i.e. audio-visual speech synthesis.

Visual speech synthesis requires a realistic model of the position and movement of the visible articulators (the lips, teeth and tongue). Traditionally this was done using computer graphics techniques; where points on the surface of the face are represented as vertices in 3D and the surface itself approximated by connecting the vertices; and more recently image processing techniques have been used to increase the video-realism. Images of real faces are used as reference frames, which are either concatenated or morphed to create realistic sequences.

The technique developed at the University of East Anglia uses statistical models of the shape and appearance variation of a face that are trained from video sequences of a person talking. Given a model learnt from a video sequence, the face in each frame of the video is projected into the model-space. The parameters can then be used in a concatenative synthesis scheme, where parameter trajectories corresponding to the synthesis unit (e.g. triphone) are extracted from a corpus and concatenated.