Mouther is an experimental
interactive system I built with Malcolm Slaney and Tom Ngo in which
an animated cartoon face is lip-synched in real-time with the user's
speech. Malcolm implemented Mouther by connecting his special
speech-parsing technologies to Tom's "Embedded-Constraint Graphic"
(ECG) graphics animation engine. I was responsible for the prototype
concept and artwork.
Mouther works by driving the state of a smoothly-interpolating
vector-based image with the output of a phoneme recognizer. This
phoneme recognizer was built using mel-frequency cepstral coefficients
(MFCC) as features and using maximum-likelihood selection based
on Gaussian mixture models (GMMs) of each phoneme. Depending on
the amount and diversity of training data, speaker-dependent or
speaker-independent GMMs could be formed for each phoneme. To reduce
the system's sensitivity to microphone and room acoustics, the MFCC's
were filtered by RASTA (a widely accepted method for reducing the
dependence of acoustic features on channel characteristics) prior
to classification.
Cartoon visemes for Mouther were developed
in Tom Ngo's ECG graphics engine, a system optimized for the creation
of smooth interpolations between vector-based graphics in high-dimensional
spaces. With this technology, weighted interpolations of facial
exemplars could be generated directly from the phoneme classifier's
confidence ratings. For example, a phoneme classified as ee
with 70% confidence, but also classified as eh with
30% confidence, would result in a 70/30 morph between the ee
and eh cartoon graphics.
The result is a talking cartoon face whose animation
is driven by the output of the classifier, and in which different
mouth positions are displayed for the different phonemes that are
spoken. While further work would be needed to increase the reliability
of the classifier and the realism of the transitions between different
visemes, the current result is amusing and could be sufficiently
responsive for the quality-level needed in children's computer games.
|