Mouther
Golan Levin, Malcolm Slaney and Tom Ngo
Interval Research Corporation, Summer 1995
 

Mouther is an experimental interactive system I built with Malcolm Slaney and Tom Ngo in which an animated cartoon face is lip-synched in real-time with the user's speech. Malcolm implemented Mouther by connecting his special speech-parsing technologies to Tom's "Embedded-Constraint Graphic" (ECG) graphics animation engine. I was responsible for the prototype concept and artwork.

Mouther works by driving the state of a smoothly-interpolating vector-based image with the output of a phoneme recognizer. This phoneme recognizer was built using mel-frequency cepstral coefficients (MFCC) as features and using maximum-likelihood selection based on Gaussian mixture models (GMMs) of each phoneme. Depending on the amount and diversity of training data, speaker-dependent or speaker-independent GMMs could be formed for each phoneme. To reduce the system's sensitivity to microphone and room acoustics, the MFCC's were filtered by RASTA (a widely accepted method for reducing the dependence of acoustic features on channel characteristics) prior to classification.

Cartoon visemes for Mouther were developed in Tom Ngo's ECG graphics engine, a system optimized for the creation of smooth interpolations between vector-based graphics in high-dimensional spaces. With this technology, weighted interpolations of facial exemplars could be generated directly from the phoneme classifier's confidence ratings. For example, a phoneme classified as ee with 70% confidence, but also classified as eh with 30% confidence, would result in a 70/30 morph between the ee and eh cartoon graphics.

The result is a talking cartoon face whose animation is driven by the output of the classifier, and in which different mouth positions are displayed for the different phonemes that are spoken. While further work would be needed to increase the reliability of the classifier and the realism of the transitions between different visemes, the current result is amusing and could be sufficiently responsive for the quality-level needed in children's computer games.

 

 

ooh uh ee
rr ss ay
mm ff ll
  oh