| Mouther is an experimental 
              interactive system I built with Malcolm Slaney and Tom Ngo in which 
              an animated cartoon face is lip-synched in real-time with the user's 
              speech. Malcolm implemented Mouther by connecting his special 
              speech-parsing technologies to Tom's "Embedded-Constraint Graphic" 
              (ECG) graphics animation engine. I was responsible for the prototype 
              concept and artwork. Mouther works by driving the state of a smoothly-interpolating 
              vector-based image with the output of a phoneme recognizer. This 
              phoneme recognizer was built using mel-frequency cepstral coefficients 
              (MFCC) as features and using maximum-likelihood selection based 
              on Gaussian mixture models (GMMs) of each phoneme. Depending on 
              the amount and diversity of training data, speaker-dependent or 
              speaker-independent GMMs could be formed for each phoneme. To reduce 
              the system's sensitivity to microphone and room acoustics, the MFCC's 
              were filtered by RASTA (a widely accepted method for reducing the 
              dependence of acoustic features on channel characteristics) prior 
              to classification.  
              Cartoon visemes for Mouther were developed 
              in Tom Ngo's ECG graphics engine, a system optimized for the creation 
              of smooth interpolations between vector-based graphics in high-dimensional 
              spaces. With this technology, weighted interpolations of facial 
              exemplars could be generated directly from the phoneme classifier's 
              confidence ratings. For example, a phoneme classified as ee 
              with 70% confidence, but also classified as eh with 
              30% confidence, would result in a 70/30 morph between the ee 
              and eh cartoon graphics.  
              The result is a talking cartoon face whose animation 
              is driven by the output of the classifier, and in which different 
              mouth positions are displayed for the different phonemes that are 
              spoken. While further work would be needed to increase the reliability 
              of the classifier and the realism of the transitions between different 
              visemes, the current result is amusing and could be sufficiently 
              responsive for the quality-level needed in children's computer games.
  
           |