Modeling lip movements and extracting of the contents of speech

- In recent years, many applications have been developed with the progress of voice recognition technology. As the technology has progressed, the interface between the software and applications have changed from a hand-operated interface into a voice-operated interface. A voice-operated interface enables for users to operate an application without the human hand and therefore has the advantage of allowing users to input commands into the application system by their voice, allowing them to use their hands for other things at the same time. Thus, users can operate many pieces of equipment simultaneously.
But voice recognition technology can’t use under noisy environment, so there is necessary to recognition content of an utterance without voice for noisy environment.
So, we construct utterance recognition method without voice recognition. This study focuses on modeling of lip movement. The main purpose is establishment of the way to analyze lip movement when speaking. So, we use a video camera to capture a movie of the lip movement, and analyze these movie for modeling. For Accurate detection, we use Fourier translation for analysis.
Pronunciation and Lip Movement
When we speak Japanese, our lip moves in a five pattern that is based on the construction of vowels for spoken words and phrases. This is a kind of vowel, [a,i,u,e,o].
Japanese includes a vowel file – like [a,i,u,e,o] - and a consonant line – like [a ka sa ta na ha ma ya ra wa , ga za da ba pa]-. When pronouncing Japanese, the lip makes a characteristic movement for each vowel file in the pronounced words. This movement of the lip is a peculiar movement that depends on a pronounced word’s vowel; this movement is not changed by the consonant line in pronounced words
As the first step in estimating vowels with lip movements, we classify the lip movement for each vowel movement. In short, we define a peculiar vowel movement - aiueo- and abstract the peculiar movement from a speaking video and estimate the vowel from the spoken words. For example, when [a] is spoken, the top and bottom parts of the lip move conspicuously. With the case of [i], the right and left edges of the lip move conspicuously, but the top and bottom parts do not move much. These movements are the characteristics of the lip movement. The best points with which to detect these features are the top and bottom parts and the right and left edges of the lips. These are the most moving points during pronunciation and easy to be detected with the image processing. We call these five points the base points.
Next, a locus drawn by the movements of the base points during the pronunciation of the words was extracted. We call the lucus as the lip movement history graph. These lip movement history graphs are classified by each vowel pronunciation. Because they change shapes according to the pronounced contents, the frequency components included in these graphs are calculated by a Fourier transform to obtain the component that they contain. By analyzing how these containing frequency components change with lip movements, we find that the frequency component which is different for every vowel is contained, and the vowel of the pronounced sound is classified. We do this vowel analysis for each of the sounds which the pronounced word contains, and finally we detect the word which was pronounced by referring to the language table.

References
T. Yanagi, A. Sakamoto, M.Yamada: Modeling lip movements and Extracting of the Contents of the Speech, MJISAT2007, Kalalumpur(2008)

Backgrounds and Purpose of this Study
Estimating the Vowel with Lip Movements