Phoneme extraction

Thu Oct 12, 2017 9:10 am

Hi Folks.
I have a project that involves real time animatronics. As such I wish to have the application running on a small self-contained (no net connection) system that will listen to a microphone, and then extract a stream of phonemes from what is said.
These phonemes can then be user to articulate servo's to give an approximation of the lip and tongue movements
The key is approximate - I am not looking at providing lip-reading capability. However more that the lips of say a werewolf or orc will purse, widen, and narrow in response to the actor's speech.

My target platform is an RPi 3 or similar running headless except for the microphone and GPIO output

Let say you have a werewolf in a movie. It can talk, growl, snarl, move it's ears and generate other expressions. What you don;t see if the 5 or 6 people off to one side who are the puppeteers that control all of the servos to make the creature 'act'.
This is great however if you want to take the same creature and place them in a convention or Cos-Play environment it is unpractical to have the puppeteers trying to handle unscripted situations.
I can use the actor/wearer's face to pick up various movements like the eyebrows etc. However if the suit can also pick up the actor's speech and approximate enunciation then that would be brilliant

Many thanks for any advice etc.

User avatar
Lead Software Architect
Re: Phoneme extraction

Fri Oct 13, 2017 3:50 pm

Syn.Speech being based on Sphinx-4 architecture was designed from ground up for recognizing streams of words rather than phonemes. Being a traditional implementation the ASR tracks words rather than individual phoneme tokens.

Although phoneme recognition would assist in pronunciation modelling I am unsure if with the current code base one can channel the framework into phoneme recognition without considerable changes.

There could be ways around this limitation but I haven't put much thought into it.

