nsh - Speech Recognition With CMU Sphinx

Blog about speech technologies - recognition, synthesis, identification. Mostly it's about scientific part of it, the core design of the engines, the new methods, machine learning and about about technical part like architecture of the recognizer and design decisions behind it.

Do you trust speech transcription in the cloud

Dealing with pruning issues

I spent a holiday looking on the issues in poketsphinx decoding in fwdflat mode. Initially I thought it's a bug but it appeared that it's just a pruning issue. The result looked like this:

INFO: ngram_search.c(1045): bestpath 0.00 wall 0.000 xRT
INFO: <s> 0 5 1.000 -94208 0 1
INFO: par..grafo 6 63 1.000 -472064 -467 2
INFO: terceiro 64 153 1.000 -1245184 -115 3
INFO: as 154 176 0.934 -307200 -172 3
INFO: emendas 177 218 1.000 -452608 -292 3
INFO: ao 219 226 1.000 -208896 -181 3
INFO: projeto 227 273 1.000 -342016 -152 3
INFO: de 274 283 1.000 -115712 -75 3
INFO: lei 284 3059 1.000 -115712 -79 3


Speech recognition is essentially a search for a globally best path in a graph. Beam pruning is used to drop the nodes during the search if node score is worse then the best node like in this picture


If beam is too narrow, the result might not be the globally best one despite its locally the best. In practice it could lead to complex issues like desribed above. See the word "lei" spans about 2k frames which means about 20 seconds. Another sign of overpruning is number of words scored per frame


Magic Words of Interspeech 2011

Interspeech 2011 is coming. It going to be an amazing event I suppose. If you are interested what is going on there, let's figure that out.

To keep things simple we will use Unix command line tools. Sometimes text processing could be fun even with simple commands. Text is still most conventint form of the information presentation, way better than HTML or databases. Of course there is lack for more advanced things like stopword filtering or named entity recognition. Let's hope one day Unix command line will have them.

1. Download full printable programs of Interspeech 2010 and Interspeech 2011 with wget, dump them to text with lynx and cleanup punctuation with sed.

2. Dump word counts with SRILM tool ngram-count and cut 1000 most frequent words on list for 2011 with head and sort. Leave all words in 2010 list.

3. Figure out which of the words in 2011 list are new and do not appear in 2010 list with sort and uniq.

Suprisingly there will be only 2 new words. They are: i-vector and crowdsourcing.

When Language Models Fail

Language modeling still have many challenging problems.


Comic by Jim Benton

Decoders And Features

CMUSphinx decoders in a glance, so one can compare. Table is incomplete and imprecise of course.




sphinx2 sphinx3 sphinx4 pocketsphinx
Acoustic Lookahead
-
-
+
-
Alignment
+
+
+
-
Flat Forward Search
+
+
-
+
Finite Grammar Confidence
+
-
-
-
Full n-gram History Tree Search
-
-
+
-
HTK Features
-
+
+
+
Phonetic Loop Decoder
+
+
-
-
Phonetic
Lookahead
+
+
-
+
PLP features
-
-
+
-
PTM Models
-
-
-
+
Score Quantization
+
-
-
+
Semi-Continuous Models
+
+
-
+
Single Tree
Search
+
-
-
+
Subvector
Quantization
+
+
-
+
Time-Switching
Tree Search
-
+
-
-
Tree Search Smear
-
+
+
-
Word-Switching
Tree Search
-
+
-
-
Thread Safety
-
-
+
+
Keyword Spotting
-
-
+
-

And here is the description of the entries

Specific Applications

Phonetic Loop Decoder. Phonetic loop decoding requires specialized search algorithm. It's not implemented in Sphinx4 for example.

Alignment. Given text and the transcription get the word timings.

Keyword spotting. Search for keyword requires separate search space and different search approach.

Finite Grammar Confidence. Get confidence estimation for finite state grammar. This is a complex problem which
require additional operations during search, for example phone loop pass.

Effective pruning

Acoustic Lookahead. Using acoustic score for the current frame we can predict the score for the next frame
and thus prune token early.

Phonetic Lookahead. Using phonetic loop decoder we can predict possible phones and thus restrict large vocabulary search.

Features

HTK Features. CMUSphinx feature extraction is different from HTK (different filterbank and transform). To provide HTK capability one needs to have specific HTK feature extraction.

PLP features. Type of the features different from traditional MFCC. They are more popular now.

Search Space

Flat Forward Search. Search space when word paths aren't joined in lextree. Separated path lets us to apply language model probability earlier. Thus search is more accurate. But because search space is bigger it's also slower. Usually flat search is applied as a second pass after tree search.

Full n-gram History Tree Search. Tokens which have different n-gram history are tracked separately. For example token for "how are UW " and token for "hello are UW.." are tracked separately. In pocketsphinx such tokens are just joined and only best one survive. Full history search is more accurate but slower and more complex in implementation.

Word-Switching Tree Search. Separate lextrees are kept for each unigram history. This search is in the middle between the one to keep the full history and another one to drop the history at all.

Single Tree Search. Lextree tokens don't care about word history. This is faster but less accurate way.

Time-Switching Tree Search. Lextree states don't care about word history but several lextrees are kept in memory (3-5). In this time switching approach lextrees are switched every frame. Because of that there is higher chance to track both histories.

Tree Search Smear. Lextree contains unigram probability and thus it's possible to prune token earlier based on the language score.

Acoustic Scoring

PTM Models. Models when gaussians are shared across senones with same central phone. So we don't need to calculate gaussians value for each senone, just few values for each central phone. Then using different mixture weights we get senone score. This approach reduce computation required but keeps accuracy on a reasonable level. It's similar to semi-continuous models where gaussians are shared across all senones, not just across senones with same central phone.

Score Quantization. Acoustic scores in some cases could be represented by just 2 bytes (semi-continuous models and specific feature set). Usually scores are in log domain and shifted by 10 bits. This reduces memory required for acoustic model and for scoring and speeds up the computation in particular on CPU without FPU.

Semi-Continuous Models. Gaussians are shared across all senones, only mixture weights are different. Such models are fast and usually quite accurate. Usually they are multistream (s2_4x or 1s_c_d_dd with subvector 0-12/13-25/26-38) since separate streams could be better quantized.

Subvector Quantization. Gaussian selection approach to reduce acoustic scoring. Basically continuous model after training is deconstructed on several subvector gaussians which are shared across senones and thus scored efficiently.


Cars Controlled By Speech

Being a speech recognition guy I'm looking for a car with speech recognition included. Sounds strange to select car just because of it, but that is just kidding. So far the list is:

  • Honda Accord
  • Any Ford 2011
  • Mazda 6
Not listing something expensive like BMW or Mersedes. Hm, it looks almost everyone is doing that. Any others? Which is the most advanced one?

Some details on particular implementation

Ford SYNC 2011

Quite advanced system. Command-based. Supports many types of commands to control dvd or get baseball scores. Supports user profiles but doesn't seem like it has specific training procedure. With current speaker recognition capabilities it could in theory adapt to users automatically without profiles.

Mazda 6 2011

Pretty interesting system, but limited comparing to previous one. According to owner manual it supports a very limited list of commands to manage calls, get incoming messages and. From interesting capabilites it supports training and voice entry for contacts. Three languages - English, French, Spanish. Looks like it's using single microphone. Looks like voice navigation system has separate speech recognition subsystem.
    Honda Fit 2009


    Many commands mostly related to navigation but no user adaptation and no profiles. Alphanumeric entry as a backup to vocabulary search. This one is very simple.


    Mitsubishi/Hyundai 2011


    I didn't manage to find the manual on them. Feature name "Bluetooth hands-free phone system with voice recognition and phonebook download" makes me think it's the same system as in Mazda.


    Talkmatic

    Doesn't seem like this is deployed, but presentation looks impressive

    KIA

    Accoding to SpeechTechMag Microsoft and Kia codeveloped the UVO multimedia and infotainment system, which the Korean automaker rolled out in its new Sportage, Sorento, and Optima models late last year. UVO lets users access media content and connect with people through  quick voice commands without having to navigate hierarchical menus.


    ICASSP 2011 Part 1 - Thoughts

    It seems like ICASSP this year was a great event, it is pity I missed it. Just comparing the keynotes list, ICASSP beats Interspeech 4:0. ICASSP is very technical, Interspeech is for linguists. Compare the two:

    Making Sense of a Zettabyte World vs Neural Representations of Word Meanings


    New section formats like technical tracks and trends discussions are interesting though I am not sure how they felt in practice.

    So this was the reason to spend few days in reading. 1000 papers on speech technology! Huh. Thanks to all authors for their hard work! Well, I found several duplicates in the end.

    Main thing I noted is that topics of the research are very sparse, for example
    • Everyone does speaker recognition. Appealing problem statement here is that here is to detect a synthetic speaker. Paper titled "DETECTION OF SYNTHETIC SPEECH FOR THE PROBLEM OF IMPOSTURE" by De Leon at al. hints that there is no solution for that.
    • I got tired to skip pursuits, bandiths and compressive sensing
    • On the other side, increased portion of papers on non-speech signals, cocktail party problem, signal recovery is very interesting to read.
    • Things like DBN features or SCARF decoder are widely represented. You can read about applications of CRF from g2p algorithms to dialogs. But traditional things like search algorithms and adaptation are almost uncovered. 
    • It was suprising to find the session dedictated to multimedia security which must be a gold mine of ideas in particular if you need a topic for a paper. Is there a company selling such products? 
    Overall I found several original problem statements as well as inspiring ideas covering very important technology issues. For example it would be nice to implement meeting transcription application with several iPhones to combine streams and later transcribe them using multichannel environment compensation. Several meeting transcription setups and channel separation methods are described in the conference proceedings.

    After reading some amount of papers I found that conference papers are too short. While you see a nice title and an abstract you expect to read a detailed insight into the problem with history discourse and everything explained in detail, a deep investigation of the problem. But you get just a description of the technology and few figures from experiments. On the other side, I will not be able to read 100 papers 20 pages each.

    Very interesting that this year awards are not related to speech technology. That will be the contents of Part 2. I just need to go through last 50 papers left.


    Chicken-And-Egg in Sphinxbase

    Recently Shea Levy pointed me to an issue with a verbose output during pocketsphinx initialization. Basically every time you start pocketsphinx, you get something like


    INFO: cmd_ln.c(691): Parsing command line:
    pocketsphinx_continuous 
    Current configuration:
    [NAME] [DEFLT] [VALUE]
    -adcdev
    -agc none none
    -agcthresh 2.0 2.000000e+00
    -alpha 0.97 9.700000e-01
    -argfile

    It's ok for a tool but not a nice thing for the library which should be a small horse in a rig of application. Not every user is happy seeing all this stuff dumped on the screen. And the worst thing is that there is no way to turn it off because "-logfn /dev/null" works only for the output after initialization. So we are looking to have pocketsphinx completely silent.

    It appeared to be more complex issue than I thought. Its classical chicken-egg issue when you use configuration framework do configure logging but configuration framework needs to log itself. We just hardcoded the initialization but thinking afterwards I found way more complex and but more rigid approach in log4j description from http://articles.qos.ch/internalLogging.html

    Since log4j never sets up a configuration without explicit input from the user, log4j internal logging may occur before the log4j environment is set up. In particular, internal logging may occur while a configurator is processing a configuration file.

    We could have simplified things by ignoring logging events generated during the configuration phase. However, the events generated during the configuration phase contain information useful in debugging the log4j configuration file. Under many circumstances, this information is considered more useful than all the subsequent logging events put together.

    In order to capture the logs generated during configuration phase, log4j simply collects logging events in a temporary appender. At the end of the configuration phase, these recorded events are replayed within the context of the new log4j environment, (the one which was just configured). The temporary appender is then closed and detached from the log4j environment.

    Oh-woh, I will never get enough passion to implement this properly ;) Let it be as is for now.

    Sphinxbase command line options are still not good. I'm pretty much lack proper --help, --version and many more nifty getopt things. One day someone should do this.

    Blog Archive

    About Me

    My Photo
    Moscow, Russia
    Nowdays I mostly work on open source projects in speech recognition and synthesis like Festival, CMU Sphinx and Voxforge. I also support the Russian parts of those projecs, providing the leading product in ASR and TTS in Russian. In the past I used to participate in GNOME, work on embedded Linux devices and on software development technologies related to automatic software verification and modelling. If you have any questions feel free to contact me by mail nshmyrev at nexiwave dot com or find me in jabber/irc.