<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-5075537609830514591</id><updated>2012-02-07T02:39:00.491+04:00</updated><category term='dictation'/><category term='sphinx4'/><category term='voting'/><category term='articles'/><category term='engines'/><category term='conditional random fields'/><category term='sphinxtrain'/><category term='business'/><category term='cuda'/><category term='java'/><category term='juicer'/><category term='debugging'/><category term='pocketsphinx'/><category term='zyxel'/><category term='sphinx'/><category term='language models'/><category term='frama-c eclipse git'/><category term='experiments'/><category term='noise filter'/><category term='TTS'/><category term='srilm'/><category term='voxforge'/><category term='blizzard'/><category term='ideas'/><category term='wfst'/><category term='g2p'/><category term='adaptation'/><category term='wiener filter'/><category term='mary'/><category term='interpolated language model'/><category term='MRF'/><category term='acoustic model training'/><category term='asterisk'/><category term='gpu'/><category term='CRF'/><category term='cmusphinx'/><category term='htk'/><category term='speech recognition'/><category term='festival'/><category term='random stuff'/><category term='summer of code'/><category term='ivr'/><category term='scarf'/><category term='testing'/><category term='sphinx4 configuration'/><category term='nexiwave'/><category term='conferences'/><title type='text'>nsh - Speech Recognition With CMU Sphinx</title><subtitle type='html'>Blog about speech technologies - recognition, synthesis, identification. Mostly it's about scientific part of it, the core design of the engines, the new methods, machine learning and about about technical part like architecture of the recognizer and design decisions behind it.</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><link rel='next' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default?start-index=101&amp;max-results=100'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>111</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-7862667600828059167</id><published>2012-01-15T01:12:00.002+04:00</published><updated>2012-01-15T01:16:55.409+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='speech recognition'/><title type='text'>Dealing with pruning issues</title><content type='html'>I spent a holiday looking on the issues in poketsphinx decoding in fwdflat mode. Initially I thought it's a bug but it appeared that it's just a pruning issue. The result looked like this:&lt;br /&gt;&lt;br /&gt;&lt;code&gt; INFO: ngram_search.c(1045): bestpath 0.00 wall 0.000 xRT&lt;br /&gt;INFO:   &amp;lt;s&amp;gt;                  0     5     1.000 -94208     0          1&lt;br /&gt;INFO:   par..grafo           6     63    1.000 -472064    -467       2&lt;br /&gt;INFO:   terceiro             64    153   1.000 -1245184   -115       3&lt;br /&gt;INFO:   as                   154   176   0.934 -307200    -172       3&lt;br /&gt;INFO:   emendas              177   218   1.000 -452608    -292       3&lt;br /&gt;INFO:   ao                   219   226   1.000 -208896    -181       3&lt;br /&gt;INFO:   projeto              227   273   1.000 -342016    -152       3&lt;br /&gt;INFO:   de                   274   283   1.000 -115712    -75        3&lt;br /&gt;INFO:   lei                  284   3059  1.000 -115712    -79        3&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;Speech recognition is essentially a search for a globally best path in a graph. Beam pruning is used to drop the nodes during the search if node score is worse then the best node like in this picture&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-Za_5a7CXVoQ/TxHjPBLnOuI/AAAAAAAAAPk/Qz-V06Cof9I/s1600/pruning.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="178" width="400" src="http://3.bp.blogspot.com/-Za_5a7CXVoQ/TxHjPBLnOuI/AAAAAAAAAPk/Qz-V06Cof9I/s400/pruning.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;If beam is too narrow, the result might not be the globally best one despite its locally the best. In practice it could lead to complex issues like desribed above. See the word "lei" spans about 2k frames which means about 20 seconds. Another sign of overpruning is number of words scored per frame&lt;br /&gt;&lt;br /&gt;&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;br /&gt;INFO: ngram_search_fwdflat.c(940):     2931 words recognized (1/fr)&lt;br /&gt;INFO: ngram_search_fwdflat.c(942):    48013 senones evaluated (16/fr)&lt;br /&gt;INFO: ngram_search_fwdflat.c(944):     9586 channels searched (3/fr)&lt;br /&gt;INFO: ngram_search_fwdflat.c(946):     3849 words searched (1/fr)&lt;br /&gt;INFO: ngram_search_fwdflat.c(948):     9602 word transitions (3/fr)&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;If you have just one word per frame it's likely an issue.&lt;br /&gt;&lt;br /&gt;More detailed behaviour can be seen if debugging in enabled in sources&lt;br /&gt;&lt;code&gt; #define __CHAN_DUMP__           1&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;You'll see something like&lt;br /&gt;&lt;code&gt;&lt;br /&gt;BEFORE:&lt;br /&gt;SSID          2866         610         611 (2608)&lt;br /&gt;SENSCR        -604        -215        -371&lt;br /&gt;SCORES    -1014874     -583095     -583097     -583223&lt;br /&gt;HISTID         170         170         170         170&lt;br /&gt;AFTER:&lt;br /&gt;SSID          2866         610         611 (2608)&lt;br /&gt;SENSCR        -604        -215        -371&lt;br /&gt;SCORES    -1015481     -583315     -583317     -583489&lt;br /&gt;HISTID         170         170         170         170&lt;br /&gt;BEFORE:&lt;br /&gt;SSID          2866         610         611 (2608)&lt;br /&gt;SENSCR        -568        -122        -358&lt;br /&gt;SCORES    -1015481     -583315     -583317     -583489&lt;br /&gt;HISTID         170         170         170         170&lt;br /&gt;AFTER:&lt;br /&gt;SSID          2866         610         611 (2608)&lt;br /&gt;SENSCR        -568        -122        -358&lt;br /&gt;SCORES    -1016052     -583442     -583444     -583696&lt;br /&gt;HISTID         170         170         170         170&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;So you see only one HMM per frame is scored and it doesn't generate any other HMMs&lt;br /&gt;&lt;br /&gt;Since those issues are hard to notice since today we will also issue you a warning in the decoder log. It will look like this:&lt;br /&gt;&lt;br /&gt;&lt;code&gt; WARNING: "ngram_search.c", line 404: Word 'lei' survived for 2764 frames, potential overpruning&lt;br /&gt;WARNING: "ngram_search.c", line 404: Word 'lei' survived for 2765 frames, potential overpruning&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;So you'll be warned if something will go wrong.&lt;br /&gt;&lt;br /&gt;It's very easy to forget about pruning issues because they are not really visible. You'll only get a drop in the accuracy and you might not notice it. And you might think it's a model accuracy not a search accuracy. In practice you need always remember about that:&lt;br /&gt;&lt;br /&gt;&lt;i&gt;Search space configuration and settings have certain effect on the final accuracy and speed.&lt;br /&gt;&lt;br /&gt;Default settings are often wrong for modified models. If you have a new model you need to review all the configuration parameters in order to make sure they work. If there are many parameters, you need to check all of them.&lt;br /&gt;&lt;br /&gt;If pruning errors in your decoder have very small effect it means you haven't optimized your search space properly. You can definitely do better.&lt;/li&gt;&lt;br /&gt;&lt;br /&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;At least we might want to report more useful metrics about pruning in the future.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-7862667600828059167?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/7862667600828059167/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2012/01/dealing-with-pruning-issues.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/7862667600828059167'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/7862667600828059167'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2012/01/dealing-with-pruning-issues.html' title='Dealing with pruning issues'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-Za_5a7CXVoQ/TxHjPBLnOuI/AAAAAAAAAPk/Qz-V06Cof9I/s72-c/pruning.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-7069692559002513011</id><published>2011-08-24T18:15:00.000+04:00</published><updated>2011-08-24T18:15:06.334+04:00</updated><title type='text'>Magic Words of Interspeech 2011</title><content type='html'>&lt;a href="http://interspeech2011.org"&gt;Interspeech 2011&lt;/a&gt; is coming. It going to be an amazing event I suppose. If you are interested what is going on there, let's figure that out.&lt;br /&gt;&lt;br /&gt;To keep things simple we will use Unix command line tools. Sometimes text processing could be fun even with simple commands. Text is still most conventint form of the information presentation, way better than HTML or databases. Of course there is lack for more advanced things like stopword filtering or named entity recognition. Let's hope one day Unix command line will have them. &lt;br /&gt;&lt;br /&gt;1. Download full printable programs of Interspeech 2010 and Interspeech 2011 with wget, dump them to text with lynx and cleanup punctuation with sed.&lt;br /&gt;&lt;br /&gt;2. Dump word counts with SRILM tool ngram-count and cut 1000 most frequent words on list for 2011 with head and sort. Leave all words in 2010 list.&lt;br /&gt;&lt;br /&gt;3. Figure out which of the words in 2011 list are new and do not appear in 2010 list with sort and uniq.&lt;br /&gt;&lt;br /&gt;Suprisingly there will be only 2 new words. They are: &lt;b&gt;i-vector&lt;/b&gt; and &lt;b&gt;crowdsourcing&lt;/b&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-7069692559002513011?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/7069692559002513011/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2011/08/magic-words-of-interspeech-2011.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/7069692559002513011'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/7069692559002513011'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2011/08/magic-words-of-interspeech-2011.html' title='Magic Words of Interspeech 2011'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-6029136826831341560</id><published>2011-07-26T09:30:00.002+04:00</published><updated>2012-01-15T01:24:31.386+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='random stuff'/><title type='text'>When Language Models Fail</title><content type='html'>Language modeling still have many challenging problems.&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://jimbenton.com" imageanchor="1" style="clear:left; float:left;margin-right:1em; margin-bottom:1em"&gt;&lt;img border="0" height="650" width="500" src="http://laughingsquid.com/wp-content/uploads/yoda-pizza-20110717-142433.jpg" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;Comic by &lt;a href="http://jimbenton.com/"&gt;Jim Benton&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-6029136826831341560?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/6029136826831341560/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2011/07/when-language-models-fail.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/6029136826831341560'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/6029136826831341560'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2011/07/when-language-models-fail.html' title='When Language Models Fail'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-6501086594894308370</id><published>2011-07-23T02:58:00.011+04:00</published><updated>2011-07-26T09:33:47.371+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='cmusphinx'/><title type='text'>Decoders And Features</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;CMUSphinx decoders in a glance, so one can compare. Table is incomplete and imprecise of course.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;table border="1" bordercolor="#E0E0E0" cellpadding="4" cellspacing="0"&gt;&lt;col width="51*"&gt;&lt;/col&gt;  &lt;col width="51*"&gt;&lt;/col&gt;  &lt;col width="51*"&gt;&lt;/col&gt;  &lt;col width="51*"&gt;&lt;/col&gt;  &lt;col width="51*"&gt;&lt;/col&gt;  &lt;tbody&gt;&lt;tr valign="TOP"&gt;   &lt;td height="54" width="20%"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;b&gt;sphinx2&lt;/b&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;b&gt;sphinx3&lt;/b&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;b&gt;sphinx4&lt;/b&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;b&gt;pocketsphinx&lt;/b&gt;&lt;/td&gt;  &lt;/tr&gt;&lt;tr&gt;   &lt;td height="55" valign="TOP" width="20%"&gt;Acoustic Lookahead&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;-&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;-&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;+&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;-&lt;/div&gt;&lt;/td&gt;  &lt;/tr&gt;&lt;tr&gt;   &lt;td height="55" valign="TOP" width="20%"&gt;Alignment&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;+&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;+&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;+&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;-&lt;/div&gt;&lt;/td&gt;  &lt;/tr&gt;&lt;tr&gt;   &lt;td height="55" valign="TOP" width="20%"&gt;Flat Forward Search&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;+&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;+&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;-&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;+&lt;/div&gt;&lt;/td&gt;  &lt;/tr&gt;&lt;tr&gt;   &lt;td height="55" valign="TOP" width="20%"&gt;Finite Grammar Confidence&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;+&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;-&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;-&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;-&lt;/div&gt;&lt;/td&gt;  &lt;/tr&gt;&lt;tr&gt;   &lt;td height="55" valign="TOP" width="20%"&gt;Full n-gram History Tree Search&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;-&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;-&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;+&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;-&lt;/div&gt;&lt;/td&gt;  &lt;/tr&gt;&lt;tr&gt;   &lt;td height="55" valign="TOP" width="20%"&gt;HTK Features&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;-&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;+&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;+&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;+&lt;/div&gt;&lt;/td&gt;  &lt;/tr&gt;&lt;tr&gt;   &lt;td height="55" valign="TOP" width="20%"&gt;Phonetic Loop Decoder&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;+&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;+&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;-&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;-&lt;/div&gt;&lt;/td&gt;  &lt;/tr&gt;&lt;tr&gt;   &lt;td height="55" valign="TOP" width="20%"&gt;Phonetic&lt;br /&gt;Lookahead&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;+&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;+&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;-&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;+&lt;/div&gt;&lt;/td&gt;  &lt;/tr&gt;&lt;tr&gt;   &lt;td height="55" valign="TOP" width="20%"&gt;PLP features&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;-&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;-&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;+&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;-&lt;/div&gt;&lt;/td&gt;  &lt;/tr&gt;&lt;tr&gt;   &lt;td height="55" valign="TOP" width="20%"&gt;PTM Models&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;-&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;-&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;-&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;+&lt;/div&gt;&lt;/td&gt;  &lt;/tr&gt;&lt;tr&gt;   &lt;td height="55" valign="TOP" width="20%"&gt;Score Quantization&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;+&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;-&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;-&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;+&lt;/div&gt;&lt;/td&gt;  &lt;/tr&gt;&lt;tr&gt;   &lt;td height="55" valign="TOP" width="20%"&gt;Semi-Continuous Models&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;+&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;+&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;-&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;+&lt;/div&gt;&lt;/td&gt;  &lt;/tr&gt;&lt;tr&gt;   &lt;td height="55" valign="TOP" width="20%"&gt;Single Tree&lt;br /&gt;Search&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;+&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;-&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;-&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;+&lt;/div&gt;&lt;/td&gt;  &lt;/tr&gt;&lt;tr&gt;   &lt;td height="55" valign="TOP" width="20%"&gt;Subvector     &lt;br /&gt;Quantization&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;+&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;+&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;-&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;+&lt;/div&gt;&lt;/td&gt;  &lt;/tr&gt;&lt;tr&gt;   &lt;td height="55" valign="TOP" width="20%"&gt;Time-Switching&lt;br /&gt;Tree Search&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;-&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;+&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;-&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;-&lt;/div&gt;&lt;/td&gt;  &lt;/tr&gt;&lt;tr&gt;   &lt;td height="55" valign="TOP" width="20%"&gt;Tree Search Smear&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;-&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;+&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;+&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;-&lt;/div&gt;&lt;/td&gt;  &lt;/tr&gt;&lt;tr&gt;   &lt;td height="54" valign="TOP" width="20%"&gt;Word-Switching&lt;br /&gt;Tree Search&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;-&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;+&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;-&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;-&lt;/div&gt;&lt;/td&gt;  &lt;/tr&gt;&lt;tr&gt;   &lt;td height="54" valign="TOP" width="20%"&gt;Thread Safety&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;-&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;-&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;+&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;+&lt;/div&gt;&lt;/td&gt;  &lt;/tr&gt;&lt;tr&gt;   &lt;td height="54" valign="TOP" width="20%"&gt;Keyword Spotting&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;-&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;-&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;+&lt;/div&gt;&lt;/td&gt;   &lt;td width="20%"&gt;&lt;div align="CENTER"&gt;-&lt;/div&gt;&lt;/td&gt;  &lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;And here is the description of the entries&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Specific Applications&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Phonetic Loop Decoder. Phonetic loop decoding requires specialized search algorithm. It's not implemented in Sphinx4 for example.&lt;br /&gt;&lt;br /&gt;Alignment. Given text and the transcription get the word timings.&lt;br /&gt;&lt;br /&gt;Keyword spotting. Search for keyword requires separate search space and different search approach.&lt;br /&gt;&lt;br /&gt;Finite Grammar Confidence. Get confidence estimation for finite state grammar. This is a complex problem which&lt;br /&gt;require additional operations during search, for example phone loop pass.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Effective pruning&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Acoustic Lookahead. Using acoustic score for the current frame we can predict the score for the next frame&lt;br /&gt;and thus prune token early.&lt;br /&gt;&lt;br /&gt;Phonetic Lookahead. Using phonetic loop decoder we can predict possible phones and thus restrict large vocabulary search.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Features&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;HTK Features. CMUSphinx feature extraction is different from HTK (different filterbank and transform). To provide HTK capability one needs to have specific HTK feature extraction.&lt;br /&gt;&lt;br /&gt;PLP features. Type of the features different from traditional MFCC. They are more popular now.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Search Space&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Flat Forward Search. Search space when word paths aren't joined in lextree. Separated path lets us to apply language model probability earlier. Thus search is more accurate. But because search space is bigger it's also slower. Usually flat search is applied as a second pass after tree search.&lt;br /&gt;&lt;br /&gt;Full n-gram History Tree Search. Tokens which have different n-gram history are tracked separately. For example token for "how are UW " and token for "hello are UW.." are tracked separately. In pocketsphinx such tokens are just joined and only best one survive. Full history search is more accurate but slower and more complex in implementation.&lt;br /&gt;&lt;br /&gt;Word-Switching Tree Search. Separate lextrees are kept for each unigram history. This search is in the middle between the one to keep the full history and another one to drop the history at all.&lt;br /&gt;&lt;br /&gt;Single Tree Search. Lextree tokens don't care about word history. This is faster but less accurate way.&lt;br /&gt;&lt;br /&gt;Time-Switching Tree Search. Lextree states don't care about word history but several lextrees are kept in memory (3-5). In this time switching approach lextrees are switched every frame. Because of that there is higher chance to track both histories.&lt;br /&gt;&lt;br /&gt;Tree Search Smear. Lextree contains unigram probability and thus it's possible to prune token earlier based on the language score.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Acoustic Scoring&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;PTM Models. Models when gaussians are shared across senones with same central phone. So we don't need to calculate gaussians value for each senone, just few values for each central phone. Then using different mixture weights we get senone score. This approach reduce computation required but keeps accuracy on a reasonable level. It's similar to semi-continuous models where gaussians are shared across all senones, not just across senones with same central phone.&lt;br /&gt;&lt;br /&gt;Score Quantization. Acoustic scores in some cases could be represented by just 2 bytes (semi-continuous models and specific feature set). Usually scores are in log domain and shifted by 10 bits. This reduces memory required for acoustic model and for scoring and speeds up the computation in particular on CPU without FPU.&lt;br /&gt;&lt;br /&gt;Semi-Continuous Models.  Gaussians are shared across all senones, only mixture weights are different. Such models are fast and usually quite accurate. Usually they are multistream (s2_4x or 1s_c_d_dd with subvector 0-12/13-25/26-38) since separate streams could be better quantized.&lt;br /&gt;&lt;br /&gt;Subvector Quantization. Gaussian selection approach to reduce acoustic scoring. Basically continuous model after training is deconstructed on several subvector gaussians which are shared across senones and thus scored efficiently.&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-6501086594894308370?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/6501086594894308370/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2011/07/decoders-and-features.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/6501086594894308370'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/6501086594894308370'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2011/07/decoders-and-features.html' title='Decoders And Features'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-5210561226339266502</id><published>2011-06-28T02:51:00.007+04:00</published><updated>2011-07-01T16:22:04.530+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='random stuff'/><title type='text'>Cars Controlled By Speech</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;Being a speech recognition guy I'm looking for a car with speech recognition included. Sounds strange to select car just because of it, but that is just kidding. So far the list is:&lt;br /&gt;&lt;br /&gt;&lt;ul style="text-align: left;"&gt;&lt;li&gt;Honda Accord&lt;/li&gt;&lt;li&gt;Any Ford 2011&lt;/li&gt;&lt;li&gt;Mazda 6&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;Not listing something expensive like BMW or Mersedes. Hm, it looks almost everyone is doing that. Any others? Which is the most advanced one?&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;Some details on particular implementation&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Ford SYNC 2011&lt;/b&gt;&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Quite advanced system. Command-based. Supports many types of commands to control dvd or get baseball scores. Supports user profiles but doesn't seem like it has specific training procedure. With current speaker recognition capabilities it could in theory adapt to users automatically without profiles.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Mazda 6&amp;nbsp;2011&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Pretty interesting system, but limited comparing to previous one. According to owner manual it supports a very limited list of commands to manage calls, get incoming messages and. From interesting capabilites it supports training and voice entry for contacts. Three languages - English, French, Spanish. Looks like it's using single microphone. Looks like voice navigation system has separate speech recognition subsystem.&lt;br /&gt;&lt;ul style="text-align: left;"&gt;&lt;/ul&gt;&lt;b&gt;Honda Fit 2009&lt;/b&gt;&lt;br /&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;br /&gt;&lt;object class="BLOGGER-youtube-video" classid="clsid:D27CDB6E-AE6D-11cf-96B8-444553540000" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=6,0,40,0" data-thumbnail-src="http://1.gvt0.com/vi/yvLcQOeBAJE/0.jpg" height="266" style="clear: right; float: right;" width="320"&gt;&lt;param name="movie" value="http://www.youtube.com/v/yvLcQOeBAJE&amp;fs=1&amp;source=uds" /&gt;&lt;param name="bgcolor" value="#FFFFFF" /&gt;&lt;embed width="320" height="266"  src="http://www.youtube.com/v/yvLcQOeBAJE&amp;fs=1&amp;source=uds" type="application/x-shockwave-flash"&gt;&lt;/embed&gt;&lt;/object&gt;Many commands mostly related to navigation but no user adaptation and no profiles. Alphanumeric entry as a backup to vocabulary search. This one is very simple.&lt;br /&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;Mitsubishi/Hyundai 2011&lt;/b&gt;&lt;br /&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;br /&gt;I didn't manage to find the manual on them. Feature name "Bluetooth hands-free phone system with voice recognition and phonebook download" makes me think it's the same system as in Mazda.&lt;br /&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;Talkmatic&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Doesn't seem like this is deployed, but presentation looks impressive&lt;br /&gt;&lt;br /&gt;&lt;b&gt;KIA&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Accoding to SpeechTechMag &lt;a href="http://www.speechtechmag.com/Articles/PrintArticle.aspx?ArticleID=76345"&gt;Microsoft and Kia codeveloped the UVO multimedia and infotainment system&lt;/a&gt;, which the Korean automaker rolled out in its new Sportage, Sorento, and Optima models late last year. UVO lets users access media content and connect with people through&amp;nbsp; quick voice commands without having to navigate hierarchical menus.&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-5210561226339266502?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/5210561226339266502/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2011/06/cars-controlled-by-speech.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/5210561226339266502'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/5210561226339266502'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2011/06/cars-controlled-by-speech.html' title='Cars Controlled By Speech'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-1547788091218214552</id><published>2011-06-21T03:33:00.002+04:00</published><updated>2011-06-21T03:42:23.048+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='speech recognition'/><category scheme='http://www.blogger.com/atom/ns#' term='conferences'/><title type='text'>ICASSP 2011 Part 1 - Thoughts</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;It seems like &lt;a href="http://www.icassp2011.com/"&gt;ICASSP&lt;/a&gt; this year was a great event, it is&amp;nbsp;pity I missed it. Just comparing the keynotes list, ICASSP beats Interspeech 4:0. ICASSP is very technical, Interspeech is for linguists. Compare the two:&lt;br /&gt;&lt;br /&gt;&lt;i&gt;Making Sense of a Zettabyte World&lt;/i&gt; vs&amp;nbsp;&lt;i&gt;Neural Representations of Word Meanings&lt;/i&gt;&lt;br /&gt;&lt;i&gt;&lt;br /&gt;&lt;/i&gt;&lt;br /&gt;&lt;i&gt;&lt;span class="Apple-style-span" style="font-style: normal;"&gt;New section formats like technical tracks and trends discussions are interesting though I am not sure how they felt in practice.&lt;/span&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;So this was the reason to spend few days in reading. 1000 papers on speech technology! Huh.&amp;nbsp;Thanks to all authors for their hard work! Well, I found several duplicates in the end.&lt;br /&gt;&lt;br /&gt;Main thing I noted is that topics of the research are very sparse, for example&lt;br /&gt;&lt;ul style="text-align: left;"&gt;&lt;li&gt;Everyone does speaker recognition. Appealing problem statement here is that here is to detect a synthetic speaker. Paper titled "DETECTION OF SYNTHETIC SPEECH FOR THE PROBLEM OF&amp;nbsp;IMPOSTURE" by De Leon at al.&amp;nbsp;hints that there is no solution for that.&lt;/li&gt;&lt;li&gt;I&amp;nbsp;got tired to skip pursuits, bandiths and compressive sensing&lt;/li&gt;&lt;li&gt;On the other side, increased portion of papers on non-speech signals, cocktail party problem, signal recovery is very interesting to read.&lt;/li&gt;&lt;li&gt;Things like DBN features or SCARF decoder are widely represented. You can read about applications of CRF from g2p algorithms to dialogs. But traditional things like search algorithms and adaptation are almost uncovered.&amp;nbsp;&lt;/li&gt;&lt;li&gt;It was suprising to find the session dedictated to multimedia security which must be a gold mine of ideas in particular if you need a topic for a paper. Is there a company selling such products?&amp;nbsp;&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;Overall I found several original problem statements as well as inspiring ideas covering very important technology issues. For example it would be nice to implement meeting transcription application with several iPhones to combine streams and later transcribe them using multichannel environment compensation. Several meeting transcription setups and channel separation methods are described in the conference proceedings.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;After reading some amount of papers I found that conference papers are too short. While you see a nice title and an abstract you expect to read a detailed insight into the problem with history discourse and everything explained in detail, a deep investigation of the problem. But you get just a description of the technology and few figures from experiments. On the other side, I will not be able to read 100 papers 20 pages each.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;Very interesting that this year &lt;a href="http://www.icassp2011.com/en/awards"&gt;awards&lt;/a&gt; are not related to speech technology. That will be the contents of Part 2. I just need to go through last 50 papers left.&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-1547788091218214552?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/1547788091218214552/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2011/06/icassp-2011-part-1-thoughts.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/1547788091218214552'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/1547788091218214552'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2011/06/icassp-2011-part-1-thoughts.html' title='ICASSP 2011 Part 1 - Thoughts'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-8709266703635548390</id><published>2011-05-02T23:51:00.001+04:00</published><updated>2011-05-02T23:52:22.098+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='pocketsphinx'/><title type='text'>Chicken-And-Egg in Sphinxbase</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;Recently Shea Levy pointed me to an issue with a verbose output during pocketsphinx initialization. Basically every time you start pocketsphinx, you get something like&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;INFO: cmd_ln.c(691): Parsing command line:&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;pocketsphinx_continuous&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Current configuration:&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;[NAME]&lt;/span&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;  &lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;[DEFLT]&lt;/span&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;  &lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;[VALUE]&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;-adcdev&lt;/span&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;    &lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;-agc&lt;/span&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;  &lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;none&lt;/span&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;  &lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;none&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;-agcthresh&lt;/span&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt; &lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;2.0&lt;/span&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;  &lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;2.000000e+00&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;-alpha&lt;/span&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;  &lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;0.97&lt;/span&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;  &lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;9.700000e-01&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;-argfile&lt;/span&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;   &lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div&gt;It's ok for a tool but not a nice thing for the library which should be a small horse in a rig of application. Not every user is happy seeing all this stuff dumped on the screen. And the worst thing is that there is no way to turn it off because "-logfn /dev/null" works only for the output after initialization. So we are looking to have pocketsphinx completely silent.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;It appeared to be more complex issue than I thought. Its classical chicken-egg issue when you use configuration framework do configure logging but configuration framework needs to log itself. We just hardcoded the initialization but thinking afterwards I found way more complex and but more rigid approach in log4j description from&amp;nbsp;&lt;a href="http://articles.qos.ch/internalLogging.html"&gt;http://articles.qos.ch/internalLogging.html&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;blockquote&gt;Since log4j never sets up a configuration without explicit input from the user, log4j internal logging may occur before the log4j environment is set up. In particular, internal logging may occur while a configurator is processing a configuration file.&lt;br /&gt;&lt;br /&gt;We could have simplified things by ignoring logging events generated during the configuration phase. However, the events generated during the configuration phase contain information useful in debugging the log4j configuration file. Under many circumstances, this information is considered more useful than all the subsequent logging events put together.&lt;br /&gt;&lt;br /&gt;In order to capture the logs generated during configuration phase, log4j simply collects logging events in a temporary appender. At the end of the configuration phase, these recorded events are replayed within the context of the new log4j environment, (the one which was just configured). The temporary appender is then closed and detached from the log4j environment.&lt;/blockquote&gt;&lt;br /&gt;&lt;div&gt;Oh-woh, I will never get enough passion to implement this properly ;) Let it be as is for now.&lt;/div&gt;&lt;br /&gt;&lt;div&gt;Sphinxbase command line options are still not good. I'm pretty much lack proper --help, --version and many more nifty getopt things. One day someone should do this.&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-8709266703635548390?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/8709266703635548390/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2011/05/chicken-and-egg-in-sphinxbase.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/8709266703635548390'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/8709266703635548390'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2011/05/chicken-and-egg-in-sphinxbase.html' title='Chicken-And-Egg in Sphinxbase'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-1698681709629508495</id><published>2011-04-26T17:17:00.004+04:00</published><updated>2011-04-26T17:27:37.792+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='asterisk'/><title type='text'>Voicemail transcription with Pocketsphinx and Asterisk (Part 2)</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;&lt;i&gt;This is a second part which describes voicemail transcription for Asterisk administrators. See previous part which describes how to setup Pocketsphinx &lt;a href="http://nsh.nexiwave.com/2010/09/voicemail-transcription-with.html"&gt;here&lt;/a&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;So you have configured the recognizer to transcribe voicemails and now look on the improved recognizer accuracy. Honestly I can tell you that you will not get perfect transcription results for free unless you will send voicemails to some human-assisted transcription company. You will not get them from the Google either. Though there are several commercial services to try like Yap or Phonetag which specialize on voicemails specifically. Our proprietary &lt;a href="http://nexiwave.com/"&gt;Nexiwave&lt;/a&gt; technology for example uses way more advanced algorithms and way bigger speech databases than distributed with Pocketsphinx. And it's a really visible difference.&lt;br /&gt;&lt;br /&gt;However even the result you can get with Pocketsphinx can be very usable or you. I estimate you can easily get 80-90% accuracy with little effort considering the language of your voicemails is simple.&lt;br /&gt;&lt;br /&gt;&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;Now, the core components of the recognizer are:&lt;br /&gt;&lt;br /&gt;&lt;ul style="text-align: left;"&gt;&lt;li&gt;Language model which controls sequence of words&lt;/li&gt;&lt;li&gt;Acoustic model which describe how each phone sounds&lt;/li&gt;&lt;li&gt;Phonetic dictionary which maps words to phonetic representation&lt;/li&gt;&lt;/ul&gt;To get better accuracy you need to improve those three. By default the following models are used&lt;br /&gt;&lt;ul style="text-align: left;"&gt;&lt;li&gt;Dictionary - &lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;pocketsphinx/model/lm/en_US/cmu07a.dic&lt;/span&gt;&lt;/li&gt;&lt;li&gt;Language model - &lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;pocketsphinx/model/lm/en_US/hub4.5000.DMP&lt;/span&gt;&lt;/li&gt;&lt;li&gt;Acoustic model - &lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;pocketsphinx/model/hmm/en_US/hub4wsj_sc_8k&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;So let's try to improve them step by step by the order of importance&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Language model&lt;/b&gt;&lt;br /&gt;The core reason voicemail transcription is bad is that language model is built for completely different domain. HUB4 is DARPA task to transcribe broadcast news so you see it's very different from the voicemail language. It's perfect to recognize voicemail about NATO or democracy but not about your wife's problems. We need to change the language model.&lt;br /&gt;&lt;br /&gt;1) Transcribe some amount of your existing voicemails. A 100 will be already good. Put the transcription in a text file line by line:&lt;br /&gt;&lt;br /&gt;&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;hello jim it's steve let's meet at five p m&lt;/div&gt;&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;hello jim buy some milk&lt;/div&gt;&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;jim it's bob i should catch you tomorrow&lt;/div&gt;&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;you fired jim &lt;/div&gt;&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;....&lt;/div&gt;&lt;br /&gt;Important here that text is all in lower case one sentence per line and it doesn't have any punctuation.&lt;br /&gt;&lt;br /&gt;Then, you can find some domain specific texts in your computer. For example, if you are working as system administrator in chemical company some chemical texts will help to improve the quality of the language model. Take few books and convert them to the same simple text form: split out punctuation, formatting and add them to the text of the transcribed voicemails. Consider your email archives, they can be also good.&lt;br /&gt;&lt;br /&gt;3) Then, you can just use MITLM toolkit to convert the texts you've collected to the language model&lt;br /&gt;&lt;br /&gt;Download MITLM language model toolkit here&lt;br /&gt;&lt;br /&gt;&lt;a href="http://code.google.com/p/mitlm/"&gt;http://code.google.com/p/mitlm/&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Run it as&lt;br /&gt;&lt;br /&gt;&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;estimate-ngram -text voicemail.txt -write-lm your_model.lm&lt;/div&gt;&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;&lt;br /&gt;&lt;/div&gt;It will create the language model model &lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;your_model.lm&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;4) Sometimes it make sense to mix your specific model with a generic model. It may help if your training text is small or your model is not good enough. To do that download a generic model here:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://keithv.com/software/giga/lm_giga_5k_nvp_3gram.zip"&gt;http://keithv.com/software/giga/lm_giga_5k_nvp_3gram.zip&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Then unpack it and interpolate with your voicemail model using MITLM tools:&lt;br /&gt;&lt;br /&gt;&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;interpolate-ngram -lm "your_model.lm, lm_giga_5k_nvp_3gram.arpa" -interpolation LI -op voicemail.txt -wl Lectures+Textbook.LI.lm&lt;/div&gt;&lt;br /&gt;See MITLM tutorial for details &lt;a href="http://code.google.com/p/mitlm/wiki/Tutorial"&gt;http://code.google.com/p/mitlm/wiki/Tutorial&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;lm_giga model is quite big, you can also pick hub4 language model for interpolation. To do that you need to convert it to text form from the binary form first:&lt;br /&gt;&lt;br /&gt;&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;sphinx_lm_convert -ifmt dmp -ofmt arpa hub4.5000.DMP hub4.lm&lt;/div&gt;&lt;br /&gt;One day you will be able to work with a language model using CMU language model toolkit CMUCLMTK, but for now it's more complicated than MITLM. So MITLM is a recommended tool for language model operations.&lt;br /&gt;&lt;br /&gt;4) To speedup the startup of the recognizer sort the model and convert it to a binary format:&lt;br /&gt;&lt;br /&gt;&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;sphinx_lm_sort &amp;lt; your_model.lm &amp;gt; your_model_sorted.lm&lt;/div&gt;&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;sphinx_lm_convert -i your_model_sorted.lm -o your_model_sorted.lm.dmp&lt;/div&gt;&lt;br /&gt;3) In Pocketsphinx script, use your language model for transcription, add the following argument:&lt;br /&gt;&lt;br /&gt;&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;-lm your_model_sorted.lm.dmp&lt;/div&gt;&lt;br /&gt;That's it.&lt;br /&gt;&lt;br /&gt;Between, Google API is mostly trained on search queries. Why it perfectly suitable for voice search it's not good for voicemail transcription either. Voicemail transcription texts are usually quite sensitive information and it's very hard to get free access to them.&lt;br /&gt;&lt;br /&gt;I think after this step the accuracy of the transcription is already good enough. You will be able to collect transcription results, fix them and use them to improve the language model.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Acoustic model&lt;/b&gt;&lt;br /&gt;Sometimes it's usable to update the acoustic model. This step will require you to compile and setup Sphinxtrain. Again, transcribe few voicemails you've recorded, then organize them into a database. Then follow the acoustic model adaptation HOWTO as described in CMUSphinx wiki:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://cmusphinx.sourceforge.net/wiki/tutorialadapt"&gt;http://cmusphinx.sourceforge.net/wiki/tutorialadapt&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Acoustic model adaptation always make sense but it's quite a time consuming process. Maybe one day someone will automate it to make it really flawless. For example we have started a project to help to train and adapt the model from the set of long files accompanied with text, not with a carefully drafted database. Once this project will be completed it will be way easier to train and adapt the acoustic models. Any help on this is appreciated.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Dictionary&lt;/b&gt;&lt;br /&gt;There can be cases when you need to add few words to the dictionary which are missing. For example in step 1 when you adapted the language model you've got few words which are missing in cmu07a.dic. Then it make sense to add them. Just open a dictionary with a text editor, find the appropriate place and change or edit the phonetic pronunciation of the word. For example, CMU dictionary is missing the word "twitter"&lt;br /&gt;&lt;br /&gt;&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;twitter T W IH T ER&lt;/div&gt;&lt;br /&gt;Usually this step is not needed but if you have for example an accented words or some other unusual words it may help.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Test the model&lt;/b&gt;&lt;br /&gt;After you have adapted the model, retranscribe the files you have already collected. Check the accuracy if it's good or not.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;&lt;span style="font-size: small;"&gt;Follow up&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;So here are the directions to take. I understand it's some work but maybe you consider it's worth the effort. We are really trying to make this process easier and your comments on that will be very appreciated.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-1698681709629508495?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/1698681709629508495/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2011/04/voicemail-transcription-with.html#comment-form' title='7 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/1698681709629508495'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/1698681709629508495'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2011/04/voicemail-transcription-with.html' title='Voicemail transcription with Pocketsphinx and Asterisk (Part 2)'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>7</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-6830695850058820076</id><published>2011-03-19T22:42:00.001+03:00</published><updated>2011-03-19T22:43:17.585+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='summer of code'/><category scheme='http://www.blogger.com/atom/ns#' term='cmusphinx'/><title type='text'>CMUSphinx accepted at Google Summer Of Code 2011</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;&lt;a href="http://upload.wikimedia.org/wikipedia/ru/1/1f/GSOC_198x128.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="128" src="http://upload.wikimedia.org/wikipedia/ru/1/1f/GSOC_198x128.png" width="198" /&gt;&lt;/a&gt;So we are in. Great to know that. &lt;br /&gt;&lt;br /&gt;For more information see&lt;br /&gt;&lt;br /&gt;&lt;a href="http://cmusphinx.sourceforge.net/2011/03/cmusphinx-at-gsoc-2011/"&gt;http://cmusphinx.sourceforge.net/2011/03/cmusphinx-at-gsoc-2011/&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;I think it's a big responsibility and a big opportunity as well. Of course we don't consider this as a way to improve CMUSphinx itself or as something that will allow us to get features coded for free. Instead, we are looking for new people to join CMUSphinx, becoming the part of it. Maybe it's a great opportunity for Nexiwave as well.&lt;br /&gt;&lt;br /&gt;For now the task is to prepare the list of ideas for the projects. I know they need to be drafted carefully. If you want to help, please jump in. I definitely need some help.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-6830695850058820076?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/6830695850058820076/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2011/03/cmusphinx-accepted-at-google-summer-of.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/6830695850058820076'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/6830695850058820076'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2011/03/cmusphinx-accepted-at-google-summer-of.html' title='CMUSphinx accepted at Google Summer Of Code 2011'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-5505201939626749490</id><published>2011-03-13T05:11:00.000+03:00</published><updated>2011-03-13T05:11:12.182+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='wfst'/><title type='text'>Fillers in WFST</title><content type='html'>Another practical question is - how do you integrate fillers? There is silence class introduced in &lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.cs.nyu.edu/~mohri/pub/dmk.pdf"&gt;A GENERALIZED CONSTRUCTION OF INTEGRATED SPEECH RECOGNITION TRANSDUCERS by Cyril Allauzen, Mehryar Mohri, Michael Riley and Brian Roark&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;and implemented in &lt;a href="http://code.google.com/p/transducersaurus"&gt;transducersaurus&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;But you know each practical model has more than just a silence. Fillers like noise, silence, breath, laugh they all go to specific senones in the model. I usually try to minimize them during the training for example joining all them ums, hmms, and mhms into a single phone but I still think they are needed. How to integrate them when you build WFST recognizer?&lt;br /&gt;&lt;br /&gt;So I tried few approaches. For example instead of adding just a &amp;lt;sil&amp;gt; class in T transducer I tried to create many branches for each filler. As a result final cascade expands to a huge moster. Like if cascade was 50mb after combination with 1 silence class it is 100mb but after 3-4 classes it's 300mb. Not a nice thing to do.&lt;br /&gt;&lt;br /&gt;So I ended in dynamic expansion of silence transitions like this:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;if edge is silence:&lt;br /&gt;   for filler in fillers:&lt;br /&gt;      from node.add_edge(filler)&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;This seems to work well.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-5505201939626749490?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/5505201939626749490/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2011/03/fillers-in-wfst.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/5505201939626749490'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/5505201939626749490'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2011/03/fillers-in-wfst.html' title='Fillers in WFST'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-4295834893011912837</id><published>2011-03-03T19:56:00.002+03:00</published><updated>2011-03-17T00:56:54.685+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='wfst'/><category scheme='http://www.blogger.com/atom/ns#' term='sphinxtrain'/><title type='text'>Word position context dependency of Sphinxtrain and WFST</title><content type='html'>Interesting thing about Sphinxtrain models is that it uses word position as a context when looking for a senone for a particular word sequence. That means that in theory a senone for the start word phones could be different from senones for the middle-word phones and senones for the end-word phones. It's actually sometimes the case:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;ZH  UW  ER b    n/a   48   4141   4143   4146 N&lt;br /&gt;ZH  UW  ER e    n/a   48   4141   4143   4146 N&lt;br /&gt;ZH  UW  ER i    n/a   48   4141   4143   4146 N&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;but &lt;br /&gt;&lt;br /&gt;&lt;pre&gt;AA  AE   F b    n/a    9    156    184    221 N&lt;br /&gt;AA  AE   F s    n/a    9    149    184    221 N&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Here in the WSJ model definition from sphinx4 a symbol in a fourth column means "beginning", "end", "internal" or "single" and the other characters are transition matrix ids and senone ids.&lt;br /&gt;&lt;br /&gt;However, if you want to build WFST cascade from the model, it's kind of an issue how to embed the word position into context-dependent part of the cascade. My solution was to ignore position. You can  ignore position in already prebuilt model since differences caused by word position are small, but to do it consistently it's better to retrain word-position-independent model.&lt;br /&gt;&lt;br /&gt;Since of today you can do this easily, mk_mdef_gen tool supports -ignorewpos option which you can set in scripts. Basically everything is counted as an internal triphone. My tests show that this model is not worse than the original one. At least for conversational speech. Enjoy.&lt;br /&gt;&lt;br /&gt;P.S. Want to learn more about WFST - read Paul Dixon's blog &lt;a href="http://edobashira.com"&gt;http://edobashira.com&lt;/a&gt; and Josef Novak's blog &lt;a href="http://probablekettle.wordpress.com"&gt;http://probablekettle.wordpress.com&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-4295834893011912837?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/4295834893011912837/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2011/03/word-position-context-dependency-of.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/4295834893011912837'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/4295834893011912837'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2011/03/word-position-context-dependency-of.html' title='Word position context dependency of Sphinxtrain and WFST'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-3560692434366119513</id><published>2011-02-21T22:29:00.002+03:00</published><updated>2011-03-17T00:55:41.181+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='wfst'/><title type='text'>Openfst troubleshooting</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;A bit of openfst troubleshooting when you try to build WFST with Juicer. Say you are running&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;fstcompose ${OUTLEXBFSM} ${OUTGRAMBFSM} | \&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;fstepsnormalize | \&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;fstdeterminize | \&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;fstencode --encode_labels - $CODEX | \&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;fstminimize - | \&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;fstencode --decode - $CODEX | \&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;fstpush --push_weights | \&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;fstarcsort&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;and get this&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;FATAL: StringWeight::Plus: unequal arguments (non-functional FST?)&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;Huh? Which arguments are not equal? What caused this? How to fix this? Definitely it should be more self-explaining. That's basically quite a common issue. You get just a short message that nobody including the author could understand. Go find out how to fix it.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;In this particular case you go to the openfst sources and change the following line:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp;&amp;nbsp;if (w1 != w2)&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;LOG(FATAL) &amp;lt;&amp;lt; "StringWeight::Plus: unequal arguments "&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;lt;&amp;lt; "(non-functional FST?) " &amp;lt;&amp;lt; w1 &amp;lt;&amp;lt; " " &amp;lt;&amp;lt; w2;&lt;/span&gt;&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Wait another half an hour for it to compile (who decided to make it with pure templates!). See that it outputs arguments now at least. You run again and get&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;FATAL: StringWeight::Plus: unequal arguments (non-functional FST?) 833_9 832_9&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Heh, also not very descriptive but at least some hint. Looking on the output states 833 and 832 you see&lt;/div&gt;&lt;div&gt;that they have identical pronunciation. That's it. Your dictionary shouldn't have identical pronunciation. Moreover, it shouldn't have identically pronounced trigrams. Things pronounced like "a b cd" vs "ab c d" make wfst non-deterministic. Why didn't it warn about the issue when it converted the dictionary? Who knows. Anyway, now you can read about lexgen and find the option to fight with identical pronunciation:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp;&amp;nbsp;-outputAuxPhones &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; -&amp;gt; indicates that auxiliary phones should be added to pronunciationsin the lexicon in order to disambiguate distinct words withidentical pronunciations&amp;nbsp;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This option should make things better.&lt;br /&gt;&lt;br /&gt;I must admit CMUSphinx is also full of this. Bad error messages which doesn't describe the problem nor hint the solution. Compare too the output of recent maven&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;[ERROR] No goals have been specified for this build. You must specify a valid lifecycle phase or a goal in the format &lt;plugin-prefix&gt;:&lt;goal&gt; or &lt;plugin-group-id&gt;:&lt;plugin-artifact-id&gt;[:&lt;plugin-version&gt;]:&lt;goal&gt;. Available lifecycle phases are: validate, initialize, generate-sources, process-sources, generate-resources, process-resources, compile, process-classes, generate-test-sources, process-test-sources, generate-test-resources, process-test-resources, test-compile, process-test-classes, test, prepare-package, package, pre-integration-test, integration-test, post-integration-test, verify, install, deploy, pre-clean, clean, post-clean, pre-site, site, post-site, site-deploy. -&amp;gt; [Help 1]&lt;/goal&gt;&lt;/plugin-version&gt;&lt;/plugin-artifact-id&gt;&lt;/plugin-group-id&gt;&lt;/goal&gt;&lt;/plugin-prefix&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;[ERROR]&amp;nbsp;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;[ERROR] Re-run Maven using the -X switch to enable full debug logging.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;[ERROR]&amp;nbsp;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;[ERROR] For more information about the errors and possible solutions, please read the following articles:&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/NoGoalSpecifiedException&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;Maybe it's too verbose but I think it's the right way to do. So if you see something that is not clear in CMUSphinx, please report about it. We'll happily fix it.&lt;br /&gt;&lt;br /&gt;Coming up next - what to do when openfst hangs or takes all your memory.&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-3560692434366119513?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/3560692434366119513/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2011/02/openfst-troubleshooting.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/3560692434366119513'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/3560692434366119513'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2011/02/openfst-troubleshooting.html' title='Openfst troubleshooting'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-5066309718680779587</id><published>2011-02-18T00:13:00.001+03:00</published><updated>2011-02-18T00:14:24.462+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='speech recognition'/><title type='text'>Looking on the waves</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;Here is the question - a perfectly looking sound file which is transcribed with 10% accuracy. Sounds crazy, isn't it? Click on it to enlarge. No noise, no accent.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://4.bp.blogspot.com/-t2JKdIHfHEk/TV2IfHksFGI/AAAAAAAAAJc/F14XEyj_3nY/s1600/waveform.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="280" src="http://4.bp.blogspot.com/-t2JKdIHfHEk/TV2IfHksFGI/AAAAAAAAAJc/F14XEyj_3nY/s400/waveform.png" width="400" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Because of that I'm looking on state-of-art in channel normalization, especially for non-linear channel distortions. No good solution yet, I've only found the description of the problem in very old paper&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.62.1803&amp;amp;rep=rep1&amp;amp;type=pdf"&gt;SOURCES OF DEGRADATION OF SPEECH RECOGNITION&amp;nbsp;IN THE TELEPHONE NETWORK Pedro J. Moreno and Richard M. Stern From the Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Adelaide, Australia, Vol. I., pp. 109 - 112, April, 1994.&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;There is CDCN normalization, few CMN improvements, RASTA and even recently invented HN normalization. CDCN is suprisingly available in Sphinxtrain but nobody uses it. Well it gives no improvement but it's an interesting approach worth to document one day. The idea to collect statistics from the speech to apply it later sounds nice.&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-vsIJ3vQZWeg/TV2HsuBImBI/AAAAAAAAAJU/ouPrZgZ7gzs/s1600/a.jpg" imageanchor="1" style="clear: right; display: inline !important; float: right; margin-bottom: 1em; margin-left: 1em;"&gt;&lt;img border="0" height="240" src="http://2.bp.blogspot.com/-vsIJ3vQZWeg/TV2HsuBImBI/AAAAAAAAAJU/ouPrZgZ7gzs/s320/a.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;There are model-level approaches, various feature transforms, adaptations. They do not really look that attractive. Most papers now deal with channel compensation for speaker recognition, not speech recognition. I must admit the topic is too large to overview it in few weeks.&lt;br /&gt;&lt;br /&gt;Luckily, I can also spend time looking on the waves like the one on the right. Somewhat more pleasant I would say.&lt;br /&gt;&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-5066309718680779587?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/5066309718680779587/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2011/02/looking-on-waves.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/5066309718680779587'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/5066309718680779587'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2011/02/looking-on-waves.html' title='Looking on the waves'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/-t2JKdIHfHEk/TV2IfHksFGI/AAAAAAAAAJc/F14XEyj_3nY/s72-c/waveform.png' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-1952966137117883179</id><published>2011-01-15T01:05:00.001+03:00</published><updated>2011-01-15T01:08:56.055+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='sphinxtrain'/><title type='text'>Some more optimization</title><content type='html'>In addition two the previous post, two more tricks for log_diag_eval.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Floats instead of double&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;If accumulator is float, SSE could be used more effectively&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Hardcode vector length&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;The most common optimizaition is loop unrolling. It helps to optimize memory access as well as eliminates jump commands. But the issue here is that number of iterations in log_diag_eval can be different on various stages. GCC has interesting profile-based optimizaition for this case, see -fprofile-generate option. It runs a program and then can derive few specific optimizations form the runtime. Good point is that we actually can be almost sure in usage patters of the our target loop, so we can optimize without profiling. So, turn&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;for (i=0;i&amp;lt;veclen;i++) {&lt;br /&gt;&amp;nbsp;&amp;nbsp; do work&lt;br /&gt;}&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;to&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;if (veclen == 40) { // Common used value, 40 floats in each frame&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;for (i=0;i&amp;lt;40;i++) {&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;do work // This will be unrolled&lt;br /&gt;&amp;nbsp;&amp;nbsp; } else {&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;for (i=0;i&amp;lt;veclen;i++)&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;do work&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;}&lt;br /&gt;}&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;GCC does same trick with profiler, but since our feature frame size is fixed, we can hardcode. As a result GCC will unroll first loop and it will be fast as a wind&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-1952966137117883179?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/1952966137117883179/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2011/01/some-more-optimization.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/1952966137117883179'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/1952966137117883179'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2011/01/some-more-optimization.html' title='Some more optimization'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-7290467349205044799</id><published>2011-01-13T00:48:00.003+03:00</published><updated>2011-01-13T03:57:16.331+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='sphinxtrain'/><title type='text'>Optimization in SphinxTrain</title><content type='html'>I spend quite significant amount of time training various models. It feels like alchemy, you add this and tune there and you get nice results. And while training you can read twitter ;) I'm also 10 years in a group which is creating optimizing compilers so in theory I should know a lot about them. I rarely apply it in practice though. But being bored with several weeks training you can apply some knowledge here. &lt;br /&gt;&lt;br /&gt;&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;So the algorithm:&lt;br /&gt;&lt;br /&gt;1) Train a model for a month and become bored&lt;br /&gt;2) Get an idea that SphinxTrain is compiled without optimization&lt;br /&gt;3) Go to SphinxTrain/config and change compilation option from -O2 to -O3&lt;br /&gt;4) Measure run time of a simple bw run with time command&lt;br /&gt;5) See that time doesn't really change&lt;br /&gt;6) Add -pg option to CFLAGS and LDFLAGS to collect profile&lt;br /&gt;7) See most of the time we are running log_diag_eval function which is a simple weighted dot product computation&lt;br /&gt;8) See the assembler code of the log_diag_eval&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;0x42c3b0 log_diag_eval: unpcklps %xmm0,%xmm0&lt;br /&gt;0x42c3b3 log_diag_eval+3: test   %ecx,%ecx&lt;br /&gt;0x42c3b5 log_diag_eval+5: cvtps2pd %xmm0,%xmm0&lt;br /&gt;0x42c3b8 log_diag_eval+8: je     0x42c3fd log_diag_eval+77&lt;br /&gt;0x42c3ba log_diag_eval+10: sub    $0x1,%ecx&lt;br /&gt;0x42c3bd log_diag_eval+13: xor    %eax,%eax&lt;br /&gt;0x42c3bf log_diag_eval+15: lea    0x4(,%rcx,4),%rcx&lt;br /&gt;0x42c3c7 log_diag_eval+23: nopw   0x0(%rax,%rax,1)&lt;br /&gt;0x42c3d0 log_diag_eval+32: movss  (%rdi,%rax,1),%xmm1&lt;br /&gt;0x42c3d5 log_diag_eval+37: subss  (%rsi,%rax,1),%xmm1&lt;br /&gt;0x42c3da log_diag_eval+42: unpcklps %xmm1,%xmm1&lt;br /&gt;0x42c3dd log_diag_eval+45: cvtps2pd %xmm1,%xmm2&lt;br /&gt;0x42c3e0 log_diag_eval+48: movss  (%rdx,%rax,1),%xmm1&lt;br /&gt;0x42c3e5 log_diag_eval+53: add    $0x4,%rax&lt;br /&gt;0x42c3e9 log_diag_eval+57: cmp    %rcx,%rax&lt;br /&gt;0x42c3ec log_diag_eval+60: cvtps2pd %xmm1,%xmm1&lt;br /&gt;0x42c3ef log_diag_eval+63: mulsd  %xmm2,%xmm1&lt;br /&gt;0x42c3f3 log_diag_eval+67: mulsd  %xmm2,%xmm1&lt;br /&gt;0x42c3f7 log_diag_eval+71: subsd  %xmm1,%xmm0&lt;br /&gt;0x42c3fb log_diag_eval+75: jne    0x42c3d0 log_diag_eval+32&lt;br /&gt;0x42c3fd log_diag_eval+77: repz retq &lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;9) Understand that it's not really as good here as it can be&lt;br /&gt;&lt;br /&gt;10) Run &lt;br /&gt;&lt;br /&gt;&lt;pre&gt;gcc -DPACKAGE_NAME=\"SphinxTrain\" -DPACKAGE_TARNAME=\"sphinxtrain\" \&lt;br /&gt;-DPACKAGE_VERSION=\"1.0.99\" -DPACKAGE_STRING=\"SphinxTrain\ 1.0.99\" \&lt;br /&gt;-DPACKAGE_BUGREPORT=\"\" -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 \&lt;br /&gt;-DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 \&lt;br /&gt;-DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_LIBM=1 \&lt;br /&gt;-I/home/nshmyrev/SphinxTrain/../sphinxbase/include \&lt;br /&gt;-I/home/nshmyrev/SphinxTrain/../sphinxbase/include   -I../../../include -O3 \&lt;br /&gt;-g -Wall -fPIC -DPIC -c gauden.c -o obj.x86_64-unknown-linux-gnu/gauden.o \&lt;br /&gt;-ftree-vectorizer-verbose=2&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;to see that log_diag_eval loop isn't vectorized&lt;br /&gt;&lt;br /&gt;11) Add -ffast-math and see it doesn't help&lt;br /&gt;&lt;br /&gt;12) Rewrite function from&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;float64&lt;br /&gt;log_diag_eval(vector_t obs,&lt;br /&gt;    float32 norm,&lt;br /&gt;    vector_t mean,&lt;br /&gt;    vector_t var_fact,&lt;br /&gt;    uint32 veclen)&lt;br /&gt;{&lt;br /&gt;    float64 d, diff;&lt;br /&gt;    uint32 l;&lt;br /&gt;&lt;br /&gt;    d = norm;   /* log (1 / 2 pi |sigma^2|) */&lt;br /&gt;&lt;br /&gt;    for (l = 0; l &amp;lt; veclen; l++) {&lt;br /&gt;        diff = obs[l] - mean[l];&lt;br /&gt;        d -= var_fact[l] * diff * diff; /* compute -1 / (2 sigma ^2) * (x - m) ^ 2 terms */&lt;br /&gt;    }&lt;br /&gt;&lt;br /&gt;    return d;&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;to&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;log_diag_eval(vector_t obs,&lt;br /&gt;    float32 norm,&lt;br /&gt;    vector_t mean,&lt;br /&gt;    vector_t var_fact,&lt;br /&gt;    uint32 veclen)&lt;br /&gt;{&lt;br /&gt;    float64 d, diff;&lt;br /&gt;    uint32 l;&lt;br /&gt;&lt;br /&gt;    d = 0.0;&lt;br /&gt;&lt;br /&gt;    for (l = 0; l &amp;lt; veclen; l++) {&lt;br /&gt;        diff = obs[l] - mean[l];&lt;br /&gt;        d += var_fact[l] * diff * diff; /* compute -1 / (2 sigma ^2) * (x - m) ^ 2 terms */&lt;br /&gt;    }&lt;br /&gt;&lt;br /&gt;    return norm - d;    /* log (1 / 2 pi |sigma^2|) */&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;to turn substraction which hurts to accumulation. &lt;br /&gt;&lt;br /&gt;13) See that loop is now vectorized. Enjoy the speed!!!&lt;br /&gt;&lt;br /&gt;The key thing to understand here is that programming is rather flexible and compilers are rather dumb. But you have to cooperate. So you need to use very simple constructs to let compiler do his work. Moreover, this idea of using simple constructs in the code has other benefits since it helps to keep code style clean and enables automated static analysis with tools like splint.&lt;br /&gt;&lt;br /&gt;Maybe same applies to speech recognition. We need to help computers in their efforts to understand us. Speak slowly and articulate clearly and both we and computers will enjoy the result&lt;br /&gt;&lt;br /&gt;If you are interested about loop vectorization in GCC, see here &lt;a href="http://gcc.gnu.org/projects/tree-ssa/vectorization.html"&gt;http://gcc.gnu.org/projects/tree-ssa/vectorization.html&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-7290467349205044799?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/7290467349205044799/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2011/01/optimization-in-sphinxtrain.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/7290467349205044799'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/7290467349205044799'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2011/01/optimization-in-sphinxtrain.html' title='Optimization in SphinxTrain'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-1279628397926461672</id><published>2010-12-22T01:59:00.000+03:00</published><updated>2010-12-22T01:59:28.587+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='mary'/><category scheme='http://www.blogger.com/atom/ns#' term='TTS'/><title type='text'>Mary TTS 4.3.0 released</title><content type='html'>With Russian voice from Voxforge DB. Yay! Try it on the web:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://mary.dfki.de:59125/"&gt;http://mary.dfki.de:59125/&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Mary is definitely superious system comparing to Festival. Graphical UI, modern design, support for various things like automatic dictionary creation make it really easy to build a language support. And due to modular and stable codebase one can easily add support for new feature, integrate with external package like it's done with OpenNLP or just fix the bug. And your fix will be accepted!&lt;br /&gt;&lt;br /&gt;There are two Voxforge TTS datasets pending between - German and Dutch and also there is a Polish voice. If anyone wants to try that, it must be really easy to add other language to Mary.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-1279628397926461672?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/1279628397926461672/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2010/12/mary-tts-430-released.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/1279628397926461672'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/1279628397926461672'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2010/12/mary-tts-430-released.html' title='Mary TTS 4.3.0 released'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-3289795331871534417</id><published>2010-12-15T15:26:00.000+03:00</published><updated>2010-12-15T15:26:59.369+03:00</updated><title type='text'>Phoneset with stress</title><content type='html'>So I finally finished testing of the stress-aware model. It took me few month and the end I could say that lexical stress is definitely better. It provides better accuracy and, more importantly, more robustness over model which has non-stressed phoneset.&lt;br /&gt;&lt;br /&gt;I hope we retrain all other models we have with the phoneset with stress. It's great that CMUDict provides enough information to do that. The story of me testing that was quite interesting. I believed in stress for a long time but wasn't able to prove that. In theory it's clear why it helps, when speech speed changes, stressed syllables remain less corrupted than unstressed and we get better control over data. Additional information like lexical stress is important. Of course the issue is the increased number of parameters to train the model. That's why I think early investigations concluded that phoneset without stress is better. Discussion about it on cmusphinx-devel this summer also confirmed Nuance moved to the model with stress in their automotive decoder.&lt;br /&gt;&lt;br /&gt;It's interesting how long I tested that. I made numerous attempts and each one had bugs&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;First attempt was using bad features (adapted for 3gp) and didn't show any improvement&lt;/li&gt;&lt;li&gt;Number of senones in second training was too small since I didn't know the reason of first failure&lt;/li&gt;&lt;li&gt;Third attempt had issue with the automatic questions which were used accidentally instead of manual ones I wrote and it went unnoticed&lt;/li&gt;&lt;li&gt;Fourth attempt was rejected because there were issues with the dictionary format in Sphinx4. Never use FastDictionary between, use FullDictionary. Fast dictionary expects specific dictionary format with variants like (2) (3) (4) and not (1) or (2) and (4).&lt;/li&gt;&lt;li&gt;Only fifth attempt was good but in shown improvement only on big test set and not on the small one&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;So basically to check every fact you need to be &lt;b&gt;very&lt;/b&gt; careful and double- or triple-check everything. Bugs are everywhere, in language model training, decoder, trainer, configuration. From run to run bugs could lead to different results, even a small change can break everything. I think optimal way for research could be to check the same proposition in independent teams using independent decoders and probably different data. Not sure if it's doable in short term.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-3289795331871534417?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/3289795331871534417/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2010/12/phoneset-with-stress.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/3289795331871534417'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/3289795331871534417'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2010/12/phoneset-with-stress.html' title='Phoneset with stress'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-9040559125380135929</id><published>2010-11-29T18:25:00.005+03:00</published><updated>2010-12-06T12:00:14.382+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='language models'/><title type='text'>On CMUCLMTK</title><content type='html'>I've rebuilt the Nexiwave langauge models and meet some issues which would be nice to solve one day. CMU language model tookit is a nice simple piece of software but it definitely lacks many features which are required to build a good language model. So thinking about features language modelling toolkit can provide I created a list. &lt;br /&gt;&lt;br /&gt;&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;None of the toolkits have all the features in place which leads to unneceessary java coding, perl coding and python coding every time I rebuild the model. It would be nice to see that list turned into software one day.&lt;br /&gt;&lt;br /&gt;Language is a live constantly changing structure. Very interesting is that in 2006 nobody used works like "twitter", "obama", "facebook". New words like "ipad" arrive every day and become very actively used. That makes senselsess to collect any static databases to train the language model. If pronunciation and acoustic changes slowly (well, not so slowly now. with Skype used everywhere acoustic of speech changed very significantly over last years!) language model needs to be very quickly adapted to the date. Other problem is that Sphinx4 is not very robust to unknown words. If unknown word is met, it bascailly screw the whole utterance around. That's why it's important to have up-to-date vocabulary. Maybe it can be small dynamic language model combined with huge static one, I'm not sure.&lt;br /&gt;&lt;br /&gt;It can be clearly seen that Gigaword doesn't have enough coverage of modern terms. Models like lm_giga are nice, but they only work for old books. We need something live.&lt;br /&gt;&lt;br /&gt;Another issue is where to find the texts. Unfortunately very relevant spoken transcriptions aren't available. Only companies which manually transcribe speech have them I think. Every written text is very different from spoken one.&lt;br /&gt;&lt;br /&gt;So we need to collect data from realtime, from Twitter, Facebook, Google, Wikipedia, Youtube, from the net. We also need to be able to process this data, classify it, convert it to a spoken form and train the language model on the result. Issues here are that the crawled text is often corrupted, has spelling issues and spam. That's a huge task to make it usable.&lt;br /&gt;&lt;br /&gt;Google and Bing use brut-force approach here, they just collect everything and hope it will be good enough. That can be seen on their n-grams data. Not sure if this approach is helpful for ASR.&lt;br /&gt;&lt;br /&gt;So the features to see:&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Automated spelling errors detection and correction for unified written form. That includes automated abbreviation and numbers expansion. For which you must have good NLP component to be able to identify part of speech and other properties. Interesting here is that unified written form must be spelling-oriented. For example there are differently pronounced "going to" and "gonna" while from the language point of view they are identical. I haven't decided what to do with "gonna" yet.&lt;/li&gt;&lt;li&gt;Automatic vocabulary selection. While in theory decoder should operate with unlimited vocabulary, in pratice it's better to have smaller one but with a good coverage. It's very important to be able to filter common spelling errors here.&lt;/li&gt;&lt;li&gt;Tookit should support crawling from major sources like from Twitter, Wikipedia, other sources.&lt;/li&gt;&lt;li&gt;Though NLP is mentioned above, I think it shoudln't be only used for expansion. Tookit should support many NLP features to be able to create more complex language model than simple n-grams.&lt;/li&gt;&lt;/ul&gt;I think that's a myth that n-gram model describe language well. Language model is not effective enough in rejection of the transcriptions that aren't possible at all in the language. We already changed the decoder to penalize trigrams which aren't common in a language model and require backoff, but this change appread to be not effective enough.&lt;br /&gt;&lt;br /&gt;There is a nice idea of discriminative correction in Joshua machine translation toolkit for example&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.cs.jhu.edu/%7Ezfli/pubs/discriminative_lm_for_smt_zhifei_amta_08.pdf"&gt;Large-scale Discriminative n-gram Language Models for Statistical Machine Translation&lt;br /&gt;Zhifei Li and Sanjeev Khudanpur&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The paper has quite interesting problem statement to correct the decoding results which are not possible in the language but solution that gives only 5% improvement is certainly not worth attention. We need to embed language knowledge deep in the search.&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Toolkit must be designed for parallel hardware in mind. Everything is getting distributed and with information volumes it's a hard requirement to be able to process data in parallel.&lt;/li&gt;&lt;/ul&gt;Quite a long list to be honest. Few years of coding on the top of cmuclmtk. It needs to be done anyway.&lt;br /&gt;&lt;br /&gt;Paper on subject:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://dx.doi.org/10.1145/1322391.1322392"&gt;Ivan Bulyko, Mari Ostendorf, Manhung Siu, Tim Ng, Andreas Stolcke Web resources for language modeling in conversational speech recognition&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-9040559125380135929?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/9040559125380135929/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2010/11/on-cmuclmtk.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/9040559125380135929'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/9040559125380135929'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2010/11/on-cmuclmtk.html' title='On CMUCLMTK'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-6514745359735837007</id><published>2010-11-27T17:44:00.003+03:00</published><updated>2010-11-29T18:44:22.988+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='sphinxtrain'/><category scheme='http://www.blogger.com/atom/ns#' term='speech recognition'/><title type='text'>Decoding of Compressed Low-Bitrate Speech</title><content type='html'>&lt;div&gt;I've spent some time on optimizing accuracy for 3gp speech recordings from mobile&amp;nbsp;phones. 3gp is a container format used on most mobile devices nowdays with speech compressed using &lt;a href="http://en.wikipedia.org/wiki/Adaptive_Multi-Rate_audio_codec"&gt;AMR-NB&lt;/a&gt; inside. Converted audio to AMR-NB&amp;nbsp;and back, extracted PLP features and then trained few models on that.&amp;nbsp;Result is not encouraging - accuracy is worse than stock model both&amp;nbsp;on original and on compressed/decompressed audio. Not much worse but significanly worse.&lt;/div&gt;&lt;br /&gt;Looks like traditional HMM issues like frame independency assumption play here which is confirmed by the papers I found. This paper is quite useful for example:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5273968"&gt;Vladimir Fabregas Surigué de Alencar and Abraham Alcaim.&amp;nbsp;On the Performance of ITU-T G.723.1 and AMR-NB Codecs for Large&amp;nbsp;Vocabulary Distributed Speech Recognition in Brazilian Portuguese&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;And this paper is good too:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.lrec-conf.org/proceedings/lrec2010/pdf/285_Paper.pdf"&gt;Patrick Bauer, David Scheler, Tim Fingscheidt.&amp;nbsp;WTIMIT: The TIMIT Speech Corpus&amp;nbsp;Transmitted Over the 3G AMR Wideband Mobile Network&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Need to research more on subject. Suprisingly there are only few papers on the subject, way less than on reverberation. It looks we have to build specialized frontend specifically targetted on decoding of low-bitrate compressed speech. Or we need to move to more robust features than PLP.&lt;br /&gt;&lt;br /&gt;For now I would state the problem to develop a speech recognition framework to provide good accuracy on:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Unmodified speech&lt;/li&gt;&lt;li&gt;Noise-corrupted speech&lt;/li&gt;&lt;li&gt;Music-corrupted speech&lt;/li&gt;&lt;li&gt;Codec-corrupted speech&lt;/li&gt;&lt;li&gt;Long-distance speech&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;Good system should decode well in all cases.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-6514745359735837007?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/6514745359735837007/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2010/11/decoding-of-compressed-low-bitrate.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/6514745359735837007'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/6514745359735837007'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2010/11/decoding-of-compressed-low-bitrate.html' title='Decoding of Compressed Low-Bitrate Speech'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-8776963822228200008</id><published>2010-11-23T01:07:00.000+03:00</published><updated>2010-11-23T01:07:55.564+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='sphinxtrain'/><title type='text'>Updates in SphinxTrain</title><content type='html'>Being tired to explain build issues over and over I found the passion to step in and start a sequence of major changes in SphinxTrain&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Ported sphinxtrain to automake, development branch you can try is here:&amp;nbsp;&lt;a href="https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/branches/sphinxtrain-automake"&gt;https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/branches/sphinxtrain-automake&lt;/a&gt;&lt;/li&gt;&lt;li&gt;Will increase SphinxTrain dependency on sphinxbase, unifying the duplicated sources.&lt;/li&gt;&lt;li&gt;Will make training use external SphinxTrain installation, no setup in training folder will be required, only configuration. All scripts will be in share and in libdir, they will be installed systemwide. To try a new version one will just need to change path to sphinxtrain.&lt;/li&gt;&lt;li&gt;Will modify scripts to be able to build and test the database using a single command. No possibility to miss anything!&lt;/li&gt;&lt;li&gt;Will include automation for language weight optimization on a development set, better model training scripts will do everything required.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;I know Autotools aren't the best build system, but they are pretty straghtforward. More importantly, the tools will follow common Unix practices and thus will be easier to use and understand.&lt;br /&gt;&lt;br /&gt;Comments are welcome!&lt;br /&gt;&lt;br /&gt;P.S.&lt;br /&gt;&lt;br /&gt;We've done a great progress on &lt;a href="http://nexiwave.com/"&gt;Nexiwave&lt;/a&gt; also. Check it &lt;a href="http://nexiwave.com/index.php/news"&gt;out&lt;/a&gt;!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-8776963822228200008?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/8776963822228200008/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2010/11/updates-in-sphinxtrain.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/8776963822228200008'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/8776963822228200008'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2010/11/updates-in-sphinxtrain.html' title='Updates in SphinxTrain'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-2072514043726643248</id><published>2010-11-16T19:42:00.001+03:00</published><updated>2010-11-16T19:45:13.703+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='sphinx4'/><title type='text'>Backward Compatibility Issues</title><content type='html'>Just today I spent few hours trying to figure out why changed makeinfo version output broke binutils build. Well, it's an old bug but we all getting mad when backward compatibility breaks. Especially when it affects our software. Especially when we don't have time no passion to fix that. My complains raised to the roof or probably even higher.&lt;br /&gt;&lt;br /&gt;Life is a strange thing. Right after that I went ahead a broke sphinx4 backward compatibility in model packaging (again!). Now models distributed with sphinx4 follow Sphinxtrain output format, all files are in the single folder, model definition is named simply "mdef" and there is feat.params. Things are very &lt;br /&gt;straightforward:&lt;br /&gt;&lt;br /&gt;&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;[shmyrev@gnome sphinx4]$ ls models/acoustic/wsj&lt;/div&gt;&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;dict         license.terms  means            noisedict  transition_matrices&lt;/div&gt;&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;feat.params  mdef           mixture_weights  README     variances&lt;/div&gt;&lt;br /&gt;It will certainly help to avoid confusion when new developers change the model, adapt the model or train their own one.&lt;br /&gt;&lt;br /&gt;In the future I hope to get feat.params used better in order to automatically build frontend, derive feature extraction properties, hold metadata about model and similar things. Shiny future is getting closer.&lt;br /&gt;&lt;br /&gt;I also removed RM1 model from the distribution. I don't think anybody is using it.&lt;br /&gt;&lt;br /&gt;So please don't complain, let's better fix that until it's too late to fix. One day we'll get everything in place and we'll release final version sphinx4-1.0. And after that we'll certainly be backward-compatible. I really like Java and Windows because of their long-term backward-compatible policy. We can do even better.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-2072514043726643248?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/2072514043726643248/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2010/11/backward-compatibility-issues.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/2072514043726643248'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/2072514043726643248'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2010/11/backward-compatibility-issues.html' title='Backward Compatibility Issues'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-8072677265591980646</id><published>2010-10-27T16:43:00.001+04:00</published><updated>2010-10-27T16:51:22.830+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='speech recognition'/><category scheme='http://www.blogger.com/atom/ns#' term='business'/><title type='text'>Speech Like WWW</title><content type='html'>Talking about custom speech application development I've got a thought. There are quite many speech companies already. Speech application development is actually quite similar to UI design or to web design in sense that you need to be specialized expert in order to create speech interface. What if speech developers will be like web designers - thousands of them every day build customized websites all over the world? What if market is so huge that it will be possible to run hundred shops each working on customer needs.&lt;br /&gt;&lt;br /&gt;To be honest I don't quite like web development. It sounds very strange for me that you MUST pay at least $1000 to build something that is pleasantly looking. And for big websites its way more. Whoever created this market didn't think about business, he designed HTML in order to drain money from small and big companies. I tried to create few websites myself, for example CMUSphinx website. Even with all modern   tools, CMS platforms, themes and stuff the reality is that you need to be an expert. Otherwise the result will not be satisfactory enough. Menus will overlap, regions will not be aligned, pictures will be blurry and colors will not match. Can it be different? Certainly it can, but not in this world. I can understand that creativity can't be automated, I can't understand that creativity is required for every company.&lt;br /&gt;&lt;br /&gt;There are some similar things in software development like for example if you want to develop a telco app you probably want to hire Asterisk developer. And there are thousands Asterisk developers out there. Or if you want JBoss you could find JBoss experts. But I think if you know how to develop applications you can configure Asterisk properly and you can create a bean to acess the database.&lt;br /&gt;&lt;br /&gt;Are we interested in creation of huge and diverse speech industry? Can CMUSphinx be the basement for it? No definite answer for now.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-8072677265591980646?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/8072677265591980646/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2010/10/speech-like-www.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/8072677265591980646'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/8072677265591980646'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2010/10/speech-like-www.html' title='Speech Like WWW'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-3255853914420658151</id><published>2010-10-22T02:54:00.001+04:00</published><updated>2010-10-22T02:55:40.661+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='cmusphinx'/><title type='text'>Recent issues</title><content type='html'>Heh, this month I discovered few critical issues in CMUSphinx.&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Pocketsphinx doesn't properly decode short silences in FSG/JSGF mode&lt;/li&gt;&lt;li&gt;Sphinx4 doesn't really work with OOV loop in grammar&lt;/li&gt;&lt;li&gt;Pocketsphinx n-best lists are useless because of too many repeated entries&lt;/li&gt;&lt;li&gt;Pocketsphinx accuracy is way lower than sphinx3 one&lt;/li&gt;&lt;li&gt;Supposedly-working sphinxbase LM stuff doesn't work with 32-bit DMP, thus no MMIE training for very large vocabulary&lt;/li&gt;&lt;li&gt;MMIE itself doesn't improve accuracy (tested on Voxforge and Fisher)&lt;/li&gt;&lt;li&gt;It's impossible to extract mixture_weights from recent sendumps in pocketsphinx models, python scripts in SphinxTrain are outdated&lt;/li&gt;&lt;li&gt;PTM model adaptation doesn't work&lt;/li&gt;&lt;li&gt;TextAligner demo from sphinx4 requires way more work to align properly&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;That's getting crazy, I wonder if I'll be able to find the time to fix all that.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-3255853914420658151?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/3255853914420658151/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2010/10/recent-issues.html#comment-form' title='11 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/3255853914420658151'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/3255853914420658151'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2010/10/recent-issues.html' title='Recent issues'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>11</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-1178531958842044669</id><published>2010-10-08T00:34:00.000+04:00</published><updated>2010-10-08T00:34:24.486+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='speech recognition'/><title type='text'>Do You Want To Talk To Your Computer?</title><content type='html'>Thanks everyone who voted at&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/_p33_0koWXHA/TK4rdwaPqgI/AAAAAAAAAH4/Sev7zqoFTcc/s1600/voting-application.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="91" src="http://2.bp.blogspot.com/_p33_0koWXHA/TK4rdwaPqgI/AAAAAAAAAH4/Sev7zqoFTcc/s640/voting-application.png" width="640" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;To be honest I was suprised because my opinion is just the reverse of this result. I strongly disagree that command and control will be ever usable. Dictation probably will, but definitely not command and control. I have a hobby - collecting complains about voice control. Here are few ones&lt;br /&gt;&lt;br /&gt;As article in &lt;a href="http://www.pcworld.com/article/202729/rim_adobe_and_microsoft_grasping_at_mobile_straws.html?tk=hp_new"&gt;PCWorld&lt;/a&gt; says&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;If so, you'll love what Microsoft is offering: voice recognition over the air, in which your commands are processed by a server in the clouds and converted into action on your smartphone. Boy, let's burn up those minutes and data plans! And &lt;em&gt;waaait&lt;/em&gt; for the slow, usually incorrect response. Android has a similar capability for search, and it's amazingly frustrating to use, not to mention inaccurate.&lt;br /&gt;&lt;br /&gt;The one good thing about Microsoft's fantasy about voice-command interfaces: You'll be able to identify a Windows Phone 7 user easily. Just listen for the person pleading with the phone to do what he asked. Whie the rest of us are quietly computing and communicating, he'll be hard to miss.&lt;/blockquote&gt;&lt;br /&gt;Another post from &lt;a href="http://news.cnet.com/8618-30684_3-20015118.html?communityId=2139&amp;amp;targetCommunityId=2139&amp;amp;blogId=265&amp;amp;messageId=9802471"&gt;CNET&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;One of the major reasons why speech recognition has not caught on or been seriously looked at in terms of major finances is because people, if given the option of accurate speech recognition too, would still not wanna go for the voice commands, but would rather just use touchscreen. This is because voicing out takes more energy off a person than smoothly running yer fingers on the screen in your hand. Intuitive touchscreens and cleaner interfaces are far better a tool to invest in than making people accustomed to say out words that computers need to understand, process and then implement. Its way easier for the user (in terms of energy used to say it out) to just press the button on touchscreen. There will be certain exceptions, but I'm talking on a mass consumer adoption assumption.&lt;/blockquote&gt;&lt;br /&gt;I strongly believe that when we want to communicate with computer, there are better ways than to give them voice commands. Yes, speech is a natural way of communication but it's a communication between people. When you communicate with machine you don't necessary need to speak to it, there are more efficient ways. Even if you are driving.&lt;br /&gt;&lt;br /&gt;On the other side I think that analytics, speech mining and similar stuff do have a very shiny future. According to &lt;a href="http://www.speechtechmag.com/Articles/News/News-Feature/DMG-Consulting-Report-Speech-Analytics-Continues-To-Grow-Rapidly-58213.aspx"&gt;DMG consulting&lt;/a&gt; the growth of this market will be 42 percent in 2011. That's a true potential. Speech recognition should seemlessly plug into comunication between people and extract value from it. Being non-intrusive it doesn't break patterns but helps to create the information. That's why we invest so much into mining and not into command and control. That's also the reason I don't want to invest too much time in gnome-voice-control.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-1178531958842044669?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/1178531958842044669/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2010/10/do-you-want-to-talk-to-your-computer.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/1178531958842044669'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/1178531958842044669'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2010/10/do-you-want-to-talk-to-your-computer.html' title='Do You Want To Talk To Your Computer?'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_p33_0koWXHA/TK4rdwaPqgI/AAAAAAAAAH4/Sev7zqoFTcc/s72-c/voting-application.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-1608442213687249340</id><published>2010-10-03T00:35:00.000+04:00</published><updated>2010-10-03T00:35:40.661+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='speech recognition'/><title type='text'>No More Word Error Rate</title><content type='html'>Reading &lt;a href="http://delong.typepad.com/sdj/2010/09/when-speech-recognition-software-attacks.html"&gt;http://delong.typepad.com/sdj/2010/09/when-speech-recognition-software-attacks.html&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;What he said:&lt;br /&gt;&lt;br /&gt;Hi Brad, it's Mike. I had a lunchtime appointment go long and I am bolting back to Evans. I'll be there shortly. See you soon. Thanks.&lt;br /&gt;&lt;br /&gt;What Google Voice heard:&lt;br /&gt;&lt;br /&gt;That it's mike. I had a list of women go a long and I am old thing. Back evidence. I'll be there for me to you soon. Thanks.&lt;br /&gt;&lt;br /&gt;The interesting thing is that it got 17 out of the 26 words right--but those 17 words convey almost none of the information in the message...&lt;/blockquote&gt;&lt;br /&gt;I found this paper&lt;br /&gt;&lt;br /&gt;&lt;a href="http://research.microsoft.com/apps/pubs/default.aspx?id=75332"&gt;Is Word Error Rate a Good Indicator for Spoken Language Understanding Accuracy&lt;/a&gt;&lt;br /&gt;Ye-Yi Wang and Alex Acero&lt;br /&gt;2003&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;It is a conventional wisdom in the speech community that better speech recognition accuracy is a good indicator for better spoken language understanding accuracy, given a fixed understanding component. The findings in this work reveal that this is not always the case. More important than word error rate reduction, the language model for recognition should be trained to match the optimization objective for understanding. In this work, we applied a spoken language understanding model as the language model in speech recognition. The model was obtained with an example-based learning algorithm that optimized the understanding accuracy. Although the speech recognition word error rate is 46% higher than the trigram model, the overall slot understanding error can be reduced by as much as 17%.&lt;/blockquote&gt;&lt;br /&gt;We definitely need to address it in sphinx4.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-1608442213687249340?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/1608442213687249340/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2010/10/no-more-word-error-rate.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/1608442213687249340'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/1608442213687249340'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2010/10/no-more-word-error-rate.html' title='No More Word Error Rate'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-2250912547599453374</id><published>2010-10-01T03:12:00.003+04:00</published><updated>2010-10-01T03:20:36.698+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='speech recognition'/><category scheme='http://www.blogger.com/atom/ns#' term='conferences'/><title type='text'>Reading Interspeech 2010 Program</title><content type='html'>Luckily speech people don't have so many conferences. In machine learning world it seems it's getting crazy. You can have conference every month. Researchers travel more than sales managers. In speech there are ASRU, ICASSP but they don't really matter. It's enough to track Interspeech. Since Tokio is too far, I'm just reading the abstract list from the program. First impressions are:&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Keynotes are all boring&lt;/li&gt;&lt;li&gt;Interesting rise of the subject "automatic error detection in unit selection". At least &lt;b&gt;three!&lt;/b&gt; papers are presented&amp;nbsp;on the subject while I haven't seen any of them before. Looks like idea appeared in less then a year! Are they spying each other?&lt;/li&gt;&lt;li&gt;RWTH Aachen presented enormous amount of papers, LIUM is also quite fruitful&lt;/li&gt;&lt;li&gt; Well, IBM T. J. Watson Research Center is active as well, but thats more a tradition&lt;br /&gt;&lt;/li&gt;&lt;li&gt;I've met in one paper:&amp;nbsp;"yields modest gains on the English broadcast news RT-04 task, reducing the word error rate from 14.6% to 14.4%" Was it worth writing an article?&lt;/li&gt;&lt;li&gt;Cognitive status assessment from speech is important in dialogs. SRI is doing that&lt;/li&gt;&lt;li&gt;Strange that reverberation issues are a separate class of problems to solve and largely covered.The problem as a whole looks&amp;nbsp;rather generic - create noise and corruption-stable features. Not sure how reverberation is special here&lt;/li&gt;&lt;li&gt;WFST is loudly mentioned&lt;/li&gt;&lt;li&gt;Andreas Stolke on SRILM noted that pruning doesn't work with KN-smoothed model! Damn, I was using it&lt;/li&gt;&lt;li&gt;Only 2 Russian papers on the whole conference. Well, it's 50% growth to previous year. And one of them is on speech recognition, that's definitely a progress&lt;/li&gt;&lt;li&gt;Suprisingly not so much research on confidence measures! Confidence is a &lt;b&gt;REALLY IMPORTANT THING&lt;/b&gt;&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;Reading the abstracts I also selected some papers which could be interesting for Nexiwave. Probably you'll find this list easier to read than 200 papers from original program. Let's hope this list will be useful for me as well. To be honest I didn't manage to read the papers I selected previous year from Interspeech 2009.&lt;br /&gt;&lt;br /&gt;&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;&lt;b&gt;Unsupervised Discovery and Training of Maximally Dissimilar Cluster Models&lt;/b&gt;&lt;br /&gt;Francoise Beaufays (Google)&lt;br /&gt;Vincent Vanhoucke (Google)&lt;br /&gt;Brian Strope (Google)&lt;br /&gt;One of the difficult problems of acoustic modeling for Automatic Speech Recognition (ASR) is how to adequately model the wide variety of acoustic conditions which may be present in the data. The problem is especially acute for tasks such as Google Search by Voice, where the amount of speech available per transaction is small, and adaptation techniques start showing their limitations. As training data from a very large user population is available however, it is possible to identify and jointly model subsets of the data with similar acoustic qualities. We describe a technique which allows us to perform this modeling at scale on large amounts of data by learning a tree-structured partition of the acoustic space,and we demonstrate that we can significantly improve recognition accuracy in various conditions through unsupervised Maximum Mutual Information (MMI) training. Being fully unsupervised, this technique scales easily to increasing numbers of conditions.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Techniques for topic detection based processing in spoken dialog systems&lt;/b&gt;&lt;br /&gt;Rajesh Balchandran (IBM T J Watson Research Center)&lt;br /&gt;Leonid Rachevsky (IBM T J Watson Research Center)&lt;br /&gt;Bhuvana Ramabhadran (IBM T J Watson Research Center)&lt;br /&gt;Miroslav Novak (IBM T J Watson Research Center)&lt;br /&gt;In this paper we explore various techniques for topic detection in the context of conversational spoken dialog systems and also propose variants over known techniques to address the constraints of memory, accuracy and scalability associated with their practical implementation of spoken dialog systems. Tests were carried out on a multiple-topic spoken dialog system to compare and analyze these techniques. Results show benefits and compromises with each approach suggesting that the best choice of technique for topic detection would be dependent on the specific deployment requirements.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;A Hybrid Approach to Robust Word Lattice Generation Via Acoustic-Based Word Detection&lt;/b&gt;&lt;br /&gt;Icksang Han (Samsung Advanced Institute of Technology, Samsung Electronics Co., Ltd.)&lt;br /&gt;Chiyoun Park (Samsung Advanced Institute of Technology, Samsung Electronics Co., Ltd.)&lt;br /&gt;Jeongmi Cho (Samsung Advanced Institute of Technology, Samsung Electronics Co., Ltd.)&lt;br /&gt;Jeongsu Kim (Samsung Advanced Institute of Technology, Samsung Electronics Co., Ltd.)&lt;br /&gt;A large-vocabulary continuous speech recognition (LVCSR) system usually utilizes a language model in order to reduce the complexity of the algorithm. However, the constraint also produces side-effects including low accuracy of the out-of-grammar sentences and the error propagation of misrecognized words. In order to compensate for the side-effects of the language model, this paper proposes a novel lattice generation method that adopts the idea from the keyword detection method. By combining the word candidates detected mainly from the acoustic aspect of the signal to the word lattice from the ordinary speech recognizer, a hybrid lattice is constructed. The hybrid lattice shows 33% improvement in terms of the lattice accuracy under the condition where the lattice density is the same. In addition, it is observed that the proposed model shows less sensitivity to the out-of-grammar sentences and to the error propagation due to misrecognized words.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Time Condition Search in Automatic Speech Recognition Reconsidered&lt;/b&gt;&lt;br /&gt;David Nolden (RWTH Aachen)&lt;br /&gt;Hermann Ney (RWTH Aachen)&lt;br /&gt;Ralf Schlueter (RWTH Aachen)&lt;br /&gt;In this paper we re-investigate the time conditioned search (TCS) method in comparison to the well known word conditioned search, and analyze its applicability on state-of-the-art large vocabulary continuous speech recognition tasks. In contrast to current standard approaches, time conditioned search offers theoretical advantages particularly in combination with huge vocabularies and huge language models, but it is difficult to combine with across word modelling, which was proven to be an important technique in automatic speech recognition. Our novel contributions for TCS are a pruning step during the recombination called Early Word End Pruning, an additional recombination technique called Context Recombination, the idea of a Startup Interval to reduce the number of started trees, and a mechanism to combine TCS with across word modelling. We show that, with these techniques, TCS can outperform WCS on a current task.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Direct Construction of Compact Context-Dependency Transducers From Data&lt;/b&gt;&lt;br /&gt;David Rybach (RWTH Aachen University, Germany)&lt;br /&gt;Michael Riley (Google Inc., USA)&lt;br /&gt;This paper describes a new method for building compact context-dependency transducers for finite-state transducer-based ASR decoders. Instead of the conventional phonetic decision-tree growing followed by FST compilation, this approach incorporates the phonetic context splitting directly into the transducer construction. The objective function of the split optimization is augmented with a regularization term that measures the number of transducer states introduced by a split. We give results on a large spoken-query task for various n-phone orders and other phonetic features that show this method can greatly reduce the size of the resulting context-dependency transducer with no significant impact on recognition accuracy. This permits using context sizes and features that might otherwise be unmanageable.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;On the relation of Bayes Risk, Word Error, and Word Posteriors in ASR&lt;/b&gt;&lt;br /&gt;Ralf Schlueter (Lehrstuhl fuer Informatik 6 - Computer Science Department, RWTH Aachen University)&lt;br /&gt;Markus Nussbaum-Thom (Lehrstuhl fuer Informatik 6 - Computer Science Department, RWTH Aachen University)&lt;br /&gt;Hermann Ney (Lehrstuhl fuer Informatik 6 - Computer Science Department, RWTH Aachen University)&lt;br /&gt;In automatic speech recognition, we are faced with a well-known inconsistency: Bayes decision rule is usually used to minimize sentence (word sequence) error, whereas in practice we want to minimize word error, which also is the usual evaluation measure. Recently, a number of speech recognition approaches to approximate Bayes decision rule with word error (Levenshtein/edit distance) cost were proposed. Nevertheless, experiments show that the decisions often remain the same and that the effect on the word error rate is limited, especially at low error rates. In this work, further analytic evidence for these observations is provided. A set of conditions is presented, for which Bayes decision rule with sentence and word error cost function leads to the same decisions. Furthermore, the case of word error cost is investigated and related to word posterior probabilities. The analytic results are verified experimentally on several large vocabulary speech recognition tasks.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Efficient Data Selection for Speech Recognition Based on Prior Confidence Estimation Using Speech and Context Independent Models&lt;/b&gt;&lt;br /&gt;Satoshi KOBASHIKAWA (NTT Cyber Space Laboratories, NTT Corporation)&lt;br /&gt;Taichi ASAMI (NTT Cyber Space Laboratories, NTT Corporation)&lt;br /&gt;Yoshikazu YAMAGUCHI (NTT Cyber Space Laboratories, NTT Corporation)&lt;br /&gt;Hirokazu MASATAKI (NTT Cyber Space Laboratories, NTT Corporation)&lt;br /&gt;Satoshi TAKAHASHI (NTT Cyber Space Laboratories, NTT Corporation)&lt;br /&gt;This paper proposes an efficient data selection technique to identify well recognized texts in massive volumes of speech data. Conventional confidence measure techniques can be used to obtain this accurate data, but they require speech recognition results to estimate confidence. Without a significant level of confidence, considerable computer resources are wasted since inaccurate recognition results are generated only to be rejected later. The technique proposed herein rapidly estimates the prior confidence based on just an acoustic likelihood calculation by using speech and context independent models before speech recognition processing; it then recognizes data with high confidence selectively. Simulations show that it matches the data selection performance of the conventional posterior confidence measure with less than 2 % of the computation time.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Discovering an Optimal Set of Minimally Contrasting Acoustic Speech Units: A Point of Focus for Whole-Word Pattern Matching&lt;/b&gt;&lt;br /&gt;Guillaume Aimetti (University of Sheffield)&lt;br /&gt;Roger Moore (Universty of Sheffield)&lt;br /&gt;Louis ten Bosch (Radboud University)&lt;br /&gt;This paper presents a computational model that can automatically learn words, made up from emergent sub-word units, with no prior linguistic knowledge. This research is inspired by current cognitive theories of human speech perception, and therefore strives for ecological plausibility with the desire to build more robust speech recognition technology. Firstly, the particulate structure of the raw acoustic speech signal is derived through a novel acoustic segmentation process, the `acoustic DP-ngram algorithm'. Then, using a cross-modal association learning mechanism, word models are derived as a sequence of the segmented units. An efficient set of sub-word units emerge as a result of a general purpose lossy compression mechanism and the algorithms sensitivity to discriminate acoustic differences. The results show that the system can automatically derive robust word representations and dynamically build re-usable sub-word acoustic units with no pre-defined language-specific rules.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Modeling pronunciation variation using context-dependent articulatory feature decision trees&lt;/b&gt;&lt;br /&gt;Samuel Bowman (Linguistics, The University of Chicago)&lt;br /&gt;Karen Livescu (TTI-Chicago)&lt;br /&gt;We consider the problem of predicting the surface pronunciations of a word in conversational speech, using a feature-based model of pronunciation variation. We build context-dependent decision trees for both phone-based and feature-based models, and compare their perplexities on conversational data from the Switchboard Transcription Project. We find that feature-based decision trees using featur e bundles based on articulatory phonology outperform phone-based decision trees, and are much more r obust to reductions in training data. We also analyze the usefulness of various context variables.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Accelerating Hierarchical Acoustic Likelihood Computation on Graphics Processors&lt;/b&gt;&lt;br /&gt;Pavel Kveton (IBM)&lt;br /&gt;Miroslav Novak (IBM)&lt;br /&gt;The paper presents a method for performance improvements of a speech recognition system by moving a part of the computation - acoustic likelihood computation - onto a Graphics Processor Unit (GPU). In the system, GPU operates as a low cost powerful coprocessor for linear algebra operations. The paper compares GPU implementation of two techniques of acoustic likelihood computation: full Gaussian computation of all components and a significantly faster Gaussian selection method using hierarchical evaluation. The full Gaussian computation is an ideal candidate for GPU implementation because of its matrix multiplication nature. The hierarchical Gaussian computation is a technique commonly used on a CPU since it leads to much better performance by pruning the computation volume. Pruning techniques are generally much harder to implement on GPUs, nevertheless, the paper shows that hierarchical Gaussian computation can be efficiently implemented on GPUs.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;The AMIDA 2009 Meeting Transcription System&lt;/b&gt;&lt;br /&gt;Thomas Hain (Univ Sheffield)&lt;br /&gt;Lukas Burget (Brno Univ. of Technology)&lt;br /&gt;John Dines (Idiap)&lt;br /&gt;Philip N. Garner (Idiap)&lt;br /&gt;Asmaa El Hannani (Univ. Sheffield)&lt;br /&gt;Marijn Huijbregts (Univ. Twente)&lt;br /&gt;Martin Karafiat (Brno Univ. of Technology)&lt;br /&gt;Mike Lincoln (Univ. of Edinburgh)&lt;br /&gt;Wan Vincent (Univ. Of Sheffield)&lt;br /&gt;We present the AMIDA 2009 system for participation in the NIST RT'2009 STT evaluations. Systems for close-talking, far field and speaker attributed STT conditions are described. Improvements to our previous systems are: segmentation and diarisation; stacked bottle-neck posterior feature extraction; fMPE training of acoustic models; adaptation on complete meetings; improvements to WFST decoding; automatic optimisation of decoders and system graphs. Overall these changes gave a 6-13% relative reduction in word error rate while at the same time reducing the real-time factor by a factor of five and using considerably less data for acoustic model training.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;A FACTORIAL SPARSE CODER MODEL FOR SINGLE CHANNEL SOURCE SEPARATION&lt;/b&gt;&lt;br /&gt;Robert Peharz (Graz University of Technology)&lt;br /&gt;Michael Stark (Graz University of Technology)&lt;br /&gt;Franz Pernkopf (Graz University of Technology)&lt;br /&gt;Yannis Stylianou (University of Crete)&lt;br /&gt;We propose a probabilistic factorial sparse coder model for single channel source separation in the magnitude spectrogram domain. The mixture spectrogram is assumed to be the sum of the sources, which are assumed to be generated frame-wise as the output of sparse coders plus noise. For dictionary training we use an algorithm which can be described as non-negative matrix factorization with ℓ0 sparseness constraints. In order to infer likely source spectrogram candidates, we approximate the intractable exact inference by maximizing the posterior over a plausible subset of solutions. We compare our system to the factorial-max vector quantization model, where the proposed method shows a superior performance in terms of signal-to-interference ratio. Finally, the low computational requirements of the algorithm allows close to real time applications.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;ORIENTED PCA METHOD FOR BLIND SPEECH SEPARATION OF CONVOLUTIVE MIXTURES&lt;/b&gt;&lt;br /&gt;Yasmina Benabderrahmane (INRS-EMT Telecommunications Canada)&lt;br /&gt;Sid Ahmed Selouani (Université de Moncton Canada)&lt;br /&gt;Douglas O’Shaughnessy (INRS-EMT Telecommunications Canada)&lt;br /&gt;This paper deals with blind speech separation of convolutive mixtures of sources. The separation criterion is based on Oriented Principal Components Analysis (OPCA) in the frequency domain. OPCA is a (second order) extension of standard Principal Component Analysis (PCA) aiming at maximizing the power ratio of a pair of signals. The convolutive mixing is obtained by modeling the Head Related Transfer Function (HRTF). Experimental results show the efficiency of the proposed approach in terms of subjective and objective evaluation, when compared to the Degenerate Unmixing Evaluation Technique (DUET) and the widely used C-FICA (Convolutive Fast-ICA) algorithm&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Speaker Adaptation Based on System Combination Using Speaker-Class Models&lt;/b&gt;&lt;br /&gt;Tetsuo Kosaka (Yamagata University)&lt;br /&gt;Takashi Ito (Yamagata University)&lt;br /&gt;Masaharu Kato (Yamagata University)&lt;br /&gt;Masaki Kohda (Yamagata University)&lt;br /&gt;In this paper, we propose a new system combination approach for an LVCSR system using speaker-class (SC) models and a speaker adaptation technique based on these SC models. The basic concept of the SC-based system is to select speakers who are acoustically similar to a target speaker to train acoustic models. One of the major problems regarding the use of the SC model is determining the selection range of the speakers. In other words, it is difficult to determine the number of speakers that should be selected. In order to solve this problem, several SC models, which are trained by a variety of number of speakers are prepared in advance. In the recognition step, acoustically similar models are selected from the above SC models, and the scores obtained from these models are merged using a word graph combination technique. The proposed method was evaluated using the Corpus of Spontaneous Japanese (CSJ), and showed significant improvement in a lecture speech recognition task.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Feature versus Model Based Noise Robustness&lt;/b&gt;&lt;br /&gt;Kris Demuynck (Katholieke Universiteit Leuven, &amp;nbsp;dept. ESAT)&lt;br /&gt;Xueru Zhang (Katholieke Universiteit Leuven, &amp;nbsp;dept. ESAT)&lt;br /&gt;Dirk Van Compernolle (Katholieke Universiteit Leuven, &amp;nbsp;dept. ESAT)&lt;br /&gt;Hugo Van hamme (Katholieke Universiteit Leuven, &amp;nbsp;dept. ESAT)&lt;br /&gt;Over the years, the focus in noise robust speech recognition has shifted from noise robust features to model based techniques such as parallel model combination and uncertainty decoding. In this paper, we contrast prime examples of both approaches in the context of large vocabulary recognition systems such as used for automatic audio indexing and transcription. We look at the approximations the techniques require to keep the computational load reasonable, the resulting computational cost, and the accuracy measured on the Aurora4 benchmark. The results show that a well designed feature based scheme is capable of providing recognition accuracies at least as good as the model based approaches at a substantially lower computational cost&lt;br /&gt;&lt;br /&gt;&lt;b&gt;The role of higher-level linguistic features in HMM-based speech synthesis&lt;/b&gt;&lt;br /&gt;Oliver Watts (Centre for Speech Technology Research, University of Edinburgh, UK)&lt;br /&gt;Junichi Yamagishi (Centre for Speech Technology Research, University of Edinburgh, UK)&lt;br /&gt;Simon King (Centre for Speech Technology Research, University of Edinburgh, UK)&lt;br /&gt;We analyse the contribution of higher-level elements of the linguistic specification of a data-driven speech synthesiser to the naturalness of the synthetic speech which it generates. The system is trained using various subsets of the full feature-set, in which features relating to syntactic category, intonational phrase boundary, pitch accent and boundary tones are selectively removed. Utterances synthesised by the different configurations of the system are then compared in a subjective evaluation of their naturalness. The work presented forms background analysis for an on-going set of experiments in performing text-to-speech (TTS) conversion based on shallow features: features that can be trivially extracted from text. By building a range of systems, each assuming the availability of a different level of linguistic annotation, we obtain benchmarks for our on-going work.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Latent Perceptual Mapping: A New Acoustic Modeling Framework for Speech Recognition&lt;/b&gt;&lt;br /&gt;Shiva Sundaram (Deutsche Telekom Laboratories, Ernst-Reuter-Platz-7, Berlin 10587. Germany)&lt;br /&gt;Jerome Bellegarda (Apple Inc., 3 Infinte Loop, Cupertino, 95014 California. USA.)&lt;br /&gt;While hidden Markov modeling is still the dominant paradigm for speech recognition, in recent years there has been renewed interest in alternative, template-like approaches to acoustic modeling. Such methods sidestep usual HMM limitations as well as inherent issues with parametric statistical distributions, though typically at the expense of large amounts of memory and computing power. This paper introduces a new framework, dubbed latent perceptual mapping, which naturally leverages a reduced dimensionality description of the observations. This allows for a viable parsimonious template-like solution where models are closely aligned with perceived acoustic events. Context-independent phoneme classification experiments conducted on the TIMIT database suggest that latent perceptual mapping achieves results comparable to conventional acoustic modeling but at potentially significant savings in online costs.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;State-based labelling for a sparse representation of speech and its application to robust speech recognition&lt;/b&gt;&lt;br /&gt;Tuomas Virtanen (Department of Signal Processing, Tampere University of Technology, Finland)&lt;br /&gt;Jort F. Gemmeke (Centre for Language and Speech Technology, Radboud University Nijmegen, The Netherlands)&lt;br /&gt;Antti Hurmalainen (Department of Signal Processing, Tampere University of Technology, Finland)&lt;br /&gt;This paper proposes a state-based labeling for acoustic patterns of speech and a method for using this labelling in noise-robust automatic speech recognition. Acoustic time-frequency segments of speech, exemplars, are obtained from a training database and associated with time-varying state labels using the transcriptions. In the recognition phase, noisy speech is modeled by a sparse linear combination of noise and speech exemplars. The likelihoods of states are obtained by linear combination of the exemplar weights, which can then be used to estimate the most likely state transition path. The proposed method was tested in the connected digit recognition task with noisy speech material from the Aurora-2 database where it is shown to produce better results than the existing histogram-based labeling method.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Single-channel speech enhancement using Kalman filtering in the modulation domain&lt;/b&gt;&lt;br /&gt;Stephen So (Signal Processing Laboratory, Griffith University)&lt;br /&gt;Kamil K. Wojcicki (Signal Processing Laboratory, Griffith University)&lt;br /&gt;Kuldip K. Paliwal (Signal Processing Laboratory, Griffith University)&lt;br /&gt;In this paper, we propose the modulation-domain Kalman filter (MDKF) for speech enhancement. In contrast to previous modulation domain-enhancement methods based on bandpass filtering, the MDKF is an adaptive and linear MMSE estimator that uses models of the temporal changes of the magnitude spectrum for both speech and noise. Also, because the Kalman filter is a joint magnitude and phase spectrum estimator, under non-stationarity assumptions, it is highly suited for modulation-domain processing, as modulation phase tends to contain more speech information than acoustic phase. Experimental results from the NOIZEUS corpus show the ideal MDKF (with clean speech parameters) to outperform all the acoustic and time-domain enhancement methods that were evaluated, including the conventional time-domain Kalman filter with clean speech parameters. A practical MDKF that uses the MMSE-STSA method to enhance noisy speech in the acoustic domain prior to LPC analysis was also evaluated and showed promising results.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Metric Subspace Indexing for Fast Spoken Term Detection&lt;/b&gt;&lt;br /&gt;Taisuke Kaneko (Toyohashi University of Technology)&lt;br /&gt;Tomoyosi Akiba (Toyohashi University of Technology)&lt;br /&gt;In this paper, we propose a novel indexing method for Spoken Term Detection (STD). The proposed method can be considered as using metric space indexing for the approximate string-matching problem, where the distance between a phoneme and a position in the target spoken document is defined. The proposed method does not require the use of thresholds to limit the output, instead being able to output the results in increasing order of distance. It can also deal easily with the multiple candidates obtained via Automatic Speech Recognition (ASR). The results of preliminary experiments show promise for achieving fast STD.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Discriminative Language Modeling Using Simulated ASR Errors&lt;/b&gt;&lt;br /&gt;Preethi Jyothi (Department of Computer Science and Engineering, The Ohio State University, USA)&lt;br /&gt;Eric Fosler-Lussier (Department of Computer Science and Engineering, The Ohio State University, USA)&lt;br /&gt;In this paper, we approach the problem of discriminatively training language models using a weighted finite state transducer (WFST) framework that does not require acoustic training data. The phonetic confusions prevalent in the recognizer are modeled using a confusion matrix that takes into account information from the pronunciation model (word-based phone confusion log likelihoods) and information from the acoustic model (distances between the phonetic acoustic models). This confusion matrix, within the WFST framework, is used to generate confusable word graphs that serve as inputs to the averaged perceptron algorithm to train the parameters of the discriminative language model. Experiments on a large vocabulary speech recognition task show significant word error rate reductions when compared to a baseline using a trigram model trained with the maximum likelihood criterion.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Learning a Language Model from Continuous Speech&lt;/b&gt;&lt;br /&gt;Graham Neubig (Graduate School of Informatics, Kyoto University)&lt;br /&gt;Masato Mimura (Graduate School of Informatics, Kyoto University)&lt;br /&gt;Shinsuke Mori (Graduate School of Informatics, Kyoto University)&lt;br /&gt;Tatsuya Kawahara (Graduate School of Informatics, Kyoto University)&lt;br /&gt;This paper presents a new approach to language model construction, learning a language model not from text, but directly from continuous speech. A phoneme lattice is created using acoustic model scores, and Bayesian techniques are used to robustly learn a language model from this noisy input. A novel sampling technique is devised that allows for the integrated learning of word boundaries and an n-gram language model with no prior linguistic knowledge. The proposed techniques were used to learn a language model directly from continuous, potentially large-vocabulary speech. This language model was able to significantly reduce the ASR phoneme error rate over a separate set of test data, and the proposed lattice processing and lexical acquisition techniques were found to be important factors in this improvement.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;New Insights into Subspace Noise Tracking&lt;/b&gt;&lt;br /&gt;Mahdi Triki (Philips Research Laboratories)&lt;br /&gt;Various speech enhancement techniques rely on the knowledge of the clean signal and noise statistics. In practice, however, these statistics are not explicitly available, and the overall enhancement accuracy critically depends on the estimation quality of the unknown statistics. The estimation of noise (and speech) statistics is particularly challenging under non-stationary noise conditions. In this respect, subspace-based approaches have been shown to provide a good tracking vs. final misadjustment tradeoff. Subspace-based techniques hinge critically on both rank-limited and spherical assumptions of the speech and the noise DFT matrices, respectively. The speech rank-limited assumption was previously experimentally tested and validated. In this paper, we will investigate the structure of nuisance sources. We will discuss the validity of the spherical assumption for a variety of nuisance sources (environmental noise, reverberation), and preprocessing (overlapping segmentation).&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Acoustic Correlates of Meaning Structure in Conversational Speech&lt;/b&gt;&lt;br /&gt;Alexei V. Ivanov (DISI, University of Trento, Italy)&lt;br /&gt;Giuseppe Riccardi (DISI, University of Trento, Italy)&lt;br /&gt;Sucheta Ghosh (DISI, University of Trento, Italy)&lt;br /&gt;Sara Tonelli (FBK-IRST, Trento, Italy)&lt;br /&gt;Evgeny Stepanov (DISI, University of Trento, Italy)&lt;br /&gt;We are interested in the problem of extracting meaning structures from spoken utterances in human communication. In SLU systems, parsing of meaning structures is carried over the word hypotheses generated by the ASR. This approach suffers from high word error rates and ad-hoc conceptual representations. In contrast, in this paper we aim at discovering meaning components from direct measurements of acoustic and non-verbal linguistic features. The meaning structures are taken from the frame semantics model proposed in FrameNet. We give a quantitative analysis of meaning structures in terms of speech features across human--human dialogs from the manually annotated LUNA corpus. We show that the acoustic correlations between pitch, formant trajectories, intensity and harmonicity and meaning features are statistically significant over the whole corpus as well as relevant in classifying the target words evoked by a semantic frame.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Using Harmonic Phase Information to Improve ASR Rate&lt;/b&gt;&lt;br /&gt;Ibon Saratxaga (Aholab Signal Processing Laboratory, University of the Basque Country)&lt;br /&gt;Inma Hernáez (Aholab Signal Processing Laboratory, University of the Basque Country)&lt;br /&gt;Igor Odriozola (Aholab Signal Processing Laboratory, University of the Basque Country)&lt;br /&gt;Eva Navas (Aholab Signal Processing Laboratory, University of the Basque Country)&lt;br /&gt;Iker Luengo (Aholab Signal Processing Laboratory, University of the Basque Country)&lt;br /&gt;Daniel Erro (Aholab Signal Processing Laboratory, University of the Basque Country)&lt;br /&gt;Spectral phase information is usually discarded in automatic speech recognition (ASR). The Relative Phase Shift (RPS), a novel representation of the phase information of the speech, has features which seem to be appropriate to improve the ASR recognition rate. In this paper we describe the RPS representation, discuss different ways to parameterize this information in a suitable way for the HMM modelling, and present the results of the evaluation experiments. WER improvements ranging from 12 to 22% open promising perspectives for the use of this information jointly with the classical MFCC parameterization. Index Terms: ASR, phase spectrum, harmonic analysis&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Using Dependency Parsing and Machine Learning for Factoid Question Answering on Spoken Documents&lt;/b&gt;&lt;br /&gt;Pere R. Comas (TALP Research Center, Technical University of Catalonia (UPC))&lt;br /&gt;Lluís Màrquez (TALP Research Center, Technical University of Catalonia (UPC))&lt;br /&gt;Jordi Turmo (TALP Research Center, Technical University of Catalonia (UPC))&lt;br /&gt;This paper presents our experiments in question answering for speech corpora. These experiments focus on improving the answer extraction step of the QA process. We present two approaches to answer extraction in question answering for speech corpora that apply machine learning to improve the coverage and precision of the extraction. The first one is a reranker that uses only lexical information, the second one uses dependency parsing to score similarity between syntactic structures. Our experimental results show that the proposed learning models improve our previous results using only hand-made ranking rules with small syntactic information. We evaluate the system on manual transcripts of speech from EPPS English corpus and a set of questions transcribed from spontaneous oral questions. This data belongs to the CLEF 2009 evaluation track on QA on speech transcripts (QAst).&lt;br /&gt;&lt;br /&gt;&lt;b&gt;A Spoken Term Detection Framework for Recovering Out-of-Vocabulary Words Using the Web&lt;/b&gt;&lt;br /&gt;Carolina Parada (Johns Hopkins University)&lt;br /&gt;Abhinav Sethy (IBM TJ Watson Research Center)&lt;br /&gt;Mark Dredze (Johns Hopkins University)&lt;br /&gt;Frederick Jelinek (Johns Hopkins University)&lt;br /&gt;Vocabulary restrictions in large vocabulary continuous speech recognition (LVCSR) systems mean that out-of-vocabulary (OOV) words are lost in the output. However, OOV words tend to be information rich terms (often named entities) and their omission from the transcript negatively affects both usability and downstream NLP technologies, such as machine translation or knowledge distillation. We propose a novel approach to OOV recovery that uses a spoken term detection (STD) framework. Given an identified OOV region in the LVCSR output, we recover the uttered OOVs by utilizing contextual information and the vast and constantly updated vocabulary on the Web. Discovered words are integrated into the system output, recovering up to 40% of the OOV terms and resulting in a reduction in system error.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Boosting Systems for LVCSR&lt;/b&gt;&lt;br /&gt;George Saon (IBM T.J. Watson Research Center)&lt;br /&gt;Hagen Soltau (IBM T.J. Watson Research Center)&lt;br /&gt;We employ a variant of the popular Adaboost algorithm to train multiple acoustic models such that the aggregate system exhibits improved performance over the individual recognizers. Each model is trained sequentially on re-weighted versions of the training data. At each iteration, the weights are decreased for the frames that are correctly decoded by the current system. These weights are then multiplied with the frame-level statistics for the decision trees and Gaussian mixture components of the next iteration system. The composite system uses a log-linear combination of HMM state observation likelihoods. We report experimental results on several broadcast news transcription setups which differ in the language being spoken (English and Arabic) and amounts of training data. Our findings suggest that significant gains can be obtained for small amounts of training data even after feature and model-space discriminative training.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Incorporating Sparse Representation Phone Identification Features in Automatic Speech Recognition using Exponential Families&lt;/b&gt;&lt;br /&gt;Vaibhava Goel (IBM T.J. Watson Research Center)&lt;br /&gt;Tara Sainath (IBM T.J. Watson Research Center)&lt;br /&gt;Bhuvana Ramabhadran (IBM T.J. Watson Research Center)&lt;br /&gt;Peder Olsen (IBM T.J. Watson Research Center)&lt;br /&gt;David Nahamoo (IBM T.J. Watson Research Center)&lt;br /&gt;Dimitri Kanevsky (IBM T.J. Watson Research Center)&lt;br /&gt;Sparse representation phone identification features (SPIF) is a recently developed technique to obtain an estimate of phone posterior probabilities conditioned on an acoustic feature vector. In this paper, we explore incorporating SPIF phone posterior probability estimates in large vocabulary continuous speech recognition (LVCSR) task by including them as additional features of exponential densities that model the HMM state emission likelihoods. We compare our proposed approach to a number of other well known methods of combining feature streams or multiple LVCSR systems. Our experiments show that using exponential models to combine features results in a word error rate reduction of 0.5% absolute (18.7% down to 18.2%); this is comparable to best error rate reduction obtained from system combination methods, but without having to build multiple systems or tune the system combination weights.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Improving ASR-based topic segmentation of TV programs with confidence measures and semantic relations&lt;/b&gt;&lt;br /&gt;Camille Guinaudeau (INRIA/IRISA)&lt;br /&gt;Guillaume Gravier (IRISA/CNRS)&lt;br /&gt;Pascale Sébillot (IRISA/INSA)&lt;br /&gt;The increasing quantity of video material requires methods to help users navigate such data, among which topic segmentation techniques. The goal of this article is to improve ASR-based topic segmentation methods to deal with peculiarities of professionnal-video transcripts (transcription errors and lack of repetitions) while remaining generic enough. To this end, we introduce confidence measures and semantic relations in a segmentation method based on lexical cohesion. We show significant improvements of the F1-measure, +1.7 and +1.9 when integrating confidence measures and semantic relations respectively. Such improvement demonstrates that simple clues can conteract errors in automatic transcripts and lack of repetitions.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;A Novel text-independent phonetic segmentation algorithm based on the Microcanonical Multiscale Formalism&lt;/b&gt;&lt;br /&gt;Vahid Khanagha (INRIA Bordeaux Sud-Ouest)&lt;br /&gt;Khalid Daoudi (INRIA Bordeaux Sud-Ouest)&lt;br /&gt;Oriol Pont (INRIA Bordeaux Sud-Ouest)&lt;br /&gt;Hussein Yahia (INRIA Bordeaux Sud-Ouest)&lt;br /&gt;We propose a radically novel approach to analyze speech signals from a statistical physics perspective. Our approach is based on a new framework, the Microcanonical Multiscale Formalism (MMF), which is based on the computation of singularity exponents, defined at each point in the signal domain. The latter allows nonlinear analysis of complex dynamics and, particularly, characterizes the intermittent signature. We study the validity of the MMF for the speech signal and show that singularity exponents convey indeed valuable information about its local dynamics. We define an accumulative measure on the exponents which reveals phoneme boundaries as the breaking points of a piecewise linear-like curve. We then develop a simple automatic phonetic segmentation algorithm usinhttp://www.interspeech2010.org/g piecewise linear curve fitting. We present experiments on the full TIMIT database. The results show that our algorithm yields considerably better accuracy than recently published ones.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Exploring Recognition Network Representations for Efficient Speech Inference on Highly Parallel Platforms&lt;/b&gt;&lt;br /&gt;Jike Chong (University of California, Berkeley; Parasians, LLC)&lt;br /&gt;Ekaterina Gonina (University of California, Berkeley)&lt;br /&gt;Kisun You (Seoul National University)&lt;br /&gt;Kurt Keutzer (Unversity of California, Berkeley)&lt;br /&gt;The emergence of highly parallel computing platforms is enabling new trade-offs in algorithm design for automatic speech recognition. It naturally motivates the following investigation: Do the most computationally efficient sequential algorithms lead to the most computationally efficient parallel algorithms? In this paper we explore two contending recognition network representations for speech inference engines: the linear lexical model (LLM) and the weighted finite state transducer (WFST). We demonstrate that while an inference engine using the simpler LLM representation evaluates 22x more transitions per second than the advanced WFST representation, the simple structure of the LLM representation allows 4.7-6.4x faster evaluation and 53-65x faster operand-gathering for each state transition. We use the 5k Wall Street Journal Corpus to experiment on the NVIDIA GTX480 (Fermi) and the NVIDIA GTX285 Graphics Processing Units (GPUs), and illustrate that the performance of a speech inference engine based on the LLM representation is competitive with the WFST representation on highly parallel implementation platforms.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Speech Recognizer Optimization under Speed Constraints&lt;/b&gt;&lt;br /&gt;Ivan Bulyko (Raytheon BBN Technologies)&lt;br /&gt;We present an efficient algorithm for optimizing parameters of a speech recognizer aimed at obtaining maximum accuracy at a specified decoding speed. This algorithm is not tied to any particular decoding architecture or type of tunable parameter being used. It can also be applied to any performance metric (e.g. WER, keyword search or topic ID accuracy) and thus allows tuning to the target application. We demonstrate the effectiveness of this approach by tuning BBN’s Byblos recognizer to run at 15 times faster than real time while maximizing keyword search accuracy.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;The 2010 CMU GALE Speech-to-Text System&lt;/b&gt;&lt;br /&gt;Florian Metze (Carnegie Mellon University)&lt;br /&gt;Roger Hsiao (Carnegie Mellon University)&lt;br /&gt;Qin Jin (Carnegie Mellon University)&lt;br /&gt;Udhyakumar Nallasamy (Carnegie Mellon University)&lt;br /&gt;Tanja Schultz (Karlsruhe Institute of Technology)&lt;br /&gt;This paper describes the latest Speech-to-Text system developed for the Global Autonomous Language Exploitation ("GALE") domain by Carnegie Mellon University (CMU). This systems uses discriminative training, bottle-neck features and other techniques that were not used in previous versions of our system, and is trained on 1150 hours of data from a variety of Arabic speech sources. In this paper, we show how different lexica, pre-processing, and system combination techniques can be used to improve the final output, and provide analysis of the improvements achieved by the individual techniques.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;The RWTH 2009 Quaero ASR Evaluation System for English and German&lt;/b&gt;&lt;br /&gt;Markus Nußbaum-Thom (RWTH Aachen University)&lt;br /&gt;Simon Wiesler (RWTH Aachen University)&lt;br /&gt;Martin Sundermeyer (RWTH Aachen University)&lt;br /&gt;Christian Plahl (RWTH Aachen University)&lt;br /&gt;Stefan Hahn (RWTH Aachen University)&lt;br /&gt;Ralf Schlüter (RWTH Aachen University)&lt;br /&gt;Hermann Ney (RWTH Aachen University)&lt;br /&gt;In this work, the RWTH automatic speech recognition systems for English and German for the second Quaero evaluation campaign 2009 are presented. The systems are designed to transcribe web data, European parliament plenary sessions and broadcast news data. Another challenge in the 2009 evaluation is that almost no in-domain training data is provided and the test data contains a large variety of speech types. The RWTH participates for the English and German languages with the best results for German and competitive results for the English. Contributing to the enhancements are the systematic use of hierarchical neural network based posterior features, system combination, speaker adaptation, cross speaker adaptation, domain dependent modeling and the usage of additional training data.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;The Impact of ASR on Abstractive vs. Extractive Meeting Summaries&lt;/b&gt;&lt;br /&gt;Gabriel Murray (University of British Columbia)&lt;br /&gt;Giuseppe Carenini (University of British Columbia)&lt;br /&gt;Raymond Ng (University of British Columbia)&lt;br /&gt;In this paper we describe a complete abstractive summarizer for meeting conversations, and evaluate the usefulness of the automatically generated abstracts in a browsing task. We contrast these abstracts with extracts for use in a meeting browser and investigate the effects of manual versus ASR transcripts on both summary types.&lt;br /&gt;&lt;br /&gt;&lt;i&gt;This is a comparision that I feel is wrong&lt;/i&gt;&lt;br /&gt;&lt;b&gt;An Empirical Comparison of the T3, Juicer, HDecode and Sphinx3 Decoders&lt;/b&gt;&lt;br /&gt;Josef R. Novak (Tokyo Institute of Technology)&lt;br /&gt;Paul R. Dixon (National Institute of Information and Communications Technology)&lt;br /&gt;Sadaoki Furui (Tokyo Institute of Technology)&lt;br /&gt;In this paper we perform a cross-comparison of the T3 WFST decoder against three different speech recognition decoders on three separate tasks of variable difficulty. We show that the T3 decoder performs favorably against several established veterans in the field, including the Juicer WFST decoder, Sphinx3, and HDecode in terms of RTF versus Word Accuracy. In addition to comparing decoder performance, we evaluate both Sphinx and HTK acoustic models on a common footing inside T3, and show that the speed benefits that typically accompany the WFST approach increase with the size of the vocabulary and other input knowledge sources. In the case of T3, we also show that GPU acceleration can significantly extend these gains.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;CRF-based Combination of Contextual Features to Improve A Posteriori Word-level Confidence Measures&lt;/b&gt;&lt;br /&gt;Julien Fayolle (IRISA/INRIA Rennes, France)&lt;br /&gt;Fabienne Moreau (University of Rennes 2/IRISA Rennes, France)&lt;br /&gt;Christian Raymond (IRISA/INSA Rennes, France)&lt;br /&gt;Guillaume Gravier (IRISA/CNRS Rennes, France)&lt;br /&gt;Patrick Gros (IRISA/INRIA Rennes, France)&lt;br /&gt;This paper addresses the issue of confidence measure reliability provided by automatic speech recognition systems for use in various spoken language processing applications. We propose a method based on conditional random field to combine contextual features to improve word-level confidence measures. The method consists in combining various knowledge sources (acoustic, lexical, linguistic, phonetic and morphosyntactic) to enhance confidence measures, explicitly exploiting context information. Experiments were conducted on a large French broadcast news corpus from the ESTER benchmark. Results demonstrate the added-value of our method with a significant improvement of the normalized cross entropy and of the equal error rate.&lt;br /&gt;&lt;br /&gt;&lt;i&gt;Interesting, do they mention Voxforge in this paper&lt;/i&gt;&lt;br /&gt;&lt;b&gt;Building transcribed speech corpora quickly and cheaply for many languages&lt;/b&gt;&lt;br /&gt;Thad Hughes (Google)&lt;br /&gt;Kaisuke Nakajima (Google)&lt;br /&gt;Linne Ha (Google)&lt;br /&gt;Atul Vasu (Google)&lt;br /&gt;Pedro Moreno (Google)&lt;br /&gt;Mike LeBeau (Google)&lt;br /&gt;We present a system for quickly and cheaply building transcribed speech corpora containing utterances from many speakers in a variety of acoustic conditions. The system consists of a client application running on an Android mobile device with an intermittent Internet connection to a server. The client application collects demographic information about the speaker, fetches textual prompts from the server for the speaker to read, records the speaker’s voice, and uploads the audio and associated metadata to the server. The system has so far been used to collect over 3000 hours of transcribed audio in 17 languages around the world&lt;br /&gt;&lt;br /&gt;&lt;b&gt;On Generating Combilex Pronunciations via Morphological Analysis&lt;/b&gt;&lt;br /&gt;Korin Korin Richmond (Centre for Speech Technology Research, Edinburgh University)&lt;br /&gt;Robert Robert Clark (Centre for Speech Technology Research)&lt;br /&gt;Sue Sue Fitt (Centre for Speech Technology Research)&lt;br /&gt;Combilex is a high-quality lexicon that has been developed specifically for speech technology purposes and recently released by CSTR. Combilex benefits from many advanced features. This paper explores one of these: the ability to generate fully-specified transcriptions for morphologically derived words automatically. This functionality was originally implemented to encode the pronunciations of derived words in terms of their constituent morphemes, thus accelerating lexicon development and ensuring a high level of consistency. In this paper, we propose this method of modelling pronunciations can be exploited further by combining it with a morphological parser, thus yielding a method to generate full transcriptions for unknown derived words. Not only could this accelerate adding new derived words to Combilex, but it could also serve as an alternative to conventional letter-to-sound rules. This paper presents preliminary work indicating this is a promising direction.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Content-Based Advertisement Detection&lt;/b&gt;&lt;br /&gt;Patrick Cardinal (CRIM)&lt;br /&gt;Vishwa Gupta (CRIM)&lt;br /&gt;Gilles Boulianne (CRIM)&lt;br /&gt;Television advertising is widely used by companies to promote their products among the public but it is hard for an advertiser to know if its advertisements are broadcast as they should. For this reason, some companies are specialized in the monitoring of audio/video streams for validating that ads are broadcast according to what was requested and paid for by the advertiser. The procedure for searching specific ads in an audio stream is very similar to the copy detection task for which we have developed very efficient algorithms. This work reports results of applying our copy detection algorithms to the advertisement detection task. Compared to a commercial software, we detected 18% more advertisements and the system runs at 0.003x of real-time.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Continuous Speech Recognition with a TF-IDF Acoustic Model&lt;/b&gt;&lt;br /&gt;Geoffrey Zweig (Microsoft)&lt;br /&gt;Patrick Nguyen (Microsoft)&lt;br /&gt;Jasha Droppo (Microsoft)&lt;br /&gt;Alex Acero (Microsoft)&lt;br /&gt;Information retrieval methods are frequently used for indexing and retrieving spoken documents, and more recently have been proposed for voice-search amongst a pre-defined set of business entries. In this paper, we show that these methods can be used in an even more fundamental way, as the core component in a continuous speech recognizer. Speech is initially processed and represented as a sequence of discrete symbols, specifically phoneme or multi-phone units. Recognition then operates on this sequence. The recognizer is segment-based, and the acoustic score for labeling a segment with a word is based on the TF-IDF similarity between the subword units detected in the segment, and those typically seen in association with the word. We present promising results on both a voice search task and the Wall Street Journal task. The development of this method brings us one step closer to being able to do speech recognition based on the detection of sub-word audio attributes.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Improved topic classification and keyword discovery using an HMM-based speech recognizer trained without supervision&lt;/b&gt;&lt;br /&gt;Man-Hung Siu (Raytheon BBN Technologies)&lt;br /&gt;Herbert Gish (Raytheon BBN Technologies)&lt;br /&gt;Arthur Chan (Raytheon BBN Technologies)&lt;br /&gt;William Belfield (Raytheon BBN Technologies)&lt;br /&gt;In our previous publication, we presented a new approach to HMM training, viz., training without supervision. We used an HMM trained without supervision for transcribing audio into self-organized units (SOUs) for the purpose of topic classification. In this paper we report improvements made to the system, including the use of context dependent acoustic models and lattice based features that together reduce the topic verification equal error rate from 12% to 7%. In addition to discussing the effectiveness of the SOU approach we describe how we analyzed some selected SOU n-grams and found that they were highly correlated with keywords, demonstrating the ability of the SOU technology to discover topic relevant keywords.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-2250912547599453374?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/2250912547599453374/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2010/10/reading-interspeech-2010-program.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/2250912547599453374'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/2250912547599453374'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2010/10/reading-interspeech-2010-program.html' title='Reading Interspeech 2010 Program'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-7466096869089115604</id><published>2010-09-25T06:32:00.005+04:00</published><updated>2011-04-26T17:22:37.947+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='pocketsphinx'/><category scheme='http://www.blogger.com/atom/ns#' term='asterisk'/><title type='text'>Voicemail transcription with Pocketsphinx and Asterisk</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;This is for admins who are aware that pocketsphinx exists and want to try it. It will describe how to quickly setup voicemail transcription using pocketsphinx and &lt;a href="http://asterisk.org/"&gt;Asterisk&lt;/a&gt;.&amp;nbsp;The process is extremely simple,&amp;nbsp;I promise it will not take more than 5 minutes.&lt;br /&gt;&lt;br /&gt;We'll use external shell command invoked when voicemail arrives and this command will transcribe voicemails which aren't transcribed yet. We won't postprocess the result as well as will not clean it up. The goal is just to show how to do it quickly and show how asterisk interface can be built.&lt;br /&gt;&lt;br /&gt;&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;So, let's start&lt;br /&gt;&lt;br /&gt;1. Setup asterisk. I hope it will run smoothly, it's really easy. Setup samples with &lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;make samples&lt;/span&gt;. Our demo will be based on them.&lt;br /&gt;&lt;br /&gt;2. Setup pocketsphinx. You need to download pocketsphinx and sphinxbase from the &lt;a href="http://cmusphinx.sourceforge.net/wiki/download"&gt;download &lt;/a&gt;page. You need at least pocketsphinx version 0.7, previous versions will not work. Some of the required features are only avaiable in this release or later releases.&lt;br /&gt;&lt;br /&gt;3. Check that pocketsphinx works. Just run &lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;pocketsphinx_continuous&lt;/span&gt; and try to say something when &lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;READY&lt;/span&gt; will appear. The decoding result will appear before next &lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;READY&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;000000000: hello world&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;4. Download &lt;a href="https://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/US%20English%20Telephone%20Model%20Communicator/communicator_semi_6000_20080321.tar.gz/download"&gt;Communicator&lt;/a&gt; acoustic model for telephone speech and unpack it into some location, for example in &lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;$prefix/var/lib/asterisk/communicator&lt;/span&gt;. There must be files like &lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;mdef, variances&lt;/span&gt;, etc.&lt;br /&gt;&lt;br /&gt;5. Edit &lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;voicemail.conf&lt;/span&gt; in &lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;$prefix/etc/asterisk/voicemail.conf.&lt;/span&gt; Configure external callback script:&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;externnotify=$prefix/sbin/voicemail-notify.sh&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;6. Now let's create the script &lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;voicemail-notify.sh&lt;/span&gt; in the folder&amp;nbsp;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;$prefix/sbin&lt;/span&gt;&amp;nbsp;where all other asterisk binaries reside. Copy-paste it from below, change permission to 755, don't forget to update the prefix to point to the asterisk installation folder&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;#!/bin/bash&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;prefix=&amp;lt;PUT YOUR ASTERISK PREFIX HERE&amp;gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;voicemaildir=$prefix/var/spool/asterisk/voicemail/$1/$2/INBOX/&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;for audiofile in `ls $voicemaildir/*.wav`; do&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;transcriptfile=${audiofile/wav/transcript}&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;# For each message.wav we check if message.transcript exists&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;if [ ! -f $transcriptfile ]; then&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;# If not, we create it&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;pocketsphinx_continuous -infile $audiofile \&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;-hmm&amp;nbsp;$prefix/var/lib/asterisk/communicator \&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;-samprate 8000 2&amp;gt; /dev/null &amp;gt; $transcriptfile&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;# Now we can do whatever we want with the new transcription&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;# Send it by mail for example&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;# mail $user &amp;lt; $transcriptfile&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;fi&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;done&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;7. Start asterisk or reload configuration with "voicemail reload"&lt;br /&gt;&lt;br /&gt;8. Dial extension 1234&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;*CLI&amp;gt; console dial 1234&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;and leave voicemail.&lt;br /&gt;&lt;br /&gt;9. Check that your voicemail is transcribed automatically and the transcription is put together with wav file into voicemail folder&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New',Courier,monospace;"&gt;ls&amp;nbsp;$prefix/var/spool/asterisk/voicemail/default/1234/INBOX/*.transcript&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;You can also send the transcript by mail or do with it whatever you want.&amp;nbsp;Easy, isn't it? Well, I didn't mention you need better language model and all the tricks to improve the transcriptoin accuracy for your voicemails, thats a separate story.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Update&lt;/b&gt;&lt;br /&gt;The second part is here: &lt;a href="http://nsh.nexiwave.com/2011/04/voicemail-transcription-with.html"&gt;http://nsh.nexiwave.com/2011/04/voicemail-transcription-with.html&lt;/a&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-7466096869089115604?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/7466096869089115604/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2010/09/voicemail-transcription-with.html#comment-form' title='18 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/7466096869089115604'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/7466096869089115604'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2010/09/voicemail-transcription-with.html' title='Voicemail transcription with Pocketsphinx and Asterisk'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>18</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-6239869038170773136</id><published>2010-09-01T02:10:00.000+04:00</published><updated>2010-09-01T02:10:34.227+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='sphinx4'/><title type='text'>Next Routine Release</title><content type='html'>So today we released Sphinx4-Beta5. I'm not very pleased by the number of features that went in, I wanted to do more, but I'm very pleased by the increasing number of people contributed to this release. From release to release thanks list is definitely getting longer.&lt;br /&gt;&lt;br /&gt;This six month I finally went in the deep blue area of the LexTree linguist. So the major feature of this release is significantly reworked LVCSR search which is supposed to get faster at least sometimes. Careful testing would be welcome here.&lt;br /&gt;&lt;br /&gt;Another important but not so visible thing is the first bit of the applicaiton-oriented public API. There will be no public XML configs anymore! The Aligner demo that comes with sphinx4 is the first step towards that. The configs for the typical applications will be stored inside jar and supposed to be used by the developer who want to tweak engine. No demo will use XML anymore.&amp;nbsp;This rework of the public API together with opimization of the data flow inside the engine is the top priority for us for the next six month. We shall discuss in Feburary how it goes.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-6239869038170773136?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/6239869038170773136/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2010/09/next-routine-release.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/6239869038170773136'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/6239869038170773136'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2010/09/next-routine-release.html' title='Next Routine Release'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-6562829666111539388</id><published>2010-08-23T02:52:00.000+04:00</published><updated>2010-08-23T02:52:16.109+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='speech recognition'/><title type='text'>Patents in ASR</title><content type='html'>I was pleased to read &lt;a href="http://patentlyobvious.m-cam.com/blog/wp-trackback.php?p=160"&gt;this review on Nuance vs Vlingo litigation&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;Based on this analysis, we can find no evidence supporting the “seven figure” price which Vlingo paid to acquire the ‘354&amp;nbsp;patent from Intellectual Ventures. It is unclear how, armed with a broad spectrum of information, Vlingo would have&amp;nbsp;justified their purchase. The implications of the Vlingo v. Nuance litigation hold tremendous implications for corporate&amp;nbsp;IP strategies.&lt;/blockquote&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-6562829666111539388?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/6562829666111539388/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2010/08/patents-in-asr.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/6562829666111539388'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/6562829666111539388'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2010/08/patents-in-asr.html' title='Patents in ASR'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-7034047436526982023</id><published>2010-07-30T16:17:00.003+04:00</published><updated>2010-07-30T16:20:38.264+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='sphinx4'/><title type='text'>Senone Tree Implementation For Sphinx4</title><content type='html'>I spent last month working on senone tree linguist for sphinx4 as a part of Nexiwave's sphinx4 performance project. Well, mostly I was fixing bugs in my initial implementation. The core idea of senone tree which was suggested to me by Bhiksha is the following. Lextree is a representation of all possible words in a dictionary which is built with triphones. Lextree is used to explore search space during decoding. There very good thing is that since number of HMMs is rather small comparing to the number of triphones (40000 vs 100000) the lextree is rather compact representation of the search space.&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/_p33_0koWXHA/TFLA0t7-ruI/AAAAAAAAAHA/k_QDye0L3Fk/s1600/d1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://1.bp.blogspot.com/_p33_0koWXHA/TFLA0t7-ruI/AAAAAAAAAHA/k_QDye0L3Fk/s320/d1.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Senone tree takes advantage of the internal representation of each triphone. Indeed the triphone is built from 3 senones and number of senones is even smaller. It's just 10000 for a big model. And respectively we can transform our graph to way more compact structure. There are also some techical advantages of our senone tree implementation - efficient hash map that doesn't waste memory, simple search space structure since we don't use helper states like LexTreeEndUnitState for word end nodes and LexTreeNonEmittingHMMState for non-emitting states. Simple code is also an advantage.&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/_p33_0koWXHA/TFLCExWF-PI/AAAAAAAAAHI/R_zRtc8Ah2M/s1600/d2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://1.bp.blogspot.com/_p33_0koWXHA/TFLCExWF-PI/AAAAAAAAAHI/R_zRtc8Ah2M/s320/d2.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;Expermients also show that senone tree provides nice results. It's a little bit more accurate (few insignificant percents) and it is significantly faster (2x times faster in growth, 20% faster overall, 20k active tokens vs 40k before). That's a nice improvement for Nexiwave engine.&lt;br /&gt;&lt;br /&gt;This work makes me think that we need to provide different answer on our core question: "How to improve the accuracy". Of course traditional ones like: optimize parameters and train better model, are still valid but they are secondary. Primary answer should be: implement proper features in engine and you'll get the accuracy improvement you need.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-7034047436526982023?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/7034047436526982023/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2010/07/senone-tree-implementation-for-sphinx4.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/7034047436526982023'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/7034047436526982023'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2010/07/senone-tree-implementation-for-sphinx4.html' title='Senone Tree Implementation For Sphinx4'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_p33_0koWXHA/TFLA0t7-ruI/AAAAAAAAAHA/k_QDye0L3Fk/s72-c/d1.png' height='72' width='72'/><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-3178020644328262858</id><published>2010-07-17T19:00:00.002+04:00</published><updated>2010-07-17T19:50:31.571+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='scarf'/><category scheme='http://www.blogger.com/atom/ns#' term='engines'/><category scheme='http://www.blogger.com/atom/ns#' term='speech recognition'/><category scheme='http://www.blogger.com/atom/ns#' term='CRF'/><category scheme='http://www.blogger.com/atom/ns#' term='MRF'/><category scheme='http://www.blogger.com/atom/ns#' term='conditional random fields'/><title type='text'>Speech Decoding Engines, Part 2. SCARF, The Next Big Thing In Machine Learning</title><content type='html'>It seems that HMM will not stay forever. If you aren't tied to speech and track big things in machine learning, you should hear about that new thing - Conditional Random Fields. According to recently started but very promising &lt;a href="http://metaoptimize.com/qa/questions/867/most-influential-ideas-1995-2005"&gt;Metaoptimize&lt;/a&gt;, it's one of the most influental ideas in machine learning.&lt;br /&gt;&lt;br /&gt;And, suprisingly, you can already apply this thing to speech recognition, thanks to Microsoft Research including Geoffrey Zweig, Patrick Nguyen. It's SCARF, a Segmental Conditional Random Field Speech Recognition Toolkit which is version 0.5 now. You can download it's sources from &lt;a href="http://research.microsoft.com/en-us/downloads/4dbcaad1-40a1-43a6-ab84-8a063fcd97fd/"&gt;Microsoft Research Website&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The idea behind SCARF is very elegant I would say. In HMM we use join probability distribution between observation features and state label to estimate the probability of label sequence. The showstopper here is the assumed independence of state distributions.&lt;br /&gt;&lt;br /&gt;In CRF we consider different thing - the conditional probability of label sequence given the observation sequence. Conditional models are used to label a novel observation sequence x by selecting the label sequence that maximizes the conditional probability p(y|x). The conditional nature of such models means that no effort is wasted on modeling the observations, and one is free from having to make unwarranted independence assumptions about these sequences; arbitrary attributes of the observation data may be captured by the model, without the modeler having to worry about how these attributes are related.&lt;br /&gt;&lt;br /&gt;In application to speech models, labels used are states in a language model FST or just words and features could be arbitrary set, including pitch, spectral features or phonetic recognizer posteriors. But in practice SCARF doesn't operate on acoustic features, instead it's used as a postprocessing step over posteriours predicted with conventional recognizer or some other detector. &lt;a href="http://www.clsp.jhu.edu/workshops/ws10/documents/Zweig-June-21.pdf"&gt;This presentation&lt;/a&gt; has more information on that. Usage of high-level events makes it similar to other postprocessing decoders like consensus decoding of lattices recently landed in CMUSphinx SVN.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://4.bp.blogspot.com/_p33_0koWXHA/TEGRPZw3RAI/AAAAAAAAAG4/wWhfwu8kcRI/s1600/scarf.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://4.bp.blogspot.com/_p33_0koWXHA/TEGRPZw3RAI/AAAAAAAAAG4/wWhfwu8kcRI/s320/scarf.png" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The whole point that it's possible to provide efficient training and decoding even for such a complex model. Taking into account all nice properties of CRF's it all sounds very promising.&lt;br /&gt;&lt;br /&gt;The whole SCARF code is very small, the codebase for trainer and decoder is just 6KLOC. The included manual is very good, pretty simple and describes everything needed in details. The little issue is to obtain the data to train and test the model. Data formats are rather clear but anyway require some effort to produce them. At least I didn't manage to get the input so had no luck to test it in action.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;So if you are interested, you definitely need to try to create a dataset for SCARF and train something.&lt;br /&gt;&lt;br /&gt;Related posts:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://nsh.nexiwave.com/2010/05/speech-decoding-engines-part-1-juicer.html"&gt;Speech Decoding Engines Part 1. Juicer, the WFST recognizer&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-3178020644328262858?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/3178020644328262858/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2010/07/speech-decoding-engines-part-2-scarf.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/3178020644328262858'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/3178020644328262858'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2010/07/speech-decoding-engines-part-2-scarf.html' title='Speech Decoding Engines, Part 2. SCARF, The Next Big Thing In Machine Learning'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_p33_0koWXHA/TEGRPZw3RAI/AAAAAAAAAG4/wWhfwu8kcRI/s72-c/scarf.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-6071648932747431007</id><published>2010-07-03T01:12:00.005+04:00</published><updated>2010-07-03T01:38:29.804+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='testing'/><category scheme='http://www.blogger.com/atom/ns#' term='cmusphinx'/><title type='text'>Testing CMUSphinx with Hudson</title><content type='html'>&lt;a href="http://3.bp.blogspot.com/_p33_0koWXHA/TC5TeFTKuwI/AAAAAAAAAGo/x7uIdPOOHdA/s1600/hudson1.png" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"&gt;&lt;img border="0" src="http://3.bp.blogspot.com/_p33_0koWXHA/TC5TeFTKuwI/AAAAAAAAAGo/x7uIdPOOHdA/s320/hudson1.png" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;As every high-quality product CMUSphinx spend a lot on testing. That isn't really trivial task because you need to make sure that all parameters that are important are improved or at least not regressed. That includes decoding accuracy, speed and API specs. Sometimes changes improve one thing and make other worse. Things are going to change with the deployment of continuous integration system &lt;a href="http://hudson-ci.org/"&gt;Hudson&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Quite sophisticated system of tests was created to track changes. That included perl scripts, various shell bits, mysql database and even commits to CVS repository. It was also spamming mailing list all the day with long and unreadable emails. Another bad thing was that it's based on private commercial data like WSJ or TIDIGITS database but now everything is changing with &lt;a href="http://nsh.nexiwave.com/2010/04/testing-asr-with-voxforge-database.html"&gt;Voxforge test set&lt;/a&gt;. Our goal is to let you test and optimize system yourself&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a name='more'&gt;&lt;/a&gt;. &lt;br /&gt;&lt;br /&gt;Luckily, it went down recently. I was wondering about why I don't get any mails for almost a month but I'm pretty sure whole thing just not working anymore. There is a good side then since we have more reasons to move to something that specifically developed to support testing workflow. With Peter's hint I downloaded and instaled Hudson.&lt;br /&gt;&lt;br /&gt;It's a nice thing in open source that you can get really professional product designed specifically to solve your problems by experts who understand what your problems are. I very much disagree with the people when they think they can write something themselves from scratch. I never understood this point of view.&lt;br /&gt;&lt;br /&gt;Anyway, Hudson is our testing guard right now, you shouldn't worry about regressions anymore. Its running all CMUSphinx tests and some accuracy tests as well. It also builds nice graphs (see on the right) and can present you statistics about last builds.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://4.bp.blogspot.com/_p33_0koWXHA/TC5Tkm1ZKAI/AAAAAAAAAGw/Mq0kR6qV1Lk/s1600/hudson2.png" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"&gt;&lt;/a&gt;&lt;a href="http://4.bp.blogspot.com/_p33_0koWXHA/TC5Tkm1ZKAI/AAAAAAAAAGw/Mq0kR6qV1Lk/s1600/hudson2.png" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"&gt;&lt;img border="0" src="http://4.bp.blogspot.com/_p33_0koWXHA/TC5Tkm1ZKAI/AAAAAAAAAGw/Mq0kR6qV1Lk/s320/hudson2.png" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&amp;nbsp;This piece of software is very good. What do I like about it:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Slick interface&lt;/li&gt;&lt;li&gt;Many plugins (I hate plugins to be honest, but hudson's plugins are great)&lt;/li&gt;&lt;li&gt;Customizable build process without any restrictions&amp;nbsp; - you can run shell scripts, you can run your own apps, basically you can do everything&lt;/li&gt;&lt;li&gt;Intelligent notification. Thanks Hudson, you only send me email when something is broken, no need to read mails every day!&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;It's defiitely a step forward in testing.&lt;br /&gt;This server is private right now but we hope to make it public soon. Oh, and you can probably see there was a regression on July 1st that I committed for sphinx4. Accuracy dropped significantly and speed become a little bit faster. If you are tracking cmusphinx-devel, you probably know what is it about. Let's hope we'll fix it soon and than it will be really great thing to have.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-6071648932747431007?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/6071648932747431007/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2010/07/testing-cmusphinx-with-apache-hudson.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/6071648932747431007'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/6071648932747431007'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2010/07/testing-cmusphinx-with-apache-hudson.html' title='Testing CMUSphinx with Hudson'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_p33_0koWXHA/TC5TeFTKuwI/AAAAAAAAAGo/x7uIdPOOHdA/s72-c/hudson1.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-7022657376015817723</id><published>2010-07-01T02:48:00.000+04:00</published><updated>2010-07-01T02:48:08.722+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='htk'/><category scheme='http://www.blogger.com/atom/ns#' term='voting'/><title type='text'>HTK Competition Voting</title><content type='html'>&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/_p33_0koWXHA/TCvGtpRppJI/AAAAAAAAAGg/HBjN84z2750/s1600/voting.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="97" src="http://1.bp.blogspot.com/_p33_0koWXHA/TCvGtpRppJI/AAAAAAAAAGg/HBjN84z2750/s640/voting.png" width="640" /&gt;&lt;/a&gt;&lt;/div&gt;Thanks everyone for your feedback, results are really interesting to see.&lt;br /&gt;&lt;br /&gt;HTK competition is something that I was worrying about for a long time. One key issue that I see is that htk-users mailing list definitely has way more deep discussions about ASR than we have on our forum. Hopefully, situatuation will change.&lt;br /&gt;&lt;br /&gt;Anyway, our goal is still to provide very accurate speech recognition and this is not yet solved task with many issues both in usability and accuracy. So we can definitely learn from each other and improve our projects.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-7022657376015817723?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/7022657376015817723/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2010/07/htk-competition-voting.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/7022657376015817723'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/7022657376015817723'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2010/07/htk-competition-voting.html' title='HTK Competition Voting'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_p33_0koWXHA/TCvGtpRppJI/AAAAAAAAAGg/HBjN84z2750/s72-c/voting.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-7805690828765204132</id><published>2010-06-13T16:34:00.000+04:00</published><updated>2010-06-13T16:34:25.613+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='sphinx4'/><category scheme='http://www.blogger.com/atom/ns#' term='speech recognition'/><title type='text'>Sphinx4 Powers Contemporary Art</title><content type='html'>&lt;a href="http://1.bp.blogspot.com/_p33_0koWXHA/TBTK7ewpBkI/AAAAAAAAAGE/hT9IB-Dy4hI/s1600/HeatherDewey-Hagborg005.jpg" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"&gt;&lt;img border="0" height="320" src="http://1.bp.blogspot.com/_p33_0koWXHA/TBTK7ewpBkI/AAAAAAAAAGE/hT9IB-Dy4hI/s320/HeatherDewey-Hagborg005.jpg" width="212" /&gt;&lt;/a&gt;Did you think that sphinx4 could be only used to build another keyboard, help you to track sales manager blaming the product or transcribe medical dictation? Working with computers on daily basis one starts to consider them as a tool.&amp;nbsp; I was thinking this way not taking into account the fact that speech act itself powered by computers has probably sacral meaning. Communication was the thing that created our mind, and keyboards aren't important when we create communication systems.&lt;br /&gt;&lt;br /&gt;The thing that pushed me to this is &lt;a href="http://deweyhagborg.wordpress.com/"&gt;Heather Dewey-Hagborg's blog&lt;/a&gt;. In particular it was the &lt;a href="http://www.deweyhagborg.com/listeningPost/"&gt;Listening Post&lt;/a&gt;, an artistic thing from the CEPA gallery. If you are interested, please also check &lt;a href="http://www.breakthruradio.com/index.php?show=10455"&gt;Heather's interview&lt;/a&gt; on BTR Radio. Check also the &lt;a href="http://www.cepagallery.org/exhibitions/Conversation_Pieces/gallery/HeatherDewey-Hagborg/content/index.html"&gt;gallery's site&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;And important point here is that I think we should not consider this as some kind of futurizm - talking computers, HAL and all that stuff. Instead, such things help us to change ourselves, change our vision of the world around. Probably next time you'll look on sphinx4 sources from a different point of view.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-7805690828765204132?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/7805690828765204132/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2010/06/sphinx4-powers-contemporary-art.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/7805690828765204132'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/7805690828765204132'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2010/06/sphinx4-powers-contemporary-art.html' title='Sphinx4 Powers Contemporary Art'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_p33_0koWXHA/TBTK7ewpBkI/AAAAAAAAAGE/hT9IB-Dy4hI/s72-c/HeatherDewey-Hagborg005.jpg' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-8940355356970439246</id><published>2010-06-08T22:09:00.002+04:00</published><updated>2010-06-08T22:11:31.180+04:00</updated><title type='text'>Great Overview Article</title><content type='html'>&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/_p33_0koWXHA/TA6GLSTKZ3I/AAAAAAAAAF8/2SJvV1E9fgs/s1600/Lemmings.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://4.bp.blogspot.com/_p33_0koWXHA/TA6GLSTKZ3I/AAAAAAAAAF8/2SJvV1E9fgs/s320/Lemmings.jpg" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;Today Dr. Tony Robinson gave me a present by mentioning this great article on &lt;a href="http://groups.google.com/group/comp.speech.research/"&gt;comp.speech.research&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Janet M. Baker, Li Deng,&lt;br /&gt;James Glass, Sanjeev Khudanpur,&lt;br /&gt;Chin-Hui Lee, Nelson Morgan, and&lt;br /&gt;Douglas O’Shaughnessy&lt;br /&gt;&lt;br /&gt;&lt;a href="http://dspace.mit.edu/handle/1721.1/51891"&gt;Research Developments and Directions in Speech Recognition and Understanding, Part 1&lt;/a&gt;&lt;br /&gt;&lt;a href="http://dspace.mit.edu/handle/1721.1/51879"&gt;Research Developments and Directions in Speech Recognition and Understanding, Part 2&lt;/a&gt; &lt;br /&gt;&lt;br /&gt;This article was MINDS 2006–2007 Report of the Speech Understanding Working Group,” one of five reports emanating from two workshops titled “Meeting of the MINDS: Future Directions for Human Language Technology,” sponsored by the U.S. Disruptive Technology Office (DTO). For me it was striking that spontaneous events are so important, I never thought about them from this point of view.&lt;br /&gt;&lt;br /&gt;The whole state of things is also nicely described in Mark Gales talk &lt;a href="http://mi.eng.cam.ac.uk/%7Emjfg/ASRU_talk09.pdf"&gt;Acoustic Modelling for Speech Recognition: Hidden Markov from Models and Beyond?&lt;/a&gt; The picture on the left is taken from it.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-8940355356970439246?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/8940355356970439246/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2010/06/great-overview-article.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/8940355356970439246'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/8940355356970439246'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2010/06/great-overview-article.html' title='Great Overview Article'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_p33_0koWXHA/TA6GLSTKZ3I/AAAAAAAAAF8/2SJvV1E9fgs/s72-c/Lemmings.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-4682068409117699751</id><published>2010-06-04T22:58:00.000+04:00</published><updated>2010-06-04T22:58:26.462+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='TTS'/><category scheme='http://www.blogger.com/atom/ns#' term='blizzard'/><title type='text'>Blizzard Challenge 2010</title><content type='html'>Since I was in TTS for a long time and still interested in in, I've been waiting a long for this - Blizzard Challenge team is ready to accept speech expert and volunteer listeners for the Blizzard Challenge 2010.&lt;br /&gt;&lt;br /&gt;The challenge was devised in order to better understand and compare research techniques in building corpus-based speech synthesizers on the same data. The basic challenge is to take the released speech database, build a synthetic voice from the data and synthesize a prescribed set of test sentences. The sentences from each synthesizer will then be evaluated through listening tests. &lt;br /&gt;&lt;br /&gt;After evaluation participants submit papers where they describe the methods used and problems solved. You could find more information on the webpage &lt;a href="http://festvox.org/blizzard/"&gt;http://festvox.org/blizzard/&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;They want as many listeners as possible over the next few weeks and we can help! So, please distribute the "Speech Experts" link within your group, and distribute the "Volunteers" link as widely as possible - mailing lists, blogs, noticeboards, your students, friends, family... anywhere you can think of. Remember - you don't have to be a native speaker to take part.&lt;br /&gt;&lt;br /&gt;For English:&lt;br /&gt;&amp;nbsp; Speech Experts: &lt;a href="http://groups.inf.ed.ac.uk/blizzard/blizzard2010/english/register-ES.html"&gt;http://groups.inf.ed.ac.uk/blizzard/blizzard2010/english/register-ES.html&lt;/a&gt;&lt;br /&gt;&amp;nbsp; Volunteers: &lt;a href="http://groups.inf.ed.ac.uk/blizzard/blizzard2010/english/register-ER.html"&gt;http://groups.inf.ed.ac.uk/blizzard/blizzard2010/english/register-ER.html&lt;/a&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;br /&gt;For Chinese:&lt;br /&gt;&amp;nbsp; Speech Experts: &lt;a href="http://groups.inf.ed.ac.uk/blizzard/blizzard2010/mandarin/register-MS.html"&gt;http://groups.inf.ed.ac.uk/blizzard/blizzard2010/mandarin/register-MS.html&lt;/a&gt;&lt;br /&gt;&amp;nbsp; Volunteers: &lt;a href="http://groups.inf.ed.ac.uk/blizzard/blizzard2010/mandarin/register-MR.html"&gt;http://groups.inf.ed.ac.uk/blizzard/blizzard2010/mandarin/register-MR.html&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Despite it's my fourth challenge where I participated as a listener I still find it's interesting to pass through. The reason for that is probably that instead of synthesized sound utterances I see people and teams behind them. Though organizers stress that the challenge is not a competition, I think it's the most interesting thing there. I even think that tradition to encrypt the results with the letter which require you to read all papers and decipher who got which result makes&amp;nbsp; the challenge worth to participate. That is a kind of observance for me. I'm looking forward to read about the results.&lt;br /&gt;&lt;br /&gt;As for non-experts, the challenge could be interesting just because it's a unique way to compare different technologies and get idea on what is possible. TTS is often utilized in different ways, but as far as you understand the speech it's ok. That should be changed. Demand for high-quality text-to-speech conversion is huge and will grow as soon as intelligent human-computer interaction will be more common. It's important to get the landmarks of the quality possible. On that background minor things like enormous noisy section that is very hard to go through or tedious QuickTime installation process looks like a minor thing.&lt;br /&gt;&lt;br /&gt;And all this reminded me again that I need to push Ben to deploy better TTS voice on searchmymeetings. The current one is not acceptable in my opinion.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-4682068409117699751?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/4682068409117699751/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2010/06/blizzard-challenge-2010.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/4682068409117699751'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/4682068409117699751'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2010/06/blizzard-challenge-2010.html' title='Blizzard Challenge 2010'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-518127705265368720</id><published>2010-06-01T00:16:00.001+04:00</published><updated>2010-06-01T00:16:48.282+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='speech recognition'/><title type='text'>KISS Principle</title><content type='html'>Still think that you can take sphinx4 engine and make a state-of-art recognizer? Check what AMI RT-09 entry is doing for meeting transcription in presentation on RT'09 workshop "&lt;a href="http://www.itl.nist.gov/iad/mig/tests/rt/2009/workshop/AMI-STT.pdf"&gt;The AMI RT’09 STT and SASTT Systems&lt;/a&gt;":&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/_p33_0koWXHA/TAQYmc1QQoI/AAAAAAAAAF0/J1vmBEy14d0/s1600/ami.jpg" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"&gt;&lt;img border="0" height="320" src="http://4.bp.blogspot.com/_p33_0koWXHA/TAQYmc1QQoI/AAAAAAAAAF0/J1vmBEy14d0/s320/ami.jpg" width="145" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;ol&gt;&lt;li&gt;Segmentation&lt;/li&gt;&lt;li&gt;Initial decoding of full meeting with&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;4g LM based on 50K vocabulary and weak acoustic model (ML) M1&lt;/li&gt;&lt;li&gt;7g LM based on 6K vocabulary and strong acoustic model (MPE) M2&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;Intersect output and adapt (CMLLR)&lt;/li&gt;&lt;li&gt;Decode using M2 models and 4gLM on 50k vocabulary&lt;/li&gt;&lt;li&gt;Compute VTLN/SBN/fMPE&lt;/li&gt;&lt;li&gt;Adapt SBN/fMPE/MPE models M3 using CMLLR&lt;/li&gt;&lt;li&gt;Adapt LCRCBN/fMPE/MPE models M4 using CMLLR and output of previous stage&lt;/li&gt;&lt;li&gt;Generate 4g lattices with adapted M4 models&lt;/li&gt;&lt;li&gt;Rescore using M1 models and CMLLR + MLLR adaptation&lt;/li&gt;&lt;li&gt;Compute Confusion networks&lt;/li&gt;&lt;/ol&gt;Click on image to check the details of the process.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-518127705265368720?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/518127705265368720/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2010/06/kiss-principle.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/518127705265368720'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/518127705265368720'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2010/06/kiss-principle.html' title='KISS Principle'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_p33_0koWXHA/TAQYmc1QQoI/AAAAAAAAAF0/J1vmBEy14d0/s72-c/ami.jpg' height='72' width='72'/><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-6151843065843284999</id><published>2010-05-26T00:10:00.002+04:00</published><updated>2010-05-28T01:36:00.025+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='sphinx4'/><category scheme='http://www.blogger.com/atom/ns#' term='gpu'/><category scheme='http://www.blogger.com/atom/ns#' term='nexiwave'/><title type='text'>Campaign For Decoding Performance</title><content type='html'>We spent some time to make speech recognition backend faster. Ben &lt;a href="http://ben.nexiwave.com/2010/05/gpu-and-speech-processing.html"&gt;reports&lt;/a&gt; in his blog the results on moving scoring to GPU with CUDA/&lt;a href="http://www.jcuda.de/"&gt;jCUDA&lt;/a&gt;, which reduced scoring time dramatically. That's an improvement we are happy to apply in our production environment.&lt;br /&gt;&lt;br /&gt;We consider that GPU is not just a speedup of computation, it's a paradigm shift. Historically search is optimized to make the number of scored tokens smaller since it affected accuracy. Now scoring is immediate, but that means that other parts should be changed. There are few issues to smash on the way:&lt;br /&gt;&lt;br /&gt;We really target to make it even more faster, in particular we would really like to solve grow part problem.&lt;br /&gt;&lt;br /&gt;&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;In classical score/prune/grow scheme unfortunately not only scoring takes significant time. In particular in sphinx4 growing branches is also a bottleneck. When sphinx4 was optimized for LVCSR in the beginning, grow time was also a problem. That's why whole number of workarounds where developed: grow skipping, skew pruning, arc caching and acoustic lookahead. They were successful at that time but not as successful as they could be. At least they don't scale for GPU.&lt;br /&gt;&lt;br /&gt;Among papers that I've found there are several publications about GPU-based speech recognition, in particular I would like to note interesting research by &lt;a href="http://chongjike.net/node/3"&gt;Jike Chong and his colleagues&lt;/a&gt;. Thanks to Tao for the link! But the issue is that complex grow algorithms is also not considered there. They write about bigram search which basically means they explore very simple state space. In results they compare themselves with HVite. That's the same situation as with WFST when the attempt to bring complexity of large vocabulary into first pass fails and one should stick with bigrams and hope that all important things will be done later on subsequent passes. I'm kind of think that it's a waste of processor resources.&lt;br /&gt;&lt;br /&gt;Next issue is more technical. We haven't found good gpu cloud service yet. Though such services are certainly very promising because GPU is more energy-efficient they aren't common yet. Lets hope this situation will improve soon.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-6151843065843284999?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/6151843065843284999/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2010/05/campain-for-decoding-performance.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/6151843065843284999'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/6151843065843284999'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2010/05/campain-for-decoding-performance.html' title='Campaign For Decoding Performance'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-328000088029626965</id><published>2010-05-18T01:00:00.000+04:00</published><updated>2010-05-18T01:00:47.953+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='sphinx4'/><title type='text'>Want To Learn How Sphinx4 Works? Help With Wiki!</title><content type='html'>Long time ago when sphinx4 development was active, the team used twiki hosted at CMU. Unlike many open source projects, this wiki was actually not just a collection of random stuff, but a complete project support system. It contained meetings notes, design decisions and prototyping results. You could find there diagrams and explanations on what is &lt;a href="http://cmusphinx.sourceforge.net/wiki/sphinx4:growskipping"&gt;grow skipping&lt;/a&gt; or what is&lt;a href="http://cmusphinx.sourceforge.net/wiki/sphinx4:frameskewpruning"&gt; skew pruning&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;This wiki died some time ago due to administrative issues, but that was even better since the content from it was merged into our current &lt;a href="http://cmusphinx.sourceforge.net/wiki"&gt;main wiki&lt;/a&gt; at sourceforge at &lt;a href="http://cmusphinx.sourceforge.net/wiki/start?do=index"&gt;sphinx4 namespace&lt;/a&gt;. Unfortunately during transition the formatting was lost since dokuwiki formats aren't always the same as twiki ones. That's actually not so bad as well because the content needs to be renewed in order to fit into current state of sphinx4.&lt;br /&gt;&lt;br /&gt;So right now there are like 170 pages total about sphinx4. Some of them are useful, some aren't. They definitely contain deep knowledge of sphinx4 internals, something that will probably help you next time you will &lt;a href="http://cmusphinx.sourceforge.net/wiki/sphinx4:largevocabularyperformanceoptimization"&gt;optimize the performance of large vocabulary recognition&lt;/a&gt; with sphinx4. I'm in process of slowly sorting them out but that will take a lot of time. It's your chance to join, help and learn!&lt;br /&gt;&amp;nbsp;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-328000088029626965?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/328000088029626965/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2010/05/want-to-learn-how-sphinx4-works-help.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/328000088029626965'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/328000088029626965'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2010/05/want-to-learn-how-sphinx4-works-help.html' title='Want To Learn How Sphinx4 Works? Help With Wiki!'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-9006102401053821072</id><published>2010-05-13T01:32:00.001+04:00</published><updated>2010-06-06T01:19:04.691+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='adaptation'/><category scheme='http://www.blogger.com/atom/ns#' term='speech recognition'/><title type='text'>Not All Speaker Adaptations Are Equally Useful</title><content type='html'>&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/_p33_0koWXHA/S-sccPegSwI/AAAAAAAAAFo/sOCEKUoJHXo/s1600/sig00490x.gif" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://2.bp.blogspot.com/_p33_0koWXHA/S-sccPegSwI/AAAAAAAAAFo/sOCEKUoJHXo/s320/sig00490x.gif" /&gt;&lt;/a&gt;&lt;/div&gt;Some time ago I was rather encouraged by VTLN which is vocal tract normalization. By so-called frequence warping it tries to unify vocal tract lenght of all speakers and thus make better model. It's done by shifting and adjusting mel filter frequencies. This thing is implemented in Sphinxtrain/Pocketsphinx/Sphinx4. Basically all you need is to enable it in sphinx_train.cfg&lt;br /&gt;&lt;code&gt;&lt;br /&gt;$CFG_VTLN = 'yes';&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;And run the training. It will extract features with all frequency warp parameters with some step (take care about space on disk) and will find out best one with forced alignment of each utterance. Then it will create new fileids and transcription files with reference to the file with proper warp parameter.&lt;br /&gt;&lt;br /&gt;To decode with VTLN model you need to guess warp parameter. There are several algorithms suggested to do that. One analyses pitch, others employ GMM for classification. Then you need to reextract features with predicted warp parameter. It gives some visible improvement in performance.&lt;br /&gt;&lt;br /&gt;But recently a set of articles like &lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.ee.ucla.edu/%7Espapl/paper/cui_euro05.pdf"&gt;MLLR-LIKE SPEAKER ADAPTATION BASED ON LINEARIZATION OF VTLN WITH MFCC FEATURES by Xiaodong Cui and Abeer Alwan&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;came into my sight thanks to antonsrv8. The simple idea is that any transform which we do on features, especially smooth transform could be mostly replaced just by linear transform of MFCC coefficients, basically by MLLR transformation. This kind of obvious fact makes me think if we really need other transformations if MLLR is generic enough. It's not harder to estimate MLLR than to estimate warp factor, especially if data is large enough which is usually the case. Another transformation applied will just conflict with MLLR. On large data sets this is confirmed by experimental results in article above.&lt;br /&gt;&lt;br /&gt;Of course non-linear transform like VTLN could be better than linear one, but it's certainly not VTLN it seems. I hope latest state of art in voice conversion could suggest something better.&lt;br /&gt;&lt;br /&gt;Update: this point was of course largely covered in research papers. Good coverage with math and results is provided in Luis thesis:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://mi.eng.cam.ac.uk/%7Elfu20/lfu20.thesis.ps%20%20"&gt;Speaker Normalisation and Adaptation in Large Vocabulary Speech Recognition by Lu ́s Felipe Uebel &lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-9006102401053821072?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/9006102401053821072/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2010/05/not-all-speaker-adaptations-are-equally.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/9006102401053821072'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/9006102401053821072'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2010/05/not-all-speaker-adaptations-are-equally.html' title='Not All Speaker Adaptations Are Equally Useful'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_p33_0koWXHA/S-sccPegSwI/AAAAAAAAAFo/sOCEKUoJHXo/s72-c/sig00490x.gif' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-1932811876036003838</id><published>2010-05-12T02:33:00.000+04:00</published><updated>2010-05-12T02:33:56.523+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='voting'/><title type='text'>Recognizer Language Voting Over</title><content type='html'>&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/_p33_0koWXHA/S-na5Y1i_wI/AAAAAAAAAFI/drOaLVCVlKk/s1600/language-voting.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="81" src="http://3.bp.blogspot.com/_p33_0koWXHA/S-na5Y1i_wI/AAAAAAAAAFI/drOaLVCVlKk/s640/language-voting.jpg" width="640" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;So, language voting is over. It seems that despite performance issues we currently face Java gets enough attention. Thanks for sharing your opinion, it's very important for us.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-1932811876036003838?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/1932811876036003838/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2010/05/recognizer-language-voting-over.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/1932811876036003838'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/1932811876036003838'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2010/05/recognizer-language-voting-over.html' title='Recognizer Language Voting Over'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_p33_0koWXHA/S-na5Y1i_wI/AAAAAAAAAFI/drOaLVCVlKk/s72-c/language-voting.jpg' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-569402352978067332</id><published>2010-05-07T00:27:00.004+04:00</published><updated>2010-05-12T02:42:32.925+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='wfst'/><category scheme='http://www.blogger.com/atom/ns#' term='juicer'/><category scheme='http://www.blogger.com/atom/ns#' term='engines'/><category scheme='http://www.blogger.com/atom/ns#' term='speech recognition'/><title type='text'>Speech Decoding Engines Part 1. Juicer, the WFST recognizer</title><content type='html'>ASR today is quite diverse. While in 1998 there was only a HTK package and some inhouse toolkits like CMUSphinx released in 2000, now there are dozen very interesting recognizers released to the public and available under open source licenses. We are starting today the review series about them.&lt;br /&gt;&lt;br /&gt;So, the first one is &lt;a href="http://juicer.amiproject.org/juicer/"&gt;Juicer&lt;/a&gt;, the WSFT recognizer from IDIAP, University of Edinburgh and University of Sheffield. &lt;br /&gt;&lt;br /&gt;Weighted finite state transducers (WFST) are very popular trend in modern ASR, with very famous addicts like Google, IBM Watson center and so many others. Basically the idea is that you convert everything into same format that allows not just unified representation, but more advanced operations like merging to build a shared search space or reducing to make that search space smaller (with operations like determinization and minimization). Format also provides you interpolation properties, for example you don't need to care about g2p anymore, it's automatically done by transducer. For WFST itself, I found a good tutorial by &lt;a href="http://www.cs.nyu.edu/%7Emohri/pub/hbka.pdf"&gt;Mohril, Pereira and Riley "Speech Recognition with Weighted State Transducers".&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Juicer can do very efficient decoding with stanard set of ASR tools - ARPA language model (bigram due to memory requirements), dictionary and cross-word triphone models could be trained by HTK. BSD license makes Juicer very attractive. Juicer is part of AMI project that targets meeting transcription, other AMI deliverables are subject for separate posts though.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;So here is the description how to try it. Don't expect it to be straightforward though, it's not a trivial process. Well, one day we'll put everything on a live CD to make ASR development environment easier. Right now you can follow this step-by-step howto as many our young friends call such thing. I wonder where do people get the idea that for everything there is detailed step-by-step howto. &lt;br /&gt;&lt;br /&gt;So, let's start Download Juicer and dependencies:&lt;br /&gt;&lt;code&gt;&lt;br /&gt;&lt;a href="http://www.torch.ch/archives/Torch3src.tgz"&gt;Torch3src.tgz&lt;/a&gt;&lt;br /&gt;&lt;a href="http://juicer.amiproject.org/tracter/sources/tracter-0.6.0.tar.bz2"&gt;tracter-0.6.0.tar.bz2&lt;/a&gt;&lt;br /&gt;&lt;a href="http://juicer.amiproject.org/juicer/sources/juicer-0.12.0.tar.bz2"&gt;juicer-0.12.0.tar.bz2&lt;/a&gt;&lt;br /&gt;&lt;a href="http://sourceforge.net/projects/kissfft/files/"&gt;kiss_fft-v1.2.8.tar.gz&lt;/a&gt;&lt;br /&gt;&lt;a href="http://mohri-lt.cs.nyu.edu/twiki/pub/FST/FstDownload/openfst-1.1.tar.gz"&gt;openfst-1.1.tar.gz&lt;/a&gt;&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;Unpack and build torch&lt;br /&gt;&lt;code&gt;&lt;br /&gt;tar xf Torch3src.tgz&lt;br /&gt;cd Torch3&lt;br /&gt;cp config/Linux.cfg .&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;Edit Linux.cfg to include packages: distributions gradients kernels&lt;br /&gt;speech datasets decoder&lt;br /&gt;&lt;code&gt;&lt;br /&gt;# Packages you want to use&lt;br /&gt;packages = distributions gradients kernels speech datasets decoder&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;Continue with the build&lt;br /&gt;&lt;code&gt;&lt;br /&gt;./xmake all&lt;br /&gt;cd ..&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;Unpack kiss_fft:&lt;br /&gt;&lt;code&gt;&lt;br /&gt;tar xf kiss_fft-v1.2.8.tar.gz&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;There is no need to build kiss, it's build is included in the next step.&lt;br /&gt;&lt;br /&gt;Unpack and build tracter&lt;br /&gt;&lt;code&gt;&lt;br /&gt;tar xf tracter-0.6.0.tar.bz2&lt;br /&gt;cd tracter-0.6.0&lt;br /&gt;aclocal &amp;amp;&amp;amp; libtoolize &amp;amp;&amp;amp; automake -a &amp;amp;&amp;amp; autoconf&lt;br /&gt;mkdir m4&lt;br /&gt;./configure \&lt;br /&gt;&amp;nbsp; --with-kiss-fft=/&lt;i&gt;current_folder&lt;/i&gt;/kiss_fft_v1_2_8 \&lt;br /&gt;&amp;nbsp; --with-htk-includes="-I/&lt;i&gt;htk_folder&lt;/i&gt;/HTKLib" \&lt;br /&gt;&amp;nbsp; --with-htk-libs="/&lt;i&gt;htk_folder&lt;/i&gt;/HTKLib/HTKLib.a"&lt;br /&gt;&amp;nbsp; --with-torch3=/&lt;i&gt;current_folder&lt;/i&gt;/Torch3&lt;br /&gt;make &amp;amp;&amp;amp; make install&lt;br /&gt;cd ..&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;Make sure you point full path to the dependencies, since relative path&lt;br /&gt;will not work. Also note that for htk you need to provide compiler&lt;br /&gt;options, not folders. Alternatively you can increase your pain trying &lt;br /&gt;to build tracter with cmake as readme describes.&lt;br /&gt;&lt;br /&gt;Unpack and build juicer&lt;br /&gt;&lt;br /&gt;Make sure PKG_CONFIG_PATH makes tracter.pc reachable.&lt;br /&gt;&lt;code&gt;&lt;br /&gt;export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/usr/local/lib/pkgconfig&lt;br /&gt;tar xf juicer-0.12.0.tar.bz2&lt;br /&gt;cd juicer-0.12.0&lt;br /&gt;aclocal &amp;amp;&amp;amp; libtoolize &amp;amp;&amp;amp; automake -a &amp;amp;&amp;amp; autoconf&lt;br /&gt;mkdir m4&lt;br /&gt;./configure \&lt;br /&gt;&amp;nbsp; --with-kiss-fft=/&lt;i&gt;current_folder&lt;/i&gt;/kiss_fft_v1_2_8 \&lt;br /&gt;&amp;nbsp; --with-htk-includes="-I/&lt;i&gt;htk_folder&lt;/i&gt;/HTKLib" \&lt;br /&gt;&amp;nbsp; --with-htk-libs="/&lt;i&gt;htk_folder&lt;/i&gt;/HTKLib/HTKLib.a"&lt;br /&gt;&amp;nbsp; --with-torch3=/&lt;i&gt;current_folder&lt;/i&gt;/Torch3&lt;br /&gt;make &amp;amp;&amp;amp; make install&lt;br /&gt;cd ..&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;Build openfst:&lt;br /&gt;&lt;code&gt;&lt;br /&gt;tar xf openfst-1.1.tgz&lt;br /&gt;cd openfst&lt;br /&gt;./configure &amp;amp;&amp;amp; make &amp;amp;&amp;amp; make install&lt;br /&gt;make&lt;br /&gt;cd ../..&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;Setup environment variables:&lt;br /&gt;&lt;code&gt;&lt;br /&gt;export JUTOOLS=/&lt;i&gt;current_folder&lt;/i&gt;/juicer-0.12.0/bin&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;At this point juicer and required tools are built, let's try it with HTK&lt;br /&gt;wsj model from &lt;a href="http://www.keithv.com/"&gt;Keith Vertanen&lt;/a&gt;. Download the model &lt;a href="http://www.keithv.com/software/htk/us/htk_wsj_si84_2750_8.zip"&gt;htk_wsj_si84_2750_8.zip&lt;/a&gt;&lt;br /&gt;and unpack it&lt;br /&gt;&lt;code&gt;&lt;br /&gt;unzip htk_wsj_si84_2750_8.zip&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;Convert model to ascii&lt;br /&gt;&lt;code&gt;&lt;br /&gt;mkdir ascii &lt;br /&gt;touch empty&lt;br /&gt;HHEd -D -T 1 -H hmmdefs -H macros -M ascii empty tiedlist&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;Convert dmp turtle model from pocketsphinx to ARPA model turtle.lm&lt;br /&gt;&lt;code&gt;&lt;br /&gt;sphinx_lm_convert -i turtle.DMP -o turtle.lm -ifmt dmp -ofmt arpa&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;Remove alternative pronunciation numbers from turtle.dic and build phoneset&lt;br /&gt;&lt;code&gt;&lt;br /&gt;sed 's/([0-9])//g' turtle.dic | tr [:upper:] [:lower:] &amp;gt; turtle.dic.lower&lt;br /&gt;mv turtle.dic.lower turtle.dic&lt;br /&gt;echo "&amp;lt;s&amp;gt;&amp;nbsp;&amp;nbsp; sil" &amp;gt;&amp;gt; turtle.dic&lt;br /&gt;echo "&amp;lt;/s&amp;gt;&amp;nbsp;&amp;nbsp; sil" &amp;gt;&amp;gt; turtle.dic&lt;br /&gt;for w in `cat turtle.dic | cut -d" " -f 2-`; do echo $w; done | sort | uniq &amp;gt; turtle.phone&lt;br /&gt;echo sp &amp;gt;&amp;gt; turtle.phone&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;Due to some script limitations, not all different words couldn't have same pronunciations. So open turtle.dic and remove line with entry "two t uw" because it conflicts with "to t uw"&lt;br /&gt;&lt;br /&gt;Now let's convert everything into WFST&lt;br /&gt;&lt;code&gt;&lt;br /&gt;gramgen -gramType ngram -lmFName turtle.lm -lexFName turtle.dic \&lt;br /&gt;-fsmFName gram.fsm -inSymsFName gram.insyms -outSymsFName gram.outsyms \&lt;br /&gt;-sentStartWord "&amp;lt;s&amp;gt;" -sentEndWord "&amp;lt;/s&amp;gt;"&lt;br /&gt;&lt;br /&gt;lexgen -lexFName turtle.dic&amp;nbsp; -monoListFName turtle.phone \&lt;br /&gt;-fsmFName dic.fsm -inSymsFName dic.insyms -outSymsFName dic.outsyms \&lt;br /&gt;-sentStartWord "&amp;lt;s&amp;gt;" -sentEndWord "&amp;lt;/s&amp;gt;" -pauseMonphone sp -addPronunsWithEndPause&lt;br /&gt;&lt;br /&gt;cdgen -htkModelsFName wsj_si84_2750_8/ascii/hmmdefs -tiedListFName \&lt;br /&gt;wsj_si84_2750_8/tiedlist&amp;nbsp; -monoListFName turtle.phone -fsmFName wsj.fsm \&lt;br /&gt;-inSymsFName wsj.insyms -outSymsFName wsj.outsyms&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;To deal with juicer bug comment the following lines in juicer-0.12.0/bin/aux2eps.pl:&lt;br /&gt;&lt;code&gt;&lt;br /&gt;#if ( ! %AUXSYMS )&lt;br /&gt;#{&lt;br /&gt;#&amp;nbsp;&amp;nbsp; print "no aux syms in symbol file - nothing to do\n" ;&lt;br /&gt;#&amp;nbsp;&amp;nbsp; exit 0 ;&lt;br /&gt;#}&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;Now let's compose it into single WFST&lt;br /&gt;&lt;code&gt;&lt;br /&gt;build-wfst-openfst gram.fsm dic.fsm wsj.fsm&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;Everything is ready for decoding. Let's try with goforward.raw from pocketsphinx&lt;br /&gt;&lt;code&gt;&lt;br /&gt;sox -r 16000 -2 -s goforward.raw goforward.wav&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;create HTK config&lt;br /&gt;&lt;code&gt;&lt;br /&gt;SOURCEFORMAT = WAV&lt;br /&gt;TARGETKIND = MFCC_0_D_A_Z&lt;br /&gt;TARGETRATE = 100000.0&lt;br /&gt;WINDOWSIZE = 250000.0&lt;br /&gt;SAVECOMPRESSED = F&lt;br /&gt;SAVEWITHCRC = F&lt;br /&gt;USEHAMMING = T&lt;br /&gt;PREEMCOEF = 0.97&lt;br /&gt;NUMCHANS = 26&lt;br /&gt;CEPLIFTER = 22&lt;br /&gt;NUMCEPS = 12&lt;br /&gt;ENORMALISE = T&lt;br /&gt;ZMEANSOURCE = T&lt;br /&gt;USEPOWER = T&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;Convert to mfcc&lt;br /&gt;&lt;code&gt;&lt;br /&gt;HCopy -C config goforward.wav goforward.mfc&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;Create control file:&lt;br /&gt;&lt;code&gt;&lt;br /&gt;echo goforward.mfc &amp;gt; train.scp&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;Decode&lt;br /&gt;&lt;code&gt;&lt;br /&gt;juicer -inputFormat htk -lexFName turtle.dic -inputFName train.scp -fsmFName final.fsm -inSymsFName final.insyms -outSymsFName final.outsyms&amp;nbsp; -htkModelsFName wsj_si84_2750_8/ascii/hmmdefs&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;Get the result:&lt;br /&gt;&lt;code&gt;&lt;br /&gt;&amp;lt;s&amp;gt; go four are &amp;lt;s&amp;gt; ten meters &amp;lt;/s&amp;gt;&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;It's not accurate for some reason. Probably feature extraction is not the same as were used for acoustic model. Probably I should use word insertion penalty.&lt;br /&gt;&lt;br /&gt;Of course not everything is so perfect. The main issues with WFST decoder are very well described in documentaion. Basically they are memory requirements for the first pass decoding (that's why Juicer can't run trigram models on commodity hardware) and lack of dynamic search optimization that's more straightforward. Anyway, WFST framework has a lot of applications going beyond just recognition. It's applied for speech indexing, open vocabulary decoding, simplifies confidence scoring.&lt;br /&gt;&lt;br /&gt;That's it, you can count it works and embed it into your software. Overall, it's an interesting package demonstrating how simple things could be when you put everything into flexible format. I'm sure CMUSphinx will follow this direction and will implement WFST decoding soon. At least we ultimately need to introduce FST tools in our framework.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-569402352978067332?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/569402352978067332/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2010/05/speech-decoding-engines-part-1-juicer.html#comment-form' title='18 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/569402352978067332'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/569402352978067332'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2010/05/speech-decoding-engines-part-1-juicer.html' title='Speech Decoding Engines Part 1. Juicer, the WFST recognizer'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>18</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-6647662185336816674</id><published>2010-05-01T04:20:00.001+04:00</published><updated>2010-05-13T01:36:30.480+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='ideas'/><category scheme='http://www.blogger.com/atom/ns#' term='speech recognition'/><title type='text'>Face Recognizers, Bloom filters and Application to Speech Recognition</title><content type='html'>&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://www.cc.gatech.edu/%7Ewujx/RareEvent/cascade.JPG" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"&gt;&lt;img border="0" height="255" src="http://www.cc.gatech.edu/%7Ewujx/RareEvent/cascade.JPG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;In scientific paper waterfall we have today I continuously face the issue of&amp;nbsp; selection of high-level important approaches to the problem. Many ideas are definitely important and lead to accuracy improvement but they are certainly not counted as core ones. Like another feature extraction algorithm that could bring you 2% of performance improvement. I definitely miss some high-level up-to-date reviews that could lead into the world of possible approaches taken and their advantages and disadvantages. I was counting on books in that, but unfortunately they aren't as accessible as papers.&lt;br /&gt;&lt;br /&gt;Some time ago I went into reading the core face detection paper by &lt;a href="http://research.microsoft.com/%7Eviola/Pubs/Detect/violaJones_IJCV.pdf"&gt;Viola and Jones&lt;/a&gt; about Haar cascades for object detection. It struck me that their method which appeared to be very fruitful in face and object detection didn't get into common practice in speech recognition.&lt;br /&gt;&lt;br /&gt;Basically the idea of their method is that it's possible to reduce search space significantly with very weak set of classifiers. For example you can easily find out that there is no face on the green grass and thus you can skip this region. This is rather fruitful idea that you can classify negatives much more accurately then positives. Putting things into cascade make search space tiny and recognition fast and efficient. Certainly it's not the only algorithm of this type, other one I met recently is &lt;a href="http://en.wikipedia.org/wiki/Bloom_filter"&gt;bloom filters&lt;/a&gt; with almost the same method for efficient hash search.&lt;br /&gt;&lt;br /&gt;The transfer of this into ASR is rather straightforward. We need to train weak classifiers that reject phone hypothesis for a given set of frames. That's actually quite easy with SVM or something built on top of existing HMM segmentation. Next, we could also apply this to a language model and reject some hypothesis which aren't possible in the language.&lt;br /&gt;&lt;br /&gt;I haven't seen any papers on that, probably I need to search more. This idea is certainly worth to try and it should get into common ASR practices like discriminative training, adaptation with linear regression or multipass search.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-6647662185336816674?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/6647662185336816674/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2010/05/face-recognizers-bloom-filters-and.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/6647662185336816674'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/6647662185336816674'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2010/05/face-recognizers-bloom-filters-and.html' title='Face Recognizers, Bloom filters and Application to Speech Recognition'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-728776641958070371</id><published>2010-04-27T00:23:00.001+04:00</published><updated>2010-04-27T00:26:36.995+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='nexiwave'/><title type='text'>Great Move To Nexiwave</title><content type='html'>We decided to move all our blogs like &lt;a href="http://ben.nexiwave.com/"&gt;Ben's blog&lt;/a&gt;, news about &lt;a href="http://searchmymeetings.com/"&gt;SearchMyMeetings&lt;/a&gt; and others to&amp;nbsp; &lt;a href="http://nexiwave.com/"&gt;http://nexiwave.com&lt;/a&gt;. Such consolidaion will help us to manage our resources as well as will improve our presence in the web. Being more officially placed we will be more responsible for content as well, so I hope to find out here more useful matherials about speech recogniton, CMUSphinx and other related things soon.&lt;br /&gt;&lt;br /&gt;Sorry for the inconvenience.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-728776641958070371?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/728776641958070371/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2010/04/great-move-to-nexiwave.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/728776641958070371'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/728776641958070371'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2010/04/great-move-to-nexiwave.html' title='Great Move To Nexiwave'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-721311657852462222</id><published>2010-04-24T02:37:00.002+04:00</published><updated>2010-05-13T01:37:47.332+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='speech recognition'/><category scheme='http://www.blogger.com/atom/ns#' term='testing'/><category scheme='http://www.blogger.com/atom/ns#' term='articles'/><title type='text'>Intelligent Testing In ASR</title><content type='html'>To continue previous topic about testing I want to share the information about nice paper I read some time ago which I wanted to bring to our daily practices.&lt;br /&gt;&lt;br /&gt;The issue is that the current way we test our systems is far from being optimal, at least there no real theory behind that. I usually apply 1/10th rule in practice where I split data on 9/10 training set and 1/10 testing set. This was done in voxforge as well. Not so good thing since with 70 hours of Voxforge data test set grows to 7 hours and it takes ages to decode it. I took this rule from festival's traintest script. And that's more or less common practice in ASR while things like &lt;a href="http://en.wikipedia.org/wiki/Cross-validation_%28statistics%29"&gt;10-fold cross-validation &lt;/a&gt;aren't popular for computational reasons mostly. Suprisingly, problems like that could be easily solved if only we could focus on real goal of testing - estimating the recognition performance. All our test sets are oversized, one could easily find that looking on decoder results during testing. They tend to stabilize very quickly unless there is some data inconsistency.&lt;br /&gt;&lt;br /&gt;Speech recognition practice unfortunately doesn't cover this even in scientific papers. Help comes from character recognition. The nice paper I found is:&lt;br /&gt;&lt;br /&gt;Isabelle Guyon, John Makhoul, Fellow, IEEE, Richard Schwartz, and Vladimir Vapnik&lt;br /&gt;&lt;a href="http://people.sabanciuniv.edu/berrin/cs512/reading/guyon-datasize.pdf"&gt;What Size Test Set Gives Good Error Rate Estimates? &lt;/a&gt;&lt;br /&gt;IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 20, NO. 1, JANUARY 1998&lt;br /&gt;&lt;br /&gt;Authors address the problem of determining what size test set guarantees statistically significant results in a character recognition task, as a function of the expected error rate. The paper is well written and actually rather clear to understand. There are no complex model behind the testing, nothing speech-specific. There are two valuable points there:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;The approach that puts reasoning behind test process&lt;/li&gt;&lt;li&gt;The forumlae itself&lt;/li&gt;&lt;/ol&gt;To put it simple: &lt;b&gt;&amp;nbsp;&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;The test set for medium vocabulary task could be small. If word error rate is expected to be like 10%, by the table on page 9 you can get that to compare two configurations with difference 0.5% absolute you need only 13k words data size.&amp;nbsp;&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;That's four times smaller than current Voxforge test set. I think this estimate can be even improved if we'll specialize with speech. I really hope this result will be useful for us and will help us to speedup the process of application testing and optimization.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-721311657852462222?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/721311657852462222/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2010/04/intelligent-testing-in-asr.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/721311657852462222'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/721311657852462222'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2010/04/intelligent-testing-in-asr.html' title='Intelligent Testing In ASR'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-4772289259542327926</id><published>2010-04-18T14:33:00.003+04:00</published><updated>2010-05-12T02:43:33.000+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='voxforge'/><category scheme='http://www.blogger.com/atom/ns#' term='sphinx4'/><title type='text'>Testing ASR with Voxforge Database</title><content type='html'>In development and research the critical issue is proper testing. There was some buzz about that recently, for example at &lt;a href="http://mloss.org/community/"&gt;MLoss blog&lt;/a&gt; where pros for using open data are considered. One interesting resource that started some time ago is &lt;a href="http://mlcomp.org/"&gt;http://mlcomp.org/&lt;/a&gt;, which combines both open data and open algorithm automatically selecting the best method for the common data set. I think it's not that easily implementable idea because "best" is often different. Sometimes you need speed, sometimes generalization. &lt;br /&gt;&lt;br /&gt;In our case by using open data you can easily solve the following problems:&lt;br /&gt;&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Test the changes you've made in speech decoder and trainer on a practical large-vocabulary database&lt;/li&gt;&lt;li&gt;Estimate how recognition engine performs. It's not just about estimating the accuracy but also about other critical parameters like confidence score quality, decoding speed, lattice variability, noise robustness and so on.&lt;/li&gt;&lt;li&gt;Share the bugs you've found. The situation is that we could definitely fix minor problems that are easy to reproduce. Any serious problem ultimately requires a reproducable test example.&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;I actually wanted to describe how this works in practice right now. The solution we propose for CMUsphinx developers is a &lt;a href="http://voxforge.org/"&gt;Voxforge&lt;/a&gt; database. It's not the only open data source out there, but I think it's most permissive one. Old an4 is good for quick tests, but it definitely doesn't satisfy our needs because everything except large vocabulary recognizer have little sense nowdays.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;The database itself is about 75 hours of read speech taken from various sources. The speech was collected by web collection application, from public audiobooks and so on. There are number of accents, sound quality also varies. The information on the source is not very reliable, but that's something we should live with. The speech is segmeneted on utterances and for each utterance transcription is provided. Voxforge DB has some disadvantages like very limited vocabulary (around 5000 words), rather limited focus on read texts, but we need to work on them. I believe issues will be fixed soon.&lt;br /&gt;&lt;br /&gt;Voxforge model for CMUSphinx is trained periodically, the recent one could be downloaded here:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.repository.voxforge1.org/downloads/Main/Trunk/AcousticModels/Sphinx/"&gt;http://www.repository.voxforge1.org/downloads/Main/Trunk/AcousticModels/Sphinx/&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The model was trained with SphinxTrain and config could be found inside as well as other scripts to train it. Test results are also inside README file in model. The corresponding performance test is provided in sphinx4.&lt;br /&gt;&lt;br /&gt;Steps to start with it are simple:&lt;br /&gt;&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Download model from the link above, unpack it.&lt;/li&gt;&lt;li&gt;Check scripts in script subfolder to setup training:&lt;/li&gt;&lt;ul&gt;&lt;li&gt;Download audio with wget&lt;/li&gt;&lt;li&gt;Unpack it&lt;/li&gt;&lt;li&gt;Convert flat co wav&lt;/li&gt;&lt;li&gt;Make transcription file from PROMPTS&lt;/li&gt;&lt;li&gt;Extract features&lt;/li&gt;&lt;li&gt;Run the training with ./scripts_pl/RunAll.pl&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;Test the recognition performance, setup other testing environment for various numbers you need.&lt;/li&gt;&lt;/ol&gt;The space required to train Voxforge is about 10Gb on disk. On a dual-core machine training should take about a day.&lt;br /&gt;&lt;br /&gt;Here are the results of the sphinx4 performance test of the acoustic model voxforge-en-r0.1.3:&lt;br /&gt;&lt;pre&gt;[java] # --------------- Summary statistics ---------&lt;br /&gt;[java]    Accuracy: 90,613%    Errors: 4679  (Sub: 2904  Ins: 559  Del: 1216)&lt;br /&gt;[java]    Words: 43889   Matches: 39769    WER: 10,661%&lt;br /&gt;[java]    Sentences: 4682   Matches: 3050   SentenceAcc: 65,143%&lt;br /&gt;[java]    Total Time Audio: 22090,44s  Proc: 116616,04s  Speed: 5,28 X real time&lt;br /&gt;[java]    Mem  Total: 1993,75 Mb  Free: 806,02 Mb&lt;br /&gt;[java]    Used: This: 1187,73 Mb  Avg: 1112,13 Mb  Max: 1962,46 Mb&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;To run this test do the following:&lt;br /&gt;&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Checkout latest sphinx4&lt;/li&gt;&lt;li&gt;Go to sphinx4/tests/performance/voxforge_en&lt;/li&gt;&lt;li&gt;Download and unpack audio files into wav folder using script build.sh from voxforge-en-r0_1_3/scripts&lt;/li&gt;&lt;li&gt;Download acoustic model and unpack it, creating etc folder with lm and voxforge_en_sphinx.cd_cont_3000 folder with model files&lt;/li&gt;&lt;li&gt;Start test simply typing ant in command line&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;You see the decoding takes like 32 hours. Well, it can be faster on multicore machine, but you need to change the configuration of ThreadedScorer to explicetely start multiple scoring threads. Unfortunately automatic detection of number of cores doesn't work in SUN's JVM.&lt;br /&gt;&lt;br /&gt;Speaking about Voxforge, we definitely need to thank Ken McLean (great work, Ken!) who is running Voxforge for several years already.&amp;nbsp; It's also worth to mention that without contributors who submitted their speech this project will not be that thriving. So, start with using Voxforge for your developments, report bugs and send us comments. That would be appreciated.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-4772289259542327926?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/4772289259542327926/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2010/04/testing-asr-with-voxforge-database.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/4772289259542327926'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/4772289259542327926'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2010/04/testing-asr-with-voxforge-database.html' title='Testing ASR with Voxforge Database'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-3057536924793042143</id><published>2010-03-26T00:34:00.002+03:00</published><updated>2010-05-18T01:33:05.500+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='sphinx'/><category scheme='http://www.blogger.com/atom/ns#' term='conferences'/><title type='text'>Finally Sorted Out Workshop Materials</title><content type='html'>&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://lh6.ggpht.com/_rcCh_jaqOVM/S6aL7buq1OI/AAAAAAAACko/NrnZv7qXkg4/s1600/sn204352.jpg" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"&gt;&lt;img border="0" height="240" src="http://lh6.ggpht.com/_rcCh_jaqOVM/S6aL7buq1OI/AAAAAAAACko/NrnZv7qXkg4/s320/sn204352.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;Since they are more CMUSphinx official documents, I posted notes about workshop and meetins after that on the website:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://cmusphinx.sourceforge.net/2010/03/sphinx-users-and-developers-workshop-2010-results/" rel="bookmark" title="Sphinx Users And Developers Workshop 2010 Results"&gt;Sphinx Users And Developers Workshop 2010 Results&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://cmusphinx.sourceforge.net/2010/03/development-meeting-notes/" rel="bookmark"&gt;Development Meeting Notes&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;I'm pleased to get so many important things planned and few very important issues cleared up. For example I didn't completely understand why lattices in sphinx4 are so bad. I hope other participants had some productive results too. &lt;br /&gt;&lt;br /&gt;It would be nice to get some pictures from workshop, but unfortunately none available now. Probably sometime later. So, just Dallas one.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-3057536924793042143?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/3057536924793042143/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2010/03/finally-sorted-out-workshop-mateials.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/3057536924793042143'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/3057536924793042143'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2010/03/finally-sorted-out-workshop-mateials.html' title='Finally Sorted Out Workshop Materials'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://lh6.ggpht.com/_rcCh_jaqOVM/S6aL7buq1OI/AAAAAAAACko/NrnZv7qXkg4/s72-c/sn204352.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-89307659865079559</id><published>2010-03-22T22:22:00.006+03:00</published><updated>2010-05-18T01:32:32.654+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='conferences'/><title type='text'>ICASSP 2010</title><content type='html'>&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/_p33_0koWXHA/S6e3bqGCedI/AAAAAAAAADk/IFy2kctYk2A/s1600-h/icassp.jpg" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"&gt;&lt;img border="0" height="170" src="http://2.bp.blogspot.com/_p33_0koWXHA/S6e3bqGCedI/AAAAAAAAADk/IFy2kctYk2A/s400/icassp.jpg" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;So, I'm back from &lt;a href="http://www.icassp2010.org/"&gt;ICASSP&lt;/a&gt; in Dallas, TX. It was very impressive conference with lots of interesting and inspiring presentations, meetings and discussions. Amazing everyone was there and I've finally met all the speech people who guided me for so long time. I've met ASR people Bhiksha Raj, David Huggins-Daines, Rita Singh, Richard Stern and TTS people Alan W. Black, Keiichi Tokuda, Simon King, Heiga Zen. I was pleased to&amp;nbsp; meet second time wonderful guys like Evandro and Peter. Worth to mention that I was able to listen talks by famous people like Hynek Hermanski.&lt;br /&gt;&lt;br /&gt;We had &lt;a href="http://www.cs.cmu.edu/%7Esphinx/Sphinx2010"&gt;Sphinx Users and Developers Workshop&lt;/a&gt;&amp;nbsp; there and also two CMU Sphinx development planing meetings. But they are&amp;nbsp; subject for another post. This one is just about interesting ideas presented on the conference by other people. I didn't have time to attend every presentation out there, I think it was impossible. You have to find the time for sightseeing and there were often two or three parallel lecture sessions and also poster presentations which I liked most. I think poster presentation is the best way to access author, ask him questions and get feedback. Many posters were so popular it was almost impossible to get to the stand.&lt;br /&gt;&lt;br /&gt;Anyway, the amount of talks I've got already exceeds what can be consumed in a week. It would be nice to get one day all information about current research collected into structure or wiki-like resource. It's a huge work for&amp;nbsp; the future though.&lt;br /&gt;&lt;br /&gt;So here are some presentations and ideas I've met there and found them to be worth attention:&lt;br /&gt;&lt;br /&gt;&lt;b&gt; Robust Speaking Rate Estimation Using Broad Phonetic Class Recognition&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Authors: Jiahong Yuan; University of Pennsylvania, Mark Liberman;&lt;br /&gt;University of Pennsylvania&lt;br /&gt;&lt;br /&gt;Presented work is about using easy classifier to get some specific data about speech like to estimate&amp;nbsp; deletions in syllables and thus speech quality. This is actually very promising approach which is ignored for some reason in most places where it seems to be practical. For example it's not clear for me why speaker identification framework doesn't try to find phonetic classes first and build GMM only after that. It seems to be a natural approach to improve SID performance.&lt;br /&gt;&lt;br /&gt;Broad phonetic classes remind me the idea from the famous face recognition algorithm by Viola and Jones about applying cascade for fast classification. This idea could be applied to speech in some form like authors suggest I think.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Clap Your Hands! Calibrating Spectral Substraction for Reverberation&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Authors: Uwe Zaeh, Korbinian Riedhammer, Tobias Bocklet, Elmar Noeth&lt;br /&gt;&lt;br /&gt;Reverberation was very popular on this conference and especially it's important for meetings. Various speech system require various noise cancellation. Far microphone need to fix reverberation from room, close microphones need to fix clicks and so on. Far microphones sometimes do calibration for reverberation estimation. This defines the set of components sphinx4 could have to deal with various environment conditions. Right now they are simply missing.&lt;br /&gt;&lt;br /&gt;&lt;b&gt; Detecting Local Semantic Concepts in Environmental Sounds using Markov Model&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Authors: Keansub Lee, Daniel Ellis, Alexander Loui&lt;br /&gt;&lt;br /&gt;Interesting that classification database for this task is available at &lt;a href="http://labrosa.ee.columbia.edu/projects/consumervideo/"&gt;http://labrosa.ee.columbia.edu/projects/consumervideo/&lt;/a&gt;, this could be a base for non-speech recognition research.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Learning Task-Dependent Speech Variability In Discriminative&amp;nbsp;Acoustic Model Adaptation&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Authors: Shoei Sato, Takahiro Oku, Shinichi Homma, Akio Kobayashi, Toru Imai&lt;br /&gt;&lt;br /&gt;Discriminative approaches are popular now days. Direct optimization of the cost function could serve on various stages of training process. In this work for example the set of subword units is selected to minimize decoding error rate.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;An Improved Consensus-Like method for Minimum Bayes Risk Decoding and Lattice Combination &lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Authors: Haihua Xu, Daniel Povey, Lidia Mangu, Jie Zhu&lt;br /&gt;&lt;br /&gt;This deals with specific criterion for lattice decoding. Not just best path could be chosen but other criterion like consensus could also apply. For me personally it would be very interesting to formalize and apply the criterion that will ensure grammatical correctness of the result. I haven't found anything on this yet.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Discriminative training based on an integrated view of MPE and MMI in margin and error space&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Authors: Erik McDermott, Shinji Watanabe, Atsushi Nakamura&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;br /&gt;Interesting to find out that real math goes into ASR. Basically it was a long waited thing and it seems it was started by Georg Heygold with his works on MMI and other methods. It would be nice to review this area to get some idea what's the outcome of it. Heygold was sited in almost every presentation, so it's really getting popular.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Balancing False Alarms and Hits in Spoken Term Detection&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Authors: Carolina Parada, Abhinav Sethy, Bhuvana Ramabhadran&lt;br /&gt;&lt;br /&gt;It's interesting to see what tools are used. WFST's are very convenient and used by everyone. IBM, Google, AT&amp;amp;T. This is also a topic for separate post.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Bayesian Analysis of Finite Gaussian Mixtures&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Authors: Mark Morelande, Branko Ristic&lt;br /&gt;&lt;br /&gt;Rather old idea (there are similar papers from 1998) to use Bayesian learning to estimate number of mixtures in the model. I'm in favor of the approach to estimate all model parameters including number of mixtures, language weight and number of senones at once.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Improving Speech Recognition by Explicit Modeling of Phone Deletions&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Authors: Tom Ko&lt;br /&gt;&lt;br /&gt;Pronunciation variation by phone deletion looks very promising since traditional linguists mostly complain about sequential HMM model which doesn't handle deletions correctly. Unfortunately, the effect&amp;nbsp; of this seems to be small. The improvement cited is only from 91.5% to 92%.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;An Efficient Beam Pruning With A Reward Considering The Potential To Reach Various Words.&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Authors: Tsuneo Kato, Kengo Fujita, Nobuyuki Nishizawa&lt;br /&gt;&lt;br /&gt;Beam pruning according to the number of reachable words or to other the risk function. Good idea to implement in sphinx4 to speedup recognition. Factor cited is 1.2 for a large vocabulary.&lt;br /&gt;&lt;br /&gt;That's it. I missed Friday and most early mornings unfortunately, so something&amp;nbsp; interesting could be there. I'm sure you could select your own set. It's&amp;nbsp; interesting to look on it.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-89307659865079559?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/89307659865079559/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2010/03/icassp-2010.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/89307659865079559'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/89307659865079559'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2010/03/icassp-2010.html' title='ICASSP 2010'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_p33_0koWXHA/S6e3bqGCedI/AAAAAAAAADk/IFy2kctYk2A/s72-c/icassp.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-4997793817042141647</id><published>2010-03-01T22:56:00.005+03:00</published><updated>2010-03-22T22:24:00.331+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='sphinx4'/><title type='text'>Sphinx4 1.0 beta4 Is Released. What's next?</title><content type='html'>&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://farm3.static.flickr.com/2448/3557849329_01152c2854_m.jpg" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"&gt;&lt;img border="0" src="http://farm3.static.flickr.com/2448/3557849329_01152c2854_m.jpg" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;So, almost according to schedule, sphinx4 was released yesterday. Check the notes at&lt;br /&gt;&lt;br /&gt;&lt;a href="http://cmusphinx.sourceforge.net/2010/03/sphinx4-1-0-beta-4-released/"&gt;http://cmusphinx.sourceforge.net/2010/03/sphinx4-1-0-beta-4-released/&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Most notable improvements were already discussed here, so let me try to plan what the next release will be. Trying to be realistic in plans, I don't want to promise everything at once. Here is some attempt to forecast the next release notes&lt;br /&gt;&lt;br /&gt;The biggest issue with sphinx4 is actually documentation. Current &lt;a href="http://cmusphinx.sourceforge.net/pollsarchive/"&gt;poll&lt;/a&gt; on CMUSphinx website clearly shows that. Personally I sometimes think that perfect documentation will not help if system doesn't work, but at least it will make product attractive and easy to use. My idea is that we need to have more developer-level documentation - tutorial, examples, task-oriented howtos. It's unlikely we'll be able to write something that is good enough as textbook on speech technologies. But we need to prove the point that it's possible to build ASR system without understanding who is Welch.&lt;br /&gt;&lt;br /&gt;On the code side, we face a biggest challenge since sphinx4 was designed. We need to move to the multipass system. It's not just about rescoring, it's about plugging &lt;a href="http://cmusphinx.sourceforge.net/wiki/speakerdiarization"&gt;diarization framework from LIUM&lt;/a&gt;, it's also about making sphinx4 suitable for both batch and live applications. That's the serious issue. &lt;br /&gt;&lt;br /&gt;The reason is that currently sphinx4 architecture is flow-oriented. It's built like a single pipe of components each passing audio to other. This is good for live applications, but not so good for batch ones. You get troubles when you need to split pipe or merge it later. In batch application one could have a huge benefit from looking on recording as a whole and returning to recording multiple times. For example, you could estimate noise level properly and just cleanup audio on the second pass. Such multipass decoding doesn't well fit into pipe paradigm. On the other side, changing it to purely batch will create issues for live applications.&lt;br /&gt;&lt;br /&gt;So we are in trouble. We have to invent some combined scheme probably and create a hybrid of pipe and batch approaches. I was thinking about knowledge base scheme when information about stream is stored in some database as processing goes. Database cleanup policies could emulate both pipe (when database is immediately cleaned) and batch approaches (when database is kept even over sessions). Festival utterances remind me such data processing scheme between. Anyway, this idea is not finalized yet.&lt;br /&gt;&lt;br /&gt;We also expect to see a lot of movement from CMUSphinx Workshop in Dallas and in Google Summer of code participation. I hope issues described above and some more interesting issuses will be resolved till next release in August. Let's discuss the rest then!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-4997793817042141647?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/4997793817042141647/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2010/03/sphinx4-10-beta4-is-release-whats-next.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/4997793817042141647'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/4997793817042141647'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2010/03/sphinx4-10-beta4-is-release-whats-next.html' title='Sphinx4 1.0 beta4 Is Released. What&apos;s next?'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://farm3.static.flickr.com/2448/3557849329_01152c2854_t.jpg' height='72' width='72'/><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-8471960103989694648</id><published>2010-02-27T02:16:00.003+03:00</published><updated>2011-03-19T22:43:40.998+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='summer of code'/><category scheme='http://www.blogger.com/atom/ns#' term='sphinx'/><title type='text'>Speech Recognition in GSoC Done Right</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;From year to year many end-user projecs are trying to push ASR with the help of Google and studens of the Summer Of Code program. If CMUSphinx team knows all about ASR, why should we stay away from that?&lt;br /&gt;&lt;br /&gt;I had diverse experience with Google Summer Of Code before, but I still like this process and enjoy communication with new people. I think we have good chances to succeed here. So I started and filed an application proposal and the initial list of ideas&lt;br /&gt;&lt;br /&gt;&lt;a href="http://cmusphinx.sourceforge.net/wiki/summerofcode2010"&gt;http://cmusphinx.sourceforge.net/wiki/summerofcode2010&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;I will submit this proposal on March 8 after program start. We need more ideas now. As much as you can generate&lt;br /&gt;&lt;br /&gt;&lt;a href="http://cmusphinx.sourceforge.net/wiki/summerofcodeideas"&gt;http://cmusphinx.sourceforge.net/wiki/summerofcodeideas&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;We need to have more or less representative list. If you want to be a mentor, don't hestitate to write down your irc nick as well.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-8471960103989694648?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/8471960103989694648/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2010/02/speech-recognition-in-gsoc-done-right.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/8471960103989694648'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/8471960103989694648'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2010/02/speech-recognition-in-gsoc-done-right.html' title='Speech Recognition in GSoC Done Right'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-2654667811481785121</id><published>2010-02-12T04:24:00.000+03:00</published><updated>2010-02-12T04:24:31.667+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='sphinx4'/><category scheme='http://www.blogger.com/atom/ns#' term='wiener filter'/><category scheme='http://www.blogger.com/atom/ns#' term='noise filter'/><title type='text'>Noise reduction filtering in sphinx4</title><content type='html'>There is a huge gap between stock sphinx4 and real ASR system since critical parts like noise filtering, speaker diarization and postprocessing are missing. Not to mention the online adaptation. The default frontend is less then optimal for several reasons. For example it doesn't handle DC offset at all, it also uses energy-based endpointer in time domain, thus not so robust to additive noise.&lt;br /&gt;&lt;br /&gt;As of today sphinx4 includes the implementation of &lt;a href="http://en.wikipedia.org/wiki/Wiener_deconvolution"&gt;Wiener filter&lt;/a&gt; that reduce noise and helps the voice activity detector as well. To try it checkout latest trunk and change the frontend pipeline as following:&lt;br /&gt;&lt;br /&gt; &amp;lt;item&amp;gt;audioFileDataSource &amp;lt;/item&amp;gt;&lt;br /&gt; &amp;lt;item&amp;gt;dataBlocker &amp;lt;/item&amp;gt;&lt;br /&gt; &amp;lt;item&amp;gt;preemphasizer &amp;lt;/item&amp;gt;&lt;br /&gt; &amp;lt;item&amp;gt;windower &amp;lt;/item&amp;gt;&lt;br /&gt; &amp;lt;item&amp;gt;fft &amp;lt;/item&amp;gt;&lt;br /&gt; &amp;lt;item&amp;gt;wiener &amp;lt;/item&amp;gt;&lt;br /&gt; &amp;lt;item&amp;gt;speechClassifier &amp;lt;/item&amp;gt;&lt;br /&gt; &amp;lt;item&amp;gt;speechMarker &amp;lt;/item&amp;gt;&lt;br /&gt; &amp;lt;item&amp;gt;nonSpeechDataFilter &amp;lt;/item&amp;gt;&lt;br /&gt; &amp;lt;item&amp;gt;melFilterBank &amp;lt;/item&amp;gt;&lt;br /&gt; &amp;lt;item&amp;gt;dct &amp;lt;/item&amp;gt;&lt;br /&gt; &amp;lt;item&amp;gt;liveCMN &amp;lt;/item&amp;gt;&lt;br /&gt; &amp;lt;item&amp;gt;featureExtraction &amp;lt;/item&amp;gt;&lt;br /&gt;&lt;br /&gt;Then define wiener component:&lt;br /&gt;&lt;br /&gt;    &amp;lt;component name="wiener"  &lt;br /&gt;        type="edu.cmu.sphinx.frontend.endpoint.WienerFilter"&amp;gt;&lt;br /&gt;        &amp;lt;property name="classifier" value="speechClassifier"/&amp;gt;&lt;br /&gt;    &amp;lt;/component&amp;gt;&lt;br /&gt;&lt;br /&gt;This frontend is stable to DC and also handles noise better. To try the noisy input, you could mix white noise with sox:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt; sox 10001-90210-01803.wav noise.wav synth white&lt;br /&gt; sox noise.wav smallnoise.wav vol -45d&lt;br /&gt; sox -m 10001-90210-01803.wav smallnoise.wav 10001-90210-01803-noisy.wav&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;It would be nice to try with Aurora database as well.&lt;br /&gt;&lt;br /&gt;This filter is very simple and has a number of disadvantages. For example it corrupts spectrum with harmonic noises sometimes and thus makes recognition even worse. But it definitely helps in presense of noise. Let's hope one day more sophisticated implementations like &lt;a href="http://ieeexplore.ieee.org/Xplore/login.jsp?url=http%3A%2F%2Fieeexplore.ieee.org%2Fiel5%2F9248%2F29343%2F01326146.pdf%3Farnumber%3D1326146&amp;authDecision=-203"&gt;Ephraim-Malah&lt;/a&gt; filter, or even noise reduction with &lt;a href="http://www.cs.cmu.edu/afs/cs/user/robust/www/Papers/icassp96-vts.pdf"&gt;vector taylor series&lt;/a&gt; will be made available in default configurations.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-2654667811481785121?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/2654667811481785121/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2010/02/noise-reduction-filtering-in-sphinx4.html#comment-form' title='14 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/2654667811481785121'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/2654667811481785121'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2010/02/noise-reduction-filtering-in-sphinx4.html' title='Noise reduction filtering in sphinx4'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>14</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-3004807425976745670</id><published>2010-01-31T19:08:00.002+03:00</published><updated>2010-02-01T00:45:51.386+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='random stuff'/><title type='text'>All ideas are already generated</title><content type='html'>After seeing flash websites take enormous amount of my CPU got a cool idea today about using flash for distributed computing. Basically everything is already in place. You setup webserver, share content with flash, it runs on client computer and does calculations uploading the result from time to time. Certainly I wasn't the first who invented that, see for example&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.vershun.com/computers/hidden-flash-applications-as-distributed-computing-clients.html"&gt;http://www.vershun.com/computers/hidden-flash-applications-as-distributed-computing-clients.html&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;and&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.csc.villanova.edu/%7Etway/courses/csc3990/f2009/csrs2009/Kevin_Berry_Grid_Computing_CSRS_2009.pdf"&gt;http://www.csc.villanova.edu/~tway/courses/csc3990/f2009/csrs2009/Kevin_Berry_Grid_Computing_CSRS_2009.pdf&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Though such ideas are rather recent and the question is how to make this framework widely used. Looking at current load of the computer at sourceforge it's most likely already used by some websites :)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-3004807425976745670?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/3004807425976745670/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2010/01/all-ideas-are-already-generated.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/3004807425976745670'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/3004807425976745670'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2010/01/all-ideas-are-already-generated.html' title='All ideas are already generated'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-2398026450449460643</id><published>2010-01-31T03:48:00.000+03:00</published><updated>2010-01-31T03:48:46.518+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='sphinxtrain'/><title type='text'>Training process</title><content type='html'>What I really like in Sphinxtrain is that it provides straightforward way for training an audio model. It remains unclear for me why everyone bothers with HTKBook while there is clean an easy way to train the model. One should just define the dictionary and transcription and put the files in the proper folder. Anyway, I'm continuously thinking about the way sphinxtrain process could be improved. Currently it indeed lacks a lot of critical information on training and that makes look uncomplete.&lt;br /&gt;&lt;br /&gt;Basically here is what I would like to put into the next versions of sphinxtrain and sphinxtrain tutorial:&lt;br /&gt;&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Description on how to prepare the data&lt;/li&gt;&lt;li&gt;Building of the database transcription. Between, what bothers me last month is the requirement to have fileids. I really think the file with fileids could be silentely dropped. What's the problem to get the id of the file from the transcription labels&amp;nbsp; &lt;/li&gt;&lt;li&gt;Automatic splitting on training data, testing data and development data. I see development data presense as a hard requirement for the training process. Unfortunately, current documentation lacks it. There could be code to do that, but for most databases it's automatic of course.&lt;/li&gt;&lt;li&gt;Bootstrapping from a hand-labelled data. I think this as an important part of training, HTK results confirm that. In general it repeats human language learning, so I think it's natural as well. &lt;/li&gt;&lt;li&gt;Training&lt;/li&gt;&lt;li&gt;Optimizing number of senones, mixtures on a devel set&lt;/li&gt;&lt;li&gt;Optimizing most important parameters like language weight on the development set. This part is complicated as I see it. First of all the reasononing behind proper language weight scaling is still unclear for me, I could one day write a separate post on it. Basically it depends on everything, even on the decoder&lt;/li&gt;&lt;li&gt;Testing on the test set&amp;nbsp;&lt;/li&gt;&lt;/ol&gt;&amp;nbsp;If it will be possible to keep this as straightforward as it is now that would be just perfect. Probably if I'll start to write the chapter in a week, this could be ready till summer.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-2398026450449460643?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/2398026450449460643/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2010/01/training-process.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/2398026450449460643'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/2398026450449460643'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2010/01/training-process.html' title='Training process'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-2317437272004589748</id><published>2010-01-21T02:19:00.002+03:00</published><updated>2010-05-13T01:36:44.725+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='speech recognition'/><title type='text'>Moving Beyond the `Beads-On-A-String'</title><content type='html'>Recently I've got interested in quite a large domain of speech recognition research where old school linguistic meets modern speech recognition. Basically the idea is that in spontaneous speech variativity is so huge that phonetic transcription from the dictionary doesn't apply well. In plain CMUSphinx setup linguistic information about phones is almost lost like we don't care if phone is labial or dental. It is used in a decision tree building but it's not clear if such usage helps. It's definitely not so good to drop such a huge amount of information that could help with classification. So this idea is actively developed and you can find there everything you miss probably - distinctive phone features, landmarks, spectrogram recognition.&lt;br /&gt;&lt;br /&gt;I went through the following articles, the number of methods, approaches and implementations described there is really huge. In other articles it's going to be even bigger:&lt;br /&gt;&lt;br /&gt;S.&amp;nbsp;King, J.&amp;nbsp;Frankel, K.&amp;nbsp;Livescu, E.&amp;nbsp;McDermott, K.&amp;nbsp;Richmond, and M.&amp;nbsp;Wester.  Speech production knowledge in automatic speech recognition.  &lt;i&gt;Journal of the Acoustical Society of America&lt;/i&gt;, 121(2):723-742,   February 2007. &lt;a href="http://www.cstr.ed.ac.uk/publications/users/simonk_bib.html#king07:JASA2007"&gt;&lt;/a&gt; &lt;a href="http://www.cstr.ed.ac.uk/downloads/publications/2007/King_et_al_review.pdf"&gt;PDF&lt;/a&gt;&lt;br /&gt;&amp;nbsp; &lt;br /&gt;Moving Beyond the `Beads-On-A-String' Model of Speech by                                                                            M. Ostendorf &lt;a href="http://ssli.ee.washington.edu/papers/abstracts/asru99-ostendorf.ps"&gt;PDF&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Speaking In Shorthand - A Syllable-Centric Perspective For Understanding Pronunciation Variation by Steven Greenberg                                                             &lt;a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.48.811&amp;amp;rep=rep1&amp;amp;type=pdf"&gt;PDF&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;To be honest the only idea from the articles that grown in my mind is that reductions on fast speech are root of the problem. I also noticed it in early days and was experimenting with a skip states. Skips didn't give any improvements except reduced speed. It will probably help to automatically increase lexicon variability and use forced alignemnt to get proper pronuciation at least at training stage. As I understood I just need to take a dictionary with syllabification and create a dictionary with a lot of reduced variants where onsets are kept as as and codas are reduced in some form. Then we force align, then train. Probably acoustic model will be better then.&lt;br /&gt;&lt;br /&gt;Another striking point was that I haven't found any significant accuracy improvement result in the articles I read. Improvement like 20% with discriminative training could make any method widely adopted but nothing like that is mentioned. Probably this research is in very initial state.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-2317437272004589748?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/2317437272004589748/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2010/01/moving-beyond-beads-on-string.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/2317437272004589748'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/2317437272004589748'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2010/01/moving-beyond-beads-on-string.html' title='Moving Beyond the `Beads-On-A-String&apos;'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-6672931118844396007</id><published>2010-01-16T05:10:00.004+03:00</published><updated>2010-01-25T22:27:47.062+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='ivr'/><title type='text'>Three Generation of IVR Systems</title><content type='html'>Recently I invented new nice concept for marketing people. Basicallly there are three generations of IVR systems right now:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Generation 1.0 - Static systems based on VoiceXML. It was suprising for me they are in wide use now and a lot of products are dedicated to their optimization/develoment. There are IDE's and a lot of testing tools, recommendations how to build proper VoiceXML. Come on, it's impossible to do that. It's something like static HTML websites that were popular in 1995. I don't believe any changes like javascript inside in VXML 3.0 will stop it slow death.&lt;/li&gt;&lt;li&gt;Generation 2.0 - Dynamic systems like &lt;a href="http://tropo.com/"&gt;Tropo&lt;/a&gt; from Voxeo. Much easier, much better. More control over content, more integration with the business logic. I really believe it's next generation because it gives developer much more control over the dialog. At least with the power of real scripting language like Python you'll be able to implement something non trivial with just several lines of code. That's AJAX or ROR in speech world.&lt;/li&gt;&lt;li&gt;Generation 3.0 - Semantic based IVR. This consists of three components - large vocabulary recognizer, semantic recognizer on top of it and even-based actions on top of it. Probably also an emotion recognition and more intelligent dialog tracking. As I see the developer has to define the structure of the dialog and provide handlers. &lt;a href="http://wiki.speech.cs.cmu.edu/olympus/index.php/Olympus"&gt;Such system&lt;/a&gt; was described and developed&amp;nbsp; in CMU long time ago already and also it's described in all ASR textbooks. But I'm not aware of any widely known platform allowing to do this kind of IVR. Once again it shows how big the gap is between the academia and software developers.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;If you are planning to create IVR application with CMUSphinx, please, consider IVR generation 3 as your base technology ;) And don't forget to share the code.&lt;br /&gt;&lt;br /&gt;Update:&lt;br /&gt;&lt;br /&gt;Very much on the same topic from a wonderful Nu Echo blog:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://blog.nuecho.com/2010/01/25/voice-apis-back-to-basics/"&gt;http://blog.nuecho.com/2010/01/25/voice-apis-back-to-basics/&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-6672931118844396007?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/6672931118844396007/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2010/01/three-generation-of-ivr-systems.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/6672931118844396007'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/6672931118844396007'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2010/01/three-generation-of-ivr-systems.html' title='Three Generation of IVR Systems'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-4289662638647351336</id><published>2010-01-12T03:07:00.000+03:00</published><updated>2010-01-12T03:07:22.743+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='sphinx'/><title type='text'>PLP is going to be default soon</title><content type='html'>It looks like MFCC features are going to become a history. Everyone is using 9 combined PLP frames + later LDA projection to 40-50 values. Few examples including &lt;a href="http://www.computer.org/portal/web/csdl/doi/10.1109/ICASSP.2009.4960723"&gt;Google in it's audio indexing system&lt;/a&gt;, IBM and BBN &lt;a href="http://www.itl.nist.gov/iad/mig/tests/std/"&gt;see system description in results&lt;/a&gt;, &lt;a href="http://www.cslu.ogi.edu/~zak/std07.pdf"&gt;OGI/ICSI&lt;/a&gt; and many others.&lt;br /&gt;&lt;br /&gt;The issue right now is that sphinx4 PLP implemetation seems to be broken, it produces kind of garbage features which doesn't give enough accuracy after training. Luckily there is HTK. Once this issue will get fixes, I think I'll retrain PLP + MLLT model for Voxforge. Unfortunately I don't have any definite plan for implementation of PLP in sphinxbase.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-4289662638647351336?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/4289662638647351336/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2010/01/plp-is-going-to-be-default-soon.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/4289662638647351336'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/4289662638647351336'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2010/01/plp-is-going-to-be-default-soon.html' title='PLP is going to be default soon'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-5323571991377324530</id><published>2010-01-05T01:18:00.001+03:00</published><updated>2010-05-13T01:37:00.375+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='speech recognition'/><title type='text'>Post-biological ASR</title><content type='html'>Recently we had a discussion we had with ionel on #cmusphinx chat on freenode about what should perfect speech recognition engine do. Though I didn't quite understood the purpose of the question the answer I could give was to make it as good as native speaker in convertion of the audio into the text. &lt;br /&gt;&lt;br /&gt;I watched today an &lt;a href="http://www.youtube.com/watch?v=QROMNOEI3PQ"&gt;interview with Ray Kurzweil&lt;/a&gt;, nothing really interesting there except the idea of post-biological future where computers replace humans. I understood that my definition of ASR is not a very good definition just because it's well established idea that computers will soon become way better than humans in most tasks, like they are already better in playing chess. I tend to forget this over and over, but it's perfectly reasonable to try to be better and not mimic human functions. Automatic recognizers could be better both in terms of speed, energy consumption and accuracy. Lets hope this year will bring us closer to such future. What will be speech then, that's the question.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-5323571991377324530?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/5323571991377324530/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2010/01/post-biological-asr.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/5323571991377324530'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/5323571991377324530'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2010/01/post-biological-asr.html' title='Post-biological ASR'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-5590775352542194643</id><published>2010-01-04T01:02:00.001+03:00</published><updated>2010-01-04T01:06:48.406+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='random stuff'/><title type='text'>Greetings and Random Thoughts</title><content type='html'>So 2010 is here, Happy New Year everyone. Wish you all success and happiness and of course increased decoder accuracy! Now we have a long 10 days vacation in Russia, time to travel, eat, drink and sort out bookmarks, read books on the shelf and watch pending google tech talks. Santa also promised me to do some great changes in sphinx4, waiting for that as well.&lt;br /&gt;&lt;br /&gt;Though &lt;a href="https://www.ohloh.net/p/534"&gt;Ohloh&lt;/a&gt; doesn't confirm that, I have a strong feeling that last year the activity around CMUSphinx definitely increased and it's usage is going to grow. &lt;br /&gt;&lt;br /&gt;I was thinking a little what should be the direction of sphinx4 development, I think we should consider several factors here. I would be happy to see it as widely-used enterprise level speech recognition engine with a great list of features, but I completely understand that due to the lack of resources it's naive to think we'll be able to do it all. We definitely need to find a market sector for the sphinx project and grow using it. There are already well established projects like HTK that are used widely with their own set of strong and weak features. Julius is used widely as a large vocabulary speech recognition engine with HTK models. It's hard to compete with HTK for us just because it will take years to add that flexibility we probably don't even need. Consider variable of adjustable number of states per phone, something that is only proven to be useful for a small vocabulary task, something we aren't really interested in and I hope will not be interested in a near future. What could be different is our practical orientation.&lt;br /&gt;&lt;br /&gt;Many project in speech domain and releated areas are often grown from the research projects and though flexible sometimes, often really unusable in applications since they aren't really designed for that. Usually a research project isn't well documented, has a lot of ways to implement the same thing and some of them are sometimes obsolete. Bugs are rarely fixed and documentation almost missing. Releases are not stable. It's definitely a large field for a commercial support company.&lt;br /&gt;&lt;br /&gt;There is a different side, many projects are created in order to solve the user needs, more or less well documented and have stable interfaces, large open community but they are doing so wrong internally I always wonder how they are used at all. &lt;a href="http://espeak.sourceforge.net"&gt;Espeak&lt;/a&gt; with it amazingly bad speech synthesis quality and even more amazing popularity. Out-of-date synthesis method doesn't let it be good with any possible modifications. Another example of this is strikingly &lt;a href="http://lucene.apache.org/java/docs/"&gt;Lucene&lt;/a&gt;. Unlike &lt;a href="http://www.lucidimagination.com/blog/2009/12/24/the-apache-lucene-ecosystem-my-view-of-2009/"&gt;lucidimagination blog states&lt;/a&gt; states lucene community is thriving, it's definitely not true. The research articles like &lt;a href="http://trec.nist.gov/pubs/trec16/papers/ibm-haifa.mq.final.pdf"&gt;Lucene and Juru at Trec 2007: 1-Million Queries Track&lt;/a&gt; definitely shows there is something wrong with Lucene. Basically it lists several trivial changes well known in research community that make Lucene perform two times better on a standard test. I can't understand why this wasn't integrated into stock after three years since article was published.&lt;br /&gt;&lt;br /&gt;Let's hope CMUSphinx will find it's place somewhere in the middle. Also, let's hope this year will bring more useful posts decreasing information overload that is certainly going to be a problem in a near future.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-5590775352542194643?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/5590775352542194643/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2010/01/greetings-and-random-thoughts.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/5590775352542194643'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/5590775352542194643'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2010/01/greetings-and-random-thoughts.html' title='Greetings and Random Thoughts'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-860184033641249157</id><published>2009-12-16T04:31:00.002+03:00</published><updated>2009-12-16T04:44:06.820+03:00</updated><title type='text'>Pocketsphinx Success Story</title><content type='html'>&lt;div style="text-align: justify;"&gt;I was pleased to find out renovated website by Keith Vertanen and an amazing real-life example of creation of the pocketsphinx application Parakeet, a dictation app with correction for Nokia N800:&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;a href="http://www.keithv.com/software/parakeet/n800/"&gt;http://www.keithv.com/software/parakeet/n800/&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Keith website and models are invaluable resource for Sphinx developers, in particular his lm_giga models are still the models I would recommend to take for adaptation. But seeing this application in action and reading about it's development should really give a good insight into the process of speech recognition application building, having all good practices described.&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: center;"&gt;&lt;object height="344" width="425"&gt;&lt;param name="movie" value="http://www.youtube.com/v/DV-WNQFe-LM&amp;fs=1&amp;"&gt;&lt;/param&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/DV-WNQFe-LM&amp;hl=ru_RU&amp;fs=1&amp;" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="344"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-860184033641249157?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/860184033641249157/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2009/12/pocketsphinx-success-story.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/860184033641249157'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/860184033641249157'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2009/12/pocketsphinx-success-story.html' title='Pocketsphinx Success Story'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-5065762494543453917</id><published>2009-12-12T20:07:00.002+03:00</published><updated>2009-12-12T21:09:42.163+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='speech recognition'/><title type='text'>Core Ideas Behind Speech Recognition</title><content type='html'>While tunning the acoustic model I've got again 40% WER and in the log the following:&lt;br /&gt;&lt;br /&gt;THEY'RE ONLY ALLOWED TEN&amp;nbsp; TO&amp;nbsp;&amp;nbsp;&amp;nbsp; a&amp;nbsp;&amp;nbsp; CLASS&amp;nbsp;&amp;nbsp;&amp;nbsp; (a100)&lt;br /&gt;***&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; HAD&amp;nbsp; BARELY&amp;nbsp; LEAD CANDY a&amp;nbsp;&amp;nbsp; CLASSIC&amp;nbsp; (a100)&lt;br /&gt;Words: 7 Correct: 1 Errors: 6 Percent correct = 14.29% Error = 85.71% Accuracy = 14.29&lt;br /&gt;&lt;br /&gt;If you'll check this recognition error you'll find that it's almost impossible to find the reason of it and fix it. Probably some senone was trained incorrectly, probably CMN give error or clipping made MFCC wrong. Probably some noise in the middle break the search. There is nothing you can do about it. That made me think about foundations of ASR.&lt;br /&gt;&lt;br /&gt;Considering a speech recognizer engine like sphinx4 one could extract the set of core ideas that lie behind it. Same ideas are usually described in speech recognition textbook. Basically they are:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;MFCC feature extraction from periodic frames (or PLP, doesn't matter)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;HMM classifier for acoustic scoring (with state tying)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Trigram word-based language model (higher grams aren't effective, lower not so precise)&lt;/li&gt;&lt;li&gt;Dynamic search with pruning&lt;/li&gt;&lt;/ul&gt;Surely commercial systems have a lot of improvement over this baseline, but the core is still the same. Such foundations are certainly reasonable and checked over the years in practice. It's hard to argue agains them. Often newbies tell that something is wrong here, but basically it's because they don't really understand how it works. Critisizm comes from old-school linguists, who do everything with rules and mostly interested in usual cases like &lt;a href="http://phonetic-blog.blogspot.com/2009/11/schedule.html"&gt;pronuciation of "schedule"&lt;/a&gt; than in theory.&lt;br /&gt;&lt;br /&gt;The only issue is that growing amount of unsolvable unexplainable problems like the problem with accuracy above breaks this theory. Quite unusual fact for me as mathematician since in mathematics theory rarely become invalid. They tranform, grow but usually all of them are stated once in forever. In natural sciences like physics it's usual. The aether theory and&amp;nbsp;&lt;a href="http://en.wikipedia.org/wiki/Mechanical_explanations_of_gravitation"&gt;mechanical explanation of gravitation&lt;/a&gt; is the good example that come to my mind. So there is nothing wrong that this ideology of speech recognition could be reviewed and modified according to the recent findings.&lt;br /&gt;&lt;br /&gt;What would I put into such modified theory:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;b&gt;Multiresolution feature extraction&lt;/b&gt;. Starting from RASTA to fMPE and &lt;a href="http://portal.acm.org/citation.cfm?id=1453653"&gt;spikes&lt;/a&gt;. The idea is that signals are sparse and nonperiodic, the signal range from 10ms to more than 10 seconds and they all needs to be passed into classifier.&lt;/li&gt;&lt;li&gt;&lt;b&gt;Some acoustic classifier that without selected states.&lt;/b&gt; The idea of phone is probably natual in slow speech or in teching but I heard so many complains about it. Dropping it seems promising indeed since speech is a process, not a sequence of states. Unfortunately I haven't found any article on this yet. Another promising idea here is &lt;a href="http://nlp.stanford.edu/IR-book/html/htmledition/soft-margin-classification-1.html"&gt;margins&lt;/a&gt; which could help with out-of-model sounds.&lt;/li&gt;&lt;li&gt;&lt;b&gt;Subword stage.&lt;/b&gt; I more think that languages with developed morphology like Turkish is more the rule than the exception. Being able to recognize a large set of words in the language is a core capability of usable recognizer and that forces it to operate on subword units. Even English recognizer could benefit from this.&lt;/li&gt;&lt;li&gt;&lt;b&gt;Language model without backoff.&lt;/b&gt; I recently had discussion with David about that and would like to thank him for this idea. Indeed counts of the model seems to be a reasonable statistics one could keep and use. But further calculation of the language weight should be modified completely. Again, there must be margin to strip some combinations that will never appear in the language. Such idea of using prohibitive rules stays in my mind for a long time. It would be also nice to find any recent articles on this. But there must be a component that will invalidate the output like "barely lead candy".&lt;/li&gt;&lt;li&gt;&lt;b&gt;Machine learning for backoff calculations.&lt;/b&gt; In continuation of the previous point, the backoff weight should have much more complex structure. Not only trigrams containing the words need to be taken into account, a semantic class should be counted, trigrams with similar class of words ought to be considered. Today I even had idea to apply machine learning to calculate the backoffs. I'm sure someone did this before, also need to look at articles about using machine learning methods to restrict search.&lt;/li&gt;&lt;/ul&gt;As for tree search, it luckily will stay as is, nothing to argue against it right now. Not sure that such modifications are breaking the initial theory, one could say they aren't really different. I still think they could explain the speech better and help to build better speech recognizer.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-5065762494543453917?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/5065762494543453917/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2009/12/core-ideas-behind-speech-recognition.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/5065762494543453917'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/5065762494543453917'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2009/12/core-ideas-behind-speech-recognition.html' title='Core Ideas Behind Speech Recognition'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-67785718121741688</id><published>2009-12-04T23:11:00.000+03:00</published><updated>2009-12-04T23:11:21.845+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='sphinx'/><title type='text'>New CMUSphinx Website Alpha</title><content type='html'>Most CMU Sphinx websites are outdated. The problems with the one at sourceforge are: &lt;br /&gt;&lt;ul&gt;&lt;li&gt;Not so modern style&lt;/li&gt;&lt;li&gt;No interactivity&lt;/li&gt;&lt;li&gt;Loosely organized outdated information&lt;/li&gt;&lt;li&gt;Hard to manage/update&lt;/li&gt;&lt;li&gt;No CMS/search&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;Also there is a generic problem with the quality of documentation available. A lot is quite outdated and just confusing.&lt;br /&gt;&lt;br /&gt;So I wanted to build a new website for a long time. This site is supposed to be central point for all sphinx tools, including pocketsphinx, sphinx4, cmuclmtk and sphinxtrain.&lt;br /&gt;&lt;br /&gt;New website is supposed to be interesting. This site is going to bring more interactivity (sharing, blog  posts, voting, comments). It looks a little bit bloggish, but I think it's even better. It would be harder to write more interesting posts, so I invite everyone to participate. I'm sure you have something to say. &lt;br /&gt;&lt;br /&gt;So here is the proposed demo version&lt;br /&gt;&lt;a href="http://cmusphinx.sourceforge.net/wordpress"&gt;http://cmusphinx.sourceforge.net/wordpress&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;We are in process of tranferring the information to the new website, so I really hope to see it running very soon.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-67785718121741688?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/67785718121741688/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2009/12/new-cmusphinx-website-alpha.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/67785718121741688'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/67785718121741688'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2009/12/new-cmusphinx-website-alpha.html' title='New CMUSphinx Website Alpha'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-8513644059370702627</id><published>2009-11-30T23:42:00.002+03:00</published><updated>2009-11-30T23:45:11.255+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='speech recognition'/><category scheme='http://www.blogger.com/atom/ns#' term='dictation'/><title type='text'>How to create a speech recognition application for your needs</title><content type='html'>Sometimes people ask: why there is no high-quality open source speech recognition applications (dictation application, IVR applications, closed-captions alignment, language acquisition and so on). The answer obviously is that nobody wrote them and make them public. It's often noted, for example by Voxforge, that we lack the database for the acoustic model. I admit Voxforge have it's reason to state we need a database. But that's only a little part of the problem, not entirely the problem as a whole.&lt;br /&gt;&lt;br /&gt;And as it always happens, the statement of the question doesn't allow constructive answer on it. To get constructive answer you need the following question: How do I create a speech recognition application.&lt;br /&gt;&lt;br /&gt;To answer on this let me provide an example. Consider we want to develop flash-based dictation website. The dictation application consists of the following parts which should be created&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Website, user accounting, user-dependent information storage&amp;nbsp;&lt;/li&gt;&lt;li&gt;Initial acoustic and language models trained with Voxforge audio and other free sources transmitted through Flash codecs&lt;/li&gt;&lt;li&gt;Recognizer setup to convert incoming streams into text. Distributed computation framework for the recognizer&lt;/li&gt;&lt;li&gt;Recognizer frontend with noise cancellation and VAD&lt;/li&gt;&lt;li&gt;Acoustic model adaptation framework to let user adapt the generic acoustic model to their pronunciation&amp;nbsp; &lt;/li&gt;&lt;li&gt;Language model adaptation framework&lt;/li&gt;&lt;li&gt;Transcription control package that will process commands during dictation like error correction ones or punctuation ones.&lt;/li&gt;&lt;li&gt;Post-processing package to put punctuation and capitalization, date and acronym post-processing&lt;/li&gt;&lt;li&gt;Test framework for dictation with dictation recordings and ability to check dictation effectiveness&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;Everything above could be done with open source tools and have approximately equal complexity and require minimum specialized knowledge. Performance-wise this system should be usable for a large vocabulary dictation for a wide range of users. The core components are:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Red5 streaming server&lt;/li&gt;&lt;li&gt;Adobe Flex SDK&lt;/li&gt;&lt;li&gt;Sphinx4&lt;/li&gt;&lt;li&gt;Sphinxtrain&lt;/li&gt;&lt;li&gt;Language model toolkit&lt;/li&gt;&lt;li&gt;Voxforge acoustic database&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;So you see mostly it's just an implementation of the existing algorithms and technologies. No rocket science. This makes me think that such application is just a matter of time.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-8513644059370702627?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/8513644059370702627/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2009/11/how-to-create-speech-recognition.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/8513644059370702627'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/8513644059370702627'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2009/11/how-to-create-speech-recognition.html' title='How to create a speech recognition application for your needs'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-6984533869092257393</id><published>2009-11-27T23:51:00.002+03:00</published><updated>2009-11-28T00:01:17.503+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='speech recognition'/><title type='text'>Multiview Representations On Interspeech</title><content type='html'>From my experience, in every activity it's important to have multilevel view of any activity, interesting is that it's both part of &lt;a href="http://en.wikipedia.org/wiki/Getting_Things_Done"&gt;Getting Things Done&lt;/a&gt; and just a good practice in &lt;a href="http://en.wikipedia.org/wiki/Model-driven_engineering"&gt;software development&lt;/a&gt;. Multiple models of the process or just different views help to understand what's going on. The only problem is to make those views consistent. That reminds me the &lt;a href="http://en.wikipedia.org/wiki/Matryoshka_doll"&gt;Russian model of the world&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;So it's actually very interesting to get a high-level overview of what's going on in speech recognition. Luckily to do that you just need to review some conference materials or journal articles. Latter is more compicated, while former is feasible. So here comes some topics from the plenary talks from Interspeech. Suprisingly they are rather consistent across each other and I hope they really present trends, not just selected topics.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Speech To Information&lt;/b&gt;&lt;br /&gt;by Mari Ostendorf&lt;br /&gt;&lt;br /&gt;Multilevel representation gets more and more important, in particular in speech recognition. The most complicated task - spontaneous meetings recording requires unifiication of the recognition efforts on all levels from acoustic representation to semantic one. Nice to call this approach "Speech To Information", as a result of speech recogntion not just the words are repaired but even syntactic and semantic structure of the talk. One of the interesting tasks is for example restoration of punctuation and capitalization, something that SRILM&amp;nbsp;&lt;a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.30.1772"&gt;does&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Good thing is that testing database for such material is already &lt;a href="http://corpus.amiproject.org/"&gt;available&lt;/a&gt; for free download. Very uncommon situation to have such representative database in free access. AMI corpus looks like an amazing piece of work.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Singe Method&lt;/b&gt;&lt;br /&gt;by Sadaoki Furui&lt;br /&gt;&lt;br /&gt;WFST-based T3 decoder looks quite impressive. Single method of data representation used everywhere which more importantly allows combination of the models gives wonderful opportunity. For example consider the example of building high-quality Icelandic ASR system combining WFST for English one and very basic Icelandic one. I imagine the decoder is really simple since basically all structures including G2P rules, language and acoustic model could be weighted finite-state automata.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Bayesian Learning&lt;/b&gt;&lt;br /&gt;by Tom Griffiths&lt;br /&gt;&lt;br /&gt;&lt;a href="http://en.wikipedia.org/wiki/Hierarchical_Bayes_model"&gt;Hierachical bayesian learning&lt;/a&gt; and things like &lt;a href="http://en.wikipedia.org/wiki/Compressed_sensing"&gt;compressed sensing&lt;/a&gt; seems to be a hot topics in mashine learning. Google does &lt;a href="http://www.youtube.com/watch?v=FO0fgVS9OmE"&gt;that&lt;/a&gt;. There are already some efforts to impelement a speech recognizer based on hierachical bayesian learning. Indeed it looks impressive to just feed the audio to the recognizer and make it understand you.&lt;br /&gt;&lt;br /&gt;Though probabilistic point of few was always questionable opposed to precise discriminative methods like MPE I'm still looking forward to see progress here. Despite huge amount of audio is required, like I remember there were estimates about 100000 hours I think it's feasible nowdays. For example i&lt;a href="http://www.youtube.com/watch?v=AyzOUbkUf3M"&gt;t already recognizes&lt;/a&gt; written digits, so success looks really close. And again, it's also multilevel!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-6984533869092257393?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/6984533869092257393/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2009/11/multiview-representations-on.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/6984533869092257393'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/6984533869092257393'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2009/11/multiview-representations-on.html' title='Multiview Representations On Interspeech'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-2021195801200829987</id><published>2009-11-25T02:01:00.001+03:00</published><updated>2009-11-25T04:15:47.463+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='speech recognition'/><title type='text'>Few open source speech projects</title><content type='html'>It's interesting that a lot of activity around speech software happen recently. I'm probably too impatient trying to track everything interesting. Even through ISCA-students added &lt;a href="https://twitter.com/ISCAStudents"&gt;twitter feed&lt;/a&gt; recently, their website still needs a lot of care. Hopefully, &lt;a href="http://voxforge.org/"&gt;Voxforge&lt;/a&gt; will become such resource one day. There is a growing amount of packages, tools, projects and events.&lt;br /&gt;&lt;br /&gt;For example I've got in touch with &lt;a href="http://www.semaine-project.eu/"&gt;SEMAINE&lt;/a&gt; project lead by DFKI recently, an effort to build a multimodal dialogue system which ca, interact with humans with a virtual character, sustain an interaction with a user for some time and react appropriately to the user's non-verbal behaviour. The sources are available and the new release is expected in December as far as I understood, so I'm definitely looking forward. The interesting thing is that SEMAINE incorporates emotion recognition framework with libSVM as a classifier, such framework would be useful in sphinx4 for example. Actually a lot of news come now from the European research institutes, projects from &lt;a href="http://www-i6.informatik.rwth-aachen.de/rwth-asr/"&gt;RWTH&lt;/a&gt; or &lt;a href="http://www.talp.cat/talp/index.php/ca/recursos/eines"&gt;TALP&lt;/a&gt; promise a lot.&lt;br /&gt;&lt;br /&gt;Another example is that I was pleased to find out that in 2009 there was a &lt;a href="http://www.itl.nist.gov/iad/mig/tests/rt/2009/index.html"&gt;rich transcription evaluation.&lt;/a&gt; It's interesting why results aren't available still and what was the progress on meeting transcription task since 2007.&lt;br /&gt;&lt;br /&gt;Probably I would sleep better if I didn't knew all above :)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-2021195801200829987?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/2021195801200829987/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2009/11/few-open-source-speech-projects.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/2021195801200829987'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/2021195801200829987'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2009/11/few-open-source-speech-projects.html' title='Few open source speech projects'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-7290436293892505473</id><published>2009-11-12T12:38:00.001+03:00</published><updated>2009-11-12T12:39:04.646+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='sphinx4'/><category scheme='http://www.blogger.com/atom/ns#' term='srilm'/><title type='text'>Using SRILM server in sphinx4</title><content type='html'>Recently I've added the support for the SRILM language model server to the sphinx4 so it's possible to use much bigger models during the search keeping the same memory requriements and, more important, during lattice rescoring. Lattice rescoring is still in progress, so here is the idea how to use network language model during search.&lt;br /&gt;&lt;br /&gt;SRILM has a number of adavantages for example it implements few interesting algorithms and even for simple tasks like trigram language model creation it's way better than cmuclmtk. At least model pruning is&lt;br /&gt;supported.&lt;br /&gt;&lt;br /&gt;To start first dump the language model vocabulary since it's required in linguist&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;ngram -lm your.lm --write-vocab my.vocab&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;So start the server with&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;ngram -use-server 5000 -lm your.lm&lt;/pre&gt;&lt;br /&gt;Configure the recognizer&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;&amp;lt;component name="rescoringModel"&lt;br /&gt;   type="edu.cmu.sphinx.linguist.language.ngram.NetworkLanguageModel"&amp;gt;&lt;br /&gt;   &amp;lt;property name="port" value="5000"/&amp;gt;&lt;br /&gt;   &amp;lt;property name="location" value="your.vocab"/&amp;gt;&lt;br /&gt;   &amp;lt;property name="logMath" value="logMath"/&amp;gt;&lt;br /&gt;&amp;lt;/component&amp;gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;And start the lattice demo. You'll see the result soon.&lt;br /&gt;&lt;br /&gt;Adjust the cache according to the size of your model. It shoudlnt' be large for a simple search. Typically the cache size isn't more than 100000 for a simple search.&lt;br /&gt;&lt;br /&gt;Still, usage of the large-gram model is not reasonable for a typical search because of the large amount of word trigrams that should be tracked. It's more efficient to use trigram or even bigram model first and make a second recognizer pass with the rescored language model. More details on rescoring in the next posts.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-7290436293892505473?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/7290436293892505473/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2009/11/using-srilm-server-in-sphinx4.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/7290436293892505473'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/7290436293892505473'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2009/11/using-srilm-server-in-sphinx4.html' title='Using SRILM server in sphinx4'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-7515190165910937721</id><published>2009-11-07T03:43:00.000+03:00</published><updated>2009-11-07T03:43:55.979+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='TTS'/><title type='text'>Rhythm of British English in Festival</title><content type='html'>Interesting how ideas rise from time to time in seemingly unrelated places. Recently I've read nice &lt;a href="http://phonetic-blog.blogspot.com/2009/09/period-piece.html"&gt;post&lt;/a&gt; in &lt;a href="http://phonetic-blog.blogspot.com/"&gt;John Well's blog &lt;/a&gt;about the proper RP English rhythm and now that issue raised again in gnuspeech mailing list where Dr. Hill cited his work&lt;br /&gt;&lt;br /&gt;&lt;a href="http://pages.cpsc.ucalgary.ca/%7Ehill/papers/isochrony-in-english-speech.pdf"&gt;JASSEM, W., HILL, D.R. &amp;amp; WITTEN, I.H. (1984) Isochrony in English speech: its statistical validity and linguistic relevance. Pattern, Process and Function in Discourse Phonology (collection ed. Davydd Gibbon), Berlin: de Gruyter, 203-225 (J) &lt;/a&gt;&lt;br /&gt;&lt;br /&gt;I spend some time thinking about how this rhythm is handled in Festival and came to the conclusion there is no such entity there. Probably it's somehow handled by CART for duration and intonation prediction, but not as a separate entity. Though many voices are supposed to be US English, I still think they can benefit from a proper rhythm prediction. Try the example from the movie, "This the house that Jack built" with artic voices. Check if Jack gets enough stress.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-7515190165910937721?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/7515190165910937721/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2009/11/rhythm-of-british-english-in-festival.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/7515190165910937721'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/7515190165910937721'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2009/11/rhythm-of-british-english-in-festival.html' title='Rhythm of British English in Festival'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-912961562842283645</id><published>2009-10-18T17:58:00.007+04:00</published><updated>2009-10-31T23:04:58.464+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='TTS'/><title type='text'>Blizzard 2009 results available</title><content type='html'>It was pleasant to find out that results of the &lt;a href="http://festvox.org/blizzard/blizzard2009.html"&gt;Blizzard Challenge 2009&lt;/a&gt; are now available. Thanks a lot ot organizers and participants!&lt;br /&gt;&lt;br /&gt;Reading the articles took me half of the day trying to solve usual Einstein-type puzzle of figuring out who give the best results there and what was changed. Unfortunately it takes to much time to read everything in details. There is no summary on methods/systems used this year, the archivements from the last year and explanations of the results provided. I could only start with the following:&lt;br /&gt;&lt;br /&gt;&lt;ol&gt;&lt;li&gt;iFlytek Speech Lab and IVO Software are still the best. Unit selection systems win.&lt;/li&gt;&lt;li&gt;DFKI which I was fan of can't unfortunately jump to a commercial level even with unit selectoin. That probably means that not only unit selection is a key issue.&lt;/li&gt;&lt;li&gt;I like the progress muXac and Mike are doing over years.&lt;/li&gt;&lt;li&gt;ES3 task with building voice from small amount of speech is kind of senseless. Don't we want to use voice adaptation in this case&amp;nbsp;&lt;/li&gt;&lt;li&gt;Interesting that machine learning for join and target cost optimization is popular nowdays &lt;/li&gt;&lt;li&gt;Though there was telephone TTS task it seems for me that nobody did anything related to the TTS over the telphone lines. The differences shouldn't be large, only 8kHz is the issue or even the advantage, but even this moment is not covered in any articles or at least I didn't notice it.&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;Short summary on systems:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Aholab - unit selection, spent &lt;b&gt;one&lt;/b&gt; day on building the voice so nothing good to expect&lt;/li&gt;&lt;li&gt;WISTON - Mandarin prosody is a key feature, but article doesn't describe challenge&lt;/li&gt;&lt;li&gt;Cereproc - experiment with combining HTS and unit selection, bad results or unknown reason, 4 man-days spent&lt;/li&gt;&lt;li&gt;CMU - article is not available, but you can try clustergen yourself in stock festival&lt;br /&gt;&lt;/li&gt;&lt;li&gt;CSTR - CSTR has started investigations on HTS methods. Good start, no results yet.&lt;/li&gt;&lt;li&gt;DFKI - spent year on adding Turkish TTS and Mary 4.0 implementation&lt;/li&gt;&lt;li&gt;Edinburgh/Idiap - interesting unsupervised entry, results are obvioulsy lower&lt;/li&gt;&lt;li&gt;I2R - good TTS, unit selection&lt;/li&gt;&lt;li&gt;Ivona - unit selection with pitch modifications by interestingly named algorithm, best English one together with iFlytek&lt;br /&gt;&lt;/li&gt;&lt;li&gt;CircumReality - unit selection with pitch modification by TD-PSOLA, best progress over years&lt;br /&gt;&lt;/li&gt;&lt;li&gt;NICT - HTS, GV, MGE and a lot of math&lt;/li&gt;&lt;li&gt;NIT - HTS with STRAIGHT, best HTS here, best Mandarin as well&lt;br /&gt;&lt;/li&gt;&lt;li&gt;NTUT - Mandarin HTS, not so interesting&lt;/li&gt;&lt;li&gt;PKU - Another Mandarin HTS with STRAGHT&lt;/li&gt;&lt;li&gt;Toshiba - Good unit selection system, interesting method about fuzzy combining units.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;iFlytek - HMM-driven unit selection, best English one together with Ivona.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;VUB - unit selection with WPSOLA, average, though interesting link on SPRAAK open source recognition toolkit, which is not completely open but has interesting description.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;Still, the challenge itself is very interesting and I'm looking forward on the next challenge results.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-912961562842283645?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/912961562842283645/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2009/10/blizzard-2009-results-available.html#comment-form' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/912961562842283645'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/912961562842283645'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2009/10/blizzard-2009-results-available.html' title='Blizzard 2009 results available'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-7743300553689049346</id><published>2009-10-16T11:43:00.000+04:00</published><updated>2009-10-16T11:43:09.359+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='acoustic model training'/><title type='text'>Another cool bit if hardware for database training.</title><content type='html'>It's sometimes hard to adopt quickly the new opportunities world provide. I'm being reading now &lt;a href="http://www.amazon.com/Innovators-Dilemma-Revolutionary-Business-Essentials/dp/0060521996"&gt;Innovator's Dilemma&lt;/a&gt; by &lt;span class="h3color" style="color: black;"&gt;Clayton M. Christensen&lt;/span&gt;&lt;span style="color: black;"&gt;. Thanks to Ellias for the advice, it really seems like a good book.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: black;"&gt;The interesting thing is that author starts with a description of hard drive industry as the fastest one with innovations going faster than customer needs. And, what do you think? Hard drive industry strikes back with &lt;a href="http://en.wikipedia.org/wiki/Solid-state_drive"&gt;SSD drives&lt;/a&gt;. Well, I read they exist but didn't understand their value for acoustic model training. Even without profiling it's clear they will be extremely useful.&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: black;"&gt;Say you have a medium size acoustic database of 60 hours of few gigabytes size. If you want to process it fast you need to use 8-core machine. Here comes the bottleneck, imagine 8 processes reading the feature vectors from a disk in an almost random way. No need to guess hard drive will be very busy trying to fetch all data required. SSD could definitely help here, I really need to try it soon.&lt;/span&gt;&lt;br /&gt;&lt;span style="color: black;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-7743300553689049346?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/7743300553689049346/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2009/10/another-cool-bit-if-hardware-for.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/7743300553689049346'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/7743300553689049346'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2009/10/another-cool-bit-if-hardware-for.html' title='Another cool bit if hardware for database training.'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-154603044869660309</id><published>2009-09-29T22:41:00.001+04:00</published><updated>2009-09-29T22:43:13.589+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='speech recognition'/><category scheme='http://www.blogger.com/atom/ns#' term='sphinx'/><title type='text'>CMU Sphinx Users and Developers Workshop 2010</title><content type='html'>I'm happy to announce&lt;br /&gt;&lt;br /&gt;&lt;b&gt;The First CMU Sphinx Workshop&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;20 March 2010, Dallas, TX, USA&lt;br /&gt;&lt;br /&gt;Event URL: &lt;a href="http://www.cs.cmu.edu/%7Esphinx/Sphinx2010"&gt;http://www.cs.cmu.edu/~sphinx/Sphinx2010&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Papers are solicited for the CMU Sphinx Workshop for Users and Developers (CMU-SPUD 2010), to be held in Dallas, Texas as a satellite to to ICASSP 2010.&lt;br /&gt;&lt;br /&gt;CMU Sphinx is one of the most popular open source speech recognition systems. It is currently used by researchers and developers in many locations world-wide, including universities, research institutions and in industry. CMU Sphinx's liberal license terms has made it a significant member of the open source community and has provided a low-cost way for companies to build businesses around speech recognition.&lt;br /&gt;&lt;br /&gt;The first SPUD workshop aims at bringing together CMU Sphinx users, to report on applications, developments and experiments conducted using the system. This workshop is intended to be an open forum that will allow different user communities to become better acquainted with each other and to share ideas. It is also an opportunity for the community to help define the future evolution of CMU Sphinx.&lt;br /&gt;&lt;br /&gt;We are planning a one-day workshop with a limited number of oral presentations, chosen for breadth and stimulation, held in an informal atmosphere that promotes discussion. We hope this workshop will expose participants to different perspectives and that this in turn will help foster new directions in research, suggest interesting variations on current approaches and lead to new applications.&lt;br /&gt;&lt;br /&gt;Papers describing relevant research and new concepts are solicited on, but not limited to, the following topics. Papers must describe work performed with CMU Sphinx:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Decoders: PocketSphinx, Sphinx-2, Sphinx-3, Sphinx-4&lt;/li&gt;&lt;li&gt;Tools: SphinxTrain, CMU/Cambridge SLM toolkit&lt;/li&gt;&lt;li&gt;Innovations / additions / modifications of the system&lt;/li&gt;&lt;li&gt;Speech recognition in various languages&lt;/li&gt;&lt;li&gt;Innovative uses, not limited to speech recognition&lt;/li&gt;&lt;li&gt;Commercial applications&lt;/li&gt;&lt;li&gt;Open source projects that incorporate Sphinx&lt;/li&gt;&lt;li&gt;Novel demonstrations&lt;/li&gt;&lt;/ul&gt;Manuscripts must be between 4 and 6 pages long, in standard ICASSP double-column format. Accepted papers will be published in the workshop proceedings.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Important Dates&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Paper submission:                  30 November 2009&lt;br /&gt;Notification of paper acceptance:  15 January  2010&lt;br /&gt;Workshop:                          20 March    2010&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Organizers&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Bhiksha Raj - Carnegie Mellon University&lt;br /&gt;Evandro Gouvêa - Mitsubishi Electric Research Labs&lt;br /&gt;Richard Stern - Carnegie Mellon University&lt;br /&gt;Alex Rudnicky - Carnegie Mellon University&lt;br /&gt;Rita Singh - Carnegie Mellon University&lt;br /&gt;David Huggins-Daines - Carnegie Mellon University&lt;br /&gt;Nickolay Shmyrev - Nexiwave&lt;br /&gt;Yannick Estève - Laboratoire d'Informatique de l'Université du Maine&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Contact&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;To email the organizers, please send email to &lt;a href="email:sphinx+workshop@cs.cmu.edu"&gt;sphinx+workshop@cs.cmu.edu&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-154603044869660309?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/154603044869660309/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2009/09/cmu-sphinx-users-and-developers.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/154603044869660309'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/154603044869660309'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2009/09/cmu-sphinx-users-and-developers.html' title='CMU Sphinx Users and Developers Workshop 2010'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-4678743857720685557</id><published>2009-09-26T22:56:00.008+04:00</published><updated>2011-09-17T04:05:36.534+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='sphinx4'/><category scheme='http://www.blogger.com/atom/ns#' term='htk'/><title type='text'>Using HTK models in sphinx4</title><content type='html'>As from yesterday long waited cool patch by &lt;a href="http://www.loria.fr/%7Ecerisara/"&gt;Christophe Cerisara&lt;/a&gt; with the help of super fast &lt;a href="http://life-and-tech.kundas.net/"&gt;Yaniv Kunda&lt;/a&gt; has landed in svn trunk. Now you can use HTK model directly from sphinx4. Though it's not easy since I spend a few hours today figuring the required issues, so here is a little step-by-step howto:&lt;br /&gt;&lt;br /&gt;1. Update to sphinx4 trunk&lt;br /&gt;&lt;br /&gt;2. Download small model, because currently binary loading is not supported unfortunately and it takes a lot of resources to load the model from a huge text file. Get a model from &lt;a href="http://www.inference.phy.cam.ac.uk/kv227/"&gt;Keith Vertanen&lt;/a&gt; &lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.blogger.com/www.inference.phy.cam.ac.uk/kv227/htk/htk_wsj_si84_2750_8.zip"&gt;http://www.inference.phy.cam.ac.uk/kv227/htk/htk_wsj_si84_2750_8.zip&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;3. Convert model to text format with HTK HHEd&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;mkdir out&lt;br /&gt;touch empty&lt;br /&gt;HHEd -H hmmdefs -H macros -M out empty tiedlist&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;4. Replace model in Lattice demo in configuration file:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;&amp;lt;component name="wsj" type="edu.cmu.sphinx.linguist.acoustic.tiedstate.TiedStateAcousticModel"&amp;gt;&lt;br /&gt;&amp;lt;property name="loader" value="wsjLoader"/&amp;gt;&lt;br /&gt;&amp;lt;property name="unitManager" value="unitManager"/&amp;gt;&lt;br /&gt;&amp;lt;/component&amp;gt;&lt;br /&gt;&amp;lt;component name="wsjLoader" type="edu.cmu.sphinx.linguist.acoustic.tiedstate.HTKLoader"&amp;gt;&lt;br /&gt;&amp;lt;property name="logMath" value="logMath"/&amp;gt;&lt;br /&gt;&amp;lt;property name="modelDefinition" value="/home/shmyrev/sphinx4/wsj/out/hmmdefs"/&amp;gt;&lt;br /&gt;&amp;lt;property name="unitManager" value="unitManager"/&amp;gt;&lt;br /&gt;&amp;lt;/component&amp;gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Please note here that modelDefinition property points to the location of the newly created hmmdefas file.&lt;br /&gt;&lt;br /&gt;5. Replace the frontend configuration to load HTK features from a file. Unfortunately it's impossible to create HTK features with sphinx4 frontend right now, but this will be implemented soon I hope. Some bits are already present like DCT-II transform with frontend.transform.DiscreteCosineTransform2, some are easy to setup like proper filter coefficients, some are missing. So for now we'll recognize MFC file instead.&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;&amp;lt;component name="epFrontEnd" type="edu.cmu.sphinx.frontend.FrontEnd"&amp;gt;&lt;br /&gt;&amp;lt;propertylist name="pipeline"&amp;gt;&lt;br /&gt;&amp;lt;item&amp;gt; streamHTKSource &amp;lt;/item&amp;gt;&lt;br /&gt;&amp;lt;/propertylist&amp;gt;&lt;br /&gt;&amp;lt;/component&amp;gt;&lt;br /&gt;&amp;lt;component name="streamHTKSource" type="edu.cmu.sphinx.frontend.util.StreamHTKCepstrum"&amp;gt;&lt;br /&gt;&amp;lt;property name="cepstrumLength" value="39"/&amp;gt;&lt;br /&gt;&amp;lt;/component&amp;gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;and let's change the Java file&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;StreamHTKCepstrum source = (StreamHTKCepstrum) cm.lookup ("streamHTKSource");&lt;br /&gt;InputStream stream = new FileInputStream(new File ("input.mfc"));&lt;br /&gt;source.setInputStream(stream);&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;6. Now let's extract mfc. Create a config file for HCopy&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;SOURCEFORMAT = WAV&lt;br /&gt;TARGETKIND = MFCC_D_A_Z_0&lt;br /&gt;TARGETRATE = 100000.0&lt;br /&gt;WINDOWSIZE = 250000.0&lt;br /&gt;USEHAMMING = T&lt;br /&gt;PREEMCOEF = 0.97&lt;br /&gt;NUMCHANS = 26&lt;br /&gt;CEPLIFTER = 22&lt;br /&gt;NUMCEPS = 12&lt;br /&gt;ENORMALISE = T&lt;br /&gt;ZMEANSOURCE = T&lt;br /&gt;USEPOWER = T&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;and run it&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;HCopy -C config 10001-90210-01803.wav input.mfc&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;make sure input.mfc is located in top sphinx4 folder now since this is the place we'll take it.&lt;br /&gt;&lt;br /&gt;7. Now everything is ready&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;ant &amp;&amp; java -jar bin/LatticeDemo.jar&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Check the result&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;I heard: once or a zero zero one nine oh to one oh say or oil days or a jury&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;It's not very precise, but still ok for such a small model and limited language model.&lt;br /&gt;&lt;br /&gt;This is still a work in progress and a lot of things still pending. The most important are reading the binary HTK files, frontend adaptation, cleanup and unification. But I really look forward on the results, since it's really a promising approach. There are not so many BSD-licensed HTK decoders out there.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-4678743857720685557?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/4678743857720685557/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2009/09/using-htk-models-in-sphinx4.html#comment-form' title='8 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/4678743857720685557'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/4678743857720685557'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2009/09/using-htk-models-in-sphinx4.html' title='Using HTK models in sphinx4'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>8</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-587103885790355559</id><published>2009-09-16T20:27:00.004+04:00</published><updated>2009-09-16T21:01:22.100+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='speech recognition'/><category scheme='http://www.blogger.com/atom/ns#' term='experiments'/><title type='text'>Speech Recognition As Experimental Science</title><content type='html'>It's well known there are two types of physics - theoretical one and experimental. During the school I always liked doing the last, measuring the speed of a ball or voltage, plotting the graphics and so on. Unforunately in later days I was mostly doing math or programming. Only recently when I started to spend a lot of time on speech recognition I found why do I like it so much - it's also an experimental science.&lt;br /&gt;&lt;br /&gt;When you build a speech recogniton system your time is mostly spent on all these beautiful things. Setting up the database training, running the learning process, tracking the results. You are trying understand the nature and find it's laws, you want to find the best feature set, phoneset, find the beams and more and more. You have an experimental material and sometimes it appeared there are things you forget to take in account. The activity that's really encouraging.&lt;br /&gt;&lt;br /&gt;Of course there are important drawbacks, issues like proper &lt;a href="http://en.wikipedia.org/wiki/Design_of_experiments"&gt;design of the experiments&lt;/a&gt; arise. Unfortunately it's not widely described in the literature but speech recognition experiments are just an examples of experiments so all issues are valid for them. To list a few:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Reproducability&lt;/li&gt;&lt;li&gt;Connection of the theory and the practice&lt;/li&gt;&lt;li&gt;Estimation of the results and their validity&lt;/li&gt;&lt;/ul&gt;For example the last point is very important. Currently when we are running the the database test we just get a number. We are trying to rely on it without even estimating the deviation and other very important attributes of every scientific measurement. As the result we make unreliable decisions like &lt;a href="http://nshmyrev.blogspot.com/2009/09/initial-value-problem-for-mllt.html"&gt;I did with MLLT transform&lt;/a&gt;. I now think that we should be more careful about that.&lt;br /&gt;&lt;br /&gt;So that's why I started with the forementioned wikipedia page trying to find a good book on experiment design and of course it would be nice to find an appropriate software for experiment management workflow.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-587103885790355559?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/587103885790355559/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2009/09/speech-recognition-as-experimental.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/587103885790355559'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/587103885790355559'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2009/09/speech-recognition-as-experimental.html' title='Speech Recognition As Experimental Science'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-6209378836903099320</id><published>2009-09-10T18:25:00.006+04:00</published><updated>2010-05-18T01:31:21.195+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='speech recognition'/><category scheme='http://www.blogger.com/atom/ns#' term='conferences'/><title type='text'>The First Glance On The Interspeech 2009 Papers</title><content type='html'>Interspeech 2009 in Brighton is over today. Unfortunately I wasn't able to particiapte for various reasons. Still, it was very interesting to review the list of sessions, abstracts and read some articles &lt;a href="http://www.interspeech2009.org/conference/programme/sessionlist.php"&gt;available.&lt;/a&gt; The modern activity in speech research is amazing, the number of articles and groups is enormous, in total I counted 459 abstracts with grep. It was enjoying to process them all. Currently I reduced the list to 50% of the original size so still need a few lookups to find something more interesting. A few random thoughts I've got:&lt;br /&gt;&lt;br /&gt;Sphinx is mentioned 2 times and HTK only once :), that's a win. Of course many researches use HTK for experiments. So it's more the win in being more open.&lt;br /&gt;&lt;br /&gt;A lot of machine learning research. And quite a significant amount of research is dedicated to another target space representation/classifier/cost function adjustments. The first glance didn't show anything interesting here unfortunately. Discriminative training is probably the most recent advance in ASR.&lt;br /&gt;&lt;br /&gt;Still enormous amount of the old style phontic research. Is vowel length a feature? How do Zulu people click? Sometimes it's interesting to read though.&lt;br /&gt;&lt;br /&gt;Almost all TTS is about HMM for speech synthesis. The quality of audio for TTS is a problem. I've recenly read the good and very detailed good &lt;a href="http://www.sp.nitech.ac.jp/%7Ebonanza/Paper/EMIME/zen_specom.pdf"&gt;review&lt;/a&gt; by Dr. Zen, even adepts of the approach know that the hybrid of HMM and unit-selection is better.&lt;br /&gt;&lt;br /&gt;Suprisingly short section on new methods and paradigms unfortunately.&lt;br /&gt;&lt;br /&gt;New trends include emotions, machine speech-to-speech translation, language aquisition.  Combination of visual and speech recognition is suprisingly common.&lt;br /&gt;&lt;br /&gt;No Russians at all. Well, not strange, Russian speech technology doesn't exist in fact.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www-i6.informatik.rwth-aachen.de/rwth-asr/"&gt;The RWTH Aachen University Open Source Speech Recognition System&lt;/a&gt; is a terrific news. The source is available, downloaded and ready for investigation.&lt;br /&gt;&lt;br /&gt;"Improvements to the LIUM French ASR system based on CMU Sphinx: what helps to significantly reduce the word error rate", no link available yet unfortunately. Should be a very interesting reading. The only problem that arises here is that someone should do the merge. The issue is that source is available but really it's very hard to integrate with the research-oriented system.&lt;br /&gt;&lt;br /&gt;I'm also waiting for Blizzard 2009 results that should be presented but still not available.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://wami.csail.mit.edu/papers/QuizletInterspeech2009.pdf"&gt;A Self-Labeling Speech Corpus: Collecting Spoken Words with an Online Educational Game&lt;/a&gt; - we wanted that for a long time for Voxforge.&lt;br /&gt;&lt;br /&gt;In few next posts I'll probably cover some interesting topics in more detail.  If you was at the conference or saw something interesting, comments are appreciated.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-6209378836903099320?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/6209378836903099320/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2009/09/first-glance-on-interspeech-2009-papers.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/6209378836903099320'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/6209378836903099320'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2009/09/first-glance-on-interspeech-2009-papers.html' title='The First Glance On The Interspeech 2009 Papers'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-3473362168815260184</id><published>2009-09-08T22:45:00.007+04:00</published><updated>2009-09-09T01:41:21.283+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='articles'/><title type='text'>Modern ASR practices review</title><content type='html'>I was never able to completely join the scientific world, most probably because engineering tasks are more attractive. Though I graduated as a mathematician, my merits aren't worth mentioning. For example the thing I never liked is writing, in particular writing a scientific article. That's the corner stone of the science now but for me it seems very dated practice. Most articles are never read, huge percent has errors, many are completely wrong or repeat other sources. Of course there are brilliant ones.&lt;br /&gt;&lt;br /&gt;From my point of view the knowledge should be probably organized in a different ways, something like a software projects. The theory could be built during ages in a wiki style with all changes tracked and probably contain complimentary information like techinical notes, software implementations, test results, formalized proofs and so on. Of course among software projects there are also issues like forks, bad maintaince and bugs, but it seems they are more organized.&lt;br /&gt;&lt;br /&gt;That's why I really like the projects that keep knowledge in a structure like wikipedia, &lt;a href="http://planetmath.org/"&gt;planetmath&lt;/a&gt; for example. Also reviews of the state of art are of course invaluable. Today I spent some time processing my library and the found again the wonderful review by Mark Gales:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://svr-www.eng.cam.ac.uk/%7Emjfg/mjfg_NOW.pdf"&gt;The Application of Hidden Markov Models in Speech Recognition&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;I would really recommend this book as a base introduction into modern speech recognition methods. Though written by HTK author, it has little HTK specific and really focused in best practices in ASR systems.&lt;br /&gt;&lt;br /&gt;P.S. Is there a personal library management software, web-based, able to store and index PDF? I used to install &lt;a href="http://dspace.mit.edu/"&gt;Dspace &lt;/a&gt;at work, but it's so heavy and the UI is really outdated.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-3473362168815260184?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/3473362168815260184/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2009/09/i-was-never-able-to-completely-join.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/3473362168815260184'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/3473362168815260184'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2009/09/i-was-never-able-to-completely-join.html' title='Modern ASR practices review'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-3830462878536435433</id><published>2009-09-06T20:15:00.005+04:00</published><updated>2009-09-06T20:36:23.070+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='sphinxtrain'/><category scheme='http://www.blogger.com/atom/ns#' term='acoustic model training'/><category scheme='http://www.blogger.com/atom/ns#' term='sphinx'/><title type='text'>Initial value problem for MLLT</title><content type='html'>So far I've recently discovered with the help of &lt;a href="https://sourceforge.net/users/mchammer2007/"&gt;mchammer2007&lt;/a&gt; the &lt;a href="https://sourceforge.net/forum/message.php?msg_id=7609100"&gt;problem&lt;/a&gt; with estimation of the initial matrix for MLLT training. The MLLT or Maximum Likelihood Linear Transform is suggested by R. A. Gopinath, "Maximum Likelihood Modeling with Gaussian Distributions for Classification", in proceedings of ICASSP 1998 and &lt;a href="http://www.speech.cs.cmu.edu/cmusphinx/moinmoin/LDAMLLT"&gt;implemented&lt;/a&gt; in Sphinxtrain.&lt;br /&gt;&lt;br /&gt;The idea is that the matrix to modify feature space is trained to fix the optimization of the covariances and make covariance matrix look more like the diagonal. The optimization is quite simple gradient descendant but unfortunately it suffers from the initial value problem. That is if you choose proper initial value you could get much better results. So right now random matrix is used:&lt;br /&gt;&lt;br /&gt;        if A == None:&lt;br /&gt;            # Initialize it with a random positive-definite matrix of&lt;br /&gt;            # the same shape as the covariances&lt;br /&gt;            s = self.cov[0].shape&lt;br /&gt;            d = -1&lt;br /&gt;            while d &lt; 0:&lt;br /&gt;                A = eye(s[0]) + 0.1 * random(s)&lt;br /&gt;                d = det(A)&lt;br /&gt;&lt;br /&gt;And depending on your luck you could get better or worse recognition results. Sometimes even worse than the usual training without LDA/MLL.&lt;br /&gt;&lt;br /&gt;        SENTENCE ERROR: 55.4% (72/130) WORD ERROR RATE: 17.5% (135/773) &lt;br /&gt;        SENTENCE ERROR: 51.5% (66/130) WORD ERROR RATE: 16.6% (128/773) &lt;br /&gt;        SENTENCE ERROR: 50.0% (65/130) WORD ERROR RATE: 15.5% (119/773) &lt;br /&gt;        SENTENCE ERROR: 56.2% (73/130) WORD ERROR RATE: 16.9% (130/773) &lt;br /&gt;        SENTENCE ERROR: 62.3% (80/130) WORD ERROR RATE: 22.3% (172/773) &lt;br /&gt;&lt;br /&gt;So the receipt for the training is the following - train several times and control the accuracy, choose the best MLLT matrix and use it in final trainings.  If you have a large database, find best MLLT for a subset of it and use it as an initial value for MLLT estimation.  No easier way until we'll find a better method for initial value estimation, quick look on the articles didn't give any.&lt;br /&gt;&lt;br /&gt;From recent articles I also got quite a significant collection of LDA derivatives, discriminative ones, HLDA and so on.  It would be nice to put them into a some review. Also some of them seems to be free from this initial value problem. It would be nice to get a proper review on this large topic.&lt;br /&gt;&lt;br /&gt;Between you can see in the chunk of the code above that the comment is not quite correct. The positive-definiteness of the matrix should be checked differently, with the Silvester criterion for example. Though I think that since the condition det(A) &gt; 0 seems to be enough for the feature space transform, the comment should be simply removed. But probably positive-defined matrix is required for optimization.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-3830462878536435433?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/3830462878536435433/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2009/09/initial-value-problem-for-mllt.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/3830462878536435433'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/3830462878536435433'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2009/09/initial-value-problem-for-mllt.html' title='Initial value problem for MLLT'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-5339017845102775651</id><published>2009-09-02T03:47:00.003+04:00</published><updated>2009-09-02T04:02:00.884+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='adaptation'/><category scheme='http://www.blogger.com/atom/ns#' term='speech recognition'/><title type='text'>Adaptation Methods</title><content type='html'>It's really hard to collect information on practical application of speech recognition tools. For example the wonderful quote from Andrew Morris on htk-users about what to update during MAP adaptation:&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;Exactly what it is best to update depends on how much training data you have, but in general it is important to update means and inadvisable to update variances. Only testing on held out test data can decide which is best, but if you are training on data from many speakers and then adapting to data from just one speaker, I expect updating just means should give best results, with variance adaptation reducing performance and transition probs or mix weights adaptation making little difference.&lt;/blockquote&gt;&lt;br /&gt;After few experiments I can only confirm this statement. You should never adapt the variances. So, the HOWTO in our &lt;a href="http://www.speech.cs.cmu.edu/cmusphinx/moinmoin/AcousticModelAdaptation"&gt;wiki&lt;/a&gt; is not so good as it could be. Another bit could be taken from this &lt;a href="http://www.cs.cmu.edu/%7Earchan/presentation/MAP.pdf"&gt;document&lt;/a&gt;, actually it's really better to combine MAP and MLLR this way and the best method for offline adaptation is:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Run bw to collect statistics&lt;/li&gt;&lt;li&gt;Estimate mllr transform&lt;/li&gt;&lt;li&gt;Update means with mllr&lt;/li&gt;&lt;li&gt;Run bw again with updated means&lt;/li&gt;&lt;li&gt;Apply MAP adaptation with fixed tau greater than 100 (try to select the best value). Unfortunately from my experience automatic tau selection is broken in map_adapt. This way you'll update the variances a bit, but only slightly.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;No book could tell you that!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-5339017845102775651?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/5339017845102775651/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2009/09/adaptation-methods.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/5339017845102775651'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/5339017845102775651'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2009/09/adaptation-methods.html' title='Adaptation Methods'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-6990012138444804392</id><published>2009-08-29T11:17:00.004+04:00</published><updated>2009-08-29T11:24:28.332+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='gpu'/><category scheme='http://www.blogger.com/atom/ns#' term='speech recognition'/><category scheme='http://www.blogger.com/atom/ns#' term='cuda'/><title type='text'>Time to buy the new video card</title><content type='html'>Everybody plays with training and recognition on a GPU now. &lt;a href="http://liuchuan.org/pub/cuHMM.pdf"&gt;200x&lt;/a&gt; improvement on NVidia CUDA is worth time and money. Moreover that the sample code is already &lt;a href="http://code.google.com/p/chmm/"&gt;available&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Thanks to prym on #cmusphinx channel on irc for the link and of course huge thanks Chuan Liu for the article and a new project. It would be so great to see the similar patches for Sphinxtrain/sphinx4, it would be the killer feature.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-6990012138444804392?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/6990012138444804392/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2009/08/time-to-by-new-video-card.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/6990012138444804392'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/6990012138444804392'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2009/08/time-to-by-new-video-card.html' title='Time to buy the new video card'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-1912784087462097246</id><published>2009-08-20T01:58:00.007+04:00</published><updated>2009-08-20T03:02:07.730+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='speech recognition'/><title type='text'>How to improve accuracy</title><content type='html'>People very often ask "how to improve accuracy?". Since I've got three questions like this today I decided to write more or less  extensive description of the ways to solve this problem. Probably it will be a bit sketchy, but I hope it will be helpful. Corrections will be very appreciated as well.&lt;br /&gt;&lt;br /&gt;1) Well, first of all let me mention that the problem is complex. It requires an understanding of the Hidden Markov Models, beam search, language modelling and every other technology involved. I really recommend you to read the book on speech recognition first. This one is very good:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.amazon.com/Spoken-Language-Processing-Algorithm-Development/dp/0130226165"&gt;http://www.amazon.com/Spoken-Language-Processing-Algorithm-Development/dp/0130226165&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;In theory it will be possible to build a speech recognition system without any extensive knowledge as listed above, but it's not the case now. We are working on easy to use system but we are still in the early beginning.&lt;br /&gt;&lt;br /&gt;Probably you don't have time to read and study all this. Then think if you have time to implement speech recognition system at all. At least learn basic concepts.&lt;br /&gt;&lt;br /&gt;Please also learn how to do programming before you are starting. I think it's obvious you need to have software development experience. We can't teach you Java really.&lt;br /&gt;&lt;br /&gt;2) Next step is to setup the basic example of the system and estimate it's accuracy while first part is quite obvious, second is often ignored. Don't do that. It's critical to test accuracy on realistic conditions during whole development process.&lt;br /&gt;&lt;br /&gt;Decide what kind of system are you going to implement - large vocabulary dictation, medium vocabulary names recognition or small vocabulary command and control or some other task. Probably you need IVR system. For each system there is a demo already. Use it as a base. Please don't try to build dictation from a command and control demo, it's just not suitable.&lt;br /&gt;&lt;br /&gt;Now the important task, the estimation of the accuracy. The examples how to do that could be found in decoder sources, in tutorial and in many more places. Collect test database, recognize it and compute the exact accuracy estimation.&lt;br /&gt;&lt;br /&gt;It should be the following:&lt;br /&gt;&lt;br /&gt;Command and control 5% WER (word error rate)&lt;br /&gt;Medium vocabulary 15% WER&lt;br /&gt;Large vocabulary 30% WER&lt;br /&gt;Large vocabulary short utterances 50% WER&lt;br /&gt;&lt;br /&gt;if you have noisy audio or some accent multiply this number on 2.&lt;br /&gt;&lt;br /&gt;4) Compare the actual accuracy with the expected value. If the accuracy is mostly the expected, proceed to the next step. If not, search for the bug. Most likely you've made a mistake in system setup. Check the configuration if it is suitable for the task, check sound speech quality with sound editing, find out the spectrum range, check the sampling rate, accent, dictionary and other parameters. If your WER is 90%, you made a mistake for sure.&lt;br /&gt;&lt;br /&gt;For example of task-dependent training consider acoustic database. If you train a small vocabulary acoustic model you need word-based acoustic model with word-based phoneset in Sphinxtrain case. If you are training large vocabulary database make sure your phoneset is not large and make sure you have selected the proper number of senones/mixtures.&lt;br /&gt;&lt;br /&gt;5) Once you've reached a baseline, it will take a lot to improve it. Think if it's enough for you and if you can build your application with such accuracy. It's unlikely you'll get significantly more. But if you are brave enough:&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Use MLLT/VTLN feature space adaptation&lt;/li&gt;&lt;li&gt;Use MLLR and other type of online speaker adaptation&lt;/li&gt;&lt;li&gt;Adapt language model, use context-sensetive language models&lt;/li&gt;&lt;li&gt;Tune beams - try different values and experiment with them&lt;/li&gt;&lt;li&gt;Implement the rejection for OOV (out-of-vocabulary) words and other noise sounds&lt;/li&gt;&lt;li&gt;Implement noise cancellation.&lt;/li&gt;&lt;li&gt;Adapt acoustic models and dictionary to your speakers/their accent.&lt;/li&gt;&lt;li&gt;....&lt;/li&gt;&lt;/ul&gt;The out-of-vocabulary words are the most frequent issue here. Unlike expected by user, most demos don't do any OOV filtering out of box while it's critical for applications. Unfortunately though unlimited vocabulary systems exist, they are quite complex (though also possible to implement). Most systems have limited vocabulary. That means that you need to implement OOV detection and/or confidence scoring in order to filter the garbage. This is doable and described in demos too, for example in confidence demo.&lt;br /&gt;&lt;br /&gt;If you are short of ideas here, join the mailing list, we have a lot of features to implement.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-1912784087462097246?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/1912784087462097246/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2009/08/how-to-improve-accuracy.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/1912784087462097246'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/1912784087462097246'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2009/08/how-to-improve-accuracy.html' title='How to improve accuracy'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-3739540052699153840</id><published>2009-08-17T23:36:00.002+04:00</published><updated>2009-08-17T23:42:02.535+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='sphinx4'/><category scheme='http://www.blogger.com/atom/ns#' term='speech recognition'/><title type='text'>Sphinx4-1.0beta3 is released</title><content type='html'>The best speech recognition engine is on it's way to world's domination. We are happy to announce the new sphinx4 release. This is still a development version, so bug reports and testing are very appreciated.&lt;br /&gt;&lt;br /&gt;Packages &lt;br /&gt;&lt;br /&gt;&lt;a href="https://sourceforge.net/projects/cmusphinx/files/sphinx4/1.0%20beta3/" target="_new"&gt;https://sourceforge.net/projects/cmusphinx/files/sphinx4/1.0-beta3/&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;New Features and Improvements:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;BatchAGC frontend component&lt;/li&gt;&lt;li&gt;Complete transition to defaults in annotations &lt;/li&gt;&lt;li&gt;ConcatFeatureExtrator to cooperate with cepwin models  &lt;br /&gt;&lt;/li&gt;&lt;li&gt;End of stream signals are passed to the decoder for end of stream handling  &lt;/li&gt;&lt;li&gt;Timer API improvement&lt;/li&gt;&lt;li&gt;Threading policy is changed to TAS&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;  Bug fixes:&lt;br /&gt;&lt;blockquote&gt;&lt;/blockquote&gt;&lt;ul&gt;&lt;li&gt;Fixes reading UTF-8 from language model dump &lt;/li&gt;&lt;li&gt;Huge memory optimization of the lattice compression&lt;/li&gt;&lt;li&gt;More stable fronend work with DataStart and DataEnd and optional SpeechStart/SpeechEnd&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;  Thanks:&lt;br /&gt;&lt;br /&gt;    Yaniv Kunda, Michele Alessandrini, Holger Brandl, Timo Baumann, Evandro Gouvea&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-3739540052699153840?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/3739540052699153840/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2009/08/sphinx4-10beta3-is-released.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/3739540052699153840'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/3739540052699153840'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2009/08/sphinx4-10beta3-is-released.html' title='Sphinx4-1.0beta3 is released'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-1487259592070311347</id><published>2009-08-05T02:51:00.003+04:00</published><updated>2010-05-13T01:35:14.994+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='festival'/><category scheme='http://www.blogger.com/atom/ns#' term='TTS'/><title type='text'>Release of the Polish voice for Festival</title><content type='html'>Very remarkable and long waited release happened recently. The &lt;a href="http://festvox.org/voices/polish/pjwstk_ks_multisyn_mbrola.tar.bz2"&gt;Polish multisyn voice&lt;/a&gt; for the Festival TTS system was made available. This is the best multisyn voice available nowdays both in terms of speech material (several hours, much more than any arctic database, around 500 Mb of audio) and label quality (it has manually corrected segment labels). Also it uses some unique synthesis method modifications like target f0 prediction for multisyn combined with ToBI/APML-based intonation module. The scheme code also has some important modifications. I really encourage you to try this voice even if you don't understand Polish.  I also look forward into the HTS voices&lt;br /&gt;&lt;br /&gt;Thanks a lot to &lt;a href="http://syntezamowy.pjwstk.edu.pl/"&gt;Krzystof&lt;/a&gt; for his hard work.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-1487259592070311347?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/1487259592070311347/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2009/08/release-of-polish-voice-for-festival.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/1487259592070311347'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/1487259592070311347'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2009/08/release-of-polish-voice-for-festival.html' title='Release of the Polish voice for Festival'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-5549956563782311844</id><published>2009-07-30T03:01:00.007+04:00</published><updated>2010-05-13T01:34:50.958+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='g2p'/><title type='text'>Training language model with fragments</title><content type='html'>&lt;a href="http://www-i6.informatik.rwth-aachen.de/web/Software/g2p.html"&gt;Sequitur g2p&lt;/a&gt; by M. Bisani and H. Ney. is a cool package for the letter to phone translation, quite accurate and, the most important, open. But actually  there are different hidden gems in this package :)&lt;br /&gt;&lt;br /&gt;One of them is the phone-oriented segmenter that splits the words on chunks - graphones. Graphone is a joint object consisting of letters and corresponding phones that combine words. Graphones are used in g2p internally, but for example they are very useful in construction of the open vocabulary models.  The system as a whole is described here:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.speech.sri.com/papers/icassp2008-std.ps.gz"&gt;Open Vocabulary Spoken Term Detection Using Graphone-Based Hybrid recognition System by M. Acbacak, D. Virgyri and A. Stolke&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;and the details of the language model in the original article:&lt;br /&gt;&lt;br /&gt;Open Vocabulary Speech Recognition with Flat Hybrid Models by  Maximilian Bisani and Hermann Ney&lt;br /&gt;&lt;br /&gt;The interesting thing is that all required components are already available, the issue is to find correct option and build the system. So the quick reciept is:&lt;br /&gt;&lt;br /&gt;1. Get Sequitur G2p&lt;br /&gt;2. Patch it to support Python 2.5 (replace elementtree with xml.etree, since elementtree is deprecated now)&lt;br /&gt;3. Convert cmudict lexicon to xml-based Bliss format (I'm not sure what's it, I failed to find information about it on the web)&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;import sys&lt;br /&gt;import string&lt;br /&gt;print "&amp;lt;lexicon&amp;gt;"&lt;br /&gt;file = open(sys.argv[1], "r")&lt;br /&gt;&lt;br /&gt;for line in file:&lt;br /&gt;toks = line.strip().split()&lt;br /&gt;if len(toks) &amp;lt; 2:&lt;br /&gt;continue&lt;br /&gt;word = toks[0]&lt;br /&gt;phones = string.join(toks[1:]," ")&lt;br /&gt;print "&amp;lt;orth&amp;gt;"&lt;br /&gt;print word&lt;br /&gt;print "&amp;lt;/orth&amp;gt;"&lt;br /&gt;print "&amp;lt;pron&amp;gt;"&lt;br /&gt;print phones&lt;br /&gt;print "&amp;lt;/pron&amp;gt;"&lt;br /&gt;print "&amp;lt;/lexicon&amp;gt;"&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;4. Train the segmenter model. The most complicated thing is to figure option to train multigram model with several phones. Default one used in g2p consist of 1 phone and 1 letter, it's not suitable for OOV language model.&lt;br /&gt;&lt;br /&gt;&lt;blockquote style="font-family: courier new;"&gt;g2p.py --model model-1 --ramp-up --train cmudict.0.7a.train --devel 5% --write-model model-2 -s 0,2,0,2&lt;/blockquote&gt;&lt;br /&gt;5. Ramp up the model to make it more precise&lt;br /&gt;6. Build the language model, here you need the dictionary in XML format. As the article above describes, the original lexicon should be around 10k, the subliminal training lexicon should be 50k or so.&lt;br /&gt;&lt;br /&gt;&lt;blockquote style="font-family: courier new;"&gt;makeOvModel.py --order=4 -l cmudict.xml --subliminal-lexicon=cmudict.xml.test -g model-2 --write-lexicon=res.lexicon --write-tokens=res.tokens&lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;After that you can get a tokens for lm and with additional options even a counts for the language model you could train with SRILM. I haven't finished the previous step yet, so this post should have follow up.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-5549956563782311844?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/5549956563782311844/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2009/07/training-language-model-with-fragments.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/5549956563782311844'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/5549956563782311844'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2009/07/training-language-model-with-fragments.html' title='Training language model with fragments'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-6342257574493868091</id><published>2009-07-23T01:33:00.006+04:00</published><updated>2010-05-18T01:32:11.275+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='conferences'/><title type='text'>I'm going to ClueCon</title><content type='html'>This August I'm going to US again to Chicago to &lt;a href="http://cluecon.com/"&gt;ClueCon&lt;/a&gt; where I'll give the talk titled "The use of open source speech recognition". Here is the small outline:&lt;br /&gt;&lt;blockquote&gt;&lt;br /&gt;The most complicated thing in modern ASR is to make user expectations agree with the actual capabilities of the technology. Although the technology itself is able to provide a number of potentially very useful features, they are not exactly what average user expects.&lt;br /&gt;&lt;br /&gt;Many specialized tasks require a huge amount of customization, for example speaker adaptation needs to be accurately embedded into the accounting system in order to let recognizer improve the accuracy.&lt;br /&gt;&lt;br /&gt;The open source solutions could help here because of much greater flexibility they have. But although many companies provide speech recognition services only several projects exist and most of them are purely academic. They often require a lot of tuning for the end-user. Many parts of the complete system are just missing.&lt;br /&gt;&lt;br /&gt;Luckily the situation is going to improve during last years, the core components are going to have more or less stable release schedule and active support including a commercial one.&lt;br /&gt;&lt;br /&gt;The purpose of this talk is to cover the trends of the development of open source based speech recognition in conjunction with the telephony systems and suggest a ways it can reach enterprise level.&lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;I'll also visit Boston for two days&lt;br /&gt;&lt;br /&gt;Update: Here is the &lt;a href="http://svn.berlios.de/svnroot/repos/festlang/trunk/multilts/reports/report-cluecon-2009/nshmyrev_cluecon.ppt"&gt;presentation&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-6342257574493868091?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/6342257574493868091/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2009/07/im-going-to-cluecon.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/6342257574493868091'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/6342257574493868091'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2009/07/im-going-to-cluecon.html' title='I&apos;m going to ClueCon'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-8462009679480451673</id><published>2009-07-05T02:17:00.003+04:00</published><updated>2009-07-09T04:04:33.067+04:00</updated><title type='text'>Gran Canaria</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_p33_0koWXHA/SlUz97idt3I/AAAAAAAAABg/fanNzWFkDlk/s1600-h/a.jpg"&gt;&lt;img style="margin: 0pt 10px 10px 0pt; float: left; cursor: pointer; width: 400px; height: 300px;" src="http://4.bp.blogspot.com/_p33_0koWXHA/SlUz97idt3I/AAAAAAAAABg/fanNzWFkDlk/s400/a.jpg" alt="" id="BLOGGER_PHOTO_ID_5356244470874355570" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;I'm on Gran Canaria!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-8462009679480451673?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/8462009679480451673/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2009/07/gran-canaria.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/8462009679480451673'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/8462009679480451673'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2009/07/gran-canaria.html' title='Gran Canaria'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_p33_0koWXHA/SlUz97idt3I/AAAAAAAAABg/fanNzWFkDlk/s72-c/a.jpg' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-3258250944005718485</id><published>2009-06-30T00:35:00.004+04:00</published><updated>2009-06-30T00:44:23.095+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='zyxel'/><title type='text'>Left Zyxel</title><content type='html'>Recently I've left Zyxel where I worked for three years on a Linux-based home class router (CPE). It was a nice place to where I've met wonderful friends and learned a lot. It was also encouraging and interesting work.&lt;br /&gt;&lt;br /&gt;I still have a few ideas for the CPE market that are not completely supported in current developments of various competitors. Things like modern web-2.0 dynamic UI, better error reporting, overall performance optimization, testing, centralized management and so on promise a lot for a vendor that will be able to handle such a suprisingly complicated product like CPE.  Cell phones are much more active market comparing to routers, although the class of home devices seems to be no less important then cell phone one. For example I amazed by overall T-Mobile G1 quality and haven't seen anything comparable on router's market. Probalby US market is different though. Recent future with 4G, extensibility and wideband access everywhere promise a lot.&lt;br /&gt;&lt;br /&gt;Let's hope one day there will be a team that could implement it. Also I hope to buy a router made by me very soon.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-3258250944005718485?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/3258250944005718485/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2009/06/left-zyxel.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/3258250944005718485'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/3258250944005718485'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2009/06/left-zyxel.html' title='Left Zyxel'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-5652502322263763929</id><published>2009-06-12T04:53:00.006+04:00</published><updated>2010-05-13T01:38:07.627+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='java'/><category scheme='http://www.blogger.com/atom/ns#' term='sphinx4'/><title type='text'>Java bits in Sphinx4</title><content type='html'>I spent some time converting sphinx4 to Java5, mostly changing the loops dropping iterators. I hope it will not just make the code cleaner but give us a few bits of performance.&lt;br /&gt;&lt;br /&gt;Also tried to profile sphinx4 lattice code. It seems that I broke it by changing the default value for keepAllStates to true. With all HMM states left in a tree it's very hard to traverse the tree to create a lattice. Unfortunately &lt;a href="http://www.eclipse.org/tptp/"&gt;TPTP&lt;/a&gt; profiler in Eclipse appeared to be very slow, looking for profiler now as well as on the way to solve this keepAllStates issue.&lt;br /&gt;&lt;br /&gt;That change had another drawback, now we doing very unnatural work when we decide if stream is over. Currently scorer returns null on every SpeechEnd signal. It should also return null on DataEnd signal and that null should be different from the first one since we should stop the recognition only after DataEnd in case of long wav file transcription and continue after SpeechEnd. Now we distinguish them by presence of the data frames which are kept due to keepAllTokens settings. Very unnatural dependency which has drawbacks as well at least it eats memory. But I haven't decided what do do with this not so perfect API yet. Most probably we'll need to introduce something like EOF in C to find out if stream is over.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-5652502322263763929?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/5652502322263763929/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2009/06/java-bits-in-sphinx4.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/5652502322263763929'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/5652502322263763929'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2009/06/java-bits-in-sphinx4.html' title='Java bits in Sphinx4'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-7958590363974496975</id><published>2009-05-31T23:49:00.000+04:00</published><updated>2009-06-01T00:11:36.574+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='sphinxtrain'/><category scheme='http://www.blogger.com/atom/ns#' term='speech recognition'/><title type='text'>Dither is considered harmful</title><content type='html'>MFCC features used in speech recogition are still a reasonable choice if you want to recognize generic speech. With tunings like frequency wrap for VTLN and MLLT they still can suggest the reasonable performance. Although there are many parameters to tune like upper and lower frequencies, the shape of the mel filters and so on, default values mostly works fine. Still I had to spend this week on one issue related to zero energy frames.&lt;br /&gt;&lt;br /&gt;Zero energy frames are quite common in telephony recorded speech. Due to noise cancellation or due to VAD speech compression telephony recordings are full of the frames with zero energy. The issue is that calculation of the MFCC features consist of taking log from eneries, thus you have an undefined value of log 0. There are several ways to overcome this issue.&lt;br /&gt;&lt;br /&gt;The one used in HTK or SPTK for example is to assign some floored value to the log, usually it's quite a big value in log domain, say 1e-5. This solution is actually quite bad at least in it's sphinx implementation. That's because it largely affects CMN computation, means goes down and bad things happen. Silent frame can affect the result of the whole phrase.&lt;br /&gt;&lt;br /&gt;Another one is dither, when you apply random 1bit noise to the sound as a whole and use this modified waveform for training. Such change is usually enough to make log take acceptable values around -1.&lt;br /&gt;&lt;br /&gt;There were &lt;a href="https://sourceforge.net/mailarchive/forum.php?thread_name=49ECBE4E.5090303%40verizon.net&amp;amp;forum_name=cmusphinx-devel"&gt;complains&lt;/a&gt; about dither, most well known one is that it affects recognition scores, results can be different from run to run. It's a bad thing but not that bad when you start with predefined seed. So I thought before that dither is fine. And by default it's applied both in training and decoder. But recently when I started with the testing of the sphinxtrain tutorial I come to more important issue.&lt;br /&gt;&lt;br /&gt;See the results on an4 database from run to run without any modifications:&lt;br /&gt;&lt;br /&gt;TOTAL Words: 773 Correct: 645 Errors: 139&lt;br /&gt; TOTAL Percent correct = 83.44% Error = 17.98% Accuracy = 82.02%&lt;br /&gt; TOTAL Insertions: 11 Deletions: 17 Substitutions: 111&lt;br /&gt; TOTAL Words: 773 Correct: 633 Errors: 149&lt;br /&gt; TOTAL Percent correct = 81.89% Error = 19.28% Accuracy = 80.72%&lt;br /&gt; TOTAL Insertions: 9 Deletions: 23 Substitutions: 117&lt;br /&gt; TOTAL Words: 773 Correct: 639 Errors: 142&lt;br /&gt; TOTAL Percent correct = 82.66% Error = 18.37% Accuracy = 81.63%&lt;br /&gt; TOTAL Insertions: 8 Deletions: 19 Substitutions: 115&lt;br /&gt; TOTAL Words: 773 Correct: 650 Errors: 133&lt;br /&gt; TOTAL Percent correct = 84.09% Error = 17.21% Accuracy = 82.79%&lt;br /&gt; TOTAL Insertions: 10 Deletions: 17 Substitutions: 106&lt;br /&gt; TOTAL Words: 773 Correct: 639 Errors: 142&lt;br /&gt; TOTAL Percent correct = 82.66% Error = 18.37% Accuracy = 81.63%&lt;br /&gt; TOTAL Insertions: 8 Deletions: 19 Substitutions: 115&lt;br /&gt;&lt;br /&gt; If you are lucky you can even get WER of 15.95%. Thats certainly unacceptable and it still remains true why training is so sensible to dither applied. Clearly it makes any testing impossible. I checked this results on medium vocabulary 50-hours database and they are still the same - the accuracy is very different from run to run. Interesting thing is only training is affected that much. For testing you can get very slight difference of 0.1%.&lt;br /&gt;&lt;br /&gt;So far my solutions are:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Disable dither on training&lt;/li&gt;&lt;li&gt;Apply a patch to drop frames with zero energy (this seems useless but it helps to be less nervious about warnings)&lt;/li&gt;&lt;li&gt;Decode with dither&lt;/li&gt;&lt;/ul&gt;I hope I'll be able to provide more information in the future about the reasons of this unstability, but for now it's all I know.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-7958590363974496975?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/7958590363974496975/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2009/05/dither-is-considered-harmful.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/7958590363974496975'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/7958590363974496975'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2009/05/dither-is-considered-harmful.html' title='Dither is considered harmful'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-7608344971376922069</id><published>2009-05-24T16:01:00.000+04:00</published><updated>2009-05-24T16:44:07.465+04:00</updated><title type='text'>Text summarization low hanging fruit</title><content type='html'>Actually all the data required for quite precise text summarization is almost in place, one should just add support for WordNet from &lt;a href="http://www.nltk.org"&gt;nltk&lt;/a&gt; into the &lt;a href="http://libots.sourceforge.net/"&gt;Open Text Summarizer&lt;/a&gt;, calculate frequencies and present highlighted sentences to the user. Or it's possible to do the same in python with nltk iteself.&lt;br /&gt;&lt;br /&gt;It would help in many cases for example in mail processing. Getting 200 mails in a day it's really hard to read them through. Or probably it's just time to unsubscribe from some mailing lists.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-7608344971376922069?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/7608344971376922069/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2009/05/text-summarization-low-hanging-fruit.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/7608344971376922069'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/7608344971376922069'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2009/05/text-summarization-low-hanging-fruit.html' title='Text summarization low hanging fruit'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-1193358768091304174</id><published>2009-05-23T14:00:00.000+04:00</published><updated>2009-05-23T14:01:40.957+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='voxforge'/><title type='text'>My vote for voxforge</title><content type='html'>I just voted for VoxForge in the category: "Most Likely to Change the Way You Do Everything". You might want to do the same :)&lt;br /&gt;&lt;p&gt;Go to &lt;a target="_blank" href="http://sourceforge.net/community/cca09/nominate/?project_name=VoxForge&amp;amp;project_url=http://voxforge.org/" onclick="r('\/r?url=http%3A%2F%2Fsourceforge.net%2Fcommunity%2Fcca09%2Fnominate%2F%3Fproject_name%3DVoxForge%26project_url%3Dhttp%3A%2F%2Fvoxforge.org%2F&amp;ids=9070000000993956667&amp;fs=inbox&amp;counter=1&amp;d=id46287779');"&gt;http://sourceforge.net/community/cca09/nominate/?project_name=VoxForge&amp;amp;project_url=http://voxforge.org/&lt;/a&gt;&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-1193358768091304174?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/1193358768091304174/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2009/05/my-vote-for-voxforge.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/1193358768091304174'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/1193358768091304174'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2009/05/my-vote-for-voxforge.html' title='My vote for voxforge'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-2091630966274186686</id><published>2009-05-19T02:03:00.000+04:00</published><updated>2009-05-19T02:06:14.704+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='TTS'/><category scheme='http://www.blogger.com/atom/ns#' term='blizzard'/><title type='text'>Blizzard Challenge 2009</title><content type='html'>CSTR and others are pleased to announce that the listening tests for the Blizzard  Challenge 2009 are now running. The Blizzard Challenge is an annual  open speech synthesis evaluation in which participants build voices  using common data, and a large listening test is used to compare them.  Participants include some of the leading commercial and academic  research groups in the field.&lt;br /&gt;&lt;br /&gt;I would appreciate your help in getting as many listeners to  participate as possible, by forwarding this message on to other lists,  colleagues, students, and of course taking part yourself.&lt;br /&gt;&lt;br /&gt;The listening test should take 30-60 minutes to complete, and can be  done in stages if you wish. You do not need to be a native speaker of  the language in order to take part. There are 4 different start pages  for the listening test, as follows:&lt;br /&gt;&lt;br /&gt;English&lt;br /&gt;&lt;br /&gt;Volunteers:&lt;br /&gt;&lt;a href="http://groups.inf.ed.ac.uk/blizzard/blizzard2009/english/register-ER.html"&gt;http://groups.inf.ed.ac.uk/blizzard/blizzard2009/english/register-ER.html&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Speech Experts:&lt;br /&gt;&lt;a href="http://groups.inf.ed.ac.uk/blizzard/blizzard2009/english/register-ES.html"&gt;http://groups.inf.ed.ac.uk/blizzard/blizzard2009/english/register-ES.html&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Mandarin Chinese:&lt;br /&gt;&lt;br /&gt;Volunteers:&lt;br /&gt;&lt;a href="http://groups.inf.ed.ac.uk/blizzard/blizzard2009/mandarin/register-MR.html"&gt;http://groups.inf.ed.ac.uk/blizzard/blizzard2009/mandarin/register-MR.html&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Speech Experts:&lt;br /&gt;&lt;a href="http://groups.inf.ed.ac.uk/blizzard/blizzard2009/mandarin/register-MS.html"&gt;http://groups.inf.ed.ac.uk/blizzard/blizzard2009/mandarin/register-MS.html&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Whether you consider yourself a 'speech expert' is left to your own &lt;br /&gt;judgement.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-2091630966274186686?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/2091630966274186686/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2009/05/blizzard-challenge-2009.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/2091630966274186686'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/2091630966274186686'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2009/05/blizzard-challenge-2009.html' title='Blizzard Challenge 2009'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-6311038147670231845</id><published>2009-05-17T02:22:00.000+04:00</published><updated>2009-05-22T00:54:47.758+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='sphinxtrain'/><category scheme='http://www.blogger.com/atom/ns#' term='speech recognition'/><title type='text'>Training the large database trick</title><content type='html'>Training of the large database requires a cluster. SphinxTrain supports training on Torque:PBS for example, to do this you need to set the following configuration variables:&lt;br /&gt;&lt;br /&gt;$CFG_QUEUE_TYPE = "Queue::PBS";&lt;br /&gt;&lt;br /&gt;and set the number of parts to train. The issue is to guess the number of parts. I previously thought&lt;br /&gt;&lt;br /&gt;1 part:&lt;br /&gt;&lt;br /&gt;TOTAL Words: 773 Correct: 660 Errors: 126&lt;br /&gt;TOTAL Percent correct = 85.38% Error = 16.30% Accuracy = 83.70%&lt;br /&gt;TOTAL Insertions: 13 Deletions: 9 Substitutions: 104&lt;br /&gt;&lt;br /&gt;3 parts:&lt;br /&gt;&lt;br /&gt;TOTAL Words: 773 Correct: 583 Errors: 262&lt;br /&gt;TOTAL Percent correct = 75.42% Error = 33.89% Accuracy = 66.11%&lt;br /&gt;TOTAL Insertions: 72 Deletions: 17 Substitutions: 173&lt;br /&gt;&lt;br /&gt;10 parts:&lt;br /&gt;&lt;br /&gt;TOTAL Words: 773 Correct: 633 Errors: 168&lt;br /&gt;TOTAL Percent correct = 81.89% Error = 21.73% Accuracy = 78.27%&lt;br /&gt;TOTAL Insertions: 28 Deletions: 10 Substitutions: 130&lt;br /&gt;&lt;br /&gt;20 parts:&lt;br /&gt;&lt;br /&gt;TOTAL Words: 773 Correct: 619 Errors: 181&lt;br /&gt;TOTAL Percent correct = 80.08% Error = 23.42% Accuracy = 76.58%&lt;br /&gt;TOTAL Insertions: 27 Deletions: 13 Substitutions: 141&lt;br /&gt;&lt;br /&gt;But it appeared that all above is not true. One potential source of problems was that the norm.pl scripts grabs all the sub directories under the bwaccum one indiscriminately. So if there are some old bwaccum dirs left over (e.g. if you train on 20 parts first then start again with 10, without deleting the directories in-between), the norm script will screw up (thanks to David Huggins-Daines for pointing that out to me). In this particular test there was another one that I forgot to update mdef after model rebuild and old scripts didn't do that automatically. On multipart the order of senones in mdef is different thats why there was a regression. Though the set of senones is the same.&lt;br /&gt;&lt;br /&gt;So the testing and statements above are completely wrong - &lt;span style="font-style: italic;"&gt;accuracy doesn't depend on number of parts used&lt;/span&gt;. As expected.  This confirms the ground truth that correct experiment statement is the most important thing in research.&lt;br /&gt;&lt;br /&gt;Now only one issue left - the dropped accuracy from old tutorial to a new one. But that is a completely different issue discussed in my mails on cmusphinx-sdmeet now.&lt;!-- google_ad_section_end --&gt;    &lt;br /&gt;&lt;!-- google_ad_section_end --&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-6311038147670231845?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/6311038147670231845/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2009/05/training-large-database-trick.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/6311038147670231845'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/6311038147670231845'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2009/05/training-large-database-trick.html' title='Training the large database trick'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-565256559128513160</id><published>2009-05-11T00:13:00.000+04:00</published><updated>2009-05-11T00:25:47.438+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='acoustic model training'/><title type='text'>Bad prompts issue</title><content type='html'>After quite a lot of training of the model on a small part of database to test things I came to conclusion that the main issue is  a bad prompts. Indeed the accuracy on the training set for 4 hours of data with the language model trained on the same training prompts is only 85%. Usually it should be around 93%. The issue here is that real testing prompts are also bad and they should stay that way, otherwise we'll be bounded to high quality speech only. I remember I tried a forced alignment with communicator model before but it didn't improve much just because of the testing set issue. Another try was to use skip state, that was not fruitful as well.&lt;br /&gt;&lt;br /&gt;So the plan for now is to choose the subset with the forced alignment again and train the model to check if the hypothesis is true and bad prompts in an acoustic database is indeed a main issue. It looks like we are walking around by the circle.&lt;br /&gt;&lt;br /&gt;I ended reading the article titled &lt;a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.28.2392"&gt;"Lightly supervised model training"&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-565256559128513160?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/565256559128513160/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2009/05/bad-prompts-issue.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/565256559128513160'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/565256559128513160'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2009/05/bad-prompts-issue.html' title='Bad prompts issue'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-3287413410019913477</id><published>2009-05-03T02:15:00.001+04:00</published><updated>2010-05-13T01:37:32.116+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='speech recognition'/><title type='text'>Speech AdBlock</title><content type='html'>Inspired by &lt;a href="http://dingoskidneys.com/%7Edholth/"&gt;Daniel's Holth&lt;/a&gt; application to remove word "twitter" from podcasts:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://dingoskidneys.com/cgi-bin/hgwebdir.cgi/twitterkiller/"&gt;http://dingoskidneys.com/cgi-bin/hgwebdir.cgi/twitterkiller/&lt;br /&gt;&lt;/a&gt;&lt;br /&gt;I think it's a very good idea to implement keyword filter to block advertizing in podcasts. Though support for keyword spotting is not easily implemented with CMUsphinx right now, it should be rather straightforward thing to do. In the end it can be just a binary application that takes a list of keywords to block and just filters mp3 file giving user the same file with blocked advertising.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-3287413410019913477?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/3287413410019913477/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2009/05/speech-adblock.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/3287413410019913477'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/3287413410019913477'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2009/05/speech-adblock.html' title='Speech AdBlock'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-6472229210931098883</id><published>2009-05-01T23:17:00.000+04:00</published><updated>2009-05-03T02:15:21.549+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='sphinx4'/><title type='text'>Cepwin Features Training</title><content type='html'>Recently the option to bypass delta and delta-delta feature extraction process and directly apply LDA transform matrix to the cepstrum coefficients of sequential frames was added to sphinxtrain. To use it you need to adjust training config and decoder as well:&lt;br /&gt;&lt;br /&gt;&lt;li&gt;Set feature type to 1s_c &lt;span class="anchor" id="line-56"&gt;&lt;/span&gt;&lt;/li&gt;&lt;li&gt;Add $CFG_FEAT_WINDOW=3; to the config file &lt;span class="anchor" id="line-57"&gt;&lt;/span&gt;&lt;/li&gt;&lt;li&gt;Train with MLLT &lt;span class="anchor" id="line-58"&gt;&lt;/span&gt;&lt;/li&gt;&lt;li&gt;Apply the attached patch to sphinxbase &lt;a class="attachment" href="http://www.speech.cs.cmu.edu/cmusphinx/moinmoin/LDAMLLT?action=AttachFile&amp;amp;do=view&amp;amp;target=cepwin.diff" title="attachment:cepwin.diff"&gt;cepwin.diff&lt;/a&gt;. &lt;span class="anchor" id="line-59"&gt;&lt;/span&gt;&lt;/li&gt;&lt;li&gt;Decode &lt;/li&gt;&lt;br /&gt;You can use these models in sphinx4 now, the following config should do the work:&lt;br /&gt;&lt;br /&gt;&amp;lt;component name="featureExtraction" type="edu.cmu.sphinx.frontend.feature.ConcatFeatureExtractor"&amp;gt;&lt;br /&gt;   &amp;lt;property name="windowSize" value="3"/&amp;gt;&lt;br /&gt; &amp;lt;/component&amp;gt;&lt;br /&gt; &amp;lt;component name="lda" type="edu.cmu.sphinx.frontend.feature.LDA"&amp;gt;&lt;br /&gt;   &amp;lt;property name="loader" value="sphinx3Loader"/&amp;gt;&lt;br /&gt; &amp;lt;/component&amp;gt;&lt;br /&gt;&lt;br /&gt;I haven't found the optimal parametrers yet, but it seems that something like cepwin=3 and final dimension around 40 should work. I hope to get results on this soon.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-6472229210931098883?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/6472229210931098883/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2009/05/recently-option-to-bypass-delta-and.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/6472229210931098883'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/6472229210931098883'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2009/05/recently-option-to-bypass-delta-and.html' title='Cepwin Features Training'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-6558219847372342622</id><published>2009-04-26T23:46:00.001+04:00</published><updated>2009-04-27T00:31:05.540+04:00</updated><title type='text'>Looking back on Free Software</title><content type='html'>I've read some books on business recently:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://www.amazon.com/Portable-MBA-Project-Management/dp/0471268992"&gt;Project Management &lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.amazon.com/Ten-Day-MBA-Step-step-Mastering/dp/0688137881"&gt;10 Days MBA &lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.amazon.com/Portable-MBA-Marketing-Mba/dp/047154728X"&gt;Marketing &lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;They sometimes repeat each other but actually have few interesting moments. At least I started to look on all this from a bit different point of view. Unfortunatly this domain is covered differently by Free Software community people who tend to be idealistic but promote their point of view actively. The words like "community" or "leadership" or "cool people" don't bring much in the end, and the most interesting thing is that such words are mosly spoken by corporate people.&lt;br /&gt;&lt;br /&gt;Anyhow, it would be nice to have a project that will have a clear mission and a set of reachable goals, like product plans each one with a design both technical and non-technical documents.  It would be nice to have a test set with 90% coverage and a build without warnings and also a tracking system for user requests. Thing like slick UI are also important. After all, it's easier to get this than to build an LVCSR with 95% accuracy I think :)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-6558219847372342622?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/6558219847372342622/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2009/04/looking-back-on-free-software.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/6558219847372342622'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/6558219847372342622'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2009/04/looking-back-on-free-software.html' title='Looking back on Free Software'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-6150109469089348463</id><published>2009-04-15T01:52:00.001+04:00</published><updated>2009-04-15T01:54:35.393+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='frama-c eclipse git'/><title type='text'>Frama-C Eclipse plugin</title><content type='html'>I decided to finally go forward and publish my modifications of frama-c Eclipse plugin I'm doing at work. Moreover I decided to try git/github. Let's see how it goes. The project is here:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://github.com/frama-c-eclipse/frama-c-eclipse/tree/master"&gt;http://github.com/frama-c-eclipse/frama-c-eclipse/tree/master&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;the future plans include&lt;br /&gt;&lt;ul&gt;&lt;li&gt;better graphics&lt;/li&gt;&lt;li&gt;more cleanup&lt;/li&gt;&lt;li&gt;offshelf support for recent Frama-C versions&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-6150109469089348463?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/6150109469089348463/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2009/04/frama-c-eclipse-plugin.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/6150109469089348463'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/6150109469089348463'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2009/04/frama-c-eclipse-plugin.html' title='Frama-C Eclipse plugin'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-7336853261690634730</id><published>2009-04-09T04:40:00.000+04:00</published><updated>2009-04-09T04:57:53.219+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='sphinx4 configuration'/><title type='text'>Quest in configuration file</title><content type='html'>After almost a year of wondering I finally discovered what does this mean in sphinx4 config files:&lt;br /&gt;&lt;br /&gt;   &amp;lt;component name=&amp;quot;activeListManager&amp;quot;&lt;br /&gt;                  type=&amp;quot;edu.cmu.sphinx.decoder.search.SimpleActiveListManager&amp;quot;&amp;gt;&lt;br /&gt;        &amp;lt;propertylist name=&amp;quot;activeListFactories&amp;quot;&amp;gt;&lt;br /&gt;            &amp;lt;item&amp;gt;standardActiveListFactory&amp;lt;/item&amp;gt;&lt;br /&gt;            &amp;lt;item&amp;gt;wordActiveListFactory&amp;lt;/item&amp;gt;&lt;br /&gt;            &amp;lt;item&amp;gt;wordActiveListFactory&amp;lt;/item&amp;gt;&lt;br /&gt;            &amp;lt;item&amp;gt;standardActiveListFactory&amp;lt;/item&amp;gt;&lt;br /&gt;            &amp;lt;item&amp;gt;standardActiveListFactory&amp;lt;/item&amp;gt;&lt;br /&gt;            &amp;lt;item&amp;gt;standardActiveListFactory&amp;lt;/item&amp;gt;&lt;br /&gt;        &amp;lt;/propertylist&amp;gt;&lt;br /&gt;    &amp;lt;/component&amp;gt;&lt;br /&gt;&lt;br /&gt;Actually it's even described in docs:&lt;br /&gt;&lt;br /&gt;The SimpleActiveListManager is of class edu.cmu.sphinx.decoder.search.SimpleActiveListManager.  Since te word-pruning search manager performs pruning on different search state types separately, we need a different active list for each state type. Therefore, you see different active list factories being listed in the SimpleActiveListManager, one for each type. So how do we know which active list factory is for which state type? It depends on the 'search order' as returned by the search graph (which in this case is generated by the  LexTreeLinguist). The search state order and active list factory used here are: &lt;p&gt; &lt;table border="1"&gt; &lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;b&gt;State Type&lt;/b&gt;&lt;/td&gt;&lt;td&gt;&lt;b&gt;ActiveListFactory&lt;/b&gt;&lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;LexTreeNonEmittingHMMState&lt;/td&gt;&lt;td&gt;standardActiveListFactory&lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;LexTreeWordState&lt;/td&gt;&lt;td&gt;wordActiveListFactory&lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;LexTreeEndWordState&lt;/td&gt;&lt;td&gt;wordActiveListFactory&lt;/td&gt;&lt;/tr&gt;  &lt;tr&gt;&lt;td&gt;LexTreeEndUnitState&lt;/td&gt;&lt;td&gt;standardActiveListFactory&lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;LexTreeUnitState&lt;/td&gt;&lt;td&gt;standardActiveListFactory&lt;/td&gt;&lt;/tr&gt; &lt;tr&gt;&lt;td&gt;LexTreeHMMState&lt;/td&gt;&lt;td&gt;standardActiveListFactory&lt;/td&gt;&lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt; &lt;/p&gt;  There are two types of active list factories used here, the standard and  the word. If you look at the 'frequently tuned properties' above, you will find that the word active list has a much smaller beam size than the standard active list. The beam size for the word active list is set by 'absoluteWordBeamWidth' and 'relativeWordBeamWidth', while the beam size for the standard active list is set by 'absoluteBeamWidth' and 'relativeBeamWidth'. The SimpleActiveListManager allows us to control the beam size of different types of states.&lt;br /&gt;&lt;br /&gt;It's hard to guess, isn't it? Well, I hope soon we'll be able to make configuration easier. The idea of annotatated configuration came to my mind today. With the older idea of using task-oriented predefined configurations it could really save a lot of efforts.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-7336853261690634730?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/7336853261690634730/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2009/04/quest-in-configuration-file.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/7336853261690634730'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/7336853261690634730'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2009/04/quest-in-configuration-file.html' title='Quest in configuration file'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-4781402821845164045</id><published>2009-04-07T01:21:00.001+04:00</published><updated>2010-05-13T01:34:20.136+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='java'/><title type='text'></title><content type='html'>Today with the help of my chief I've found &lt;a href="http://findbugs.sourceforge.net/"&gt;FindBugs&lt;/a&gt;, a nice static analyzer that tries to find the issues in Java code and reports about them. It's a very useful tool since I've already fixed a few bad things in sphinx4 and in other projects. The number of false positives is acceptable. The similar tool for C for example is &lt;a href="http://www.splint.org/"&gt;splint&lt;/a&gt;, though java tools as usual are much more useful. And there is an Eclipse plugin that helps to apply the tool with a single mouse click.&lt;br /&gt;&lt;br /&gt;This makes me think about what can be counted as a development platform. Although it's well known that scripting languages like Python speedup the development, they totally lack the tools like static analyzers, debuggers, profilers, documentation and testing frameworks and so on and so forth. There is some effort to create a common framework to quickly build development tools along with DSL language, but the result is not so advanced I suppose. Basically it seems today there is no choise which language to use for the development and in the light of this it seems very strange that GNOME development goes in completely opposite direction stepping to the domain of JavaScript and naive programming. I hope the desktop will not become a collection of bugs after that.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-4781402821845164045?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/4781402821845164045/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2009/04/today-ive-found-findbugs-nice-static.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/4781402821845164045'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/4781402821845164045'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2009/04/today-ive-found-findbugs-nice-static.html' title=''/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-436737869306243196</id><published>2009-03-28T00:26:00.001+03:00</published><updated>2010-05-13T01:35:49.725+04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='sphinx'/><title type='text'>Sphinx4 migrated to git</title><content type='html'>This change started some time ago, but now it's mostly finished and announced. The tree could be found here:&lt;br /&gt;&lt;br /&gt;&lt;a href="https://sourceforge.net/scm/?type=git&amp;amp;group_id=257562"&gt;https://sourceforge.net/scm/?type=git&amp;amp;group_id=257562&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The discussion is &lt;a href="https://sourceforge.net/mailarchive/forum.php?thread_name=11ADE37F-A071-43FC-A652-392B27A5C1F7%40talkhouse.com&amp;amp;forum_name=cmusphinx-devel"&gt;here&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;I'm glad to see the progress happens, big thanks to everyone involved - Joe, Piter and others.&lt;br /&gt;&lt;br /&gt;About git itself I have a mixed feeling. The advantages of DVCS aren't obvious for me and in the past I even gave up my participation in one of the projects after it's migration to mercurial (it was http://linuxtv.org). Distributed nature increases complexity and confuses at least me. It's hard to understand where the latest changes are done, what is the real state of thing and where change happens. Developers tend to add their changes to their own branches and little effort is made to create a common branch. Also among all DVCSs git is the worst in terms of usability. Sadly GNOME also migrates to git in near future.&lt;br /&gt;&lt;br /&gt;Every change has it black and white sides. Many things I do like in a new sphinx4 - clear split of the tests one can run. Some things are hard to understand like Rakefile migration. I'm afraid of windows users, how will they build sphinx4 now? Anyhow, let's hope issues will be resolved and the new shiny release will appear very soon.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-436737869306243196?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/436737869306243196/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2009/03/sphinx4-migrated-to-git.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/436737869306243196'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/436737869306243196'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2009/03/sphinx4-migrated-to-git.html' title='Sphinx4 migrated to git'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-2124508697774949894</id><published>2009-03-26T04:04:00.000+03:00</published><updated>2009-03-26T04:05:33.447+03:00</updated><title type='text'>Russian GNOME 2.26</title><content type='html'>Russian GNOME 2.26 is &lt;a href="http://l10n.gnome.org/releases/gnome-2-26/"&gt;100%&lt;/a&gt; translated. Congratulations to the team for their hard work.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-2124508697774949894?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/2124508697774949894/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2009/03/russian-gnome-226.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/2124508697774949894'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/2124508697774949894'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2009/03/russian-gnome-226.html' title='Russian GNOME 2.26'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-9140458123207848635</id><published>2009-03-22T08:16:00.001+03:00</published><updated>2009-03-22T08:35:50.823+03:00</updated><title type='text'>GNOME Summer of code tasks</title><content type='html'>I spent some time today trying to invent some interesting tasks for GNOME summer of code 2009. My favorite list for now is:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Text summarizer in Epiphany&lt;/li&gt;&lt;li&gt;Improved spell check for GEdit&lt;/li&gt;&lt;li&gt;Doxygen support for gtk-doc&lt;/li&gt;&lt;li&gt;Desktop-wide services for activity registration&lt;/li&gt;&lt;li&gt;Automatic workstation mode detection and more AI tasks desktop can benefit from&lt;/li&gt;&lt;li&gt;Cleanup of the Evolution interface where sent and received mail are grouped together&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;The list is probably too boring, but one should note that usually summer is to small to implement something serious and students are not that experienced as one want to see them. Some of the tasks were rejected already, though it's not a big deal. I just find discouraging that the list of the tasks proposed officially is even more tedious.&lt;br /&gt;&lt;br /&gt;The overview of this issue makes me think again about GNOME as a product on the market and the possible ways of it's development. It seems that we are now at a point when feature set among competitors are stabilized and it's hard to invent something else in a market. So-called &lt;a href="http://en.wikipedia.org/wiki/Product_life_cycle_management"&gt;mature product stage&lt;/a&gt; where it's important to polish and lower costs. The big step is required to shift product on a new level. Probably I need to investigate the research desktops that completely change the way users works with the system. For example I'd love to see better AI support everywhere like adaptive preferences, better stability and security with proper IPC and service-based architecture, the self-awareness services, the modern programming language. I'm not sure I'm brave enough for that though.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-9140458123207848635?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/9140458123207848635/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2009/03/gnome-summer-of-code-tasks.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/9140458123207848635'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/9140458123207848635'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2009/03/gnome-summer-of-code-tasks.html' title='GNOME Summer of code tasks'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5075537609830514591.post-2045304500614255508</id><published>2009-03-18T00:40:00.001+03:00</published><updated>2009-03-18T00:50:33.631+03:00</updated><title type='text'>HTK 3.4.1 is released</title><content type='html'>&lt;p&gt;Amazing news really. The new features of the release include:&lt;/p&gt; &lt;p&gt;  1. The HTK Book has been extended to include tutorial sections on HDecode and discriminative training. An initial description of the theory and options for discriminative training has also been added.&lt;br /&gt;  2. HDecode has been extended to support decoding with trigram language models.&lt;br /&gt;  3. Lattice generation with HDecode has been improved to yield a greater lattice density.&lt;br /&gt;  4. HVite now supports model-marking of lattices.&lt;br /&gt;  5. Issues with HERest using single-pass retraining with HLDA and other input transforms have been resolved.&lt;br /&gt;  6. Many other smaller changes and bug fixes have been integrated.&lt;/p&gt;&lt;p&gt;The release is available on the &lt;a href="http://htk.eng.cam.ac.uk/"&gt;HTK website&lt;/a&gt;&lt;br /&gt;&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5075537609830514591-2045304500614255508?l=nsh.nexiwave.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nsh.nexiwave.com/feeds/2045304500614255508/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://nsh.nexiwave.com/2009/03/htk-341-is-released.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/2045304500614255508'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5075537609830514591/posts/default/2045304500614255508'/><link rel='alternate' type='text/html' href='http://nsh.nexiwave.com/2009/03/htk-341-is-released.html' title='HTK 3.4.1 is released'/><author><name>Nickolay V. Shmyrev</name><uri>http://www.blogger.com/profile/11220369315272283124</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-mLPNJlEdMV8/TbZcF3qUEqI/AAAAAAAAAKA/6PXGkO0cmAk/s220/me.jpg'/></author><thr:total>0</thr:total></entry></feed>
