[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Arabic and CMU Sphinx speech recognition



Hello,
 
I came across a post here from late 2005 asking about CMU Sphinx support for Arabic. 
 
I notice there is someone out there who apparently is working on it:
  http://sourceforge.net/forum/message.php?msg_id=3606483

(I have cc'd him or her on this message.)
 
But I don't see anything available for download yet (although I did not look very hard).  So I am writing this message to point out that some Arabic speech recognition applications may not require explicit Arabic support in Sphinx. In my research group someone once needed to create a speech recognizer for Chinese in a hurry. She used an American English acoustic model (sound model) which was already prepared, and a pronunciation dictionary which spelled out Chinese words in terms of those American English sounds.  Probably this kind of trick is obvious to some of you, but it surprised me when I heard about it, so I will go into it in more detail.  I expect that the use of English models lowered the ability of the speech recognition system, but not all speech recognition applications require the highest possible ability.
 
A hypothetical application which I think would not require the highest possible ability in the speech recognition system is allowing a handicapped person who cannot use a mouse or keyboard, but can speak normally, to be able to browse the web using speech recognition to follow bookmarks and links.  For this I might want the following phrases recognized in Arabic: "Open Firefox", "Close Firefox", "Open Bookmark List", "Close Bookmark List", "Next Bookmark", "Open Bookmark", "Next link", "Open link", "Down", "Up", and "Back".  The speech recognition difficulty of this application is much lower than, e.g., the difficulty of dictating free-form documents, because there are not many phrases and no two of them sound very much alike (at least in English).
 
A recognizer like Sphinx uses three types of language-dependent models:
 
(1) An acoustic model.  This represents (usually statistically) a range of possible audio representations for the phones (individual sounds) of the  language.   (There are lists of English and Arabic phones at  http://www.phon.ucl.ac.uk/home/sampa/, although the list used by Sphinx might differ.) This is sometimes customized for the application, to better match the type of speech (e.g., the speaking style or the recording conditions).
 
(2) A pronunciation dictionary specifying how each word is pronounced in terms of the phones in the acoustic model.  
 
(3) A language model or grammar model which models patterns of word usage.   This is normally customized for the application.   Every word in the language model  must be in the pronunciation dictionary.
 
In the hypothetical example, I could use one of the existing acoustic models, and I would edit the pronunciation dictionary so that it approximated the pronunciation of the required Arabic words using American English phones (or whatever language the acoustic model was built for; perhaps one of the existing non-English acoustic models would be a better match).  Since the number of phrases is small,  it might be convenient to put entire phrases at a time into the dictionary pretending they were words; in this case only a dummy language model would be needed. 
 
Regards,
David Gelbart
 
PS Some places you might find useful if you are looking for Arabic data
for building speech recognition systems are www.nemlar.org, www.elda.org,
www.ldc.upenn.edu, and (on Usenet or Google Groups) comp.speech.research. 
 
PPS I do not read mail at the account I am posting this from; if you want to contact me directly for some reason please use my work account which you can find at my ICSI home page.
 
 

 


Be smarter than spam. See how smart SpamGuard is at giving junk email the boot with the All-new Yahoo! Mail