[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: A (too huge) Arabic word-list (with prefixes) for spell-checkers



On Tue, May 16, 2006, Mohammed Sameer wrote about "Re: A (too huge) Arabic word-list (with prefixes) for spell-checkers":
>...
> The data set contains words from the Holy Quran, The words in the Holy Quran are sometimes
> spelled in a different way due to the script used to write the Quran.
> 
> Those words are incorrect outside the Quran context.
>...

I have looked at Dan's example on http://ivrix.org.il/projects/arabic/,
and it seems that spell-checking a modern Arabic text (that he took from
Wikipedia) worked quite well. Could it be that while that word list is not
100% correct, it still contains a substancial amount of correct data, and,
say, 90% of the words it lists are spelled correctly, and most of the remaining
words can easily be fixed by an Arabic writer?

The reason I'm asking this is because, like I said, 90% of the work that
went into Hspell was building the lexicon. We spent a very large amount of
time sifting through texts, looking for spelling errors which are in fact
correct words, to add to the lexicon. This effort became harder and hard as
our lexicon grew, and I estimate that now it takes me 10 times the effort to
find a new word to add than it took me when I was adding the first 1000 words.
The reason we had to do this slow word-finding process was that it is illegal
to just open a Hebrew dictionary, and start copying the words one by one,
so we had to find other ways to come up with missing words (we obviously
couldn't just "recall" words from memory, and we had no free Hebrew lexicon).

With Tim Buckwalter's list, you have a much better start than we did: you
can actually go over his list, word by word, and remove, or better yet fix,
any mispelled word. It should be easier, I think, than to start from scrach.

Of course, you still need an inflection program in addition to the lexicon.
If you think that Tim Buckwalter's inflection program creates wrong
inflections, you can write a different one.


-- 
Nadav Har'El                        |      Tuesday, May 16 2006, 19 Iyyar 5766
nyh at math dot technion dot ac dot il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Hindsight is always 20:20
http://nadav.harel.org.il           |