[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [OT?] Re: A (too huge) Arabic word-list (with prefixes) for spell-checkers



On Tue, May 16, 2006, Jonathan Ben Avraham wrote about "[OT?] Re: A (too huge) Arabic word-list (with prefixes) for spell-checkers":
> Hi Dan, Mohammed,
> What would it take to port Hspell to Arabic? That is, to make the code 
> adaptable to any of the Semitic languages? Would it be possible to have 
> one spellchecker for both Hebrew and Arabic?
> 
>  - yba

I'm CC'ing this to Arabeyes, because I don't think Mohammed reads the Ivrix
list.

Like everything(?), Hspell is basically composed of three parts:
 1. The inflection code: taking base words and a few hints for each,
    and producing a list of all legal inflections.
 2. The lexicon: the list of these base words and hints.
 3. The spell-checking code, taking the list of all legal word forms and
    marking words as correct or incorrect.

Part #3 is easily useful also for Arabic, but this is not a terribly big
win, because aspell, for example, also replaces this part of Hspell (e.g.,
see our dictionaries for aspell and OpenOffice).

Part #1 is only useful for Arabic as an idea, a technique that we proved
useful and very easy to implement (a protype useful for about 50% of the
Hebrew language was done in a few days). I'll gladly explain more about this
idea to anyone interested in listening. Obviously, the specific inflection
rules we wrote will not apply directly to Arabic.

But part #2, the lexicon, is what we spent most of our work on - I'd estimate
that as much as 90% of the work that went into Hspell went into building the
lexicon: from scratch, with utmost attention given to correctness and
accuracy. Unfortunately, none this work can be reused for an Arabic spell-
checker, which will have a completely different lexicon. We can, however,
if there's interest, outline some of the processes we used to find missing
words and deciding on the correct spelling, if someone is interested.

What Dan tried to do is to show that once an lexicon + inflections is 
available, it's relatively easy to create an aspell-format dictionary and
get support for that language in OpenOffice, Mozilla, and GMail. Since
support for Arabic does not yet exist in these programs, we thought that
such progress will be very exciting for Arabic writers. Of course, if
the lexicon itself is worthless and filled with incorrect words, well,
nothing we ever do with it will be useful. In computer parlance, this is
known as "garbage in, garbage out"...

-- 
Nadav Har'El                        |      Tuesday, May 16 2006, 18 Iyyar 5766
nyh at math dot technion dot ac dot il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |I have a great signature, but it won't
http://nadav.harel.org.il           |fit at the end of this message -- Fermat