[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Arabic spellchecker




Salam,

I've been recently looking into the issue of an arabic spellchecker.
I've looked at duali and Myspell. Myspell is OOo's spellchecker. Aspell
can use Myspell's dictionary.

Thanks to an old document written by elzubeir about arabic grammer
rules [1], I was able to experiment with a small affix dictionary.

I think its general consensus that a Myspell/Aspell AFFIX dictionary is
the best way to go. We will ignore harakat for now. Ammar and Nadim seem
to agree on this.

An affix dictionary consists of 2 files, the AFFIX file, and the
DICTIONARY file.

The AFFIX file is a small file containing rules about how prefixes and
suffixes are to be added to a word. The DICTIONARY file is a word list
with flags on every word to indicate the compatible prefixes and
suffixes from the AFFIX file.

So, as you can see, creating an arabic spellchecker is only a matter of
populating 2 files and plugging them into OOo, or using Myspell/Aspell
standalone.

The affix file needs to be populated by people with arabic-grammer
knowledge, or people who have references. Not much knowledge is needed,
just knowledge of the general structure of regular verbs, irregular
verbs, nouns, etc...
What we need here is a formal categorization of all arabic words (verbs,
nouns, and anything else) so we can express these categories in AFFIX
format. When all categories are defined, work on the dictionary can
begin. If certain categories need changes in *spell, we can submit bugs,
patches, or whatever necessary. The only augmentation I see might be
necessary is, INFIX. i.e. insertion of a character into theword, like
KTB -> KATB
As a starting point, I suggest elzubeir's "Arabic Grammer Rules" [1]

Ammar, please advise on the AGR document and what its missing as this is the next move. I tried to classify words such as "إستغفر, شاور"
(ESTAGHFAR, SHAWAR) with it, but didnt know where they fit.


The dictionary file is going to be a big file, and probably needs some
automation/scripting and human verification.

I wrote a small document describing the affix file and the dictionary
file here...
http://khalifa.ws/files/public/arabic-dictionary.txt

Anyhelp with the AFFIX file or the DICTIONARY file is welcomed. We need
grammer expertise. Any feedback at all :)

NOTE, This is a completely different approach than duali, Im not sure
duali's cvs is a good place for it.

[1] Elzubeir's arabic grammer rules
    http://cvs.arabeyes.org/viewcvs/projects/duali/doc/arabic-grammar

--
Salam,
Ahmad Khalifa