[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Arabic spellchecker



Abdalla Alothman wrote:
I read somewhere about someone who did a very cleaver move: He
[...]
I would rather spend a week or two developing such tools rather than
type large amounts of data by hand.

I think duali's dataset was collected that way. By crawling arabic news sites like http://www.ahram.org.eg/

This is helpful if you want to collect a fat wordlist which
describes the verb 'كتب'(KTB) and all its derivatives in 15 entries.
see http://www.khalifa.ws/files/public/arabic-dictionary.txt

You can get such a dictionary in a few hours. Just parse duali's dataset
and generate a wordlist.

But, such a dictionary would not be very efficient given that a verb such as 'كتب'(KTB) can be written as 1 entry with special flags.

This is where its difficulty lies. Defining the AFFIX rules and
writing a *flagged* wordlist.

Right now, we are in the 'Define Arabic as AFFIX rules' phase. Next we
would be in the 'populate the flagged dictionary list' phase.

If all fails, or this takes too long, we will fall back to the fat
wordlist option, which would then require a small *PERL* script to parse
duali's space delimited 'stems' file.

This is just a simple idea that I have never tried. I hope it's helpful.

It would actually work :) but we're going for the better solution. We need this to be as efficient as possible, for this would be (hopefully) part of OO.o someday.

--
Salam,
Ahmad Khalifa