[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Arabic spellchecker
- To: abdalla at pheye dot net, Development Discussions <developer at arabeyes dot org>
- Subject: Re: Arabic spellchecker
- From: Ahmad Khalifa <ahmad at khalifa dot ws>
- Date: Tue, 15 Nov 2005 19:55:13 +0200
- User-agent: Mozilla Thunderbird 1.0.6 (Windows/20050716)
Abdalla Alothman wrote:
I read somewhere about someone who did a very cleaver move: He
[...]
I would rather spend a week or two developing such tools rather than
type large amounts of data by hand.
I think duali's dataset was collected that way. By crawling arabic news
sites like http://www.ahram.org.eg/
This is helpful if you want to collect a fat wordlist which
describes the verb 'كتب'(KTB) and all its derivatives in 15 entries.
see http://www.khalifa.ws/files/public/arabic-dictionary.txt
You can get such a dictionary in a few hours. Just parse duali's dataset
and generate a wordlist.
But, such a dictionary would not be very efficient given that a verb
such as 'كتب'(KTB) can be written as 1 entry with special flags.
This is where its difficulty lies. Defining the AFFIX rules and
writing a *flagged* wordlist.
Right now, we are in the 'Define Arabic as AFFIX rules' phase. Next we
would be in the 'populate the flagged dictionary list' phase.
If all fails, or this takes too long, we will fall back to the fat
wordlist option, which would then require a small *PERL* script to parse
duali's space delimited 'stems' file.
This is just a simple idea that I have never tried. I hope it's helpful.
It would actually work :) but we're going for the better solution. We
need this to be as efficient as possible, for this would be (hopefully)
part of OO.o someday.
--
Salam,
Ahmad Khalifa