[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Arabic spellchecker
- To: abdalla at pheye dot net, Development Discussions <developer at arabeyes dot org>
- Subject: Re: Arabic spellchecker
- From: Ahmad Khalifa <ahmad at khalifa dot ws>
- Date: Tue, 15 Nov 2005 21:20:20 +0200
- User-agent: Mozilla Thunderbird 1.0.6 (Windows/20050716)
Abdalla Alothman wrote:
Asalamu alaikum.
Salam,
I did something exactly the same way because it was feasible. ;)
I agree the approach is far from being organized.
You mean you already have such a wordlist ? I would be interested
in taking a look at it, if you don't mind. I would like to see how it
performs in OO.o.
This is where its difficulty lies. Defining the AFFIX rules and
writing a *flagged* wordlist.
This is a real problem.
If:
رءى
is the root for:
أريناك
chances for a findig a pragmatical way, or a decent pattern, could be difficult. Not
to mention that the AFFIX rules would be useless, in my humble opinion (don't let me
put you down).
But consider AFFIX rules augmented with INFIX ?! :)
Not just PREfix, and SUFfix, but also INfix, which is insertion in the
middle by means of index. Ofcourse the INFIX approach would be costly to
adapt, as we'd have to submit patches to Aspell/Myspell and have INFIX
widely accepted.
For fun, consider modern Arabic terms -- one that I can't forget was "maykanat"
(automating). The root is MKN (e.g., wallatheena inn makkannaahum fil ardh...).
Problem is that the yaa comes exactly in the middle of the root. Same goes for
kitaab, the alif comes in the middle of the root. If you could solve such cases,
I would be very much interested to see your work.
The way I see it, we have two options.
1- Add INFIX to the AFFIX rules. That way you can describe KETAB by
flagging the root KTB
2- Add KETAB as an entry of its own beside KTB. That way you can combine
KETAB easily with the 'AL' prefix rule, PLUS you still get only one
entry for the 15 entries of KTB.
I am in favour of the second approach. Its faster to adapt, does not
cost much, and would make it easier to define rules for NOUNS.
Its only downside is that for most root verbs that can be derived to
nouns, you get 2 or 3 entries. 1 for the verb and its derivatives, 1 for
the noun KETAB, and one for the MAKTAB noun.
I think 3 entries per root beats 17 entries, no ?
Right now, ammar is working on elzubeir's "Arabic Grammer Rules"
document,
http://cvs.arabeyes.org/viewcvs/projects/duali/doc/arabic-grammar
I think its the key to developing all the AFFIX rules, as we need to
formally categorize ALL the arabic language words to be able to write
the AFFIX rules.
When the document is finished, we can better estimate the need for INFIX
Please let me know what you think of the two approaches above.
I wish you goodluck insha-Allah.
Thank you.
--
Salam,
Ahmad Khalifa