[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Duali - Creating a Dictionary
- To: developer at arabeyes dot org
- Subject: Duali - Creating a Dictionary
- From: Mohammed Elzubeir <elzubeir at arabeyes dot org>
- Date: Sat, 24 Aug 2002 21:58:34 -0500
- User-agent: Mutt/1.3.28i
Salam,
I have posted on this subject earlier, but this is a more elaborate
one that hopefully explains the process better. What I need is for
people to test and give me some feedback.
The Arabic language grammar is known to be a rule-based. This is what
the Duali (http://www.arabeyes.org/project.php?proj=Duali) aims
to exploit.
In order to produce a compact dictionary, we need to put those rules
to the test. Unfortunately, this has not been as easy as it sounds
in theory.
The majority of Arabic words are derived from 3-5 letter roots. Please
note that ALEF's encompass all kinds and forms of it. So, let us break
it down in easy to understand terms:
I. 3 letter words:
Those are already in root form, so nothing needs to be done to them.
II. 4 letter words:
There are two main rules that we can apply to 4 letter words, which
are:
1. If the 3rd letter is ALEF, YEH or WAW
For example:
THEH, QAF, YEH, LAM [thaqeel] ==> THEH, QAF, LAM
2. If the 2nd letter is ALEF, WAW, TAH, DAl, or YEH
For example:
THEH, ALEF, QAF, BEH [thaqib] ==> THEH, QAF, BEH
III. 5 letter words:
There are several rules that govern 5 letter words, which are:
1. If 3rd and 4th letter are both ALEF
2. If the 2nd letter is TEH or YEH
=AND=
if the 4th letter is ALEF
3. If the 2nd and 3rd letters are WAW and ALEF
4. If the 2nd letter is ALEF
=AND=
if the 4th letter is YEH or WAW
5. If the 4th letter is ALEF, YEH or WAW
6. If the 3rd letter is ALEF or YEH
IV. 6 letter words:
Although less common, they have 2 rules:
1. If 2nd and 3rd letters are WAW and ALEF
=AND=
if 5th letter is YEH
2. If 3rd letter is ALEF and 5th letter is YEH
If any of the above conditions are met, the assumption is that
once we drop the found letters, the result is the root.
Examples:
I. QAF, LAM, MEEM [qalam] ==> as is
II
1. Correct:
THEH, QAF, YEH, LAM [thaqeel] ==> THEH, QAF, LAM
Incorrect:
ALEF, YEH, ALEF, MEEM [ayyam] ==> ALEF, YEH, MEEM
2. Correct:
THEH, ALEF, QAF, BEH [thaqib] ==> THEH, QAF, BEH
Incorrect:
ALEF, TAH, LAM, QAF [aTlaqa] ==> ALEF, LAM, QAF
After that, it gets more complex, and you may wish to check the
results yourself. I have no access to a dictionary to reference
the terms and verify if in fact they are correct.
NOTE:
-----
You can verify the results yourself by running the 'gendic.py' script.
For example:
$ ./gendic.py -f dict_wordlist -a | grep 'l=5:d=4'
to get the list of words that are of length 5 and derivative 4. The script
should run any platform, but if you are under a *n*x system 'mlterm' is
to the rescue ;)
Please do ignore the words that have an obvious prefix and/or suffix,
I am yet to get satisfactory results from stemming.
To get the script, using the anoncvs account (if you don't have a cvs
account):
$ cvs -d:pserver:anoncvs at arabeyes dot org:/home/CVS login
('anoncvs' for password)
$ cvs -d:pserver:anoncvs at arabeyes dot org:/home/CVS co duali
$ cd duali/src/tools
$ ./gendic.py -f dict_wordlist -a > dict_duali
--
-------------------------------------------------------
| Mohammed Elzubeir | Visit us at: |
| | http://www.arabeyes.org/ |
| Arabeyes Project | Homepage: |
| Unix the 'right' way | http://fakkir.net/~elzubeir/|
-------------------------------------------------------
---
Was I helpful? Let others know:
http://svcs.affero.net/rm.php?r=elzubeir