[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Duali Status



Salam,

This is kind of a status report on Duali.

The amount of time dedicated to Duali has been anything but consistent,
and so I will try to sum up the difficulties I have been facing in the
hopes that someone may be able to enlighten me ;)

I. Verifying assumptions

My goal is to make Duali make a number of assumptions about the Arabic
language. But I cannot verify if any number of those assumptions are 
accurate. I simply don't have the Arabic grammar background (neither do
I remember much of what I learned).

For example, I make the assumption that any 3-letter word has 14 
derivatives. That, provided that the process of stripping the word
from any prefix and suffix didn't destroy the word.

These are some of the problems, and I am finding more as I refine the
dictionary. So in general, I can say that I really could use the help
of someone who is proficient in the Arabic grammar that we can work
on laying out certain rules.

II. The dictionary

I was able to produce a dictionary of 3 letter root verb derivatives,
but noticed that a couple of terms shouldn't of have been there. I used
the word list recently uploaded to CVS to generate the dictionary, which
produced 2217 words.

The format I want to follow for the dictionary would be something along
these lines:

TERM:LENGTH:DERIVATIVES

where length would be a 4bits and the other 12bits store the possible derivative
templates the term could fit. Each term length has a different set of possible
derivatives. 

Of course you can have words that do not have derivatives at all. It will
always be set to 0's.

III. The process

Duali parses the text, grabs each word. Analyzes if it has any prefix or suffix
that can be stripped. Then depending on its length sees if it fits any of
the derivative templates. If it does, the root of that derivative is looked
up in the dictionary. If it's found, it's correct. If not, then it compares
the stripped word to the dictionary of the words with no derivatives. If
that fails as well, it compares it again but with the full word (pre-analysis).
If that too fails, the word is misspelled.

So, if a word is tagged incorrect and you know it is correct, what do you
do? We can add it to the dictionary, but the problem is, how accurate will
we be? Do we always assume that user inputted words are with 0 derivatives?

In general, I found that without a proper dictionary in hand at the very least,
a lot of the simple should-be-quickie looksup are not possible.  

Feedback is highly recommended ;) If you are an Arabic expert raise your voice!
I could really use some pointers.

later
-- 
-------------------------------------------------------
| Mohammed Elzubeir    | Visit us at:                 |
|                      |  http://www.arabeyes.org/    |
| Arabeyes Project     | Homepage:                    |
| Unix the 'right' way |  http://fakkir.net/~elzubeir/|
-------------------------------------------------------
---
Was I helpful? Let others know:
http://svcs.affero.net/rm.php?r=elzubeir

Attachment: pgp00000.pgp
Description: PGP signature