[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: A (too huge) Arabic word-list (with prefixes) for spell-checkers



Hi Meor,

It is me, Dan, that released the Arabic wordlist. Nadav was explaining how one
can rebuild an Arabic spell-checker, according to what we did for Hebrew.

> I've downloaded the file, and have some questions about the format.
> The data is stored in ascii (i think ) which uses latin character to
> represent arabic character. Please forgive my ignorance, but is there
> a standard mapping for this? Any reference?  

It was Tim Buckwalter who built the database (see http://www.qamus.org/).
What I released was a mere conversion, and I kept his files (almost) untouched.
He uses his own transliteration of Arabic to ascii. You can see it in
http://www.qamus.org/transliteration.gif .

> I do notice the there is a perl script that translate it to ISO format, but
> personally, I prefers UTF-8 encoding for the data. Most "modern" programs can
> handle UTF-8 .

I preferred working with 8 bit encoding internally - it is more economical. If
you want to use utf8, you can do
	to_iso6.pl | iconv -f iso8859-6 -t utf8

> One thing about the data the Mr Sameer mentioned, is the concern about
> Holy Quran words spelling. I think this one I can help a bit. I do

Indeed. Having words in antiquated spelling is bad. However, I don't see any way
to solve this problem without someone proficient in Arabic going over the 82,157
stems in the list and correct them. I is possible!

> I don't know about aspell format, but I would like to add a few field
> into the generated data. Probably the final data will be stored in a
> database. The "must have" column for each words is it's root. If the
> data is stored in a database, then this column can be just the
> reference to the row which indicate it's root. So, when the users want
> to lookup those words, I can tell you straight away it's root. I think
> this is a big help.

I think that what you are describing is called morphological analyzer, and such
software exists http://www.nongnu.org/aramorph/, based on the same Buckwalter
database. I, on the other hand, am "thinking small", and trying to limit myself
to building a useful spell-checker.
 
> Why database? Well, I think it's the easiest for lookup. As you
> mentioned before, the generated dataset is huge, and some application
> have some problem loading it to memory. If we store it in database, we
> don't have to worry about that. Application just ask the database for
> it. Also, it is easy to generate a new word list from the database to
> suite other applications' need (aspell,  dict format etc). The
> advantage, as I mentioned before, we don't have to worry about storage
> and memory management, plus, we get the benefit of relational data,
> easier for cross referencing.

I disagree with you on that. The database already exists. It was written in
plaintext by Tim Buckwalter. I don't see why storing this database in SQL
could help; it is not like there are millions of people trying to help to
correct and extend the database :-(
 
> Or maybe what we really need is a seperate dictionary for the Quran?

Indeed, it would be useful to keep Quran words in a separate file, and not just
replace them with the modern spelling. Someone has to do it. Hopefully, someone
(maybe from this list) will.

-- 
Dan Kenigsberg        http://www.cs.technion.ac.il/~danken        ICQ 162180901