[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: A (too huge) Arabic word-list (with prefixes) for spell-checkers



Dan,
Thanks for clarification. Just curious, how big is the file for the
generated wordlist? I can't generate it yet because I'm working under
XP, so I'm thinking if it is not that big, maybe I would like to
download it instead.
Anyway, I'm thinking, maybe it is a good idea to create a gui that
will make it easier to go thru all the stem words and the generated
word one by one for proofreading. It will show the stem word, the
morphological and grammatical category applied to it, and the list of
generated words. I think that would be easier to proofread. Or how
about if someone compare it with Lane's Arabic lexicon (I don't have a
copy of this)? Will it cause legal issues later? We are just
comparing, not copying.

Regards.


On 5/18/06, Dan Kenigsberg <danken at cs dot technion dot ac dot il> wrote:
Hi Meor,

It is me, Dan, that released the Arabic wordlist. Nadav was explaining how one
can rebuild an Arabic spell-checker, according to what we did for Hebrew.

> I've downloaded the file, and have some questions about the format.
> The data is stored in ascii (i think ) which uses latin character to
> represent arabic character. Please forgive my ignorance, but is there
> a standard mapping for this? Any reference?

It was Tim Buckwalter who built the database (see http://www.qamus.org/).
What I released was a mere conversion, and I kept his files (almost) untouched.
He uses his own transliteration of Arabic to ascii. You can see it in
http://www.qamus.org/transliteration.gif .

> I do notice the there is a perl script that translate it to ISO format, but
> personally, I prefers UTF-8 encoding for the data. Most "modern" programs can
> handle UTF-8 .

I preferred working with 8 bit encoding internally - it is more economical. If
you want to use utf8, you can do
        to_iso6.pl | iconv -f iso8859-6 -t utf8

> One thing about the data the Mr Sameer mentioned, is the concern about
> Holy Quran words spelling. I think this one I can help a bit. I do

Indeed. Having words in antiquated spelling is bad. However, I don't see any way
to solve this problem without someone proficient in Arabic going over the 82,157
stems in the list and correct them. I is possible!

> I don't know about aspell format, but I would like to add a few field
> into the generated data. Probably the final data will be stored in a
> database. The "must have" column for each words is it's root. If the
> data is stored in a database, then this column can be just the
> reference to the row which indicate it's root. So, when the users want
> to lookup those words, I can tell you straight away it's root. I think
> this is a big help.

I think that what you are describing is called morphological analyzer, and such
software exists http://www.nongnu.org/aramorph/, based on the same Buckwalter
database. I, on the other hand, am "thinking small", and trying to limit myself
to building a useful spell-checker.

> Why database? Well, I think it's the easiest for lookup. As you
> mentioned before, the generated dataset is huge, and some application
> have some problem loading it to memory. If we store it in database, we
> don't have to worry about that. Application just ask the database for
> it. Also, it is easy to generate a new word list from the database to
> suite other applications' need (aspell,  dict format etc). The
> advantage, as I mentioned before, we don't have to worry about storage
> and memory management, plus, we get the benefit of relational data,
> easier for cross referencing.

I disagree with you on that. The database already exists. It was written in
plaintext by Tim Buckwalter. I don't see why storing this database in SQL
could help; it is not like there are millions of people trying to help to
correct and extend the database :-(

> Or maybe what we really need is a seperate dictionary for the Quran?

Indeed, it would be useful to keep Quran words in a separate file, and not just
replace them with the modern spelling. Someone has to do it. Hopefully, someone
(maybe from this list) will.

--
Dan Kenigsberg        http://www.cs.technion.ac.il/~danken        ICQ 162180901


_______________________________________________ Developer mailing list Developer at arabeyes dot org http://lists.arabeyes.org/mailman/listinfo/developer