[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: A (too huge) Arabic word-list (with prefixes) for spell-checkers
- To: "Development Discussions" <developer at arabeyes dot org>
- Subject: Re: A (too huge) Arabic word-list (with prefixes) for spell-checkers
- From: "Meor Ridzuan Meor Yahaya" <meor dot ridzuan at gmail dot com>
- Date: Thu, 18 May 2006 15:16:05 +0800
- Domainkey-signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=Ns/pq0rD3/HQXTnhyoK6IYZ4JXpQ55tIrgorLrLcbjHDQKvcGlEAjITQZcrnnU9Mxc2urJi0IYMsLHo4eKz2hl54j0qI9F4Ei7wq+jg1ZmPxjrz9rikWLshwpEZdyPXrSrRbUdyT/wdPt76JB6yadE9GTCmi0chmnYxplwVIVuk=
Nadav,
I think the approach is good (based on my limited knowledge of
arabic), although I'm not qualified to judge because I'm not an arabic
speaker.
I've downloaded the file, and have some questions about the format.
The data is stored in ascii (i think ) which uses latin character to
represent arabic character. Please forgive my ignorance, but is there
a standard mapping for this? Any reference? I do notice the there is
a perl script that translate it to ISO format, but personally, I
prefers UTF-8 encoding for the data. Most "modern" programs can handle
UTF-8 .
One thing about the data the Mr Sameer mentioned, is the concern about
Holy Quran words spelling. I think this one I can help a bit. I do
have the complete Quran Text, complete with all the marks in text
format. Here is what I have in mind.
I don't know about aspell format, but I would like to add a few field
into the generated data. Probably the final data will be stored in a
database. The "must have" column for each words is it's root. If the
data is stored in a database, then this column can be just the
reference to the row which indicate it's root. So, when the users want
to lookup those words, I can tell you straight away it's root. I think
this is a big help.
Why database? Well, I think it's the easiest for lookup. As you
mentioned before, the generated dataset is huge, and some application
have some problem loading it to memory. If we store it in database, we
don't have to worry about that. Application just ask the database for
it. Also, it is easy to generate a new word list from the database to
suite other applications' need (aspell, dict format etc). The
advantage, as I mentioned before, we don't have to worry about storage
and memory management, plus, we get the benefit of relational data,
easier for cross referencing.
As I said before, I don't have much experience in this, but seems like
this datased is the one that I've been looking for. As for the Quran
words, once the dataset is in the database, I can do a lookup for it.
If necessary, add the modification.
Or maybe what we really need is a seperate dictionary for the Quran?
Regards.
On 5/17/06, Nadav Har'El <nyh at math dot technion dot ac dot il> wrote:
On Tue, May 16, 2006, Mohammed Sameer wrote about "Re: A (too huge) Arabic word-list (with prefixes) for spell-checkers":
>...
> The data set contains words from the Holy Quran, The words in the Holy Quran are sometimes
> spelled in a different way due to the script used to write the Quran.
>
> Those words are incorrect outside the Quran context.
>...
I have looked at Dan's example on http://ivrix.org.il/projects/arabic/,
and it seems that spell-checking a modern Arabic text (that he took from
Wikipedia) worked quite well. Could it be that while that word list is not
100% correct, it still contains a substancial amount of correct data, and,
say, 90% of the words it lists are spelled correctly, and most of the remaining
words can easily be fixed by an Arabic writer?
The reason I'm asking this is because, like I said, 90% of the work that
went into Hspell was building the lexicon. We spent a very large amount of
time sifting through texts, looking for spelling errors which are in fact
correct words, to add to the lexicon. This effort became harder and hard as
our lexicon grew, and I estimate that now it takes me 10 times the effort to
find a new word to add than it took me when I was adding the first 1000 words.
The reason we had to do this slow word-finding process was that it is illegal
to just open a Hebrew dictionary, and start copying the words one by one,
so we had to find other ways to come up with missing words (we obviously
couldn't just "recall" words from memory, and we had no free Hebrew lexicon).
With Tim Buckwalter's list, you have a much better start than we did: you
can actually go over his list, word by word, and remove, or better yet fix,
any mispelled word. It should be easier, I think, than to start from scrach.
Of course, you still need an inflection program in addition to the lexicon.
If you think that Tim Buckwalter's inflection program creates wrong
inflections, you can write a different one.
--
Nadav Har'El | Tuesday, May 16 2006, 19 Iyyar 5766
nyh at math dot technion dot ac dot il |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Hindsight is always 20:20
http://nadav.harel.org.il |
_______________________________________________
Developer mailing list
Developer at arabeyes dot org
http://lists.arabeyes.org/mailman/listinfo/developer