[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [OT?] Re: A (too huge) Arabic word-list (with prefixes) for spell-checkers



On Tue, May 16, 2006 at 03:27:02PM +0300, Nadav Har'El wrote:
> On Tue, May 16, 2006, Jonathan Ben Avraham wrote about "[OT?] Re: A (too huge) Arabic word-list (with prefixes) for spell-checkers":
> > Hi Dan, Mohammed,
> > What would it take to port Hspell to Arabic? That is, to make the code 
> > adaptable to any of the Semitic languages? Would it be possible to have 
> > one spellchecker for both Hebrew and Arabic?
> > 
> >  - yba
> 
> I'm CC'ing this to Arabeyes, because I don't think Mohammed reads the Ivrix
> list.
> 
Thanks for CC'ing,
I sent the subscription email a few minutes ago but looks like majordomo
didn't answer me yet ;-)

PS. You might like to have a look at those 2 small articles I wrote, They
summarize my 3-4 months experience with the whole thing:

http://www.foolab.org/node/1482 <-- This is the important one, I guess ;-)
http://www.foolab.org/node/1439

> Like everything(?), Hspell is basically composed of three parts:
>  1. The inflection code: taking base words and a few hints for each,
>     and producing a list of all legal inflections.
>  2. The lexicon: the list of these base words and hints.
>  3. The spell-checking code, taking the list of all legal word forms and
>     marking words as correct or incorrect.
> 
> Part #3 is easily useful also for Arabic, but this is not a terribly big
> win, because aspell, for example, also replaces this part of Hspell (e.g.,
> see our dictionaries for aspell and OpenOffice).
>
> Part #1 is only useful for Arabic as an idea, a technique that we proved
> useful and very easy to implement (a protype useful for about 50% of the
> Hebrew language was done in a few days). I'll gladly explain more about this
> idea to anyone interested in listening. Obviously, the specific inflection
> rules we wrote will not apply directly to Arabic.

I'd really love to know, If you have time.

I'm not sure about the formation of Hebrew words, But in Arabic we have roots
You then derive the various words from those words by adding letters to the,
beginning, end or in the middle, Sometimes you change one letter or more, How
are the Hebrew inflections formed ?

I guess that implementing such an engine is not a big issue, If you have the data
set you need, Unfortunately I don't really like this approach for 3 reasons:
1) The data set is the hardest part to get.
2) You can't really say that the theory is error free before you have a working
implementation, Something I was really tired to implement
3) I guess aspell is fine, Which makes me wonder why you are working on hspell?
I assume you have a strong argument which I'd really like to know.

So do you usually take a word, Strip it down until you get a base word ? and
if you are able to get a base word then you flag it as correct ?
I think this is the only way,

> But part #2, the lexicon, is what we spent most of our work on - I'd estimate
> that as much as 90% of the work that went into Hspell went into building the
> lexicon: from scratch, with utmost attention given to correctness and
> accuracy. Unfortunately, none this work can be reused for an Arabic spell-
> checker, which will have a completely different lexicon. We can, however,
> if there's interest, outline some of the processes we used to find missing
> words and deciding on the correct spelling, if someone is interested.

I know, This is something I won't be able to build by myself and I found only 1
person to help me, And he didn't have much time!!!
If you've used an automated way, Please give me some hints ?

> What Dan tried to do is to show that once an lexicon + inflections is 
> available, it's relatively easy to create an aspell-format dictionary and
> get support for that language in OpenOffice, Mozilla, and GMail. Since
> support for Arabic does not yet exist in these programs, we thought that
> such progress will be very exciting for Arabic writers. Of course, if
> the lexicon itself is worthless and filled with incorrect words, well,
> nothing we ever do with it will be useful. In computer parlance, this is
> known as "garbage in, garbage out"...

I don't think it's full with errors, But no one knows how many incorrect words are
there given the fact that you can't review all of the words and you don't know
anything about the process used to generate the lexicon.

Many many thanks,

-- 
GNU/Linux registered user #224950
Proud Egyptian GNU/Linux User Group <www.eglug.org> Member.
Life powered by Debian, Homepage: www.foolab.org
--
Don't send me any attachment in Micro$oft (.DOC, .PPT) format please
Read http://www.gnu.org/philosophy/no-word-attachments.html
Preferable attachments: .PDF, .HTML, .TXT
Thanx for adding this text to Your signature

Attachment: signature.asc
Description: Digital signature