[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [OT?] Re: A (too huge) Arabic word-list (with prefixes) for spell-checkers



On Tue, May 16, 2006, Mohammed Sameer wrote about "Re: [OT?] Re: A (too huge) Arabic word-list (with prefixes) for spell-checkers":
> > Like everything(?), Hspell is basically composed of three parts:
> >  1. The inflection code: taking base words and a few hints for each,
> >     and producing a list of all legal inflections.
>...
> > Part #1 is only useful for Arabic as an idea, a technique that we proved
> > useful and very easy to implement (a protype useful for about 50% of the
> > Hebrew language was done in a few days). I'll gladly explain more about this
> > idea to anyone interested in listening. Obviously, the specific inflection
> > rules we wrote will not apply directly to Arabic.
> 
> I'd really love to know, If you have time.

Here is the idea, in brief. You can find a longer, but better written,
overview of Hspell's structure in
	http://ivrix.org.il/projects/spell-checker/doc/short.ps.gz

As I assume you don't know Hebrew, I'll try to give examples in Arabic,
but please forgive my lousy Arabic.

Our goal was to produce a complete list of all legal word forms in Hebrew -
not just base forms, but also nouns in plural and with possesives, verbs
in all tenses and all subjects (I, he, she, etc.), and so on. This word list
can be used by a very simple spell-checker code which just checks if each
input word is in the legal word list. aspell is an example of such code.
(NOTE: we also have to deal with prefix particles, such as "the", "in",
etc., but I'll ignore this issue in this mail; It's treated in the article).

In fact, making our word list useful for aspell was one of the original
goals in our choosing this approach for Hspell.

Naturally, we don't want to manually insert all the inflections for each
noun and verb. This is not only extremely slow and labor-intensive, it's
also a sure-fire way to make many errors because nobody can do such boring
repetitive work without making mistakes.

Instead, let's look at one base word, a noun, say "klb" (dog). We can
easily write an inflection program that given the input line

	klb noun

generates the base word, and its possesives:

	klb	(dog)
	klbi	(my dog)
	klbk	(your dog)
	...

However, the program cannot guess what the plural looks like, because Arabic
(and to a lesser degree, also Hebrew) has several forms of plural. In this
case, the correct plural is formed by putting an aliph before the end of the
word. Let's call this form of plural "aliph_end", and give the inflection
program the following input

	klb noun,aliph_end

and when it sees this it generates

	klb, klbi, klbk...
	klab, (dogs)
	klabk, (your dogs)
	...
	klbein (two dogs)

We can similarly inflect other nouns which have the same pluralization rule,
say,

	bnt noun,aliph_end	(girl)

(note that the fact that the vowels in bnt and klb are different doesn't really
matter to us if we aim (like Hspell does) for vowel-less spelling.)

Other nouns need to be pluralized differently, but in Hebrew 99% of the words
have just 4 types of basic pluralization rules, and I'm guessing that Arabic
also has a finite number of them (for the exceptional cases, you can add an
option like "plural=..." to manually specify the plural).

After two days of Perl programming, my noun inflector could probably inflect
correctly 50% of all Hebrew nouns (of course, I didn't have all of them in
my lexicon - I had something like 100 :-)).
It took more work, and more inflection hints, until we could confidently say
that we can inflect correctly every nouns in the Hebrew language (when given
the appropriate hints, of course).

Note, by the way, how easy it is to add more words to the lexicon - sometimes
it's enough to just say something is a noun (and the code guesses the correct
hints correctly), and sometimes you just need to add a short textual hint -
no need to change code, no need to use obscure regular expressions or numbers
or whatever.

While I was working on nouns, Dan was doing similar work for verbs. The
verb inflector is given a 3-letter root, the "binyanim" (forms of verb
inflection - the same concept also exists in Arabic), and sometimes more
hints, and inflects them correctly. Correct verb inflection requires a bit
more grammar knowledge, but this sort of knowledge is available in many
grammar books.

For a slightly longer (but more organized) recount of the same story, in
English also, you can take a look at
	http://ivrix.org.il/projects/spell-checker/doc/short.ps.gz

> I'm not sure about the formation of Hebrew words, But in Arabic we have roots
> You then derive the various words from those words by adding letters to the,
> beginning, end or in the middle, Sometimes you change one letter or more, How
> are the Hebrew inflections formed ?

Very similarly.
But note that for most intents and purposes, the roots of Hebrew nouns do not
really matter, and you can start with the base form of the noun without caring
about its root (this also allows for nouns with foreign origin).
For verbs, starting with the roots is advised.

> set you need, Unfortunately I don't really like this approach for 3 reasons:
> 1) The data set is the hardest part to get.

Indeed. Which is why we worked incrementally, slowly but surely. Hspell 0.1,
released after two months of intensive work by Dan and myself, had 7350
base words (after inflection, 125,022 words result). Three and a half years
later, Hspell 1.0 has 22589 base words.
The great thing is that Hspell was use already useful with a small lexicon,
and became more and more useful as the lexicon grew. You don't need to start
with a huge lexicon, and can work incrementally, over years.

> 2) You can't really say that the theory is error free before you have a working
> implementation, Something I was really tired to implement

My approach (explained above) is a generative ("synthetic") approach, which
generates the list of words. If it generates wrong words, I (as a Hebrew
speaker) will see them and know it (I can also refer to books and dictionaries
if I'm not sure). Of course, this means that you can only develop a spell-
checker for a language you know :-)

> 3) I guess aspell is fine, Which makes me wonder why you are working on hspell?
> I assume you have a strong argument which I'd really like to know.

If you read my above answers, you probably know by now: the Hspell project
is really about building a list of correct word forms - not about the simple
program (also called "hspell") which is run to do the actual spell-checking.
Our makefile generates both the "hspell" program, and dictionary files
suitable for use in aspell. "Working on hspell" means, most of all, working
on Hspell's lexicon, something which obviously nobody in aspell will do
for us.

> So do you usually take a word, Strip it down until you get a base word ? and
> if you are able to get a base word then you flag it as correct ?
> I think this is the only way,

No, what you describe is the "analytic" aproach - taking an input word,
analyzing (breaking it up) until we understand it, and if we do, we accept
it as legal. The main problem with this aproach is that it requires Hebrew-
specific code at spell-check time (so it won't work in aspell, of course).
Its code also tends to be more entangled and harder to break up for several
people to work on concurrently (like Dan and I did).

As I said, our approach is a "synthetic" approach - we *synthesize* all
legal words as a first stage, and after we have this large full list,
we use it to spell-check input words.

> > But part #2, the lexicon, is what we spent most of our work on - I'd estimate
> > that as much as 90% of the work that went into Hspell went into building the
> > lexicon: from scratch, with utmost attention given to correctness and
>..
> I know, This is something I won't be able to build by myself and I found only 1
> person to help me, And he didn't have much time!!!
> If you've used an automated way, Please give me some hints ?

As I explained above, I made sure the lexicon format was very simple (no
obfuscated case numbers, no long XML, etc.) to make it easier to add new
words. However, this process was far from automatic... My estimate is that
Dan and I spent more than 6 man-months on building the lexicon over the past
four years - so if you do not have a passion for this type of thing, try
to find someone who does.

To find missing words, we used a lot of techniques, but the central technique
was spell-checking huge crawls of online Hebrew documents (newspapers, books,
etc.) and than manually sifting through the literally *tens of thousands*
of words reported as mispelled, looking for words that are actually correct
and are not really mispelled.

As I already said repeatedly, in your case, it might be much easier to use
the word list you already have to look for missing words: even if it has
50% wrong words, it's much better than our crawl's spellcheck output, which
perhaps 95% of the words of it were mispellings and we had to find the needle
in the haystack.

-- 
Nadav Har'El                        |      Tuesday, May 16 2006, 19 Iyyar 5766
nyh at math dot technion dot ac dot il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Seen on the back of a dump truck:
http://nadav.harel.org.il           |<---PASSING SIDE . . . . . SUICIDE--->