[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Arabic spellchecker
- To: developer at arabeyes dot org
- Subject: Re: Arabic spellchecker
- From: Abdalla Alothman <abdalla at pheye dot net>
- Date: Tue, 15 Nov 2005 07:29:18 +0300
- Organization: Pheye Technologique, GT&C
- User-agent: KMail/1.8
Asalamu alaikum wa rahmatullaah
On Saturday 12 November 2005 00:18, Mohammed Sameer wrote:
> > So, as you can see, creating an arabic spellchecker is only a matter of
> > populating 2 files and plugging them into OOo, or using Myspell/Aspell
> > standalone.
> That's another problem, Creating the data files.
I read somewhere about someone who did a very cleaver move: He
collected user-entered Arabic search strings from Google. There
are some ways to automatically collect such words without actually
typing them by hand.
There are data structures (Sets) that would avoid duplications.
With appropriate filtering techniques (I can share some), someone
could automatically generate an interesting word list from existing
search engines and Arabic web pages.
I would rather spend a week or two developing such tools rather than
type large amounts of data by hand.
I don't think it is that difficult. The programmer generates an exclusion
list (words not to include like min, ila, 'an, hum, etc.) and a minimal
word list. A search starts using one or a random word in the basic list.
Each result (a URL from the server is an Arabic web page). The page and
its links are parsed. If the word is not in the exclusion list, and if
the word is not in the basic list, you keep on adding it to the list.
Usually, to avoid adding misspelled words, an authentic source is used (e.g.,
newspapers, academic papers, online books, etc.)
Actually, the "wget" utility is a good tool to grab related web pages
(online books, articles, directories, etc.) using the -r flag and passing
a set of URLS (probably from a file) if you don't want to write your own tool.
This is just a simple idea that I have never tried. I hope it's helpful.