[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: A (too huge) Arabic word-list (with prefixes) for spell-checkers



On Tue, May 16, 2006 at 11:28:18AM +0300, Dan Kenigsberg wrote:
> Hello,
> 
> I have converted Tim Buckwalter's database of Arabic - including all suffixes
> and prefixes - to the Aspell format. This makes it possible to spell-check
> Arabic in Aspell, Mozilla Thunderbird, and OpenOffice - if you can spare some
> 200Mb of RAM.
> 
 > It is freely available under the GPL, on
> 	http://ivrix.org.il/projects/arabic .
> 

Hi Dan,

Note: I'm not criticizing, I'm just stating facts and trying to be constructive

Originally, I worked on a fork of Duali, The Arabic spell checker by M. Elzubeir.
I then realized that aspell can do it and that we do not need an Arabic spell checker.

I had a look at the Buckwalter data since it was the data set for both spell checkers.

Something we didn't notice before we didn't have a working spell checker implementation.
The data set contains words from the Holy Quran, The words in the Holy Quran are sometimes
spelled in a different way due to the script used to write the Quran.

Those words are incorrect outside the Quran context.

Which leads to 2 points:
1) Those words are not correct
2) The data files contain a small set of incorrect words "Maybe this is a problem with
my implementation of the Buckwalter algorithm".
3) The affix data is huge and IMHO not easy to modify/extend which means
that it'll be hard to strip those words.

This is why I decided to ignore the Buckwalter data and work on a new data set.

I know about a google project to create a dictionary from the Buckwalter data which
makes me wonder, Why don't you cooperate with them ?

PS. Why "DICT ar EG ar" only in dictionary.lst ? ;-)

Best wishes,

-- 
GNU/Linux registered user #224950
Proud Egyptian GNU/Linux User Group <www.eglug.org> Member.
Life powered by Debian, Homepage: www.foolab.org
--
Don't send me any attachment in Micro$oft (.DOC, .PPT) format please
Read http://www.gnu.org/philosophy/no-word-attachments.html
Preferable attachments: .PDF, .HTML, .TXT
Thanx for adding this text to Your signature

Attachment: signature.asc
Description: Digital signature