[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: A (too huge) Arabic word-list (with prefixes) for spell-checkers



On Tue, May 16, 2006 at 01:18:15PM +0300, Dan Kenigsberg wrote:
> Mohammed,
> 
> Thank you for your criticism :-)
> 
> > Which leads to 2 points:
> > 1) Those words are not correct
> > 2) The data files contain a small set of incorrect words "Maybe this is a
> > problem with my implementation of the Buckwalter algorithm".
> > 3) The affix data is huge and IMHO not easy to modify/extend which means
> > that it'll be hard to strip those words.
> > 
> > This is why I decided to ignore the Buckwalter data and work on a new data
> > set.
> 
> Are you sure the best option was to ignore that data? How many incorrect words
> spellings are there? Would you please give me an example of an incorrect
> spelling of such a word, and the correct one?

I don't think it's the best option in general, But since I don't know how many
incorrect word, I decided to ignore it completely, I know that this list is
better than nothing, But since I also don't know how can one extend the data
after that or regenerate the affix rules he used "no time to carefully investigate"
if I dump it to a plain text file to spell.
Because of all that, I decided not to use it!

> I know that the affix data is huge, but please explain what has to be done. Do
> you mean that for some words the prefix+stem+suffix is wrong even the stem is
> correct?

I can't really tell whether the problem is with the prefix, stem or suffix but
I personally assume it's the combination between the 3 of them even if each one
is valid by its own, I wonder whether there's a way to tell how the final word
was generated, Do you have an idea how ?


> > I know about a google project to create a dictionary from the Buckwalter data
> > which makes me wonder, Why don't you cooperate with them ?
> I wouldn't mind. Maybe now one of them approaches me.

I emailed them and I think you've got their email by now since I was CC'ed!

> > PS. Why "DICT ar EG ar" only in dictionary.lst ? ;-)
> I was just trying to be minimalistic here, not to offend non-Egyptians...

Since I'm from EG so it's fine with me :-D

-- 
GNU/Linux registered user #224950
Proud Egyptian GNU/Linux User Group <www.eglug.org> Member.
Life powered by Debian, Homepage: www.foolab.org
--
Don't send me any attachment in Micro$oft (.DOC, .PPT) format please
Read http://www.gnu.org/philosophy/no-word-attachments.html
Preferable attachments: .PDF, .HTML, .TXT
Thanx for adding this text to Your signature

Attachment: signature.asc
Description: Digital signature