On Tue, May 16, 2006 at 01:18:15PM +0300, Dan Kenigsberg wrote: > Mohammed, > > Thank you for your criticism :-) > > > Which leads to 2 points: > > 1) Those words are not correct > > 2) The data files contain a small set of incorrect words "Maybe this is a > > problem with my implementation of the Buckwalter algorithm". > > 3) The affix data is huge and IMHO not easy to modify/extend which means > > that it'll be hard to strip those words. > > > > This is why I decided to ignore the Buckwalter data and work on a new data > > set. > > Are you sure the best option was to ignore that data? How many incorrect words > spellings are there? Would you please give me an example of an incorrect > spelling of such a word, and the correct one? I don't think it's the best option in general, But since I don't know how many incorrect word, I decided to ignore it completely, I know that this list is better than nothing, But since I also don't know how can one extend the data after that or regenerate the affix rules he used "no time to carefully investigate" if I dump it to a plain text file to spell. Because of all that, I decided not to use it! > I know that the affix data is huge, but please explain what has to be done. Do > you mean that for some words the prefix+stem+suffix is wrong even the stem is > correct? I can't really tell whether the problem is with the prefix, stem or suffix but I personally assume it's the combination between the 3 of them even if each one is valid by its own, I wonder whether there's a way to tell how the final word was generated, Do you have an idea how ? > > I know about a google project to create a dictionary from the Buckwalter data > > which makes me wonder, Why don't you cooperate with them ? > I wouldn't mind. Maybe now one of them approaches me. I emailed them and I think you've got their email by now since I was CC'ed! > > PS. Why "DICT ar EG ar" only in dictionary.lst ? ;-) > I was just trying to be minimalistic here, not to offend non-Egyptians... Since I'm from EG so it's fine with me :-D -- GNU/Linux registered user #224950 Proud Egyptian GNU/Linux User Group <www.eglug.org> Member. Life powered by Debian, Homepage: www.foolab.org -- Don't send me any attachment in Micro$oft (.DOC, .PPT) format please Read http://www.gnu.org/philosophy/no-word-attachments.html Preferable attachments: .PDF, .HTML, .TXT Thanx for adding this text to Your signature
Attachment:
signature.asc
Description: Digital signature