[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Volunteers for verifying the quran data

To: General Arabization Discussion <general at arabeyes dot org>
Subject: Re: Volunteers for verifying the quran data
From: Gregg Reynolds <gar at arabink dot com>
Date: Wed, 29 Jun 2005 11:33:23 -0500
Cc: "Bernard S. Greenberg at Basis" <bsg2004 at basistech dot com>, Tom Patterson <pattersont at summa dot com>, Zina Saadi <ZinaS at basistech dot com>
User-agent: Mozilla Thunderbird 1.0.2 (Windows/20050317)

Mete Kural wrote:

From: Gregg Reynolds <gar at arabink dot com> Now, IMO a difficult design question is whether some true morphemes should in fact be encoded. Obvious examples: definite article, other particles like laa, sawfa, sa-, direct object suffixes -hu, -ha, etc. Unicode will never countenance something like that, but that doesn't mean we shouldn't. Such design decisions should be made strictly on a costs/benefits basis, IMO.
I'd like to restate my opinion here that such morphemic encoding is
better done at the markup level. So basically encode the characters
on the basis of a graphemic encoding using Unicode and then further
encode the morphemes on the markup level using an appropriate XML
schema.

Understood. No argument from me on that point. Well, I might dispute "better"; and we can probably have a discussion about just what is and isn't a morpheme codepoint. As for Unicode, it would be great if they would do the right thing; I just happen to think the design principles of Unicode are inhospitable to some notions of character semantics that would be very beneficial for Arabic. So I just don't think Unicode will ever encode some of the things I'd like to see encoded. Doesn't mean Unicode isn't useful.

I guess what I'm suggesting is an intellectual exercise in encoding design. Do the cost/benefit analysis for any given codepoint; then e.g. encoding <negative-particle-laa> doesn't look so bad. The more information you can pack into the encoding, the less money you have to spend on higher-level software, and the more you can do with non-specialized software like grep. I'm not saying at this point that we *should* encode such morphemes; only that it is worth evaluating in neutral, quantifiable terms.

Please take a look at what OSIS (www.bibletechnologies.com)

has done. They have already done a lot of this kind of morpheme-based
encoding at the markup level.

Thanks. I believe the Text Encoding Initiative has a bunch of stuff like that too.

-g

References:
- Re: Volunteers for verifying the quran data
  - From: Mete Kural

Prev by Date: Re: Volunteers for verifying the quran data
Next by Date: Re: Volunteers for verifying the quran data
Previous by thread: Re: Volunteers for verifying the quran data
Next by thread: Re: Volunteers for verifying the quran data
Index(es):
- Date
- Thread