[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Volunteers for verifying the quran data



Mete Kural wrote:
From: Gregg Reynolds <gar at arabink dot com> Now, IMO a difficult design
question is whether some true morphemes should in fact be encoded.
Obvious examples: definite article, other particles like laa,
sawfa, sa-, direct object suffixes -hu, -ha, etc. Unicode will
never countenance something like that, but that doesn't mean we
shouldn't. Such design decisions should be made strictly on a costs/benefits basis, IMO.


I'd like to restate my opinion here that such morphemic encoding is
better done at the markup level. So basically encode the characters
on the basis of a graphemic encoding using Unicode and then further
encode the morphemes on the markup level using an appropriate XML
schema.

Understood. No argument from me on that point. Well, I might dispute "better"; and we can probably have a discussion about just what is and isn't a morpheme codepoint. As for Unicode, it would be great if they would do the right thing; I just happen to think the design principles of Unicode are inhospitable to some notions of character semantics that would be very beneficial for Arabic. So I just don't think Unicode will ever encode some of the things I'd like to see encoded. Doesn't mean Unicode isn't useful.


I guess what I'm suggesting is an intellectual exercise in encoding design. Do the cost/benefit analysis for any given codepoint; then e.g. encoding <negative-particle-laa> doesn't look so bad. The more information you can pack into the encoding, the less money you have to spend on higher-level software, and the more you can do with non-specialized software like grep. I'm not saying at this point that we *should* encode such morphemes; only that it is worth evaluating in neutral, quantifiable terms.

Please take a look at what OSIS (www.bibletechnologies.com)
has done. They have already done a lot of this kind of morpheme-based
encoding at the markup level.

Thanks. I believe the Text Encoding Initiative has a bunch of stuff like that too.

-g