[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Volunteers for verifying the quran data
- To: General Arabization Discussion <general at arabeyes dot org>
- Subject: Re: Volunteers for verifying the quran data
- From: Gregg Reynolds <gar at arabink dot com>
- Date: Wed, 29 Jun 2005 11:33:23 -0500
- Cc: "Bernard S. Greenberg at Basis" <bsg2004 at basistech dot com>, Tom Patterson <pattersont at summa dot com>, Zina Saadi <ZinaS at basistech dot com>
- User-agent: Mozilla Thunderbird 1.0.2 (Windows/20050317)
Mete Kural wrote:
From: Gregg Reynolds <gar at arabink dot com> Now, IMO a difficult design
question is whether some true morphemes should in fact be encoded.
Obvious examples: definite article, other particles like laa,
sawfa, sa-, direct object suffixes -hu, -ha, etc. Unicode will
never countenance something like that, but that doesn't mean we
shouldn't. Such design decisions should be made strictly on a
costs/benefits basis, IMO.
I'd like to restate my opinion here that such morphemic encoding is
better done at the markup level. So basically encode the characters
on the basis of a graphemic encoding using Unicode and then further
encode the morphemes on the markup level using an appropriate XML
Understood. No argument from me on that point. Well, I might dispute
"better"; and we can probably have a discussion about just what is and
isn't a morpheme codepoint. As for Unicode, it would be great if they
would do the right thing; I just happen to think the design principles
of Unicode are inhospitable to some notions of character semantics that
would be very beneficial for Arabic. So I just don't think Unicode will
ever encode some of the things I'd like to see encoded. Doesn't mean
Unicode isn't useful.
I guess what I'm suggesting is an intellectual exercise in encoding
design. Do the cost/benefit analysis for any given codepoint; then e.g.
encoding <negative-particle-laa> doesn't look so bad. The more
information you can pack into the encoding, the less money you have to
spend on higher-level software, and the more you can do with
non-specialized software like grep. I'm not saying at this point that
we *should* encode such morphemes; only that it is worth evaluating in
neutral, quantifiable terms.
Please take a look at what OSIS (www.bibletechnologies.com)
Thanks. I believe the Text Encoding Initiative has a bunch of stuff
like that too.
has done. They have already done a lot of this kind of morpheme-based
encoding at the markup level.