[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Volunteers for verifying the quran data
- To: "General Arabization Discussion" <general at arabeyes dot org>
- Subject: Re: Volunteers for verifying the quran data
- From: "Thomas Milo" <t dot milo at chello dot nl>
- Date: Wed, 29 Jun 2005 09:30:53 +0200
- Cc: "Bernard S. Greenberg at Basis" <bsg2004 at basistech dot com>, Tom Patterson <pattersont at summa dot com>, Zina Saadi <ZinaS at basistech dot com>
I agree with Mete. This concept of encoding root morphemes separately from
other Arabic letters, if ported to Indo-European languages (much more
similar to Semitic than one would think at first glance), would mean that
the Root Consonants and the Vowels representing the "Ablaut Stufe" inside a
verb would have to be encoded distinctly, so that <EA> in rEAd [imperfect
tense] is encoded differently from EA in rEAd [perfect tense], while - to
push this over the top - the root elements <RD> in "read" would be encoded
differently from the root elements <RD> in "red".
Today, Unicode only encodes the visual aspect, even if that means losing the
morphological opposition rEAd:rEAd, on the other hand it does express the
opposition rEAd:rEd even if they are phonologically identical.
Encoding morphemes instead of graphemes involves a level of linguistic
sophistication that falls way outside the scope of the Unicode Standard. The
latter is designed to encode plain text units of script, not units of
language. The best way forward is to straighten out Unicode regarding Arabic
so that at least it becomes consistent with is own professed principles.
That is why I advocate the concept of grapheme. On top of a set of
well-picked graphemic Unicode points for Arabic one should be able to design
sophisticated linguistic encoding.
Even then, Arabic script does not fully cover the Arabic language from a
linguistic perspective. A (or maybe /the/) striking example is the inserted
vowel between the /n/ of tanween and any initial cluster of consonants,
e.g., /muHammadu-ni r-rasuulu/: it has no orthographic expression (I found
it described as kasra, bound to a small nuun in an Ottoman handbook, but I
never attested it in a manuscript). Such observations can serve to argue
that however deep the relation between language and the writing system is,
they remain separate systems with their own internal logic. Unicode can
serve to encode the Arabic writing system, not Arabic grammar.
Mete Kural wrote:
> Hi Gregg,
> Interesting stuff. Post it when you get a chance.
> Though I think whatever needs to be done should be done under the
> Unicode umbrella if possible. And that's the part that takes time.
> Since if it is not under Unicode, it is not really a standard and
> non-standard encoding schemes we have a dime a dozen. Also I have the
> bias for trying to enhance the current Arabic 06xx code block rather
> than a seperate Arabic codeblock. If the 06xx Arabic code block is
> enhanced to enable clean abstract character/grapheme based encoding,
> that I much prefer, and I think this is not impossible, but it would
> take some time to get it through the Unicode Consortium. I think the
> difference between the new Arabic codeblock approach and improving
> the 06xx block approach is that with the seperate codeblock, it would
> "force" people to encode using the best practices, but with the
> improved 06xx block approach, it will "allow" people to encode using
> the best practices as well as allowing people to encode using bad
> partially glyph based encoding practices because of the support for
> the legacy Arabic codepoints in the current 06xx Arabic block. But
> the obvious advantage to doing the 06xx block approach is that
> basically the same codeblock in which 99% digital Arabic material
> today is created would be utilized rather than a second codeblock
> that most people will not even hear of. In addition it is probably
> far more likely for the Unicode Consortium to finally adopt the
> enhancements to the 06xx code block (as long as backwards
> compatibility is preserved) rather than allowing to add a second
> Arabic block in plain 2. So that's the reason for my bias for using a
> single codeblock 06xx for Arabic.
> ---------- Original Message ----------------------------------
> From: Gregg Reynolds <gar at arabink dot com>
> Reply-To: General Arabization Discussion <general at arabeyes dot org>
> Date: Tue, 28 Jun 2005 17:19:47 -0500
>> Mete Kural wrote:
>>> Tom had some ideas on creating a whole new code chart from scratch
>>> for proper Arabic that is ideal for scholarly research in Level 2
>>> plain of Unicode (so it would still be under the umbrella of
>>> Unicode). Then there would be a well defined conversion scheme
>>> between Unicode 06xx block and this new codeblock. He can perhaps
>>> give us some insight after he's back from his conference.
>>> But that would be a long term project and I wouldn't expect anything
>>> solid within the next five years.
>> Surprise! Already did that. ;)
>> I haven't looked at it for 2 or 3 years, but I've been mulling over
>> this sort of thing off and on for about 10 years. In the end I did
>> come up with an encoding design I think you might find interesting,
>> after about 3 years, and Lord only knows how many tries.
>> Unfortunately, it's been sitting on my harddrive for about 2 years
>> in TeX (actually Omega)
>> waiting to be rewritten in XML, so I can't really submit anything at
>> the moment for you to look at. I'll hack at it over the next few
>> weeks and post it to the web eventually; maybe you'll find something
>> useful. (It's kind of exciting to know that I'm not the only person
>> in the world who has daydreamed about this!)
>> I'll give you a few brief ideas of what I tried to capture:
>> 1) The most important principal is: move intelligence out of
>> and into the encoding;
>> 2) On the basis of that, include codepoints for both radical and
>> non-radical consonants. E.g., "m" is non-radical meem, "M" is
>> radical meem. Before you dismiss this is silly overkill, consider
>> all the
>> things it allows us to do with standard software, such as searching
>> by root, sorting in the traditional manner, etc.
>> 3) Include codepoints for "occluded" or implicit radicals. So for
>> example, I use u as damma, but ˙ to mean a damma with an implicit waw
>> radical, as in "lam yaq˙l". So I can still search by root, even
>> the radicals are only implicit.
>> 4) Various other stuff like tanween, tamweem, idgham, semantic v.
>> phonotactic shadda, etc.
>> 5) Explicit treatment of hamza as a full-fledged character that
>> happens to need a kursi
>> 6) Use latin-1 as the abstract encoding, which means I can write
>> Arabic precisely (including Quranic with all marks!) in any Latin-1
>> enabled editor
>> 7) Jettison the supremely stupid legacy (read: Unicode) notion that
>> written Arabic is "inherently" bidirectional
> Mete Kural
> Touchtone Corporation
>> General mailing list
>> General at arabeyes dot org