[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Volunteers for verifying the quran data

Hi Gregg,

Interesting stuff. Post it when you get a chance.

Though I think whatever needs to be done should be done under the Unicode umbrella if possible. And that's the part that takes time. Since if it is not under Unicode, it is not really a standard and non-standard encoding schemes we have a dime a dozen. Also I have the bias for trying to enhance the current Arabic 06xx code block rather than a seperate Arabic codeblock. If the 06xx Arabic code block is enhanced to enable clean abstract character/grapheme based encoding, that I much prefer, and I think this is not impossible, but it would take some time to get it through the Unicode Consortium. I think the difference between the new Arabic codeblock approach and improving the 06xx block approach is that with the seperate codeblock, it would "force" people to encode using the best practices, but with the improved 06xx block approach, it will "allow" people to encode using the best practices as well as allowing people to encode using bad partially glyph based encoding practices because of the support for the legacy Arabic codepoints in the current 06xx Arabic block. But the obvious advantage to doing the 06xx block approach is that basically the same codeblock in which 99% digital Arabic material today is created would be utilized rather than a second codeblock that most people will not even hear of. In addition it is probably far more likely for the Unicode Consortium to finally adopt the enhancements to the 06xx code block (as long as backwards compatibility is preserved) rather than allowing to add a second Arabic block in plain 2. So that's the reason for my bias for using a single codeblock 06xx for Arabic.


---------- Original Message ----------------------------------
From: Gregg Reynolds <gar at arabink dot com>
Reply-To: General Arabization Discussion <general at arabeyes dot org>
Date:  Tue, 28 Jun 2005 17:19:47 -0500

>Mete Kural wrote:
>> Tom had some ideas on creating a whole new code chart from scratch
>> for proper Arabic that is ideal for scholarly research in Level 2
>> plain of Unicode (so it would still be under the umbrella of
>> Unicode). Then there would be a well defined conversion scheme
>> between Unicode 06xx block and this new codeblock. He can perhaps
>> give us some insight after he's back from his conference.
>> But that would be a long term project and I wouldn't expect anything
>> solid within the next five years.
>Surprise!  Already did that.  ;)
>I haven't looked at it for 2 or 3 years, but I've been mulling over this
>sort of thing off and on for about 10 years.  In the end I did come up
>with an encoding design I think you might find interesting, after about
>3 years, and Lord only knows how many tries.  Unfortunately, it's been
>sitting on my harddrive for about 2 years in TeX (actually Omega)
>waiting to be rewritten in XML, so I can't really submit anything at the
>moment for you to look at.  I'll hack at it over the next few weeks and
>post it to the web eventually; maybe you'll find something useful.
>(It's kind of exciting to know that I'm not the only person in the world
>who has daydreamed about this!)
>I'll give you a few brief ideas of what I tried to capture:
>1)  The most important principal is: move intelligence out of software
>and into the encoding;
>2)  On the basis of that, include codepoints for both radical and
>non-radical consonants.  E.g.,  "m" is non-radical meem, "M" is radical
>meem.  Before you dismiss this is silly overkill, consider all the
>things it allows us to do with standard software, such as searching by
>root, sorting in the traditional manner, etc.
>3)  Include codepoints for "occluded" or implicit radicals.  So for
>example, I use u as damma, but ˙ to mean a damma with an implicit waw
>radical, as in "lam yaq˙l".  So I can still search by root, even where
>the radicals are only implicit.
>4)  Various other stuff like tanween, tamweem, idgham, semantic v.
>phonotactic shadda, etc.
>5)  Explicit treatment of hamza as a full-fledged character that happens
>to need a kursi
>6)  Use latin-1 as the abstract encoding, which means I can write Arabic
>precisely (including Quranic with all marks!) in any Latin-1 enabled editor
>7)  Jettison the supremely stupid legacy (read: Unicode) notion that
>written Arabic is "inherently" bidirectional

Mete Kural
Touchtone Corporation