[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Volunteers for verifying the quran data



Mete Kural wrote:
Tom had some ideas on creating a whole new code chart from scratch
for proper Arabic that is ideal for scholarly research in Level 2
plain of Unicode (so it would still be under the umbrella of
Unicode). Then there would be a well defined conversion scheme
between Unicode 06xx block and this new codeblock. He can perhaps
give us some insight after he's back from his conference.

But that would be a long term project and I wouldn't expect anything
solid within the next five years.

Surprise!  Already did that.  ;)

I haven't looked at it for 2 or 3 years, but I've been mulling over this sort of thing off and on for about 10 years. In the end I did come up with an encoding design I think you might find interesting, after about 3 years, and Lord only knows how many tries. Unfortunately, it's been sitting on my harddrive for about 2 years in TeX (actually Omega) waiting to be rewritten in XML, so I can't really submit anything at the moment for you to look at. I'll hack at it over the next few weeks and post it to the web eventually; maybe you'll find something useful. (It's kind of exciting to know that I'm not the only person in the world who has daydreamed about this!)

I'll give you a few brief ideas of what I tried to capture:

1) The most important principal is: move intelligence out of software and into the encoding;

2) On the basis of that, include codepoints for both radical and non-radical consonants. E.g., "m" is non-radical meem, "M" is radical meem. Before you dismiss this is silly overkill, consider all the things it allows us to do with standard software, such as searching by root, sorting in the traditional manner, etc.

3) Include codepoints for "occluded" or implicit radicals. So for example, I use u as damma, but ú to mean a damma with an implicit waw radical, as in "lam yaqúl". So I can still search by root, even where the radicals are only implicit.

4) Various other stuff like tanween, tamweem, idgham, semantic v. phonotactic shadda, etc.

5) Explicit treatment of hamza as a full-fledged character that happens to need a kursi

6) Use latin-1 as the abstract encoding, which means I can write Arabic precisely (including Quranic with all marks!) in any Latin-1 enabled editor

7) Jettison the supremely stupid legacy (read: Unicode) notion that written Arabic is "inherently" bidirectional

etc.

-gregg