[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Quran data and issues in encoding the Quran in unicode



Hello Meor,

Thank you for your response. I understand that you want to make the font and encode the text such that it is possible to display the Arabic Quran as it is found in the print using the currently available versions of Gnome and Uniscribe. That is a worthy objective. I have certain suggestions though in order to make this endavour as useful as possible.

>Ok, maybe someone can list out, which word / code they better encoded
>in a different way so that I can work on it (if i can ).

Please find my suggested encodings and explanations below.

>About the small alef, personally I would like to encode it using a
>tatweel + superscipt alef  for medial position, and a space +
>superscript alef for isolated position. The reason being is that the
>sequence will work on most, if not all existing font. You might argue
>that we don't need a tatweel for medial position, but without it, you
>will encounter another problem under windows. The same goes for small
>noon and yeh, which i thnk beter encode it with a tatweel. For small
>waw, I agree with Mr Milo.

First of all, I would suggest to you not to steer the project in a way to accomodate the variety of Arabic fonts that are available today which do not implement the Unicode Arabic spec adequately. More than 90% of the Arabic fonts out there ignore implementing considerable sections of the Unicode Arabic specs. Only a handful of fonts come close to rendering the Quran correctly (at least rendering what can be encoded of the Quran with the current Unicode spec, excepting the missing needed Quranic characters). Take notice that I say come close to rendering the Quran correctly. I wouldn't be surprised if there are only a handful fonts in the world today that in fact do render correctly what can be encoded of the Quran with the current Unicode spec. In fact the only four that I know are Microsoft's Arabic Typesetting, your Arabeyes.org Meor font, and SIL's Scheherazade (even these two have problems with small alef I think) and Lateef fonts (http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=ArabicFonts). There may be other solutions that are not yet delivered to the market or I haven't heard of. So I think trying to accomodate the encoding of the text to render the small alef correctly with other fonts that aren't suitable for rendering the Quran is unnecessary. The compromise made on the consistency of the encoding is not worth it to try to accomodate these unsuitable fonts. You already have a challenge to accomodate for the Gnome and Uniscribe rendering engines; trying to accomodate for some incomplete fonts in addition to that would leave you with a not so desired encoding quality. This is why I would recommend you not to use a tatweel before the small alef.

>For the sequential tanween, I've tried to use two fatha's as you
>suggested, but I encounter numerous problem to solve under windows
>(not tested under linux). For simlpe character it is ok, but when it
>comes to ligature, it was a nightmare to solve it. Finally, i decided
>to use existing code that will take the least effort on my side.  For
>sequential fathatan, I encode it as fathatan + subscript meem. For
>dammatan, I use superscript meem. Why you  may ask? First, the
>sequence is legal under uniscribe. Second, it  would not change the
>shaping of the character before and after it. And third, the sequence
>does not have any meaning as far as I know in the quran, so it would
>not led to any confusion.

I think this is sensible. In the end neither two fathas consecutively nor fathatan and a subscript meem are standardized by Unicode to encode the sequential fathatan. So choosing either one sounds fine to me. In fact there is probably a need to address this problem in the Unicode spec.

>For hamza, the main problem, as I mention in the document, that I
>can't tell is there any hamza added to the original rasm or not. The
>best, I think is to use one code for the original hamza in the rasm,
>and another hamza for the added one. The logical choice for now is the
>arabic hamza and superscript hamza. As you mention, since unicode's
>property of hamza is different from what's in the quran, we run into
>other problems if we were to do that. That's why I encode it the way
>it is.

Depends on what you mean by the original rasm. In the very original rasm there were no hamzas of any kind at all. The Quran manuscripts from the the first two Islamic centuries do not have any hamzas. It is a later addition and in fact it does not have any impact on the base rasm like the tanween or the vowel marks. Hamza traditionally has never caused a change in the base rasm. Any change hamza causes to the base rasm is an invention of the 20th century. Additionally you can put the hamzas in contemporary Quran printings to two categories:

1) Standalone chairless independent hamzas that do not have a chair such as the hamza in aadam, the hamza in ya'aadamu, the hamza in saabi'iin. These should be for now encoded with 0621. Eventually a proposal for a new chairless hamza should be made to Unicode.

2) Hamzas that have a chair such as alef, ya, or wow and are always attached to one base letter. Most hamzas are like this. Pretty most of the time a word starts with an alef we have one of these hamzas on top of the alef. These should be encoded with either 0654 or using the corresponding composite alef with hamza above, wow with hamza above, etc. codepoints.

>As mentioned earlier, my main objective for now is to make the
>document works under windows, and linux, and the workaround does work
>for all of the problems so far. Since there is no proposal submitted
>to Unicode yet (about the hamza, sequential tanween etc) and MS is not
>going to release a new OS until next year, I think the wordaround will
>be useful for quite sometime. When all of the issues are resolved, it
>is  quite trivial to change the document to conform to unicode
>standard.

Since the hamzas in aadamu, ya'aadamu, li-aadama and al-aakhirati all belong to the first category of hamzas; they should all be encoded as 0621. If the hamzas in aadamu, li-aadama and al-aakhirati are encoded as 0621 and the one in ya'aadamu is encoded as tatweel and 0654 then the graphemic integrity (semantic consistency in the choice of codepoints) of the text is compromised unnecessarily. Currently with OpenType there should be a solution for implementing the hamza in ya'aadamu with 0621 codepoint just as there is a solution to implement the hamza in li-aadama, and al-aakhirati with 0621 codepoint. If this can be implemented, that is if your objective of displaying correctly using current versions of Gnome and Uniscribe can be accomplished, why compromise the consistency of the encoding? I would recommend you not to use a tatweel and 0654 but rather use a 0621 for the hamza in ya'aadamu and other similar hamzas in the Quran (such as the hamza in saabi'iin of verse 2:62).

>As for searching text is concern, I'm working on creating a database
>that will be able to do that (probably sqlite, for standalone and
>mysql. It is easy to transfer the data between the 2). It will have
>the original encoded word, current spelling of the word, and the root
>word for it. Users will then have the option to search based on their
>needs, be it root word, current spelling or the actual text ( which is
>unlikely). Since I'm not an arabic speaker, I will need some help in
>filling in the database. I know someone already working on root words
>of the quran ( there is a book for it, which I don't have access to,
>and the is www.openburhan.com , which not complete yet ) This, I think
>will bring greate benefit to all muslim, especially for research. Any
>volunteer?

Yes these features would be indispensible for a student of the Quran. The semantic consistency of the encoded text is also very important for consistent search results and consistent researching capability in the Quran. Searching for aadam in the Quran and not getting the ya'aadamu instances in the search results because 0654 hamza above is used in the encoding rather than 0621 is no good. Achieving the searching consistency with complicated search algorithms and conversions is going to be just as difficult as designing a font that renders consistently encoded text. So why not choose the second option which yields a cleanly and consistently encoded text?

Kind regards,
Mete

--
Mete Kural
Touchtone Corporation
714-755-2810
--