[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Quran data and issues in encoding the Quran in unicode

To: <general at arabeyes dot org>, Meor Ridzuan Meor Yahaya <meor dot ridzuan at gmail dot com>
Subject: Re: Quran data and issues in encoding the Quran in unicode
From: "Mete Kural" <metek at touchtonecorp dot com>
Date: Mon, 20 Jun 2005 19:15:28 -0700
Hello Meor,

I have some ideas. Please find my comments inline.

>Let me made clear a few points. First, my goal with the project. What
>I would like to have is some thing like this. We should have the quran
>encoded in full with the ability to extract it in 3 forms: the
>original Uthmani script (without the marks, dot is ok to encure the
>character), the current visual representation (according to madinah
>mushaf), and the current spelling to the words, which is more suitable
>for searching (esp for meaning). I think I manage to achive the 2nd
>goal with the workaround found in the file.
>
>Second, I'm not an arabic speaker, thus not very well versed in arabic
>grammar etc (I did took a course in elementary arabic though, and
>quite well versed in Quranic tajweed rules) So , it will be a bit
>difficult for me to accomplish the 3rd goal. This is where I really
>need help from others. That's why I post this issues to arabeyes, and
>hope that the encoding issues can be resolve asap. I don't really have
>a strong opinion on how it should have been done, it does not really
>matter to me as long as the standard is define.

I think that the 3rd form (words with current spellings) should be eliminated. Spelling words the Qur'an in modern spellings was a common practice in the past (thousands of manuscripts from throughout the 2nd millenium) that caused Qur'an text to be corrupted with many extra alifs. People would add the alefs to make the words spelled in modern form (for example adding an alef between teh and beh for the word kitaab). It took extensive research for the scholars to remove most of these alifs (and a handful of some other letters) in 1924 when the infamous 1924 Egyptian edition of the Quran was printed under the sponsorship of King Fu'ad of Egypt. I think that rather than providing the Qur'an in this 3rd form (with modern spellings of the words) a clever conversion algorithm can be used that would return the word in modern form when the user requests the modern spelling of a word found in the Qur'an. Also this conversion algorithm could be part of the search function such that when the user searches for a word and writes the word in modern spelling it still finds it. So a conversion algorithm (which will also include table lookups for exceptional cases where an automatic conversion is not possible) would be the preferred solution rather than a parallel text database. In any case a parallel text database would increase the maintenance and correction to double since now you have to maintain and correct possible mistakes in both the main text and the modern spelled text. In regards to how this conversion algorithm and necessary lookup tables would be prepared that is going to take some considerable amount of work but I think it should be doable insha'Allah.

>To accomplish the 3rd goal, we need a rasm scholar. I think in my
>town, we don't have any. There might be a few in my country, but not
>sure. Some might argue the need of this,but I personally think that it
>would be the best if we can do this, so that the quran will be
>preserved in it's original state  (in digital form, or course)

I'm assuming you meant the 1st goal here (original Othmani text). The 1924 Egyptian printing of the Qur'an comes pretty close to the original Othmani text actually (if you remove all the fathas, dammas, kasras, fathatans, dammatans, kasratans, small alefs, and all the other marks that are not base letters). The 1984 Saudi printing (QuranComplex) which is the most popular printing today is an exact copy of the 1924 Egyptian. The only changes between the 1924 Egyptian printing and the Saudi printing are that the calligrapher Othman Taha wrote the Saudi one by hand copying from the Egyptian printing whereas the Egyptian edition was done with metal typography, not by hand, and also some of the annotation marks were changed, but no change in the base letters, nor the diacritical marks such as fathas, kasras, etc. So I think that rather than maintaining another parallel text database for original Othmani text it is again possible to do this using a conversion algorithm. Well actually the algorithm here is pretty simple so one shouldn't even call it an algorithm. Simply remove all the fathas, dammas, kasras, etc, etc, and you'll end up with text that is very similar to the original Othmani text. Now beyond removing all these marks, the original Othmani text still differs from the current Egyptian and Saudi printings in some alefs and a handful of other cases. So for instance more alefs need to be removed. But this is a sensitive issue and it is not likely that everybody will agree on which alefs should be removed and which should be left since there are some differences in ancient manuscripts also. But you could stop at removing the extra marks automatically on the fly as the user requests text to be viewed in original form without the need to maintain a seperate parallel text database. Although going further than this and removing more alefs to match original form is something that should be researched.

>On top of that, there is one more goal, that would be great if we can
>accomplish also. That is, the different reading style of the quran.
>The files from the site is based on the madinah mushaf, which is based
>on Hafs reading. I think the Madinah Quran complex print at least 2
>edition, which is the Hafs and Warsh. A friend of mine, who is from
>Mali, said that people from his country use Qaloon style. I really
>would like to have a copy of each to see the diffrerent style.  Anyone
>would like to send me a copy (other than the Hafs style) are most
>welcome.

To maintain a Warsh Qur'an you pretty much can't escape from having to create a seperate parallel text database since there are so many variations in the fathas, dammas, kasras, etc. I suggest that we work on the above first and that could be something to be looked into. 

>Final note, I will release a totally new font which I created from
>scratch (digitally). The font was created from the actual mushaf, and
>I think this will pleased most people. The font will not have any
>truetype hint at all, but the quality rendered under linux is awesome
>(thanks to freetype autohinter). Under windows, it renders ok, but not
>that great. One of the issues that prevented me from realeasing it
>earlier was the license. ATM, I don't see any advantage of releasing
>it under GPL, since there is no source code involve here. The font is
>the source. So, I've decided to release it Free for non commercial use
>(use and distribution). Hope you will enjoy it.

Do you have plans to add truetype hinting to this font in the future?

Kind regars,
Mete

>Regards.
>
>
>
>
>On 6/21/05, Mete Kural <metek at touchtonecorp dot com> wrote:
>> >1. Mohamed Zakariya told me sequential tanween was practised in the North
>> >African tradition.
>> 
>> Hmm.. interesting information. Do you know since what century it has been used? By the way when you say North Africa are you also referring to Egypt or do you mean that it's a Maghribi tradition?
>> 
>> 
>> >2. I concur that a clean encoding should leave tanween as such intact
>> >(whether it is encoded as fatha-fatha of fathatan etc would be immaterial)
>> >so that the phonetic variants need a separate code point.
>> 
>> I agree with you that the best way to handle this is to keep the tanween intact and use a special codepoint that comes after the tanween codepoint in order to trigger the variant sequential tanween glyph.
>> 
>> Regards,
>> Mete
>> 
>> >
>> >Regards,
>> >
>> >t
>> >
>> >
>> >
>> >Mete Kural wrote:
>> >> Salaam Abdulhaq,
>> >>
>> >> On the point of so-called "sequential tanween" I am a little bit
>> >> undecided. First of all, it is perfectly clear that what we call a
>> >> sequential fathatan, sequential dammatan, and sequential kasratan in
>> >> fact without doubt is a fathatan, a dammatan and a kasratan
>> >> respectively. As far as I know (please confirm this those who know)
>> >> the 1924 Egyptian printing of the Quran was the first to use variant
>> >> glyphs for fathatan, dammatan and kasratan for the cases when the
>> >> noon is not pronounced. So these sequential tanweens have not been a
>> >> part of Arabic until the 20th century. Regardless since most of the
>> >> Qurans printed in the world today employ these sequential tanweens
>> >> Unicode has to accomodate for them somehow and these sequential
>> >> tanweens need to be listed in the Arabic code page. But the thing is
>> >> that since these sequential tanweens are essentially no more than
>> >> just tanweens, I would be inclined to encode them as regular
>> >> tanweens, fathatan, dammatan, or kasratan, in order to preserve the
>> >> graphemic integrity of the text. To trigger the sequential behaviour,
>> >> a special codepoint could be added that would be placed right after
>> >> the respective tanween codepoint. Well this would be the ideal state
>> >> of things.
>> >>
>> >> Although if three seperate codepoints were added for sequential
>> >> fathatan, sequential dammatan and sequential kasratan, this would not
>> >> be totally inconsistent with how the Arabic codepage has been
>> >> evolving since to me the Unicode Arabic codepage as a whole is a
>> >> hybrid of grapheme (character) based encoding and some glyph based
>> >> encoding, which is ugly. So I guess adding the three seperate
>> >> sequential tanween codepoints would not be inconsistent with the
>> >> current ugly state of things in the Arabic codepage, but I would
>> >> prefer a cleaner method such as a special codepoint that triggers
>> >> sequential behaviour.
>> >>
>> >> Eventually the Arabic codepage needs to evolve to at least allow a
>> >> clean encoding of the Quran, although it seems like even if that
>> >> happens, the uglier method of encoding will always be available to
>> >> whoever chooses to anyways.
>> >>
>> >> Kind regards,
>> >> Mete
>> >>
>> >> ---------- Original Message ----------------------------------
>> >> From: Abdulhaq Lynch <al-arabeyes at alinsyria dot fsnet dot co dot uk>
>> >> Reply-To: Development Discussions <developer at arabeyes dot org>
>> >> Date:  Mon, 20 Jun 2005 22:49:29 +0100
>> >>
>> >>> I don't agree with some basic points about all this. As I understand
>> >>> it Unicode wants to move from what has become a glyph-based coding
>> >>> over to a semantic-based encoding, and allow the font technology to
>> >>> worry about the glyphs. Fine.
>> >>>
>> >>> However, Thomas et al. seem to be determined to pursue the opposite
>> >>> in terms of tanween and tajweed related marks. Tanween is
>> >>> semantically totally different to a fatha or one fatha followed by
>> >>> another. It is a semantic character of its own and deserves codes of
>> >>> its own. Iqlaab, ikhfaa, madd etc are tajweed marks that govern the
>> >>> pronounciation of the arabic and each carries a full semantic load.
>> >>> It should be possible to encode these semantically loaded objects in
>> >>> any textual representation of quran. If the Unicode consortium is
>> >>> not interested in encoding one of the most common books in the world
>> >>> then a further code standard must be developed on top of the unicode
>> >>> one. Hacks like placing two characters next to each other to
>> >>> 'inspire' the font renderer to display a third semantically
>> >>> different character just don't cut it.
>> >>>
>> >>> If Unicode really is about semantics and not glyphs, then let's have
>> >>> that then please and give us a code-point per semantic load. If
>> >>> instead we have to hack around with glued-together glyphs to try and
>> >>> indicate missing meaning, then we should look elsewhere.
>> >>>
>> >>> If anyone is interested (on the safe assumption that Unicode is not
>> >>> interested in that) then perhaps we could discuss such a code
>> >>> extension here.
>> >>>
>> >>> Abdulhaq
>> >>>
>> >>> On Friday 17 June 2005 09:29, Thomas Milo wrote:
>> >>>> Hi Mete, Meor,
>> >>>>
>> >>>> Just a quick reaction: U+0641 TATWEEL does not represent a
>> >>>> character (or rather, grapheme) but a unit of typography (i.e., a
>> >>>> glyph). It should never have been part of the Arabic code block in
>> >>>> the first place. If you prefer the sequence like
>> >>>> Fatha-SmalAlifAbove (in regular Unicode) to print with a upporting
>> >>>> Tatweel, consider building a substitution in your OTF.
>> >>>>
>> >>>> A second point is the use of tanween followed by SmalMeemAbove and
>> >>>> SmalMeemBelow. This is non-standard use of the small meems, plain
>> >>>> and simple. The fact that the obvious encoding with sequential
>> >>>> single harakat and single harakat+small meem is not supported
>> >>>> correctly by Microsofts Uniscribe does not justify the use of
>> >>>> illegal encoding. It would be better to report a bug to MS
>> >>>> typography or, if you don't like to be the prisoner of third
>> >>>> party's prorietary solutions, develop your own OTF parser (what we
>> >>>> do).
>> >>>>
>> >>>> BTW, Mete did not mention Decotype's Naskh as a font that handles
>> >>>> Qur'anic Arabic, because it is not yet published. In this project
>> >>>> we consider the Uncode points SmalMeemAbove and SmalMeemBelow a
>> >>>> mistake: they are contextual variants of SmallMeem (were - a
>> >>>> single! -kasra pulls it below the script line). Therefore we treat
>> >>>> them as identical. However, for compatibility's sake we could add a
>> >>>> few front end substitutions to convert your private encoding to our
>> >>>> (private?) encoding (which I believe you could have done to bypass
>> >>>> the MS Unicsribe constraints)
>> >>>>
>> >>>> Regards,
>> >>>>
>> >>>> t
>> >>>>
>> >>>> Mete Kural wrote:
>> >>>>> Hello Meor,
>> >>>>>
>> >>>>> Please find my suggested encodings and explanations below.
>> >>>>>
>> >>>>>> About the small alef, personally I would like to encode it using a
>> >>>>>> tatweel + superscipt alef  for medial position, and a space +
>> >>>>>> superscript alef for isolated position. The reason being is that
>> >>>>>> the sequence will work on most, if not all existing font. You
>> >>>>>> might argue that we don't need a tatweel for medial position, but
>> >>>>>> without it, you will encounter another problem under windows. The
>> >>>>>> same goes for small noon and yeh, which i thnk beter encode it
>> >>>>>> with a tatweel. For small waw, I agree with Mr Milo.
>> >>>>>
>> >>>>> First of all, I would suggest to you not to steer the project in a
>> >>>>> way to accomodate the variety of Arabic fonts that are available
>> >>>>> today which do not implement the Unicode Arabic spec adequately.
>> >>>>> More than 90% of the Arabic fonts out there ignore implementing
>> >>>>> considerable sections of the Unicode Arabic specs. Only a handful
>> >>>>> of fonts come close to rendering the Quran correctly (at least
>> >>>>> rendering what can be encoded of the Quran with the current
>> >>>>> Unicode spec, excepting the missing needed Quranic characters).
>> >>>>> Take notice that I say come close to rendering the Quran
>> >>>>> correctly. I wouldn't be surprised if there are only a handful
>> >>>>> fonts in the world today that in fact do render correctly what can
>> >>>>> be encoded of the Quran with the current Unicode spec. In fact the
>> >>>>> only four that I know are Microsoft's Arabic Typesetting, your
>> >>>>> Arabeyes.org Meor font, and SIL's Scheherazade (even these two
>> >>>>> have problems with small alef I think) and Lateef fonts
>> >>>>
>> >>>>
>> >(http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=ArabicFon
>> >>>> t s).
>> >>>>
>> >>>>> There may be other solutions that are not yet delivered to the
>> >>>>> market or I haven't heard of. So I think trying to accomodate the
>> >>>>> encoding of the text to render the small alef correctly with other
>> >>>>> fonts that aren't suitable for rendering the Quran is unnecessary.
>> >>>>> The compromise made on the consistency of the encoding is not
>> >>>>> worth it to try to accomodate these unsuitable fonts. You already
>> >>>>> have a challenge to accomodate for the Gnome and Uniscribe
>> >>>>> rendering engines; trying to accomodate for some incomplete fonts
>> >>>>> in addition to that would leave you with a not so desired encoding
>> >>>>> quality. This is why I would recommend you not to use a tatweel
>> >>>>> before the small alef.
>> >>>
>> >>>
>> >>
>> >> --
>> >> Mete Kural
>> >> Touchtone Corporation
>> >> 714-755-2810
>> >
>> >
>> 
>> --
>> Mete Kural
>> Touchtone Corporation
>> 714-755-2810
>> --
>> 
>> 
>> _______________________________________________
>> Developer mailing list
>> Developer at arabeyes dot org
>> http://lists.arabeyes.org/mailman/listinfo/developer
>>
>

--
Mete Kural
Touchtone Corporation
714-755-2810
--
Prev by Date: Fwd: Unicode - New Public Review Issues
Next by Date: Proposal for the Basis of a Codepoint Extension to Unicode for the Encoding of the Quranic Manuscripts
Previous by thread: Re: Quran data and issues in encoding the Quran in unicode
Next by thread: Re: Quran data and issues in encoding the Quran in unicode
Index(es):
- Date
- Thread