[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Quran data and issues in encoding the Quran in unicode



Thanks to all for your inputs. I think I should clarify a few things. 
The text that I've created is purely for viewing purposes only, thus
to me it is more important on how to make it displays correctly. As
you can see, the PDF document was created under windows, which uses my
font, does render the words correctly (it was not a scan image) ATM my
refference platform for now is Windows XP, since it is the most widely
use here. I've seen a discussion at gnome that they also sometimes
deviate a bit from Opentype spec to make the behaviour similar to
uniscribe. I've tested my font and text under Ubuntu 5.04 ( I think it
includes most of the latest stuff), and it seems that most of the
workaround work. Only one bugs (I think) appears in gnome, which is
incorrect placement of marks for certain ligature under certain
circumstances. It is a bit wierd on why it happens. I've submitted a
bug report to them, and someone already looking into the problem.

For the sequential tanween, I've tried to use two fatha's as you
suggested, but I encounter numerous problem to solve under windows
(not tested under linux). For simlpe character it is ok, but when it
comes to ligature, it was a nightmare to solve it. Finally, i decided
to use existing code that will take the least effort on my side.  For
sequential fathatan, I encode it as fathatan + subscript meem. For
dammatan, I use superscript meem. Why you  may ask? First, the
sequence is legal under uniscribe. Second, it  would not change the
shaping of the character before and after it. And third, the sequence
does not have any meaning as far as I know in the quran, so it would
not led to any confusion.

About the small alef, personally I would like to encode it using a
tatweel + superscipt alef  for medial position, and a space +
superscript alef for isolated position. The reason being is that the
sequence will work on most, if not all existing font. You might argue
that we don't need a tatweel for medial position, but without it, you
will encounter another problem under windows. The same goes for small
noon and yeh, which i thnk beter encode it with a tatweel. For small
waw, I agree with Mr Milo.

For hamza, the main problem, as I mention in the document, that I
can't tell is there any hamza added to the original rasm or not. The
best, I think is to use one code for the original hamza in the rasm,
and another hamza for the added one. The logical choice for now is the
arabic hamza and superscript hamza. As you mention, since unicode's
property of hamza is different from what's in the quran, we run into
other problems if we were to do that. That's why I encode it the way
it is.

As mentioned earlier, my main objective for now is to make the
document works under windows, and linux, and the workaround does work
for all of the problems so far. Since there is no proposal submitted
to Unicode yet (about the hamza, sequential tanween etc) and MS is not
going to release a new OS until next year, I think the wordaround will
be useful for quite sometime. When all of the issues are resolved, it
is  quite trivial to change the document to conform to unicode
standard.

As for searching text is concern, I'm working on creating a database
that will be able to do that (probably sqlite, for standalone and
mysql. It is easy to transfer the data between the 2). It will have
the original encoded word, current spelling of the word, and the root
word for it. Users will then have the option to search based on their
needs, be it root word, current spelling or the actual text ( which is
unlikely). Since I'm not an arabic speaker, I will need some help in
filling in the database. I know someone already working on root words
of the quran ( there is a book for it, which I don't have access to,
and the is www.openburhan.com , which not complete yet ) This, I think
will bring greate benefit to all muslim, especially for research. Any
volunteer?

Ok, maybe someone can list out, which word / code they better encoded
in a different way so that I can work on it (if i can ).

Regards.


On 6/17/05, Mete Kural <metek at touchtonecorp dot com> wrote:
> Hello again Meor,
> 
> Another addition I have to make is in regards to ya-aadamu in 2:33. The hamza in ya-aadamu of 2:33 (and the 3-4 other instances ya-aadamu found in the Quran) should be 0621 (for now until the new chairless hamza is added to Unicode), not {tatweel 0640} + {hamza above 0654}. Because this is the same exact aadam as found two verses above in 2:31 where it says "wa allama aadama al-asmaa..". In 2:31 when aadam is written, 0621 is used for the hamza of aadam. Whereas in 2:33 aadam is simply prefixed with ya-. This should not cause the hamza of aadam to change from 0621 to 0654 on top of 0640 tatweel. The hamza should still be 0621, otherwise we are breaking the graphemic integrity of the text and we don't want to do that. In fact the next verse 2:34 has "li-aadama" and it is still the same 0621 hamza that you have used there. If aadamu by itself uses 0621, li-addama uses 0621 then ya-aadamu should use 0621 too.
> 
> Kind regards,
> Mete
> 
> ---------- Original Message ----------------------------------
> From: "Mete Kural" <metek at touchtonecorp dot com>
> Reply-To: metek at touchtonecorp dot com,Development Discussions <developer at arabeyes dot org>
> Date:  Thu, 16 Jun 2005 09:01:31 -0700
> 
> >Hello Meor,
> >
> >Back a long time ago there was a big discussion regarding submitting a proposal to Unicode. Thomas Milo was involved as well. You can find it in the archives. I don't want to start another big discussion but just want to comment on your issues document found at http://www.pakistanopensource.org/projects/quran/files/issues-01.pdf and make suggestions on what I think is a good way to solve the situation. Everything I say here was pretty much mentioned in the emails you can find in the archives.
> >
> >1) Sequential Tanween: The best and most practical way that was formulated for this is to use two fathas consecutively (with no other Unicode characters in between) for sequential fathatan, two kasras consecutively for sequential kasratan and two dammas consecutively for sequential dammatan and design your font in such a way that it will substitute the corresponding glyph whenever these character sequences are encountered.
> >
> >2) Small Letters: You mention the problems in regards to the positioning of small alef, wow, and seen and whether they disconnect adjacent connecting characters. Unicode specifications may be unclear on this detail. My suggestion is to just design your font in a smart way such that it will do the necessary positioning based on the context of the small letter and that it won't disconnect adjacent characters since this is never the case in the contemporary Qur'an printings. This issue should not require any change to Unicode spec, but perhaps a request could be made to more clearly define the properties of small alef, small wow, and small seen. For instance the note "actually a vowel sign, despite the name" in the code chart for 0670 is misleading.
> >
> >3) Hamza: The good old hamza... Unfortunately in Unicode the chairless hamza 0621 is defined as a character that disconnects adjacent connecting characters. And we can no longer fix this situation by re-defining 0621 because it breaks backwards compatibility with Farsi and possibly some other languages that use hamza as a disconnecting character. As we know no hamza ever breaks the connection between adjacent connecting characters in the Quran. So character 0621 as it is defined today should not be used for encoding the Quran. So the solution is to propose a new chairless hamza character that is defined not to break the connection. Until such a character is added to Unicode just use 0621 for now as if it is the new proposed chairless hamza that does not break connections.
> >
> >4) Ligature lam alef wasla, lam hamza alef: This issue does not require any change being made to Unicode other than the change already proposed above in number 3. These are both font issues. The unique positioning of hamza in the lam hamza alef in Sura 2 verse 4 should be accomplished by smart font technology. Please take note that the codepoint used for the hamza in this word  bi-l-aakhirati should be the new disconnecting chairless hamza codepoint proposed above, not 0654 hamza above. For now, you can use 0621 and mass replace later when the new character is added to Unicode.
> >
> >Kind regards,
> >Mete
> >
> >---------- Original Message ----------------------------------
> >From: Meor Ridzuan Meor Yahaya <meor dot ridzuan at gmail dot com>
> >Reply-To: Meor Ridzuan Meor Yahaya <meor dot ridzuan at gmail dot com>,Development Discussions <developer at arabeyes dot org>
> >Date:  Thu, 16 Jun 2005 11:30:32 +0800
> >
> >>First of all, I would like to inform that I've created a new Quran
> >>unicode data, complete with diacritics marks acording to Madinah
> >>Mushaf. The file have not been verified yet, so volunteers are
> >>welcome.
> >>
> >>Second, I remember last time there was an initiatives to submit a
> >>proposal to unicode. What have happen to the initiative? I've compiled
> >>my own issues regarding encoding and displaying the Quran in unicode.
> >>Appreciate comments and feedback from fellow arabeyes on the
> >>documents, especially those who are expert in Quranic Rasm.
> >>
> >>The above said documents can be downloaded from
> >>http://www.pakistanopensource.org/projects/quran/ .
> >>
> >>Regards.
> >>
> >>
> >
> >--
> >Mete Kural
> >Touchtone Corporation
> >714-755-2810
> >--
> >
> >
> 
> --
> Mete Kural
> Touchtone Corporation
> 714-755-2810
> --
> 
> 
> _______________________________________________
> Developer mailing list
> Developer at arabeyes dot org
> http://lists.arabeyes.org/mailman/listinfo/developer
>