[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Questions about yeh, hamzah on yeh, alef maksura and dotless ba



Mohammed Yousif wrote:
> If you mean encoding the dots separately but _still_ maintaining one
> codepoint for every Arabic letter then I identify this as being one
> way to handle it although I would say that it is an overkill and can
> cause confusion.
> For example, a user might type a Qaf and One Dot and then think that
> he typed a Feh and uses it as a such. or might type a Heh and Two
> Dots and then think that he typed a Teh which is not the case.

First: This is not what I mean. I mean a break down of qaf and feh into
components. Without dots there is not difference between the loop that
shapes qaf or feh in non-final position. This ambiguous character is what I
call an ARCHIGRAPHEME (analogous to ARCHIPHONEME in Prague School phonology
http://mails.fju.edu.tw/~phono/prague.htm).  When dot patterns are added,
the ambiguity is resolved. The sum of the components equals the conventional
character.

> But if you mean encoding a Theh for example as Plate + Three Dots and
> encoding a Beh using the same codepoint Plate + One Dot, then I
> disagree completely.

Yet, this is exactly what I mean. It brings the encoding closer to how older
texts and ancient mus-hafs in particular are encoded. I once attested an
instance where the "plate" (or BEH archigrapheme) is marked with both two
stripes (=dots) above and three stripes (=dots) above. As a result, the
phrase /fa athaabahumu l-laahu/ could also be read as /fa ataahumu l-laahu
(since only one archigrapheme was - doubly - marked with a total of five
stripes, two in red, three in black). Both readings turned out to be
eattested the the /mucjam al-qira'aati l-qur'aaniyyä/ published in Kuweit in
1986. Without separate "plate" and "dot pattern" such observations could not
be encoded for accurate printing, exchanging between scholars, or searching.

BTW, nobody is expected to be typing all these details. It is meant for
scholarly of encoding Qur'an and other manuscripts. What I propose should
happen on the level of digital text representation, as an alternative or
supplement to conventional encoding. Whenever possible, decomposed
characters should be treated as the equivalents of their precomposed
counterparts. In terms of user interface, I designed a simple conversion
mechanism from composed to decomposed encoding. From there, simple
backspacing already suffices to remove dots.

>>>> The word /stay'asuw/ in Q12:80 is rather a spanner in the works:
>>>> its existence implies that there can be no rule that the sequence
>>>> Yeh, Hamza can be trusted to be Yeh+Hamza_above/below.
>>>
>>> The well established Qur'an sciences can be employed to know if the
>>> hamza is above/below or standalone.  Namely, the Rasm science, it
>>> disambigu clearly this type of situations and identifies the various
>>> variations that can exist with other types of Masahef "Maghribi
>>> Mushaf...etc".
>>
>> If it's not a simple straightforward rule, it cannot be expected to
>> be built into a font. So our earlier idea of assuming that the
>> string (any) YEH followed by hamza could be substituted by a single
>> - ligature! - YEH+HAMZA (above or below according to Qur'anic rules
>> and locale) turns out to be false.
>>
>
> Now I understand what you mean. You want to be able to use one Hamza
> codepoint for both the standalone Hamza and the HamzaAbove/Below mark.

Well, no. When this automatic combination was first suggested in this list,
I considered it an intriguing idea. At second thought, I concluded that it
is a non-starter and sent my example /stay'asuw/ above. However, when such
an amphibious hamza (see below) would be encoded as U+0621, a combing
mechanism could still work with (farsi/maqsura) YEH and U+0651 HAMZA ABOVE.

> Well, not only Hamza possesses  this behavior but also more Arabic
> letters.
>
> To give and example, the Hamza situation here is exacly like the Seen
> situation.
>
> Just to be clear:
>  Hamza U+0621                HamzaAbove U+0654
>  Seen    U+0633                SeenAbove    U+06DC

I agree - I never contested that.

> Hamza can come standalone and Seen can come standalone, in that case
> the effect is adding one more letter to the word (Hamza or Seen) and
> the Hamza or Seen becomes a part of the spelling of the word.
> The word باء (Beh,Alef,Hamza) for example is three letters long, the
> Hamza is counted because it is a separate letter here that doesn't
> affect in the different letter Alef.
> And the word شيء (Sheen,Yeh,Hamza). Three letters, the Hamza is
> a separate letter than the Yeh and has no effect on the Yeh. And the
> word
> is spelled using the three letters' names (Sheen,Yeh,Hamza).

Absolutely right. We must maintain standalone U+0621 and superscript hamza
U+0651.

> Also, the word مسيطرون (Meem,Seen,Yeh,Tah,Reh,Waw,Noon) is seven
> letters.
> The Seen here is counted because it's a seprate letter. It doesn't
> affect any other letter in the word.
>
> But Hamza and Seen can also come "attached" to other letters acting
> as a mark. In that case the effect isn't adding one more letter to
> the word but only affecting the way one might think about the letter
> which they are attached to. In this case (let me call them HamzaAbove
> and SeenAbove), their meaning can be thought of as an alert to the
> reader that "Beware! the letter which HamzaAbove/SeenAbove wasn't
> really that letter, it was a Hamza/Seen that has been replaced and
> you should pronounce them using the Hamza/Seen sound not the sound of
> the underlying letter".

In other words, they count as superscript corrections. This is well-known by
Arabic linguists. From here I am deleting the additional examples - you made
your point.

> Falls under the same domain are more letters:
>  - Yeh/SmallYeh and YehAbove (YehAbove can be seen above Alef in Warsh
>     Mushaf, Maghribi style. Here the Alef acts as a seat for the Yeh).
>  - Alef/SmallAlef and AlefAbove (AlefAbove can be seen above Yeh,Waw).
>
>
> Anyway, I remember you proposed a workaround for using one codepoint
> for both SmallAlef and SmallAlefAbove. You can also use the same
> workaround for using one codepoint for both Hamza and
> HamzaAbove/Below.

Foe small alef - yes, but not for hamza: I believe we are agreed on hamza.
You simply misunderstood my comment about stay'asuw/.

> The workaround depends on the the fact that modern Masahef are fully
> marked. The idea is that if Alef/SmallAlef comes after a letter, that
> letter is certainly marked with a haraka or something. But if
> SmallAlefAbove comes after a letter,  there will be no marks between
> the SmallAlefAbove and that letter because the SmallAlefAbove is
> "attached" to that letter which acts as
> a seat and of course the marks for that letter comes after the
> SmallAlefAbove mark.

This is not a workaround, but efficient and accurate use of existing Unicode
points. In my analysis, there is only one small alef. It is attached to the
previous rasm element. If a fatha is placed between the rasm element and the
small alef, an offset to the left occurs, if necessary forcing its own
horizontal spacing. The moment the fatha is removed, small alef retakes its
normal position. In this way the Osmanli and modern Arabic mus-hafs can be
encoded with maximal compatibility. For instance:

Cairo mus-haf       هَـٰذَا
Osmanli mus-haf:     هٰذَا

Incidently, the amphibious behaviour of small alef can already be found in
Magregi mus-hafs. But spacing of small alef between letters or placing them
on tatweels appears for the first time in the typeset Cairo Mushaf. Since
then it occurs exclusively in fully vowelled Qur'an texts such as the Medina
editions.

> However, I highly reject this type of workarounds because:
>  - they make no distinction between the concept of a Standalone
>     letter that has nothing to do with the other letters in the word
>     and the concept of a combining mark associated with another
>  letter. - They depend on the text being _accurately_ fully marked
>     which is not the case in most existing texts. And as a
>     consequnce, it would be impossible for the reader or software to
>     know the meaning of the given character (woudn't be able to tell
>     the difference between a letter and a vowel mark) and as such
>     would make searching and other text processing tasks very hard
> and inaccurate.

This is where everybody experiences the worst problems when trying to encode
Arabic with Unicode.

IMHO there is a category missing between STANDALONE letter and COMBINING
mark. What's missing is the the category of Arabic AMPHIBIOUS characters.
Amphi ("between") bious ("two") should be taken in the literal sense: hamza,
small yeh, small waw, and possibly a few more miniatures, follow
discontinous letters (reh, waw, etc) and final letters on the base line with
their own spacing, but between two connected letters (with or without
tatweel), they become "amphibious": they are positioned between the
surrounding connecting letters (with lam-alef as an extreme case!), not
above them, carring their own vowel or madda when necessary.

Your proposed independent smal alef could be encoded separately as such an
amphibious Arabic character. Sof far I believe you would agree. Where we
differ is that I claim that this particular behaviour of small alef is -
without exception - triggerd by a preceding fatha, as I described above. So
I propose to add this amphibious behaviour to the existing code point for
small alef instead, in order to maintain full compatibility with modern
conventional spelling (and Osmanli spelling of Arabic).

>> I admire Meor's efficiency in creating a first workable Qur'an using
>> Unicode and OpenType components. But there are still a couple of
>> open ends that are not his fault, but that are the consequence of
>> font technological limitations.
>>
>
> I have given up using an OpenType font in custom Qur'an application.
> This is because I'm not forced to use OpenType fonts or any other
> font for that matter since I in a custom app I have control over how
> text is being drawn.

I couldn't agree more - I am doing exactly the same. But I hope to feed back
my experience to the Unicode consortium - whether they like it or not :-)
After all, Unicode is the only way towards interchangebility and
searchability on the internet.

Let's keep on pioneering!

Best regards,

t