[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Questions about yeh, hamzah on yeh, alef maksura and dotless ba



Meor Ridzuan Meor Yahaya wrote:
> Greg,
> If what you said about dotless yeh is true, then I think there is no
> need for 0649 to represent it. Eventhough most of us agree here that
> unicode does have problems with arabic, however, I think we should as
> much be compliant with the description of the character itself, at
> least. Lets forget about the naming for a while. 0649, even though
> described as dual join, if you look further in other documents, they
> do specify that the initial and medial forms of the character is not
> used in arabic, but it is used for some other language (uighur etc).

Hi

Unicode is confused and inconsistent here.  There are several points.
First is that Unicode is supposed to encode script, not language.
Hence, they have no business saying which characters should/must be used
to encode text in a given language.  All Unicode says is, here is a
codepoint, here are its semantics (e.g. joining class, case, etc).
Correct spelling is entirely outside the scope of Unicode.  Second, it
may be that 0649 is used *alone* as (word-) initial or medial form in
some written languages.  It most definitely *is* used in initial and
medial form in Arabic, but only in combination with a hamza (or maybe
also e.g. small alif).  (Note btw that Unicode's terminology of ini/med/
etc. is confusing.  The initial *form* of 0649 occurs (with hamza)
frequently in Arabic, but only in medial *context*, not as a
word-initial character.  Better to say "the post-joining form of 0649
occurs in medial context" or the like.  But this is all a matter for a
spell-checker, not a character encoding.)

In other words, when Unicode says things like "ini and med forms of 0649
are not used in Arabic" they are offering an incorrect specification of
a higher level protocol (i.e. orthography) that is completely outside
the scope of Unicode.

To put it another way, 0626 should be considered a composition of 0649
and hamza.

 I
> had the same impression as what you had last time about 649, but after
> had the discussion with some other person outside arabeyes, I think he
> is right. 649 is not the options.

I'm not sure I understand the argument against 0649.  Is there an
argument other than what you mentioned above, i.e. Unicode claims that
is is not used in ini/med form in Arabic?

 However, I do disagree with his
> suggestion to use dotless ba for it, because I think it is not a ba.

I agree with you; but note that "dotless ba" is not a ba either; it has
no semantics, it only denotes a shape.  But since it does not denote the
final/iso form or ya, I don't think it should be used.

> 
> Basically, you suggest to encode dotless yeh with hamza above/below
> with 649 + hamza above/below, eventhough 626 is ok. And for dotless
> yeh/ba initial/medial with small alef, encode 649 + small alef, am I
> right? So, anyone else have some other opinion/suggestion on this?
> 
Yes, that is my preference.

> However, there is one more thing. You did not answer my first question
> regarding yeh final (64A). Seems like in the Madinah mushaf, all yeh
> final appear to be without the dots.

I was hoping somebody else would answer.  ;)  It looks to me like that
is true, but I haven't inspected the entire text.

 So, the question would be, how to
> differentiate the difference between the 2, 649 and 64A, especially in
> the final form? Is there any difference between the 2 other than the
> appearance?
> 
> I think the idea from Mr Yousif was to use 64A throughout. The dots
> will appear depending on the context. In final forms, no dots. In
> medial /initial form, if it comes with hamza above/below or small
> alef, then no dots. otherwise the should be a dots. What do you think?

Mmmm, I would advise against that.  It would confuse the semantics.
Sometimes a final dotless ya means (dotted) ya, which is a first-class
letter in Arabic, and sometimes it means alef maksura, which is not.  So
 I don't think you want to use the same codepoint for both.
> 
> Personally, I really thinks that we need a clear definition for at
> least alef maksura. 649 either should exist with a good reason, or
> should be deleted/not use. Unicode can leave the code there for
> compatibility reasons, but recomend against the use of it.

I'm ok with 0649 once they fixed it to be dual-join, with all four
forms, and with the caveat that it does not necessarily mean alef maksura.

In terms of written Arabic - regardless of how other written languages
may use it - 0649 is semantically distinct from the other characters.
It denotes only a graphical form, not a semantic category.  In other
words, it is purely graphotactic - it assists the reader in deciphering
text, but has no *direct* phonological or other significance.  When it
acts as the seat of hamza or small alif, it is purely graphical.  When
it acts as alef maksura, it has *indirect* phonological significance -
it means the preceding (usually implied) fatha should not be lengthened.
 (The same is true of alif and of waw used as seated of hamza.)

My view is that 0649 models this sufficiently.

> 
> On side note, unicode also have Farsi yeh. At first, I though it was
> strictly for Persian language. But in their document, it does mention
> Arabic, the language. The characteristic of Farsi yeh is, in
> initial/medial forms, it exist with dots, otherwise, no dots. More
> like what it appear in Madinah Mushaf. However, I think it should be
> kept for Persian Language only.

I wasn't even aware of this.  :)  It's exactly what is needed, not only
for Quranic text, but for ordinary print - in Egypt especially it is
common for printed text to omit the two dots of a final ya, as in the
word "fi" (في = فى).  With this (misnamed) codepoint you get to desired
graphical representation while retaining the semantics of ya.  But you
don't want to use it everywhere - the dotless ya of ila ٌٌإلى and the
dotless ya of fi فى are not the same semantically.

Hope that helps,

gregg