[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Small Alef (was Re: Standalone Superscript Alef (Item 8))



On Friday 25 June 2004 20:55, Mete Kural wrote:
>
> After reading your response I can now clearly
> understand the cause of the communication gap between
> us. Your proposal does not take into account the
> concept of graphemes vs. allographs. 

No, I think you still have some confusion regarding the difference between
Small Alef and Superscript Alef but I can understand that this is a confusing
thing so I think it would be best if you consulted some Qur'an scholars.

> For that reason while we are proposing that a single code point for
> superscript/dagger/small alef is appropriate for all
> instances of superscript/dagger/small alef because
> there is really only one superscript/dagger/small alef
> "grapheme", 

No, Superscript Alef is a vowel sign and is encoded as a NSM in
Unicode and thus is ONLY suitable for attaching to base characters not
to be used as a standalone character and it is also very different from the
Arabic alphabetical letter Alef.

Small Alef is essentially an Alef and cannot be considered a vowel sign and
cannot be encoded as a NSM in Unicode and cannot be attached to any base
character and MUST be used as a standalone character which other NSM's can
be attached to.

The difference is obvious, these are TWO very different graphemes, one of them
is a vowel sign and the other is an Arabic letter. How can one ever consider
them the same grapheme?

> you are proposing two codes for 
> superscript/dagger/small alef because of there are two
> superscript/dagger/small alef "allographs". This is
> more of a philosophical problem between us in regards
> to encoding theory which is not easily solvable within
> one mailing list thread. 

It's not a philosophical problem (and cannot be) between any two persons.
I'm not telling you my opinion about something, I'm telling you facts and
rules that Arabic have and this is really not something to argue with and if
you must then it still has to be discussed and approved by organizations
responsible for standardizing the Arabic language.
It's very simple, Superscript Alef is a vowel sign and Small Alef is an Arabic
letter. I know it's confusing because the name of the vowel sign but that's
why you can see in the Unicode Standard right below the name the sentence
"a vowel sign, despite the name".
The ones who added this were wise enough to clear this confusion.

> You also have some 
> justification for your proposal by saying that what
> you are proposing is consistent with the rest of the
> Unicode Arabic block. I agree with you here, yes it
> may be consistent with the rest of the Unicode Arabic
> block, but the Unicode Arabic block is not based on a
> purely graphemic encoding scheme either. "The fact
> that the code is bad is no excuse to make it worse."
>

Even if it was bad, adding a new character won't make it worse, it will
make it consistent.
"A complete Arabic block that is bad is MUCH BETTER than an incomplete
 Arabic block that is bad"

And even if it will make it worse, it's still not worse than accepting:
ARABIC LETTER REH WITH HAMZA ABOVE
http://www.unicode.org/alloc/Pipeline.html (accepted in 2004-Feb-04)


> >   As you can notice there is no SMALL HIGH WAW
> > because the damma looks
> >   exactly like a SMALL HIGH WAW, so there is no need
> > for another character
> >   for SMALL HIGH WAW and instead damma is used.
> >   They share the same look  property and even
> > pronouncation but their name is
> >   different because one is used as a vowel, and the
> > other to denote a missing
> >   WAW.
>
> The above also points to a misunderstanding of
> graphemes vs. allographs. 

I think you have some confusion regarding graphemes and allographs.
"The fact that two graphemes have similar looking glyphs doesn't meant that
 they are allographs"

As a proof, I'm asking you about the reason why you think they are allographs?
(I think the answer would be "Because their glyphs are similar" and that
 proves my point)

You should understand that a vowel sign and an Arabic letter cannot in anyway
be considered allographs.

For example:
 + A Jeem and a kasra cannot be considered allographs, they are graphemes.
 + A Waw and a damma cannot be considered allographs, they are graphemes.
 + An Alef and a "Superscript Alef - a vowel sign despite its name" cannot be
    considered allographs, they are graphemes.
 + A "Small Alef - used to replace a missing Alef" and a "Superscript Alef -
    a vowel sign despite its name" cannot be considered allographs, they are
    graphemes.

> If you are going to use two 
> seperate codepoints to encode superscript/dagger/small
> alef, one for its usage on top of alef maksura and
> another for its usage in words like haadha, dhaalika
> and bura'aa'u,

The one on top of the Alef Maksura is a vowel sign.
The other one is an Arabic letter. (Notice that you couldn't use "on top of"
here, that is because Arabic letters CANNOT be used on top of other Arabic
letters)

Any attempt to encode them using the same codepoint is not only illogical but
also considered misspelling.

> I would tell you that you should use 
> the same codepoint in haadha, dhaalika and bura'aa'u.

I can't recognize the words "haadha" or "dhaalika" please use Arabic letters
or at least transliterate the words in a meaningful manner so that I can
understand what you mean.

I don't want to leave the strong point I'm raising but here is another
one for the record.
How can you use the same codepoint for the two words:
علىٰ
and
برءاؤا
(with the first Alef in the second word meaning the Small Alef)

Superscript Alef is a vowel sign and is defined in Unicode as a NSM that is
attached to base characters.
Thus if you used the same codepoint existing already, it fails horribly:
برءٰؤا
To compare them:
برءاؤا
برءٰؤا
See, they are different in spelling, one has an Alef and the other has a vowel
sign on top of the hamza which is completely wrong.



> Do not use a different codepoint for bura'aa'u just
> because it appears lower than the one used in haadha
> and dhaalika. The difference between dhaalika and
> bura'aa'u is only at the allograph level, it is really 
> the same grapheme. Otherwise you make an existing
> problem even worse.
>

Ah you mean
ذالك
and
برءاؤا
You are confusing here, both of them are Small Alef not the vowel sign and
there's not any differences between them at all even at the allograph level,
they are at the same Y-Axis position no one of them is lower than the other.


But I think you meant words like
علىٰ
and
برءاؤا

First, I will talk about the expected rendering behavior:
One of them doesn't have to be lower than the other (Actually in some masahef
they are placed at the same Y-Axis position) and thus depends on the height
of the base character the superscript alef is on top of and the various
symbols that may be already on top of that base character.
But one of them MUST be on top of the previous character and the other MUST
NOT be on top of the previous character and MUST have its own spacing.
Second, I will talk about the expected meaning:
One of them is a vowel sign and the other is the Arabic letter Alef.

I think that this is more than sufficient justification to encoding them as
different characters but I will give you an example where it's clear that
using the same code point for them is crude.

Assume that I'm developing a Qur'an application and let's say that I want to
implement a good searching algorithm for it but there is a problem, the user
is expected to type the word without the various Qur'anic symbols and even
without harakat and vowel signs.

There are two solutions:
 1. Add a separate text for searching which is encoded without Qur'anic
     symbols and vowel signs. (This is a bad solution)
 2. Use the concept of Normalization where you do various tasks including:
    a. strip vowel signs (This includes the removal of all "superscipt alefs")
    b. add missing letters by replacing the small letters by regular ones
       (This includes replacing all "Small Alefs" by Alef)

Let's assume we go with solution (2) which is the natural one.
Let's assume we are applying the algorithm for the two words (which are
encoded using the same codepoint as you suggest):
علىٰ  - برءٰؤا
Applying a,b in the order "a then b"
 Applying a, the words become على  - برءؤا
 Applying b, the words become على  - برءؤا
        (Notice that applying 'b' did nothing)
and the result normalized words are    على  - برءؤا
But the correct normalized words are    على  - برءؤا
The word برءؤا is misspelled here as you see
and thus a search for the word برءاؤا fails miserably although it exists in
the Qur'an.

Let's say we will apply a,b in the order "b then a"
 Applying b, the words become علىا  - برءاؤا
 Applying a, the words become علىا  - برءاؤا
        (Notice that applying 'a' did nothing)
and the result normalized words are    علىا  - برءاؤا
But the correct normalized words are   على  - برءاؤا
The word على is misspelled here as you see
and thus a search for the word على using "Whole Words Matching" fails
miserably although it exists in the Qur'an as a whole word

I think that this should make it very clear why they should be encoded as
different characters even if you think they are only allographs (They may btw 
have the same glyph in some masahef but still they are different characters
and different graphemes)

> So in conclusion I would like to tell you that we will
> not endorse a proposal to add a new additional
> superscript/dagger/small alef codepoint in a joint
> proposal. If you wish to propose this please do it in
> a seperate individual proposal.
>
> Kind Regards,
> Mete
>

Mete, I don't know why I'm getting the feeling that you are against it period.
I think you agree that we want the best idea to make its way to the proposal,
expressions like "we will not endorse a proposal to..." is really not
appropriate, all of this needs to be discussed and the best idea wins not just
sticking to an idea and insisting on it without considering other factors.

I would please ask you to at least read the point about Normalization of
the Qur'an text in this post carefully and comment on it.

-- 
Mohammed Yousif
Egypt