[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Tanween variants and Unicode



Mete Kural wrote:
Hello Nadim,

I think I didn't communicate myself efficiently. I am not proposing
that we should use a <tanween+modifier> sequence for tanween with
small meem and assimilated tanween just to save the hassle of
proposing six extra new codepoints to Unicode (although it would
truly be quite a hassle to try to propose six new codepoints). It is
because using a <tanween+modifier> sequence preserves the text's
graphemic integrity better and results in a cleaner encoding. A
fathatan is a fathatan, regardless of whether its pronounciation
changes slightly. An assimilated fathatan or a fathatan with small
meem is still a fathatan, in fact it is just as much fathatan as any
other fathatan. For hundreds of years all of these fathatans were
written the same exact way. In more recent times scribes have decided
to write these two kinds of fathatans slightly differently to cue the
un-educated reciter to pronounce correctly. For that reason the
logical way to encode this is the <fathatan+modifier> sequen ce in
order to preserve the fathatan codepoint. Using a seperate codepoint
will break this graphemic integrity.

Again, I respectfully but strongly disagree. What you've described is not graphemic integrity but morphological integrity.


"Fathatan" is not even a distinctly recognizable concept in Arabic, anyway. The mark to which this Unicode concept refers is not merely "two fathas", it is a single fatha, plus a tanween mark that takes the shape of a fatha in this context. Nor does it simple mean "fatha munawwana"; it means *distinctly enunciated nuun" of fatha munawwana. A mark of assimilated tanween - horizontally tiled vowel marks - does not merely mean fatha munawwana either, it means fatha munawwana in which the nuun of tanween is assimilated to the following consonant. This is no different from an accented vowel in French receiving a phonetic modification. Similarly for the other tanween variants - they all have distinct meanings. To put it another way, these are not morphological but phonological indicators, just like the other signs in the written language.

So IMHO it is quite misleading to state that e.g. "an assimilated fathatan or a fathatan with a small meem is still a fathatan". Phononologically and graphically this is clearly untrue (or half-true at best); morphologically it is true insofar as the underlying /n/ of tanween signals indefiniteness. But Unicode, correctly IMO, does not encode morphemes.

Note, btw, that the way the scribes do it *already* encodes tanween modification, explicitly. If they had wanted to indicate idgham by adding a distinct <idgham> mark to the distinct-tanween mark, they surely could have done so. But they didn't. Why don't we follow their lead? There is no need to add a modifier mark to an explicit tanween, e.g. <fatha><tanween><idgham> or the like. That is certainly one way to do the design, but I don't see a good reason to favor it over e.g. <fatha><tanween-mudgham>. The only justification I can think of is that the former design would allow reuse of <idgham> on other consonants. But that is easily handled with <absolute-idgham> or whatever one decides to call it.

The problem is simply that Unicode got it wrong by encoding the -atan codepoints as distinct units (at least for written Arabic). The solution is to point out that, while this may make sense for some languages that use the Arabic script, it makes no sense at all for the Arabic language. Therefore, additional codepoints should be adopted that allow for proper Arabic writing, namely the various <tanween> elements.

In Unicode Arabic there are several instances where certain codepoints break this kind of graphemic integrity. Some of these were

I guess I may not understand just what you mean by "graphemic integrity". FWIW, I don't believe it's accurate to say that Unicode encodes graphemes; if you run that past the Ken Whistlers and Mark Davis' of the world I think they will dispute it.


added because that was the way it was in legacy Arabic codeblocks
that were prepared a long time ago by corporations that wanted to
localize their software into Arabic the cheapest and quickest way.
Not much scholarly advice was sought. Your argument is that we can
compromise from the graphemic integrity yet another time in order to
allow legacy font technologies to render these tanween variants. My
opinion is that it is better not to introduce yet another blunder
into Unicode Arabic in order to support the legacy.

This I don't follow at all.

Respectfully,

gregg