[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Arabic Unicode fonts

To: <general at arabeyes dot org>
Subject: Re: Arabic Unicode fonts
From: "David Starner" <dstarner98 at aasaa dot ofe dot org>
Date: Tue, 14 Aug 2001 06:26:08 +0100
From: "Nadim Shaikli" <shaikli at yahoo dot com>
>  1. All Arabic fonts _must_ include forms-B (or equivallent) for them to
>     be properly called Arabic fonts (ie. ISO8859-6) since without those

No can do. An ISO8859-6 font has a fixed set of 190 glyphs, 94 ASCII and 96
Arabic. It can't include Forms-B. If you want Forms-B, you're going to have
to use a Unicode font, not an ISO8859-6 font.

>     extensions (again, they should not be optional) my shaping library
>     or application won't find those glyphs.  Moreover, those glyphs have
>     to have be standardized so that the same library (or code) could be
>     utilized in many applications (it needs to look for the same encoding
>     irrespective of the application).  I tend to think of shaping as a
>     library - develop once, use everywhere.

True - for most Arabic fonts in currently used Unix font formats,
appropriate characters in the Forms-B blocks should be included. For an
OpenType system, Freetype and the font will do the shaping for you, and
whatever ligating the font supports.

>  2. It would have been much nicer to treat all the Arabic glyphs as
>     characters

I've appended the response from unicode at unicode dot org below.

> > Any font format that wants to have full Unicode support is going to have
> > to have a table of glyphs without corresponding character codes.
>
> I didn't see any mention of this table anywhere on unicode ?  Plus
> shouldn't that table (if it doesn't have corresponding character codes)
> be formalized somehow so that a developed library would work across
> all fonts/tables ?

It's a system/font-type specific thing. An OpenType font system gets handed
characters and it finds the appropriate glyphs.

> > Zero width spaces, zero width non-joiner and zero width joiner
> > characters should let you decide how you want it to look; it's going
> > to take some work to either get everyone familar with how they work
> > (I believe Roozbeh said that ZWJ and ZWNJ are standard on Persian
> > keyboards) or get a nice UI to hide the ugly details.
>
> I have a feeling the document size will be 2x-4x the size with all these
> control and hint characters - but yah I know disk space is cheap... :-)

Going from ISO-8859-6 to Unicode is going to double the space (you could use
SCSU (http://www.unicode.org/unicode/reports/tr6) to stop that, but no one
actually uses SCSU in practice.) From what I'm told, ZWJ and ZWNJ are used
very rarely in Arabic text.

> Unfortunately, I'm still unconvinced - but don't
> think there is much I can do about that.  You see I also keep thinking
> of memory usage (visual order - save order - bidi display, etc) and
> having to keep track of all these contexts.

The solution for most people is to just turn it over to a rendering engine
like Pango or whatever QT will use and not worry about it.

> > OpenType is a new font format designed by Microsoft and Apple to be
> > the "ultimate" font format. One of OpenType's capabilities is to
> > convert characters internally to glyphs in a font dependent way.
>
> Is there an OpenSource equivallent to OpenType being worked on (I'm
> guessing since micro$oft is involved it'll be propriatery, no) ?

OpenType may be under the control of Apple and Microsoft, but the
specifications are publicly available from them. The big problem is creating
the fonts under Un*x, but that's not exclusive to OpenType - there is no
Free, working scalable font editor for Un*x that I've seen.

From Phillip Reichmuth:

Hi David,

sorry for the long reply, it's a bit huge, but i hope it helps :-) and
BTW by "you" I'm mainly addressing your contact person :-)

>> David Starner wrote:
>>> Arabic Presentation Form A and B shouldn't be used in files; use
>>> characters in the 0600-06FF block and the application should take
>>> the responsibility for using glyphs from Presentation Forms A & B
>>> if neccesary.

>>Well, it _always_ will be necessary and that's my point (its not even
>>almost always, its "always" :-).  0600-06FF presents a flavor of the
>>entire Arabic alphabet (each letter is represented in _a_ particular
>>form - initial, medial, final and isolated), it also includes all the
>>Arabic numbers and punctuation, but 0600-06FF, by all means, is not
>>complete since it doesn't include all the various character permutations
>>(forms).

Unicode is, however, not concerned primarily with what characters look
like. Unicode does not encode glyphs or visual appearances, it encodes
characters.

The Arabic script has a very high degree of variation in appearance of
the individual letters. This is also highly dependent on the style or
script in which the particular text is written; Nasta'liq, Naskhi,
Shekaste and Hijazi script styles, for example, have wholly different
ligature sets which are not at all completely covered in Unicode.
However, they don't _need_ to; if, say, Shekaste has a ligature for
the letters XYZ, it is completely the rendering system's (i.e. mainly
the font's) issue to display this in pretty Shekaste form. Unicode
encodes the underlying characters, that is, XYZ.

>>Why do it this way :-D ?  Are there some hidden advantage that I'm not
>>thinking of (beside saving font space) ?

Yes:

- Searching and comparison are easier because if you want to search or
compare the letters XYZ, that way you only have to look for the
letters, possibly sans vowelization symbols. If one uses presentation
forms for encoding, you have to search/compare X + YZ, XY + Z, X + Y +
Z and XYZ separately which is a real pain, extremely complicated to
implement and prone to errors of all sorts.

- Vowelization of ligatures is impossible. On a three-letter
combination, in theory, there can be vowelization signs, recitation signs
and all sorts of diacriticals on each of the three letters. It is
technically impossible to place different vowels on different
consonants in a Unicode Arabic presentation form, however. If you use,
say, OpenType, it does the vowel placement for you even
in ligatures.

- If you want to use a script style that does not contain all (or
contains more) the ligatures from Unicode Presentation Forms, you have
a problem. The ligatures from Unicode are based on Naskhi. If you want
to write a text in Naskhi and later reformat it in Shekaste where the
ligatures are completely different, the program has to go to real
pains, replacing the characters everywhere _in the entire file_. This
process is much more complicated than having the system-built-in
rendering engine render a paragraph.

- What is described as "reverting to the visual re-mapping every time
this file is opened" is not something the programmer has to care
about. This is done by the operating system and by the font. The speed
and memory loss is next to irrelevant on modern computers, and Unicode is
not
supported on older platforms anyway.

>>(currently all that visual conversion would be lost, right ?)

No. They're not needed at all for _storage_, they are needed for
_display_. The next time you open it, they're back again.

>>and is stored on disk
>>using only 0600-06FF encodings.  Why not preserve all these conversions so
>>that if someone wanted to read my 15MB :-) file they wouldn't have to wait
>>for any more conversions to take place (its a waste of time and processor
>>throughput) ?

No. Conversion is probably done on a per-paragraph basis by the
rendering engine which does, in practice, not take that long. If you
want a comparison, try opening a Word document with Arabic text in
Simplified Arabic and in DecoType Naskh fonts, you're not going to
notice the difference at all, probably, regardless of all the nice
output done in DecoType Naskh.

The document, in fact, will probably not even be much larger. For
example, the maximum ligature length is three (not counting, for
example, the ALLAH ligature). It's improbable, however, that your
document that way shrinks by a factor of three, since not every
character is in a ligature, and a word of, say, four letter still has
to consist of at least two ligatures. So let's agree that you can
shrink your data by a factor of two if the text does not contain
vowels. Now, all the Arabic ligatures are in the FXXX range, which
means that they get really long in UTF-8, which is what most
applications use; as opposed to the Arabic characters, which get
encoded to two bytes and are comparatively short. So the size of your
document is about the same, and you lose sorting, searching,
comparison, and freedom of font choice. Not really an advantage, I'd
say.

>>You see what I'm saying ?  With that in mind, I was thinking
>>that Form-B is an integral part of any unicode "Arabic" font since it
>>needs to be known (and used) by everyone (well, the converter has to
>>have these glyph from somewhere, right ?).

Yes. However, a fairly modern font like OpenType has more glyphs than
it supports characters anyway. And it does not need to be used by
everyone, that's just like forcing everyone to write their texts in
Naskhi. Take, for example, NOON + KHA + MEEM in Naskhi and Nasta'liq:
pretty different. :-)

>>It just seems odd to go this way - its certainly cleaner to include all
>>the characters and their various permutations and give the user the 
>>ability to decide what he wants to type and how he wants it to look;

That's done in OpenType anyway: you can choose if you want to use a
ligature. However, if you just support Unicode presentation forms, the
user does not get any option beyond Unicode presentation forms at all,
which is quite a limitation in the processing of, say, Persian poetry.

>>ensuring that what he
>>typed would be saved in exact-mode (what-you-see-is-what-you-store --
>>WYSIWYS :-)

Except that this is not what Unicode is about: Unicode is about
what-you-store-is-what-you-mean. What you see is the font's business.
But I'm repeating myself :-)

>>Granted that the application would still have to do this conversion (or
>>shaping), but its only done once -- upon creation.  Moreover, this
>>conversion library would be universal given universal fonts and encodings
>>(no optional anything).

This optional anything is what the freedom of choosing a style for
your document is about. "Universal" Arabic fonts are impossible: the
variation of the script is so vast that it is simply impossible to
include all the varieties, ligatures, letter presentation forms and so
on in a single font. In fact, the optional anything is quite
necessary; in Unicode, you have, for example, a ligature for ALLAH,
but none for LI-LLAH, and since you want the name of God to look the
same way, you'll either have to write God as ALIPH+LAM+LAM+HA (which
does not look nice if you don't have optional ligatures in your font)
or use an extra ligature for LI-LLAH; i.e. if you want pretty output,
you need optional, non-Unicode ligatures one way or the other.

Say, you have the Qur'an in computer-readable form and you want to
build a computer-generated index on it, like "all words derived from
the root JEEM-YAH-HAMZA". If you store all the presentation forms, you
can simply forget it because the extraction engine will have to know
all about which ligatures contain which letters in which order. If you
just store letter by letter, you can just look for the letters. Pretty
display is done by the font and by the rendering engine, the user does
not have to care about it (but he can, if he uses OpenType [for
example], still control the output): he gets pretty display, and the
software works much more easily.

Hope that helps :-)
 Philipp                            mailto:uzsv2k at uni-bonn dot de

From Marco Cimarosti:

Peter Constable wrote:
>Philipp Reichmuth wrote:
>>David Starner wrote:
>>>ensuring that what he
>>>typed would be saved in exact-mode (what-you-see-is-what-you-store --
>>>WYSIWYS :-)
>>
>>Except that this is not what Unicode is about: Unicode is about
>>what-you-store-is-what-you-mean.
>
>I agree (to this and pretty much everything else in Philipp's response).
>If you want what-you-store-is-what-you-see, use PDF.

I too agree with Philipp, but I must note that he mostly explained why it is
not wise to encode Arabic *ligatures*.

But I think that David's question was more about encoding the contextual
form of *single* Arabic letters. After all, it is easy to see Arabic
contextual forms as a thing very similar to European case variants.

So a devil's advocate may ask: if the Arabic shaping forms of Kaaf have been
unified in the same code point, then why Latin uppercase and lowercase K
haven't been unified as well? And, conversely, if Latin case variant have
been assigned to different code points, why not Arabic shape variants?

The pros and cons of the two problems are relatively similar: disunifying
Latin case variants makes search and sort slightly more complicated;
unifying them makes search and sort simpler but complicates the display
process, and requires the introduction of "zero width uppercase" and "zero
width lowercase" controls.

Similarly: disunifying Arabic shape variants makes search and sort slightly
more complicated; unifying them makes search and sort simpler but
complicates the display process, and requires the introduction of a "zero
width joiner" and "zero width non joiner" controls.

Of course, I think I know the short answer: both the Latin and Arabic part
of Unicode descend from ISO-8859, a pre-existing standard, which encoded the
two scripts this way.

However, this may be an unsatisfying answer, especially out of
standardization circles, so someone may come up with more philosophical
answers.

Now I go back acting as an angel's advocate, and try giving two possible
justifications for Unicode:

1) While the difference between upper and lower case is very clear, how to
count Arabic shape variants is not as clear. Traditionally, "dual linking"
letters are considered to have four shapes (initial, medial, final,
isolate), while "right linking" letters have two (final, isolate). However,
there is another way of counting which ignores the tiny differences (on the
right side of letter) that differentiate initial from medial and final from
isolate forms. With this method, most "dual linking" letters have two shapes
(non-final, final) and "right linking" have a single shape. Which one of
this system should be the basis of a hypothetical shape encoding?

2) In the majority of cases, the choice of Arabic shapes is determined by
simple language-independent rules based only on the two neighboring
characters. The rules are simple enough to be incorporated in a software
component to handle rendering. The exception to these rules are rare enough
to make sense handling them with an escape mechanism (the ZWJ and ZWNJ
controls). On the other hand, choosing whether to use a capital or a small
letter derives from complicated grammatical rules. These rules change
considerably from language to language, and are also influenced by stylistic
choices. In practice, the only capitalization rule that can be automated is
the capital letter at the beginning of a sentence. However, it is not so
easy to automatically determine the beginning of sentences!

_ Marco

From: Phillip Reichmuth

Oh, there's quite a bit of a functional difference in Arabic shapes
and Latin cases. Arabic shapes operate mainly on the script level;
they are different appearances of the same letter within the flow of
script. There are some exceptions from this rule, such as the use of
the final form of the letter HEH to denote Hijri dates or sometimes
the use of joined/non-joined variants to denote either parts from the
middle of a word or composite words in non-Arabic languages, but these
are comparatively rare and, in most cases, tied to the simple factor
of whether or not a word border is present at the respective point.

Latin cases, on the other hand, operate on the language level. In
German, nouns get capitalized to distinguish them from verbs; in most
languages written in Latin script, proper names get capitalized to
distinguish them as such; in some writing systems (e.g. Tibetan
transcription) capital letters get employed in the middle of words to
denote specific phonetic features and so on. All these features denote
quite language-inherent properties.

 Philipp                            mailto:uzsv2k at uni-bonn dot de

From: Roozbeh Pournader

On Mon, 13 Aug 2001, Philipp Reichmuth wrote:

> Arabic shapes operate mainly on the script level; they are different
> appearances of the same letter within the flow of script. There are
> some exceptions from this rule, such as the use of the final form of
> the letter HEH to denote Hijri dates or sometimes the use of
> joined/non-joined variants to denote either parts from the middle of a
> word or composite words in non-Arabic languages, but these are
> comparatively rare and, in most cases, tied to the simple factor of
> whether or not a word border is present at the respective point.

Just a note about the usage of non-joiner (not much related to the topic,
only correcting you):

The use of non-joiner in Persian (using the Arabic script) is not rare. It
is obligatory in some common words, and the usage is increasing (some
words that were previously written without the non-joiner, are being
written more and more using it).

I can provide examples if anyone's interested.

roozbeh

--
David Starner - dstarner98 at aasaa dot ofe dot org
"The pig -- belongs -- to _all_ mankind!" - Invader Zim
Follow-Ups:
- Re: Arabic Unicode fonts
  - From: Nadim Shaikli
References:
- Re: Arabic Unicode fonts
  - From: Nadim Shaikli
Prev by Date: Re: A newer logo
Next by Date: Re: Akka Compilation problem under RH 7.1
Previous by thread: Re: Arabic Unicode fonts
Next by thread: Re: Arabic Unicode fonts
Index(es):
- Date
- Thread