[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Arabic Unicode fonts



I've forwarded part of your mail to the Unicode mailing list; I hope you
don't object.

From: "Nadim Shaikli"
> On Thu, 9 Aug 2001 07:44:14 +0100
>  David Starner wrote:
>>
> Well, it _always_ will be necessary and that's my point (its not even almost
> always, its "always" :-).  0600-06FF presents a flavor of the entire Arabic
> alphabet (each letter is represented in _a_ particular form - initial,
> medial, final and isolated), it also includes all the Arabic numbers and
> punctuation,

Conceptually, each letter in 0600-06FF is formless. The glyph in the
standard is merely an example, and isn't - can't - be really definitive in a
shaped script.

> but 0600-06FF, by all means, is not complete since it doesn't include all the
> various character permutations (forms).  With that said, let me rephrase what
> you've noted (sorry, if I'm being dense); the idea here is to use 0600-06FF
> and simply plop characters down (irrespective of form) upon which time the
> application (or underlying library) would go about transforming the
> characters into their appropriate visual glyph (based on location, etc),
> right ?

Yes.

> OK, here are a couple more questions :-)
>
> Why do it this way :-D ?  Are there some hidden advantage that I'm not
> thinking of (beside saving font space) ?

Because Unicode is a character standard, instead a glyph standard. For a
system that lets you use any Unicode script, you're going to have to do much
more complex shaping for the Indic scripts, so supporting Arabic shaping
shouldn't be a problem. It makes it possible to search for part of a word,
without getting the forms all right. It corresponds closer to what's on the
keyboard, doesn't it?

> Why not preserve all these conversions so
> that if someone wanted to read my 15MB :-) file they wouldn't have to wait
> for any more conversions to take place (its a waste of time and processor
> throughput) ?

Is time and processor throughput really much of an issue? I'll see how fast
Roman Czyborra's Perl script to turn the characters in 0600-06FF into the
forms in Forms-A & B, but I strongly suspect the time will be negligible
compared to everything else.

> With that in mind, I was thinking
> that Form-B is an integral part of any unicode "Arabic" font since it needs
> to be known (and used) by everyone (well, the converter has to have these
> glyph from somewhere, right ?).

Any font format that wants to have full Unicode support is going to have to
have a table of glyphs without corresponding character codes. It's necessary
for Indic scripts, which don't have a corresponding Presentation Forms
section.

> It just seems odd to go this way - its certainly cleaner to include all the
> characters and their various permutations and give the user the ability to
> decide what he wants to type and how he wants it to look;

Zero width spaces, zero width non-joiner and zero width joiner characters
should let you decide how you want it to look; it's going to take some work
to either get everyone familar with how they work (I believe Roozbeh said
that ZWJ and ZWNJ are standard on Persian keyboards) or get a nice UI to
hide the ugly details.

> If this were to happen, it would give any application, given the
> right set of fonts, the ability to display Arabic characters, no ?  The
> person would be able to display (or read) a document, but wouldn't be able
> to modify it unless he had Bidi support and shaping.

If you can't support something as simple as Arabic shaping, then there's a
lot of stuff in the Unicode standard you aren't supporting. They probably
found it more important to encode Arabic "right" than to try and make it as
simple as possible.

> > Under Unix, OpenType is supported by FreeType 2. Since OpenType fonts are
> > currently almost impossible to make under Unix, what about BDF fonts?
> > Arabic Presentation Forms A & B is made for stuff like BDF fonts, and an
> > argument can be made that every Arabic BDF font should include them.
>
> Didn't follow - sorry (I'm new to all this Unicode stuff).  How does OpenType
> relate to Unicode (or does it) and you imply that OpenType does conversions
> itself (which encodings is it using ? which standard is it adhering to ?
> it sounds like a library to me).

OpenType is a new font format designed by Microsoft and Apple to be the
"ultimate" font format. One of OpenType's capabilities is to convert
characters internally to glyphs in a font dependent way. It makes it
possible to make an Indic font, or a cursive font, where you have to worry
about what the next character is, so you can draw the current one to connect
to it. A lot of people are of the opinion that once OpenType fonts start to
really be created, most of the standard conversions, include Arabic shaping,
will be automatically included by the tools.

> > You had another question about how you were going to encode that many
> > glyphs in an 8-bit font. UTF-8 is irrelevant here. If you have
> > XFree86 4.0, it includes fixed fonts encoded in ISO10646-1, which is the
> > encoding for Unicode fonts under X.
>
> What if I'm not using XFree86 - what if I'm on a solaris SUN system and want
> to augment an application to support Arabic :-)

XFree86 is mostly irrelevant here. The easiest way is to build the
application on GTK 2.0 or QT 3.0 (?), as they should take care of most of
the display issues for you. This is a rather vague question, and I'm not
quite sure that I'm the person to answer it.

> Is there an issue regarding
> sharing these documents across different hardware platforms (that's my
> biggest fear and concern).

No. Different applications may have different problems, but Unicode and
UTF-8 itself is well defined and platform independent. (UTF-16 and UTF-32
have byte-order issues with complex solutions, but Unix people don't use
UTF-16 and only use UTF-32 internally.)

> > When you use UTF-8 (Unix's normal encoding for Unicode), U+FE70 will be
> > encoded as 0xE08080, but it uses the U+FE70 to display under X.
>
> I think there is a thread on "font encoding", so please paste this there :-)
>
> U+FE70 -> 0xE08080 how ?  Why not 0xEF8080 or 0xE08081 ? What are the rules ?

The example came from that thread. It's probably easier to point you to
http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 , which explains what
UTF-8 is and what the rules are.
--
David Starner - dstarner98 at aasaa dot ofe dot org
"The pig -- belongs -- to _all_ mankind!" - Invader Zim