[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Arabic Unicode fonts



On Tue, 14 Aug 2001 06:26:08 +0100
 David Starner wrote:
>
> From: "Nadim Shaikli"
> >  1. All Arabic fonts _must_ include forms-B (or equivallent) for
> >     them to be properly called Arabic fonts (ie. ISO8859-6) since
> >     without those
> 
> No can do. An ISO8859-6 font has a fixed set of 190 glyphs, 94 ASCII
> and 96 Arabic. It can't include Forms-B. If you want Forms-B, you're
> going to have to use a Unicode font, not an ISO8859-6 font.
>

Well that simply means that ISO8859-6 stand-alone is useless.

> >     extensions (again, they should not be optional) my shaping
> >     library or application won't find those glyphs.  Moreover,
> >     those glyphs have to have be standardized so that the same
> >     library (or code) could be utilized in many applications (it
> >     needs to look for the same encoding irrespective of the
> >     application).  I tend to think of shaping as a library -
> >     develop once, use everywhere.
> 
> True - for most Arabic fonts in currently used Unix font formats,
> appropriate characters in the Forms-B blocks should be included.
> For an OpenType system, Freetype and the font will do the shaping
> for you, and whatever ligating the font supports.

OK - so we agree on the point that the various glyphs have to be
included for a font to be usable; now the question is how does one
accomplish this (sans OpenType/FreeType).

> >  2. It would have been much nicer to treat all the Arabic glyphs
> >     as characters
> 
> I've appended the response from unicode at unicode dot org below.

<plug>

Thanks - I invite all the unicode'ers to check out our archives :-)

    http://www.arabeyes.org/cgi-bin/mailman/listinfo/general

</plug>

> > > Any font format that wants to have full Unicode support is
> > > going to have to have a table of glyphs without corresponding
> > > character codes.
> >
> > I didn't see any mention of this table anywhere on unicode ? 
> > Plus shouldn't that table (if it doesn't have corresponding
> > character codes) be formalized somehow so that a developed
> > library would work across all fonts/tables ?
> 
> It's a system/font-type specific thing. An OpenType font system
> gets handed characters and it finds the appropriate glyphs.

Without getting into specifics of OpenType and how it functions
(my reading indicates that it will require a font input file anyways),
let's agree that OpenType is not a solution I can go and download
__today__ for my Arabization development effort -- as such, I
continue to look for a means to generate these fonts/tables/glyphs
and I'm trying to understand the "standard" way of how these things
are supposed to fit together (specifically the tables).

In any regard, any font that I'd be able to use _today_ will have to
have a lookup-table with the appropriate glyphs (if I'm understanding
you correctly) -- the question now is whether there is a standard which
specifies how these tables are supposed to be built.  Certainly for
all the applications to use the same underlying library the
encodings/addresses/etc of the table entries have to be consistent if
I were to be able to swap "standard" fonts.

> From Phillip Reichmuth:
>
[snip snip]
>
> The Arabic script has a very high degree of variation in appearance of
> the individual letters. This is also highly dependent on the style or
> script in which the particular text is written; Nasta'liq, Naskhi,
> Shekaste and Hijazi script styles, for example, have wholly different
> ligature sets which are not at all completely covered in Unicode.
> However, they don't _need_ to; if, say, Shekaste has a ligature for
> the letters XYZ, it is completely the rendering system's (i.e. mainly
> the font's) issue to display this in pretty Shekaste form. Unicode
> encodes the underlying characters, that is, XYZ.

Ligature (defined as - "a written character consisting of two or more
letters or characters joined together") don't concern me at this point.
They are the exception to the rule and one could argue that 99% of
the Linux Arabic population could live without them for the time being.
I want to get the basics working (my questions/comments/mussings
disregard ligatures wholeheartedly).

> >>Why do it this way :-D ?  Are there some hidden advantage that I'm
> >>not thinking of (beside saving font space) ?
> 
> Yes:
> 
> - Searching and comparison are easier because if you want to search
> or compare the letters XYZ, that way you only have to look for the
> letters, possibly sans vowelization symbols. If one uses presentation
> forms for encoding, you have to search/compare X + YZ, XY + Z, X + Y +
> Z and XYZ separately which is a real pain, extremely complicated to
> implement and prone to errors of all sorts.

Its not that complicated if you ignore ligatures (I've noted a simple
algorithm on arabeyes' mailing-list in case anyone wants a follow-up).

> - If you want to use a script style that does not contain all (or
> contains more) the ligatures from Unicode Presentation Forms, you have
> a problem. The ligatures from Unicode are based on Naskhi. If you want
> to write a text in Naskhi and later reformat it in Shekaste where the
> ligatures are completely different, the program has to go to real
> pains, replacing the characters everywhere _in the entire file_. This
> process is much more complicated than having the system-built-in
> rendering engine render a paragraph.

My point exactly - all fonts have to be consistent in order for an
application to work universally.  This is a central problem we're
seeing with "most" Arabic code/applications -- they each use their
own set of fonts without which the application is useless.  Which is
unacceptable.

There are a myriad of so-called ISO8859-6 fonts out there which utilize
the control character space to add their glyphs -- each of these fonts
utilizes its own encoding for that glyphs - (see the inconsistency !!
even for something they are not supposed to be doing).

Mark Leisher's arabic24-1.4 notes the following,

 "# The current version of MUTT allocates U+E600 to U+E6FF for
  # Arabic extensions.  The current characters are the contextual
  # forms that are not encoded in Unicode plus anything else Arabic
  # related that is not in Unicode."

Is MUTT considered _the_ standard or is it just one of the many options
out there (I'm gathering its the later).  OK, so unicode doesn't deal
with glyphs - who's defining the standardization of the glyphs then
(again, I'm not talking about ligatures) ?

> >>It just seems odd to go this way - its certainly cleaner to include
> >>all the characters and their various permutations and give the
> >>user the ability todecide what he wants to type and how he wants
> >>it to look;
> 
> That's done in OpenType anyway: you can choose if you want to use a
> ligature. However, if you just support Unicode presentation forms, the
> user does not get any option beyond Unicode presentation forms at all,
> which is quite a limitation in the processing of, say, Persian poetry.

Unfortunately I didn't find any Arabic OpenType fonts and I don't think
it would be a good idea for me to sit and wait and thus my questions.

> From Marco Cimarosti:
>
> >>Except that this is not what Unicode is about: Unicode is about
> >>what-you-store-is-what-you-mean.
> >
> >I agree (to this and pretty much everything else in Philipp's
> >response). If you want what-you-store-is-what-you-see, use PDF.
> 
> I too agree with Philipp, but I must note that he mostly explained
> why it is not wise to encode Arabic *ligatures*.
> 
> But I think the question was more about encoding the contextual
> form of *single* Arabic letters.  After all, it is easy to see
> Arabic contextual forms as a thing very similar to European case
> variants.
> 
> So a devil's advocate may ask: if the Arabic shaping forms of Kaaf
> have been unified in the same code point, then why Latin uppercase
> and lowercase K haven't been unified as well? And, conversely, if
> Latin case variant have been assigned to different code points, why
> not Arabic shape variants?
> 
> The pros and cons of the two problems are relatively similar:
> disunifying Latin case variants makes search and sort slightly
> more complicated; unifying them makes search and sort simpler but
> complicates the display process, and requires the introduction of
> "zero width uppercase" and "zero width lowercase" controls.
> 
> Similarly: disunifying Arabic shape variants makes search and sort
> slightly more complicated; unifying them makes search and sort
> simpler but  complicates the display process, and requires the
> introduction of a "zero width joiner" and "zero width non joiner"
> controls.

Thank you Marco -- that's exactly what I was getting at.  Not for
the sake of argument, but to understand and progress on a foundational
element.

> Of course, I think I know the short answer: both the Latin and Arabic
> part of Unicode descend from ISO-8859, a pre-existing standard, which
> encoded the two scripts this way.
> 
> However, this may be an unsatisfying answer, especially out of
> standardization circles, so someone may come up with more
> philosophical answers.

Yah, it certainly is - the good thing going for Arabic is the lack of
legacy code and documents.  99.9% of the Arabic documents out there are
Micro$oft which leaves us (non-standardization) linux/unix folk thinking
of more straight-forward ways of doing things - thus the questions.

Thanks for all the help.

 - Nadim


__________________________________________________
Do You Yahoo!?
Make international calls for as low as $.04/minute with Yahoo! Messenger
http://phonecard.yahoo.com/