[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Arabic Unicode fonts



On Sat, 11 Aug 2001 07:44:17 +0100
 David Starner wrote:
> 
> I've forwarded part of your mail to the Unicode mailing list; I hope
> you don't object.

As long as the email addresses are scrambled or removed (to avoid
spam) - feel free to do so now and in the future...

> Nadim Shaikli wrote:
>
> > but 0600-06FF, by all means, is not complete since it doesn't include
> > all the various character permutations (forms).  With that said, let
> > me rephrase what you've noted (sorry, if I'm being dense); the idea
> > here is to use 0600-06FF and simply plop characters down (irrespective
> > of form) upon which time the application (or underlying library) would
> > go about transforming the characters into their appropriate visual
> > glyph (based on location, etc), right ?
> 
> Yes.

For those that are following this (and I hope everyone is - its important)
here's a visual representation of what's verbalized above,

  http://homepage.ntlworld.com/rishida/scripts/egarabic.htm

Note the "Characters in memory" section towards the bottom.

> > Why do it this way :-D ?  Are there some hidden advantage that I'm
> > not thinking of (beside saving font space) ?
> 
> Because Unicode is a character standard, instead a glyph standard.
> For a system that lets you use any Unicode script, you're going to
> have to do much more complex shaping for the Indic scripts, so
> supporting Arabic shaping shouldn't be a problem. It makes it possible
> to search for part of a word, without getting the forms all right. It
> corresponds closer to what's on the keyboard, doesn't it?

I'm not advocating removal of the shaping requirement (it'll be there
no matter what is/was agreed upon).

My points are the following,

 1. All Arabic fonts _must_ include forms-B (or equivallent) for them to
    be properly called Arabic fonts (ie. ISO8859-6) since without those
    extensions (again, they should not be optional) my shaping library
    or application won't find those glyphs.  Moreover, those glyphs have
    to have be standardized so that the same library (or code) could be
    utilized in many applications (it needs to look for the same encoding
    irrespective of the application).  I tend to think of shaping as a
    library - develop once, use everywhere.

 2. It would have been much nicer to treat all the Arabic glyphs as
    characters (whether they are on the keyboard or not is immaterial since
    English treats upper and lower case letters as two characters (A & a);
    Chinese/Japanese/Korean/etc treats various symbols as characters with
    no regard to the keyboard mapping - see my point).  If that was done
    one would be able to save/restore/buffer/etc files in their visual
    form.

    How do you deal with Searching you say ?  OK here's what I was thinking.
    Lets assume each character has a maximum of 4 glyphs (which I think is
    true); I would then encode as follows (again this is a mere example),

    Letter   1-byte encoding (hex)       glyphs
    --------------------------------------------------------------------
     meem - base encoding 0x00 (bits 1,0 determines which visual glyph)
     noon - base encoding 0x04 (bits 1,0 determines which visual glyph)
     alef - base encoding 0x08 (bits 1,0 determines which visual glyph)
     beh  - base encoding 0x0C (bits 1,0 determines which visual glyph)
     dal  - base encoding 0x10 (bits 1,0 determines which visual glyph)
     jeem - base encoding 0x14 (bits 1,0 determines which visual glyph)

    I just picked Arabic letters randomly (no particular) - with this
    if I'm searching for "alef-jeem-noon" my search would be look like
    this in binary format --> 0000_10xx 0001_01xx 0000_01xx
                                alef      jeem      noon

    the 'x' is a don't care for my search (ie. masked-out).  This will
    match all the appropriate letters along with their various permutations
    which is exactly what I want (the 'FATHA', 'DAMMA' etc) would have to
    be ignored in the text being searched if they aren't included in the
    string to match.

    Are there any pitfalls with this approach ?  Certainly shaping would
    have to occur, but it would only need to take place once (upon creation)
    and if saved text includes "FATHA", "DAMMA" (I don't know what you call
    those things in English) since the shaping library would have to account
    for those "characters".  Since usage of "tanween" (that's what they are
    called in Arabic) is very rare, shaping would almost never have to occur
    on saved documents.

> > With that in mind, I was thinking that Form-B is an integral part of
> > any unicode "Arabic" font since it needs to be known (and used) by
> > everyone (well, the converter has to have these glyph from somewhere,
> > right ?).
> 
> Any font format that wants to have full Unicode support is going to have
> to have a table of glyphs without corresponding character codes.

I didn't see any mention of this table anywhere on unicode ?  Plus
shouldn't that table (if it doesn't have corresponding character codes)
be formalized somehow so that a developed library would work across
all fonts/tables ?

> > It just seems odd to go this way - its certainly cleaner to include
> > all the characters and their various permutations and give the user
> > the ability to decide what he wants to type and how he wants it to look
> 
> Zero width spaces, zero width non-joiner and zero width joiner characters
> should let you decide how you want it to look; it's going to take some
> work to either get everyone familar with how they work (I believe Roozbeh
> said that ZWJ and ZWNJ are standard on Persian keyboards) or get a nice UI
> to hide the ugly details.

I have a feeling the document size will be 2x-4x the size with all these
control and hint characters - but yah I know disk space is cheap... :-)

> > If this were to happen, it would give any application, given the
> > right set of fonts, the ability to display Arabic characters, no ?
> > The person would be able to display (or read) a document, but
> > wouldn't be able to modify it unless he had Bidi support and shaping.
> 
> If you can't support something as simple as Arabic shaping, then there's
> a lot of stuff in the Unicode standard you aren't supporting. They
> probably found it more important to encode Arabic "right" than to try
> and make it as simple as possible.

Its not a question of being able to support this or not (anything is
possible); I just started learning about this and some of the stuff
just doesn't make much logical sense and so I wanted to broach the
subject to go through the same train of thought of whomever developed
Arabic in unicode.  Unfortunately, I'm still unconvinced - but don't
think there is much I can do about that.  You see I also keep thinking
of memory usage (visual order - save order - bidi display, etc) and
having to keep track of all these contexts.

Am I alone on this ?  if I am, I'll readily conform and move-on :-)

> OpenType is a new font format designed by Microsoft and Apple to be
> the "ultimate" font format. One of OpenType's capabilities is to
> convert characters internally to glyphs in a font dependent way.

Is there an OpenSource equivallent to OpenType being worked on (I'm
guessing since micro$oft is involved it'll be propriatery, no) ?

 - Nadim


__________________________________________________
Do You Yahoo!?
Send instant messages & get email alerts with Yahoo! Messenger.
http://im.yahoo.com/