[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Difference between encodings and glyphs



Salaam,
Again, my apologies if anything here might seem silly, but the projects
list has an important mistake that shouldn't be there for people who are
supposed to coordinate Arabization and know what they are talking about.

So I would like to explain the difference between glyphs and character
encodings just in order to make it clear so we don't look like amateurs.

A character encoding is the way a character is internally encoded,
ISO-8859-*, CP-12??, Unicode encodings etc...

For example, let's suppose we are using an encoding called AE where s is
in position 50, l in position 55 and m in position 60.
The string slm (salaam) would be
char *str = [50, 55, 60,0]; // The 0 being the mark of end of string in
C
i.e. str[0]==50, str[1]==55, etc..

The glyph is the drawing that appears on a screen.
For example, if we decide we associate the drawing "s" to the position
50 in the encoding AE, "l" to the position 55 and "m" to the position
60, the string slm would appear on the screen as "slm".
But if we decide we associate the drawing "a" to the position 50 in the
encoding AE, "c" to the position 55 and "e" to the position 60, the
string slm would appear on the screen as "ace" BUT WOULD STILL MEAN slm
TO THE COMPUTER!!!

A font file is in general a file where these drawings are stored at
given positions.
Where it's easy to make the match between the encoding and the drawing's
position (Latin, Hebrew, Russian, etc...) the drawings, better known as
the glyphs, are stored at the same location their encoding indicate,
thus making the mapping trivial. For example, the latin glyph A will be
stored at position 65 in a font file, like the ISO-8859-1 or most other
encodings indicate.
In Arabic (except if you use my position agnostic glyphs "Square
Arabic";)), we have shaping, which means, the same encoding will not
necessarily correspond to the same glyph and we have to go through more
complex processing. For example, we could decide that AE's s would be
associated with the first position glyph stored at the element 50 in the
glyph file, mid position in position 51, end position in element 52, and
lone position in element 53, and the soft would chose what is the
appropriate glyph to display according to the context. For that reason,
it is not important where glyphs are stored in a file as long as the
soft knows where they are, but it is important to use standard encodings
because files, libraries for string manipulations and other data
treatment pieces of codes rely on the encoding.

We might wonder why we don't simply make an Arabic encoding using
directly a trivial code-to-glyph encoding (or use the trivial upper
Forms maps in unicode's tables). Because that would mean making the
problem harder, moving it in the layer of every data manipulation where
the same letters would be considered different according to their
position or context, and thus making a trivial comparison of mid-word s
and end word s a complex work for example.

There, I think I'm done. Going back to the project list, UTF-8 and other
Unicode encodings are not the answer to everything, Arabic needs an
8-bit encoding, at least for converting all the huge legacy code that
uses ISO-8859-1, and ISO-8859-6 is a more standard and appropriate
answer at that than CP-1256 which is proprietary and doesn't respect
terminal control codes. My only reservation about ISO-8859-6 is that it
doesn't have the necessary codes for letters necessary to transcript
popular chants, prose, etc... G,P,V could be our next challenge to make
the ISO folks integrate them:)

Salaam,
Chahine