[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Unicode Bidi Proposal



Salam Roozbeh,

I'll let other people do a bigger time introduction as I am a bit lazy
for that. I'll just summarize things by saying we're working on
Arabization of software, and unixes in particular (see
http://www.arabeyes.org).

Anyway, we have been discussing the unicode bidirectional processing
specification lately and I wrote a draft with the help of my friends on
this list. Nadim more or less pointed out that you could be interested
in the matter as an expert in unicode;) and I agree with him since I've
seen you active in many lists about unicode stuff.

Here's a copy of the draft, feel free to massacre it, destroy it, tear
it in little pieces. We want it to be solid because we're to base our
work on it.

Best regards,
Chahine

___________________________________________

Bidi standardization

Introduction

Arabic text processing faces two technical problems today. The first one
is contextual formatting, which is the process of shaping glyphs
according to their position in the word. This can be easily resolved in
systems that do not require conventional calligraphy by simply using
position independent glyph shapes that are readable by anyone who reads
Arabic. The second one is the directional layout: Arabic script being
written from right to left, insertion of latin based text or numbers
need to be transparently.

Many proposals were made and implemented. Today, the most widely
accepted is the Unicode consortium's Bidirectional specification, which
comes together with many not fully compliant, and therefore not fully
compatible, implementations from different constructors of developers.
The Unicode specification raises important issues though, and although
many documents and pieces of software are written in its spirit,
technical arabization is relatively limited so it's not too late to
change standards. We will describe here the issues and make proposals
about the possible ways to solve them. A technical draft will then be
proposed as the futur standard.

1) Maximum transferability

Since many pieces of software are unaware of the peculiarities of
directional layout, the presence of idiosyncrasies in text storage (text
internally stored backward, in the sense of the visual order, and
embedded directional commands) greatly reduces transferability of
documents. The most easy way to keep a text readable by virtually any
software would then be to store a text in the most simple way, devoid of
hidden commands and in the visual order.

Thus, the Unicode recommendation violates the principle of maximum
transferability. The Unicode Consortium suggests the use of an internal,
logical order, different from the visual order in that it includes
strings in backward order and directional commands.

The logical/visual difference in memory storage actually assumes mixed
latin-arabic text is written more than non mixed text, and assumes that
more of it will be written than any kind of text will be viewed. In
reality though, these two assumptions are wrong. First most texts
contain either an essentially latin text, with only little or no Arabic
insertions or vice-versa. Second, most texts are essentially viewed, and
comparatively only exceptionnally written or edited. When following the
Unicode recommendation, every time a user views an Arabic text, it must
be analysed again. This includes for example web pages that are almost
never written. A significant gain could be won by storing Arabic texts
in visual order, and could make Arabic texts readable by most existing
software using the right fonts.

2) Minimal programming overhead

The Unicode Consortium's specification requires at least a double scan
of the text everytime it is viewed. When a character is added or
inserted, complex information has to be kept in memory about the string
in order to avoid recomputing the visual order.

It also requires word processing capabilities. To illustrate this, let's
consider a text console. The Unicode Consortium's specification doesn't
mean much there. Terminals are character oriented, each character is
more or less independant from the others. How could we control ordering?
For example, should we consider that a word stops at the end of the
screen line or should we consider the following lines to be connected?
In the latter case, how could we gather information about the
information that comes out of visual borders and relevant to displaying
the viewed one? Wouldn't this add complexity to applications that have
the entire information and would have to correct the already processed
information? All these issues would be eliminated by matching the visual
representation on a character basis.

3) The "block cursor" mode

In the widely used edition mode called insertion, when characters are
typed they are inserted at the cursor location. In that mode the cursor
moves forward each time, just like in overwrite mode. The goal is to
enter a reverse string without by typing it in the right order. For
example, in right to left mode, we would like to type the word "hello"
after we have typed MALS (salam, Arabic greeting). The trick here is an
insertion mode where the cursor doesn't move anymore.
So let's suppose we have typed MALS (salam, seen laam alef meem) in
Arabic (the console direction is now set to right to left direction for
this example) and the cursor is represented by * :
     * MALS
now let's suppose we are writing hello, we block the cursor and we get
keystroke by keystroke:
    h* MALS
   he* MALS
  hel* MALS
 hell* MALS
hello* MALS

then we can type "End" to move the cursor to the end of the sentence.
*hello MALS

In this case visual and logical orders are the same, without the
problems that we can easily check by trying to edit a mixed text with
bidi using most current Unicode-like implementations (lost cursor,
jumping cursor, jumping selections, plus all implementation trouble) and
without the computer inefficiencies.

4) Transparent input

While blocking/unblocking the cursor should be possibly done at will by
the user for controlling rare cases, it also should be allowed to have
an automated switch to and from the "block cursor" (BC) mode so the
input would all be transparent to the user. This way, a user could enter
a text in a natural flow, i.e. ordered in the way that would come
naturally to mind. One problem though is that the possibilities of mixed
text ordering are too complex to be predicted at input time without
knowledge of the portion of text that is not yet entered, especially
when it comes to blanks and punctuation.
We are left with a wide choice of heuristics for guessing mixed entry
ordering. Though it is not necessary to have the same algorithm for all
bidi input processes as long as the storage remains the same, it is
desirable that the user input habits are not disturbed by changing
software.

As an example, We suggest two different approaches which have the merit
of remaining simple in terms of implementation while bringing a
reasonable dose of confort for the user.
In the first approach, we will only switch the BC mode automatically for
numbers when writing from right to left (assuming numbers are entered in
spoken or modern standard Arabic order, i.e. from left to right - except
for tens and units though we won't consider them), and the user will
have to activate the BC mode manually when the need for entering a
reverse orientation text would arise in all other conditions.
In the second approach, one more dose of confort is given to the user by
allowing the automatic switch to happen in sequences of reverse oriented
characters mixed with spaces.

FIXME: the best approach is one that makes the automatic switch on all
characters with predictable behavior. What's the set of those
characters? alphanumerical and spaces at least. Anything else?
A systematic survey should be done to determine the probability of
character ``branches'' when entering a bidirectional flow. A system of
trial and error with different algorithm should help figure out what the
most confortable one is.

5) When a Unicode-style bidi processor is needed

Even though electronic legacy texts in Arabic are relatively
insignificant in numbers compared to English texts for example (which
what allows us today to suggest a change in methods), there are enough
of them to consider it interesting enough to have a Unicode-style bidi
processor integrated to some format converters or viewer applications.
The transition period between Unicode-style bidi processed text and
those non processed must indeed come at no cost. Every newly edited text
should be stored in visual order for the reasons mentionned above. Bidi
support would then be limited to reading legacy text by backword
compatible applications and converters.

6) When a Unicode-style bidi processor seems to be better

In the case of Latin based languages (computer languages, meta like HTML
or traditional like Java/C/C++) with Arabic output, the amount of
right-to-left and left-to-right text can sometimes reach proportions
significant enough in both directions to make virtually impossible an
automatized input as described above, and a left-to-right viewport
direction while developping is imposed. Thus a bidi input method as
described above would only enter a text in the right direction if the
output is meant to be sent on a left-to-right oriented viewport.

For example, the C instruction
printf("MALS");
or the HTML tagged
<title>MALS</title>
will store MALS in that order and would therefore display it backward
(SLAM) on a right-to-left, or Unicode-style bidi viewport. This could be
a source of heavy discomfort since it would impose either of Arabic or
Latin text to be viewed backward while developping to have it right at
runtime in the cases above mentionned.
On the other hand, Unicode-style bidi processors would impose that the
viewport wouldn't be a basic Unicode conformant right-to-left console,
rendering all the recommendations above useless.

The solution here lies in two complementing answers.

- 6.1: the creation of Arabic based languages

The first solution mainly targets new developments. The creation of
Arabic based languages able to interface with legacy code would
eliminate heavy use of bidirectional texts. It would come together with
another major benefit, which we only briefly quote as it is out of the
scope of this paper: a better access to development tools for both kids
and adults who are native speakers of languages that use Arabic based
scripts.

For meta languages, XML which is Unicode characters based is a step in
the right direction as we can base new meta languages on characters that
are not Latin, and would be correctly interpreted by existing
Unicode-conformant software.

- 6.2: the creation of macros, automatic translators and editors

The second solution targets the reuse of existing languages. It is
therefore likely to be the most used on the short and mid term and
should then precede in priority to the first solution in terms of
implementation.

In the case of traditional computer languages, the idea here is to
create macros in the gettext library style. Thus, it should be possible
to determine the direction of the text dynamically or at compile time by
knowing what orientation is assumed. For example, the C instruction
would then look like this:
printf(LTR("MALS"));

In the case of meta languages, considering the example of an HTML page,
we could either use a WYSIWYG editor, or use a compiler which would
reorder a hand edited text.