[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Revised Bidi proposal



I am posting the revised bidi proposal to the mailing list,
expecting critics :)

Salam,
Chahine

________________________________________________________________________

Bidi standardization

Introduction

Arabic text processing faces two technical problems today. The first one
is contextual formatting, which is the process of shaping glyphs
according to their position in the word. This can be easily resolved in
systems that do not require conventional calligraphy by simply using
position independent glyph shapes that are readable by anyone who reads
Arabic. The second one is the directional layout: Arabic script being
written from right to left, insertion of latin based text or numbers
need to be transparently.
Many proposals were made and implemented. Today, the most widely
accepted is the Unicode consortium's Bidirectional specification, which
comes together with many not fully compliant, and therefore not fully
compatible, implementations from different constructors of developers.
The Unicode specification raises important issues though, and although
many documents and pieces of software are written in its spirit,
technical arabization is relatively limited so it's not too late to
change standards. We will describe here the issues and make proposals
about the possible ways to solve them. A technical draft will then be
proposed as the futur standard.

1) Maximum transferability

Since many pieces of software are unaware of the peculiarities of
directional layout, the presence of idiosyncrasies in text storage (text
internally stored backward, in the sense of the visual order, and
embedded directional commands) greatly reduces transferability of
documents. The most easy way to keep a text readable by virtually any
software would then be to store a text in the most simple way, devoid of
hidden commands and in the visual order.

Thus, the Unicode recommendation violates the principle of maximum
transferability. The Unicode Consortium suggests the use of an internal,
logical order, different from the visual order in that it includes
strings in backward order and directional commands.

The logical/visual difference in memory storage actually assumes mixed
latin-arabic text is written more than non mixed text, and assumes that
more of it will be written than any kind of text will be viewed. In
reality though, these two assumptions are wrong. First most texts
contain either an essentially latin text, with only little or no Arabic
insertions or vice-versa. Second, most texts are essentially viewed, and
comparatively only exceptionnally written or edited. When following the
Unicode recommendation, every time a user views an Arabic text, it must
be analysed again. This includes for example web pages that are almost
never written. A significant gain could be won by storing Arabic texts
in visual order, and could make Arabic texts readable by most existing
software using the right fonts.

2) Minimal programming overhead

The Unicode Consortium's specification requires at least a double scan
of the text everytime it is viewed. When a character is added or
inserted, complex information has to be kept in memory about the string
in order to avoid recomputing the visual order.

It also requires word processing capabilities. To illustrate this, let's
consider a text console. The Unicode Consortium's specification doesn't
mean much there. Terminals are character oriented, each character is
more or less independant from the others. How could we control ordering?
For example, should we consider that a word stops at the end of the
screen line or should we consider the following lines to be connected?
In the latter case, how could we gather information about the
information that comes out of visual borders and relevant to displaying
the viewed one? Wouldn't this add complexity to applications that have
the entire information and would have to correct the already processed
information? All these issues would be eliminated by matching the visual
representation on a character basis.

3) The "block cursor" mode

In the widely used edition mode called insertion, when characters are
typed they are inserted at the cursor location. In that mode the cursor
moves forward each time, just like in overwrite mode. The goal is to
enter a reverse string without by typing it in the right order. For
example, in right to left mode, we would like to type the word "hello"
after we have typed MALS (salam, Arabic greeting). The trick here is an
insertion mode where the cursor doesn't move anymore.
So let's suppose we have typed MALS (salam, seen laam alef meem) in
Arabic (the console direction is now set to right to left direction for
this example) and the cursor is represented by * :
     * MALS
now let's suppose we are writing hello, we block the cursor and we get
keystroke by keystroke:
    h* MALS
   he* MALS
  hel* MALS
 hell* MALS
hello* MALS

then we can type "End" to move the cursor to the end of the sentence.
*hello MALS

In this case visual and logical orders are the same, without the
problems that we can easily check by trying to edit a mixed text with
bidi using most current Unicode-like implementations (lost cursor,
jumping cursor, jumping selections, plus all implementation trouble) and
without the computer inefficiencies.

4) Transparent input

While blocking/unblocking the cursor should be possibly done at will by
the user for controlling rare cases, it also should be allowed to have
an automated switch to and from the "block cursor" (BC) mode so the
input would all be transparent to the user. This way, a user could enter
a text in a natural flow, i.e. ordered in the way that would come
naturally to mind. Even though the bidi input algorithm is not important
per se as long as the text is stored in a conventional way, it is
desirable for the user not to change his input habits everytime the
software is changed.
Thus, as an example, we suggest an approach which have the merit of
remaining simple in terms of implementation while bringing a reasonable
dose of confort for the user.
This approach consists in switching the BC mode automatically only for
numbers and Latin text on an isolated word basis when writing from right
to left (assuming numbers are entered in spoken or modern standard
Arabic order, i.e. from left to right - except for tens and units though
we won't consider them), and the user will have to activate the BC mode
manually when the need for entering a reverse orientation text would
arise in all other conditions. This means that if two subsequent Latin
words are inserted in an Arabic text for example, each seperate word
will read correctly, but the first word will appear on the right of the
second unless the BC mode is activated manually.

This is the easiest possible approach to automated input because other
issues that impose a more complex treatment appear when more than two
subsequent reverse order text appear. This is described in §5.2.

5) When a Unicode-style bidi processor is needed

- 5.1: Legacy

Even though electronic legacy texts in Arabic are relatively
insignificant in numbers compared to English texts for example (which is
what allows us today to suggest a change in methods), there are enough
of them to consider it interesting to have a Unicode-style bidi
processor integrated to some format converters or viewer applications.
The transition period between Unicode-style bidi processed text and
those non processed must indeed come at no cost. Every newly edited text
should be stored in visual order for the reasons mentionned above. Bidi
support would then be limited to reading legacy text by backword
compatible applications and converters.

- 5.2: ``Complex text'' is incompatible with visual storage

What we define here as ``complex text'' (as opposed to ``simple text'')
is a text in a given direction where two or more subsequent words of
reverse direction are inserted, numbers not included. The problem here
rises with line breaks. The first words of a sequence of reverse text
will appear from bottom to top if stored in visual order.
With such texts, the Unicode approach is better, even though a slight
change to the standard would make the Unicode recommendation compatible
with the visual order approach for simple texts.
The visual order approach assumes the text will be displayed on a device
that has a predetermined direction. For example, the device can be a
terminal that will be considered to be displaying text from right to
left. Even though this makes two possible storages for the same text, we
can safely assume that a ``simple text'' will be stored for a device
that will be supposed to be oriented in the same direction. In other
words, an Arabic text will most probably be assumed to be displayed on a
right to left device while a Latin text will be assumed to be destinated
to a left to right device (and in the few cases where those assumptions
are wrong, we can either revert to ``complex text'' treatment or process
a ``mirroring'' after a detection by trivial text analysis algorithms).
The benefit of this approach is that the treatment of most Arabic texts
would become a mere issue of inverting the X coordinate, and would then
make every existing English software compatible with Arabic ``simple
text''.It would be interesting to make ``simple texts'' a subset of
``complex texts'', i.e. to make them processable by Unicode-like complex
algorithms as well. For this, we could introduce the assumption of
display direction in the logical storage of texts that are supposed to
be processed by those high level algorithms. Since those texts already
contain idiosyncracies and are displayed in a different order than the
logical one, it doesn't matter if the words are stored visually or
logically on a word by word basis and if new commands are created. Thus,
a control code could be assigned to optionally force the assumed
directional display, and each word would then be stored visually
independently of the words order inside a sentence.

6) When a Unicode-style bidi processor seems to be better

In the case of Latin based languages (computer languages, meta like HTML
or traditional like Java/C/C++) with Arabic output, the amount of
right-to-left and left-to-right text can sometimes reach proportions
significant enough in both directions to make virtually impossible an
automatized input or assumed display direction as described above, and a
left-to-right viewport direction while developping is imposed. Thus a
bidi input method as described above would only enter a text in the
right direction if the output is meant to be sent on a left-to-right
oriented viewport.

For example, the C instruction
printf("MALS");
or the HTML tagged
<title>MALS</title>
will store MALS in that order and would therefore display it backward
(SLAM) on a right-to-left, or Unicode-style bidi viewport. This could be
a source of heavy discomfort since it would impose either of Arabic or
Latin text to be viewed backward while developping to have it right at
runtime in the cases above mentionned.
On the other hand, Unicode-style bidi processors would impose that the
viewport have complex word processing abilities, rendering all the
recommendations above useless and throwing away the benefit of all
existing English software.

The solution here lies in two complementing answers.

- 6.1: the creation of Arabic based languages

The first solution mainly targets new developments. The creation of
Arabic based languages able to interface with legacy code would
eliminate heavy use of bidirectional texts. It would come together with
another major benefit, which we only briefly quote as it is out of the
scope of this paper: a better access to development tools for both kids
and adults who are native speakers of languages that use Arabic based
scripts.

For meta languages, XML which is Unicode characters based is a step in
the right direction as we can base new meta languages on characters that
are not Latin, and would be correctly interpreted by existing
Unicode-conformant software.

- 6.2: the creation of macros, automatic translators and editors

The second solution targets the reuse of existing languages. It is
therefore likely to be the most used on the short and mid term and
should then precede in priority to the first solution in terms of
implementation.

In the case of traditional computer languages, the idea here is to
create macros in the gettext library style. Thus, it should be possible
to determine the direction of the text dynamically or at compile time by
knowing what orientation is assumed. For example, the C instruction
would then look like this:
printf(LTR("MALS"));

In the case of meta languages, considering the example of an HTML page,
we could either use a WYSIWYG editor, or use a compiler which would
reorder a hand edited text.