[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Internal number storage



Hi,

Roozbeh Pournader a *crit :

> If you want my word, forget about 8859-6 if you want consistency. There is
> no gurantee that you can get Arabic-Indic or European numbers (whichever
> you want) in a different environment.

ISO-8859-6 is no more or less valid than any existing Unicode encoding when it
comes to "consistency", Roozbeh. In all those encodings, you have a the
possibility of storing both in a "unified" or a "separate" model. ISO-8859-6
should definitely not be forgotten, it is THE only correct 8-bit standard for
Arabic so far, and though I do not believe it to be perfect, inconsistency is
not one of its flaws.

>
> European numbers if the user wants them, and Arabic numbers in the other
> case.

What Unicode calls European numbers is what the whole world calls Arabic
numbers, those shapes having been invented in the Arab Spain (Al-Andalus). This
is not about a semantic battle, but about possible confusions in terminology.
So in order to avoid confusions, I will refer to the digits as written in the
Western parts of the Arab World (the Maghreb) and Europe as being the Arabic
digits, and to the ones used in the Eastern parts of Arab World as Hindi
digits.

The approach you are suggesting is a technical headache (see following), and
goes against common practice (which for a good part is, yes, Microsoft related,
but not only, Sakhr also does it, as well as a few pieces of soft here and
there).

> And if you are using a charset who has unified those, you can
> either:
>
> 1) Convert them intelligently (however you define that).

yes, this part should be defined, see following

>
>
> 2) Convert them to 0030..0039 (which helps you from getting into a can of
> worms, but makes the users complain).

This is valid only for legacy, as most of the current charsets allow both
unified and seperate storage.

> > What does the unicode (or any other) standard say about which encodings
> > should be stored (if they don't care and don't state anything, who does) ?
>
> W3C cares, and Unicode cares. Both recommend storing the form that the
> user sees on the screen.

I might have missed something here, so could you please point pointers to the
specific parts that "recommend storing the form that the user sees on the
screen"? Because if that's the case, they'd be inconsistent in some of their
specs. For example, Unicode's UAX#9 avoids any siding, and assumes that both
the "seperate" and "unified" approaches can be used, which is the reason why
they give possible treatments for both cases. The same goes for the W3C, which
in some of its specs, gives different solutions depending on the implementor's
approach.

Now on the technical side, using the "seperate" approach limits the
universality of Arab texts. There are many Arabs in North Africa who are simply
unable to read Hindi digits. Forcing it upon them by storing it either limits
the text's readabilty or translate to the source code side the transformation
operation, where a simple font change would have done it (granted, if it's just
about fonts, both approaches can be fixed that way, but the "seperate" approach
would involve a numeric redundency in the font in order to translate). The same
goes the other way, it might less confortable for an Arab from the East to read
Arabic digits.
Another problem when storing numbers in a "seperate" model, is that by doing
so, you simply flush all legacy code that deals with text<->numbers data. Put
in a few words, it means the herculean task of reinventing the wheel and
changing such low level stuff as glibc...

The only trade in the unified model is for context dependent digits display.
This can be done in a bidi-style (or shaping-style) approach at a high level.
Not all implementations are perfect indeed, but that's more due to a lack of
standard in this very particular case than to the impossibility of implementing
one. In other words, MS and other software developers were right in their
unifying approach, what we might want to do would rather be to establish a
standard around this issue, especially the context dependent numbering display.

Salaam,
Chahine