[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Unicode and Bidi vs Shaping



As promised, attached is our proposal to the Unicode Technical Committee
about Bidi and Shaping interaction, that we sent last night for
consideration in the UTC meeting due to be started on March 4. Also
attached is a discussion paper from Mark Davis, the president of Unicode
Consortium and the author of Unicode Bidirectional Algorithm.

Please send any comments to me, Behdad, and/or this mailing list.

Also, please don't circulate the documents out of Arabeyes mailing lists.

roozbeh

Attachment: bidi5.pdf
Description: Adobe PDF document

Title: Open BIDI Issue #5

L2/03-064

Unicode BIDI Issue #5

2003-01-22, MED

The key issue for the Unicode BIDI committee before the next UTC meeting is to discuss and come to consensus on item #5: whether (logically) shaping gets applied before or after BIDI directional reordering. In most cases, this doesn't matter, but it can affect the result. The following describes the possible differences in appearance, and outlines options for the committee to decide among.


We will first set up a simple test case. Suppose that we have the following string of Arabic characters in memory, as characters 1, 2, 3, and 4.

1 2 3 4
ج
062C
JEEM
ع
0639
AIN
ل
0644
LAM
م
0645
MEEM
L L R R

We will override the first two characters to be LTR. So that we can show both paragraph directions, the next two will be embedded, but with the normal RTL direction. One can use embedding codes to get this effect in plain text, or markup in HTML.

This is reproduced below, although the effect in the last three rows will depend on the browser's BIDI support of these characters and/or HTML styles.

Codes Left-Right Paragraph Right-Left Paragraph
LRM/RLM
LRO JEEM AIN PDF
RLO LAM MEEM PDF
‎‭جع‬‮لم‬ ‏‭جع‬‮لم‬
<p dir="ltr"/"rtl">
LRO JEEM AIN PDF
RLO LAM MEEM PDF
‭جع‬‮لم‬ ‭جع‬‮لم‬
<p dir="ltr"/"rtl">
<bdo dir="ltr"> JEEM AIN </bdo>
<bdo dir="rtl"> LAM MEEM </bdo>
جعلم جعلم

The resulting display order will be one of the following, depending on the paragraph direction.

Left-Right Paragraph Right-Left Paragraph
1 2 4 3
ج
062C
JEEM
ع
0639
AIN
م
0645
MEEM
ل
0644
LAM
4 3 1 2
م
0645
MEEM
ل
0644
LAM
ج
062C
JEEM
ع
0639
AIN

There are a number of possible shaping results, depending on what happens within runs and what happens across runs. The four most likely candidates are:

A. If we shape, then apply BIDI, we get the following visual result:

Left-Right Paragraph Right-Left Paragraph
1 2 4 3

JEEM-I

AIN-M

MEEM-F

LAM-M
4 3 1 2

MEEM-F

LAM-M

JEEM-I

AIN-M

B. If we shape simply according to the resulting display order (after BIDI), we get the following:

Left-Right Paragraph Right-Left Paragraph
1 2 4 3

JEEM-F

AIN-M

MEEM-M

LAM-I
4 3 1 2

MEEM-F

LAM-M

JEEM-M

AIN-I

C. If we shape simply according to the resulting display order (after BIDI), but don't shape across direction-run boundaries, we get the following:

Left-Right Paragraph Right-Left Paragraph
1 2 4 3

JEEM-F

AIN-I

MEEM-F

LAM-I
4 3 1 2

MEEM-F

LAM-I

JEEM-F

AIN-I

D. If we simply don't shape characters with overridden direction, we get the following:

Left-Right Paragraph Right-Left Paragraph
1 2 4 3
ج
JEEM
ع
AIN

MEEM-F

LAM-I
4 3 1 2

MEEM-F

LAM-I
ج
JEEM
ع
AIN

I think the argument for the (A) is that in practice it will be quite unusual to override the direction of  Arabic letters, and it may not matter than the forms look odd. And (A) may be simpler to implement, since line breaks can be decided before applying the BIDI algorithm.

For (B) or (C), one could argue that the end result is less weird, and that in practice the BIDI algorithm must be applied anyway to the entire paragraph; so at that point one knows what the ordering is anyway. (C) may be simpler to implement, since one never needs to look outside of directional boundaries for shaping. (D) probably is no simpler to implement, since you still have to determine the runs before you decide whether or not to shape.

Note: it appears that both IE and NN use (C). Please try out other products to see what they do.

We could also, of course, have an approach Z:

Z. The results of shaping directionally-overridden characters are undefined, and could be any of the above.

The BIDI committee should discuss the ramifications of these approaches, hopefully developing a consensus before the next UTC meeting.