[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: shaping arabic

To: developer at arabeyes dot org
Subject: Re: shaping arabic
From: Otakar Smrz <smrz at ckl dot ms dot mff dot cuni dot cz>
Date: Fri, 10 Oct 2003 15:57:42 +0200 (CEST)
Cc: roman at czyborra dot com

> > I also found a few characters reversed with your routine so this past
> > weekend I combined the functionality of the two routines (shape_arabic and
> > arabjoin) and created a third.  The routine also does the following:
> > 
> > SOURCE:
> > <Arabic1> <Latin1> <Arabic2> <Latin2> <Arabic3>
> > 
> > RESULT:
> > <3cibarA> <Latin2> <2cibarA> <Latin1> <1cibarA>
> 
> I'm not a fan of arabjoin and I think it is your source of problems.
> Dump it and use fribidi instead.

Hello, Chris and Nadim,

I have no unsolvable problems with arabjoin. I do not know fribidi, but
will have a look at Arabeyes.

I would like to stress the striking 'perl-way' implementation of the 
shaping algorithm, which in the code at http://czyborra.com/ reads

    @uchar = # UTF-8 character chunks
	/([\x00-\x7F]|[\xC0-\xFF][\x80-\xBF]+)/g;

    # We walk through the line of text and do contextual analysis:
    for ($i = $[; $i <= $#uchar; $i = $j)
    {
	for ($b=$uchar[$j=$i]; $transparent{$c=$uchar[++$j]};){};

	# The following assignment is the heart of the algorithm.
	# It reduces the Arabic joining algorithm described on
	# pages 6-24 to 6-26 of the Arabic character block description
	# in the Unicode 2.0 Standard to four lines of Perl:

	$uchar[$i] =  $a && $final{$c} && $medial{$b} 
	||  $final{$c} && $initial{$b}
	||  $a && $final{$b}
	||  $isolated{$b}
	||  $b;
	$a = $initial{$b} && $final{$c};
    }

[to avoid 'undefined' warnings, you might use something like 
	for ($b=$uchar[$j=$i]; $transparent{$c=$uchar[++$j]||''};){};
 in the code above]

The rest of the script is either getting the Unicode data, or dealing with
ligatures, which may be omitted except for the compulsory lam+alif ones.

The problem you might face is that the data in the file are in utf8, and 
that you will need to perform conversions like

	use Encode;

	$internal_perl_representation =  decode 'utf8', $arabjoin_data; 

	# or if taking arabjoin.pl as is

	$expected_by_arabjoin = encode 'utf8', $in_perl_internal_utf8;

while having your Arabic strings in the perl's internal representaion.

If there are new implementations to the shaping algoritm, I think they 
should evaluate against this arabjoin.pl script/algorithm by Roman 
Czyborra, using e.g. the Benchmark module for the developer's tests.

If anything needs improvement in arabjoin, it is the clear programming
interface and 'file-encoding-independent' storage of the Unicode data,
simply using string interpolation and the \x{...} construct or so. Further
to improve is the optionality/scope of the non-compulsory ligatures, and
the efficiency of it. The shaping itself is solved excellently.

So, if there are any modules to appear at CPAN, I would like them to 
address the above issues, so that they are really reusable.


Thanks,

Otakar Smrz

Prev by Date: Re: Katoob does not recognise ar_EG.UTF-8 locale
Next by Date: Arabbix-0.6 Feedback
Previous by thread: Re: shaping arabic
Next by thread: Re: shaping arabic
Index(es):
- Date
- Thread