[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: shaping arabic
- To: developer at arabeyes dot org
- Subject: Re: shaping arabic
- From: Otakar Smrz <smrz at ckl dot ms dot mff dot cuni dot cz>
- Date: Fri, 10 Oct 2003 15:57:42 +0200 (CEST)
- Cc: roman at czyborra dot com
> > I also found a few characters reversed with your routine so this past
> > weekend I combined the functionality of the two routines (shape_arabic and
> > arabjoin) and created a third. The routine also does the following:
> >
> > SOURCE:
> > <Arabic1> <Latin1> <Arabic2> <Latin2> <Arabic3>
> >
> > RESULT:
> > <3cibarA> <Latin2> <2cibarA> <Latin1> <1cibarA>
>
> I'm not a fan of arabjoin and I think it is your source of problems.
> Dump it and use fribidi instead.
Hello, Chris and Nadim,
I have no unsolvable problems with arabjoin. I do not know fribidi, but
will have a look at Arabeyes.
I would like to stress the striking 'perl-way' implementation of the
shaping algorithm, which in the code at http://czyborra.com/ reads
@uchar = # UTF-8 character chunks
/([\x00-\x7F]|[\xC0-\xFF][\x80-\xBF]+)/g;
# We walk through the line of text and do contextual analysis:
for ($i = $[; $i <= $#uchar; $i = $j)
{
for ($b=$uchar[$j=$i]; $transparent{$c=$uchar[++$j]};){};
# The following assignment is the heart of the algorithm.
# It reduces the Arabic joining algorithm described on
# pages 6-24 to 6-26 of the Arabic character block description
# in the Unicode 2.0 Standard to four lines of Perl:
$uchar[$i] = $a && $final{$c} && $medial{$b}
|| $final{$c} && $initial{$b}
|| $a && $final{$b}
|| $isolated{$b}
|| $b;
$a = $initial{$b} && $final{$c};
}
[to avoid 'undefined' warnings, you might use something like
for ($b=$uchar[$j=$i]; $transparent{$c=$uchar[++$j]||''};){};
in the code above]
The rest of the script is either getting the Unicode data, or dealing with
ligatures, which may be omitted except for the compulsory lam+alif ones.
The problem you might face is that the data in the file are in utf8, and
that you will need to perform conversions like
use Encode;
$internal_perl_representation = decode 'utf8', $arabjoin_data;
# or if taking arabjoin.pl as is
$expected_by_arabjoin = encode 'utf8', $in_perl_internal_utf8;
while having your Arabic strings in the perl's internal representaion.
If there are new implementations to the shaping algoritm, I think they
should evaluate against this arabjoin.pl script/algorithm by Roman
Czyborra, using e.g. the Benchmark module for the developer's tests.
If anything needs improvement in arabjoin, it is the clear programming
interface and 'file-encoding-independent' storage of the Unicode data,
simply using string interpolation and the \x{...} construct or so. Further
to improve is the optionality/scope of the non-compulsory ligatures, and
the efficiency of it. The shaping itself is solved excellently.
So, if there are any modules to appear at CPAN, I would like them to
address the above issues, so that they are really reusable.
Thanks,
Otakar Smrz