[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Aralex Status



On Monday 24 November 2003 08:45, you wrote:
> On Sun, Nov 23, 2003 at 11:44:18PM +0200, Mohammed Yousif wrote:
> > Salam,
> >  Since you are using C++ (which is a good thing) I noticed that a unicode
> >  string class would really be appreciated, but since I couldn't be able
> > to find any standalone class I decided to write one first.
> >  I'm attaching the work-in-progress DString and DChar classes.
> >  They are not complete, but they work and I'll not complete them as
> >  I will add only the features we want to use as we go.
> >  Another issue is (guess what?) regular expressions!!
> >  It seems to me that there are two workarounds for it:
>
> Cool -- I'll have a look at them and start working seriously on it over
> this Eid break -- right now I am coming down with a flu so I'll keep my
> answer short (about to leave work to go back home and get some rest).
>

 It seems like all people got a flu :-)
 I hope you will be fine.

> >   + The first is to simply pass on raw utf8 encoded bytes and
> > match/search them against utf8 encoded bytes.
> >       To give you an example, consider this listing
> > ############################
> >   def stripDiacritics(self, ustr):
> >     "Strip diacritics from word and returns clean unicode string"
> >     return re.sub(ur'[%s%s%s%s%s%s%s%s%s]' % (FATHATAN, DAMMATAN,
> > TATWEEL, KASRATAN, FATHA, DAMMA, KASRA, SUKUN, SHADDA),
> >                   '', ustr)
> > ############################
> >       The workaround says that we should use utf8 encoded bytes for ustr
> >       (which is natural anyway) and the tricky part is to use utf8
> > encoded bytes _also_ for the constants FATHATAN...etc that the matching
> > is done against.
> >       That way it should work.
>
> I am not sure I like this solution. I prefer to keep the library/class
> that deals with any manipulation completely encoding agnostic. That's
> why aralex is fully Unicode. Also, last I checked PCRE didn't do too
> well with Arabic UTF-8.
>

 That's the job of DString, you should be able to pass to it a wide
 range of encodings (not implemented yet) but DString stores it
 internally as UTF-8, so there shouldn't be any problem using
 UTF-8 as the unified encoding used internally by Duali and still
 being completely encoding agnostic :-)

> >   + The second is to use a transliteration table for anything we pass to
> >       the regular expression library.
> >       It works like that:
> >         * convert the utf8 string to a Latin1 string using the
> > transliteration table.
> >         * convert the pattern utf8 string string using the
> > transliteration table.
> >         * take the result and reverse the process using, again, the
> >           transliteration table.
>
> I think I can live with something like that as a temporary solution. I
> don't like it, but I think it is probably the best way to go. I can't
> believe I didn't think of that ;) Now I feel very stupid ;)
>
 You shouldn't ;-)

 It's just not natural, we might as well implement it as a part of
 DString that handles this automatically (i.e. takes care of the
 transliteration thing so it's transparent to any one who uses
 regular expressions with Arabic)

> >  That way we can say goodbye to this problem and move on to
> >  another one.
> >  I would go for the first one if it works well but if it doesn'n I would
> >  go for the second.
>
> I think we will go with the second solution (unless you can convince me
> otherwise) -- I can be flexible ;)
>
> >  The last issue is that I discovered that the GNU C Library has support
> >  for regular expressions so we may be able to not depend on pcre
> >  after all.
> >   http://www.gnu.org/manual/glibc-2.2.5/html_node/Regular-Expressions.htm
>
> Glibc's regex implementation is rather simplistic last I checked. But I
> wasn't looking for a latin-based regex engine at the time. I'll have
> another look at it.
>
> P.S. I prefer to keep this discussion on the 'developer' list -- you
> never know who else might have ideas or jump in to the discussion.
> Besides, people can know we're working on something. In the future,
> unless an email is explicitly marked 'private' I'll reply back on a
> public list.
>
> Regards

-- 
Mohammed Yousif
We _will_ restore OUR Jerusalem.