[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Aralex Status
- To: Development Discussions <developer at arabeyes dot org>
- Subject: Re: Aralex Status
- From: Mohammed Yousif <mhdyousif at gmx dot net>
- Date: Mon, 24 Nov 2003 16:31:36 +0200
- User-agent: KMail/1.5.1
On Monday 24 November 2003 08:45, you wrote:
> On Sun, Nov 23, 2003 at 11:44:18PM +0200, Mohammed Yousif wrote:
> > Salam,
> > Since you are using C++ (which is a good thing) I noticed that a unicode
> > string class would really be appreciated, but since I couldn't be able
> > to find any standalone class I decided to write one first.
> > I'm attaching the work-in-progress DString and DChar classes.
> > They are not complete, but they work and I'll not complete them as
> > I will add only the features we want to use as we go.
> > Another issue is (guess what?) regular expressions!!
> > It seems to me that there are two workarounds for it:
>
> Cool -- I'll have a look at them and start working seriously on it over
> this Eid break -- right now I am coming down with a flu so I'll keep my
> answer short (about to leave work to go back home and get some rest).
>
It seems like all people got a flu :-)
I hope you will be fine.
> > + The first is to simply pass on raw utf8 encoded bytes and
> > match/search them against utf8 encoded bytes.
> > To give you an example, consider this listing
> > ############################
> > def stripDiacritics(self, ustr):
> > "Strip diacritics from word and returns clean unicode string"
> > return re.sub(ur'[%s%s%s%s%s%s%s%s%s]' % (FATHATAN, DAMMATAN,
> > TATWEEL, KASRATAN, FATHA, DAMMA, KASRA, SUKUN, SHADDA),
> > '', ustr)
> > ############################
> > The workaround says that we should use utf8 encoded bytes for ustr
> > (which is natural anyway) and the tricky part is to use utf8
> > encoded bytes _also_ for the constants FATHATAN...etc that the matching
> > is done against.
> > That way it should work.
>
> I am not sure I like this solution. I prefer to keep the library/class
> that deals with any manipulation completely encoding agnostic. That's
> why aralex is fully Unicode. Also, last I checked PCRE didn't do too
> well with Arabic UTF-8.
>
That's the job of DString, you should be able to pass to it a wide
range of encodings (not implemented yet) but DString stores it
internally as UTF-8, so there shouldn't be any problem using
UTF-8 as the unified encoding used internally by Duali and still
being completely encoding agnostic :-)
> > + The second is to use a transliteration table for anything we pass to
> > the regular expression library.
> > It works like that:
> > * convert the utf8 string to a Latin1 string using the
> > transliteration table.
> > * convert the pattern utf8 string string using the
> > transliteration table.
> > * take the result and reverse the process using, again, the
> > transliteration table.
>
> I think I can live with something like that as a temporary solution. I
> don't like it, but I think it is probably the best way to go. I can't
> believe I didn't think of that ;) Now I feel very stupid ;)
>
You shouldn't ;-)
It's just not natural, we might as well implement it as a part of
DString that handles this automatically (i.e. takes care of the
transliteration thing so it's transparent to any one who uses
regular expressions with Arabic)
> > That way we can say goodbye to this problem and move on to
> > another one.
> > I would go for the first one if it works well but if it doesn'n I would
> > go for the second.
>
> I think we will go with the second solution (unless you can convince me
> otherwise) -- I can be flexible ;)
>
> > The last issue is that I discovered that the GNU C Library has support
> > for regular expressions so we may be able to not depend on pcre
> > after all.
> > http://www.gnu.org/manual/glibc-2.2.5/html_node/Regular-Expressions.htm
>
> Glibc's regex implementation is rather simplistic last I checked. But I
> wasn't looking for a latin-based regex engine at the time. I'll have
> another look at it.
>
> P.S. I prefer to keep this discussion on the 'developer' list -- you
> never know who else might have ideas or jump in to the discussion.
> Besides, people can know we're working on something. In the future,
> unless an email is explicitly marked 'private' I'll reply back on a
> public list.
>
> Regards
--
Mohammed Yousif
We _will_ restore OUR Jerusalem.