[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Regular Expressions
- To: General Arabization Discussion <general at arabeyes dot org>
- Subject: Re: Regular Expressions
- From: Nadim Shaikli <shaikli at yahoo dot com>
- Date: Sat, 20 Aug 2005 09:35:05 -0700 (PDT)
- Domainkey-signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=Message-ID:Received:Date:From:Subject:To:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding; b=5ezPWKc9/fkYn6LIS+aaiYj+WLJO3THviJRcRQmdd8gr0lWbXDqmiouJ28rpBAGnk3EMG/Z1Se7VviD2UAIeMXp4OR6hFxbtTGj9PRZ0jvtk4Sf9qd5R3aBFzBcAiVGCeobr4OUOJhKnpEl7cNR9Y4gPHN16/jok0jFuj8d+gng= ;
--- Abdalla Alothman <abdalla at pheye dot net> wrote:
> On Sunday 10 July 2005 16:21, Gregg Reynolds wrote:
>
> > Standard regexes use the metacharacter "." to mean "match any
> > single character". So a search pattern like "k.b" will match
> > ktb, krb, etc. but also kab, kub, k<sukuun>b, etc.
> >
> > Which is fine; but in Arabic we may want to ignore "stackers" (fatha,
> > shadda, etc.). So we need another metacharacter that means "match any
> > non-stacking character". Suppose we use ":" with this meaning. Then
> > the search pattern "k:b" will match ktb, krb, etc., but *not* kab, kub,
> > k<sukuun>b, etc.
>
> Alternatively, you can put all those marks in a C++ string, and call
> the find_first_of() member to filter out those characters. With C++
> there are many other ways to do it using the STL algorithms
> (transform, replace, etc. you can add predicates that fit your needs.)
A couple of comments in passing,
a. Perl has pretty good unicode support, how does it handle all of this ?
It should be pretty simple to test (I'll try to do this when I get
a chance) for reference. Just to know that status of things now
(I'd guess that composers are simply treated like normal characters,
so a '.' search would include harakat).
b. This kinda-of relates to 'diff'ing files as well. What I'm getting at
is there will be instances when harakat (or composing characters in
general) ought to be ignored and should simply be passed over as though
they don't exist at all. I would think this would be best handled via
an environmental variable (export IGNORE_COMPOSERS=1 or similar). So
when you enable this variable the composers are ignored in all apps
where searching/regex is used. This seems like a simple/feasible idea
to me that should be _very_ simple to implement, the only issue is
how to make something like this a standard that other applications
know about and follow.
Salam.
- Nadim
____________________________________________________
Start your day with Yahoo! - make it your home page
http://www.yahoo.com/r/hs