[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Regular Expressions (Was Re: Arabization, techniques and problems)

To: general at arabeyes dot org
Subject: Regular Expressions (Was Re: Arabization, techniques and problems)
From: Abdalla Alothman <abdalla at pheye dot net>
Date: Sat, 20 Aug 2005 00:26:49 +0300
Organization: Pheye Technologique, GT&C
User-agent: KMail/1.8

Salam (peace), Gregg.

I saw your Emails, but I was about to travel with the family.

On Sunday 10 July 2005 16:21, Gregg Reynolds wrote:

> Here's a brief example of what I mean.  Standard regexes use the 
> metacharacter "." to mean "match any single character".  So a search 
> pattern like "k.b" will match ktb, krb, etc. but also kab, kub, 
> k<sukuun>b, etc.
>
> Which is fine; but in Arabic we may want to ignore "stackers" (fatha, 
> shadda, etc.).  So we need another metacharacter that means "match any 
> non-stacking character".  Suppose we use ":" with this meaning.  Then 
> the search pattern "k:b" will match ktb, krb, etc., but *not* kab, kub, 
> k<sukuun>b, etc.
> 
> If you start by asking "what kinds of searches might an Arabic speaker 
> want to do" and then think about how regexes could make such searches 
> natural and easy, you can come up with a lot of ideas.
> 
> -gregg

What  you are  probably referring  to is  what I  always define  in my
source code as "rule1."

Basically you first  construct is a character class  that contains all
the diacritical marks:

[dm1dm2dm3dm4dm5]

Then you need  to tell the Regexp engine that  one of those characters
will appear X number of times. This is best used with a quantifier. So
it becomes:

[dm1dm2dm3dm4dm5]{0,2}

{0,2}: Appears zero to two times.

Usually it is  rare that each letter will have more  than two DMs, but
if strange things can happen, replace 2 with three.

Now, put all that in a group.

([dm1dm2dm3dm4dm5]{0,2})

And that's  it. If you  need help using  it, please don't  hesitate to
ask. Please note  that I only read my  non-business Emails on weekends
(Wednesdays  through Fridays, here  in Kuwait).  Usually I  delete all
mailing list messages if I was  traveling for a long time (for obvious
reasons).

Alternatively, you can  put all those marks in a  C++ string, and call
the find_first_of()  member to filter  out those characters.  With C++
there  are  many  other  ways  to  do  it  using  the  STL  algorithms
(transform, replace, etc. you can add predicates that fit your needs.)

BTW, although regexps are nice,  they are not always portable from one
implementation to the other. I chocked once with the regexp above when
it  was  tried  with  MySQL's  implementation.  Other  implementations
produce  unacceptable and  unexpected results  (avoiding  regexps with
simple  filtering techniques  like the  above is  more safer  and more
efficient).  Regexps  also  consume  computational power  if  you  are
searching huge amounts of data.

In  Summary, if  you  construct  a perfect  regular  expression for  a
certain  implementation, don't  count that  it would  work  on another
implementation. I do have some  regexps that work with PostgreSQL, but
fail on MySQL and other implementations (e.g., BOOST regex). In addition,
most regex implementions are poor with Unicode characters. In PostgreSQL,
for example, word boundaries do not work with Arabic.

I hope this would be of little help.

Wishing you and your family peace and good health.

Salam (peace),
Abdalla S. Alothman

Follow-Ups:
- Re: Regular Expressions
  - From: Nadim Shaikli

Prev by Date: @Meor Re: Quran PDF generation, help needed
Next by Date: Tanween variants and Unicode
Previous by thread: Re: Quran font issue
Next by thread: Re: Regular Expressions
Index(es):
- Date
- Thread