[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Regular Expressions (Was Re: Arabization, techniques and problems)
- To: general at arabeyes dot org
- Subject: Regular Expressions (Was Re: Arabization, techniques and problems)
- From: Abdalla Alothman <abdalla at pheye dot net>
- Date: Sat, 20 Aug 2005 00:26:49 +0300
- Organization: Pheye Technologique, GT&C
- User-agent: KMail/1.8
Salam (peace), Gregg.
I saw your Emails, but I was about to travel with the family.
On Sunday 10 July 2005 16:21, Gregg Reynolds wrote:
> Here's a brief example of what I mean. Standard regexes use the
> metacharacter "." to mean "match any single character". So a search
> pattern like "k.b" will match ktb, krb, etc. but also kab, kub,
> k<sukuun>b, etc.
>
> Which is fine; but in Arabic we may want to ignore "stackers" (fatha,
> shadda, etc.). So we need another metacharacter that means "match any
> non-stacking character". Suppose we use ":" with this meaning. Then
> the search pattern "k:b" will match ktb, krb, etc., but *not* kab, kub,
> k<sukuun>b, etc.
>
> If you start by asking "what kinds of searches might an Arabic speaker
> want to do" and then think about how regexes could make such searches
> natural and easy, you can come up with a lot of ideas.
>
> -gregg
What you are probably referring to is what I always define in my
source code as "rule1."
Basically you first construct is a character class that contains all
the diacritical marks:
[dm1dm2dm3dm4dm5]
Then you need to tell the Regexp engine that one of those characters
will appear X number of times. This is best used with a quantifier. So
it becomes:
[dm1dm2dm3dm4dm5]{0,2}
{0,2}: Appears zero to two times.
Usually it is rare that each letter will have more than two DMs, but
if strange things can happen, replace 2 with three.
Now, put all that in a group.
([dm1dm2dm3dm4dm5]{0,2})
And that's it. If you need help using it, please don't hesitate to
ask. Please note that I only read my non-business Emails on weekends
(Wednesdays through Fridays, here in Kuwait). Usually I delete all
mailing list messages if I was traveling for a long time (for obvious
reasons).
Alternatively, you can put all those marks in a C++ string, and call
the find_first_of() member to filter out those characters. With C++
there are many other ways to do it using the STL algorithms
(transform, replace, etc. you can add predicates that fit your needs.)
BTW, although regexps are nice, they are not always portable from one
implementation to the other. I chocked once with the regexp above when
it was tried with MySQL's implementation. Other implementations
produce unacceptable and unexpected results (avoiding regexps with
simple filtering techniques like the above is more safer and more
efficient). Regexps also consume computational power if you are
searching huge amounts of data.
In Summary, if you construct a perfect regular expression for a
certain implementation, don't count that it would work on another
implementation. I do have some regexps that work with PostgreSQL, but
fail on MySQL and other implementations (e.g., BOOST regex). In addition,
most regex implementions are poor with Unicode characters. In PostgreSQL,
for example, word boundaries do not work with Arabic.
I hope this would be of little help.
Wishing you and your family peace and good health.
Salam (peace),
Abdalla S. Alothman