[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

tanween and regex



Abdulhaq Lynch wrote:


I like this idea however I don't like (as you've probably guessed by now) mixing up what is pure text, in the sense that it changes the meaning of the words, and what indicates pronounciation. Therefore I would modify this such that IDGHAAM, IKHFAA AND IQLAAB (TAMWEEN) are indicated by subsequent codepoints:


TAMWEEM/IQLAAB = <vowel><small nuun><iqlaab> (was using small meem)
IDGHAAM = <vowel><small nuun><idghaam> (was using shadda on subsequent letter)
IKHFAA = <vowel><small nuun><ikhfaa>  (was sequential blahblah)

and arguably, because it is redundant, I would add

IDHHAAR = <vowel><small nuun><idhhaar>

Likewise I would change the nuun with iqlaab, ikhfaa etc from

NUUN + IQLAAB was = <nuun><small meem>

to

NUUN + IQLAAB = <nuun><sukuun><iqlaab>

etc.

This has great benefits in terms of searching in that the tajweed codes can be treated as whitespace and all vowels and sukuuns are easily identified.

Hi,

Interesting design option I hadn't thought of. But I see a couple of objections:

1. It makes rendering more complex and expensive. With e.g.
<vowel><tanween><iqlab>
the rendering engine must always check the character following <tanween> before it can decide what to do.


2. It doesn't reflect the graphic structure of the written text. The iqlaab mark never accompanies a tanween mark nor a sukuun, does it? IMO <tanween> should always generate a written mark; it's up to the rendering engine to figure out which one.

As for searching, it's true that if you can only search for one character at a time then searching for e.g. all indefinite nouns works better with your proposal. But I recommend thinking of search functionality in terms of regular expressions, which allow character classes. So it's easy to search for any character that is member of the class {<tanween>, <iqlaab>, <idghaam>}.

In fact, I'd go further. Rethinking encoding design from the ground up frees us also to rethink basic text processing conventions. For search, this means a rethinking of regular expression syntax. Standard regex syntax supports standard character classes like [[:digit:]] and [[:alpha:]]. Obviously the standard classes were designed for a particular script. With Arabic, we need other classes, like [[:haraka:]], [[:radical:]], etc. In particular, [[:tanween:]] to denote the set listed above.

We can go yet further. Standard regex syntax uses the metacharacter '.' (period) to denote "any character". Well, that's useful; but in Arabic we have two fundamental classes of character: base chars and stackers (for lack of better terminology). So we can define two more metacharacters, for example ':' = any base char, and "~" = any stacker (vowel, sukuun, etc.). Then the regex:
k:b
matches any string of three base chars starting with k and ending with b, e.g. ktb, klb, ksb, etc. If we add a switch like --ignore-stackers, then k:b would match the same consonants even with vowels, e.g. kataba, kitaab, etc. The regex k~tb would match katb but not ktb.


There are lots of other interesting possibilities for design of regex syntax for Arabic. (Note that we can do this even with Unicode as it is now). And lots of open source regex engines. I looked at making this kind of mod a few years ago, but I'm afraid this kind of hacking is a little over my head at the moment (it's been a good five years since I've coded). I'll bet there is somebody on this list who could code up some experimental regex implementations relatively quickly.

-g