[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Arabic 16-bit encodings



>From: Gregg Reynolds <unicode at arabink dot com>
>(By the way, there's the real contrary to plaintext: character codes 
>that denote grammatical semantics rather than just graphemic semantics.)

Gregg and I have been discussing this issue of encoding grammatical semantics. For the benefit of the Unicode list I'll just state my opinions in this regard.

I think it makes a lot of sense for the character encoding of Arabic to be graphemic (abstract character based or script based). It is also true that encoding grammatical semantics has a lot of benefits for text analysis. So how are we gonna tackle this problem? Is the solution to invent a new character encoding model that encodes letters by their grammatical semantic values? I think the approach to use in this case is to continue encoding based on graphemes but encoding the grammatical semantics using markup, i.e. an XML schema. To me this makes a lot of sense because regardless of whether grammatical semantics is captured or not the letters are the same letters and it is sensible to continue to encode the letters using the same codepoints for both grammatically-aware texts and non-grammatically-aware texts. I think graphemic encoding is also the natural way of how Arabic text is written. In the end, when an Arabic speaker handwrites a text in Arabic, he does not consciously think of the grammatical function of each letter. For instance he is not thinking that the alef he is writing is graphotactic vs. phonological, etc. If he wants to analyze the grammatical semantics of the word, then after he has already written it he can look at the letters and identify their grammatical functions. IMHO, the best analogy for that in computer technology is using markup to capture that grammatical semantics, and continue encoding the letters themselves in a graphemic way.

Kind regards,
Mete

--
Mete Kural
Touchtone Corporation
714-755-2810
--