Oibane,
First of all, thanks for your comments and suggestion.
I think it should be made clear that, my work is mainly for encoding
the Quran. I think, for the first stage, I've accomplished my task,
that is to encode the quran correctly based on visual appearance,
complying as much as I can to Unicode standard. I do need to do some
workaround where unicode support is lacking.
Now, what I would like to accomplish is, actually to make the text
more usefull for other people to study the Quran. For this, searching
is crucial. So, to get an accurate search results, the underlying text
must be encoded correctly. This is where a good solution is still not
there. For example, i think most Arabic users would use 64A for yeh,
and maybe sometimes 649, by some users. What I understood from my
research is, traditionally, the final yeh does not come with the dots.
It was used mainly by non arabic speaker. Later one, it was somehow
adopted by arabic speaker somehow. I think most arabic speakers would
key in 64A when searching, am I right (I am not an arabic speaker,
BTW)?. If the text was encoded as Farsi yeh or any other code, it
would missed the word. So, this is just more of a practical problem
that I'm trying to solve.
FYI, I can implement any solution without any problem. If I need to
encode all occurance of yeh (dotted or dotless) with one code point, I
can do it within few hours, together with the required font. So, I'm
not to concern about implementation, because it can be done quite
easily. The only thing that worries me is alef maksura, the dotless
yeh final form which represent alef. This is because one cannot easily
determine which one is alef maksura and which one is yeh just by
looking at it. I think most people here are suggesthing that not to
worry about it. Just treat alef maksura like a normal dotless yeh.
Regards.
On 12/29/05, Oibane <pflm52td at w6 dot dion dot ne dot jp> wrote:
Hello, there.
Let's remember that desirable yeh is absent in Unicode now.
Thus, it is not the problem which, i.e. U+0649 ("yeh"),
U+064A("alif maqsura"), and U+06CC ("farsi yeh"), to choose among
*them*, but to choose the desirable *behavior* from what they have,
(Or attribute would be the better word? I guess you know what I mean.)
in order to add the apropriate modification to the Unicode standard.
In this line, Farsi Yeh-like one looked most promising. At least, it
should be modified to lose the dots when accompanied with hamza
over/under or small alif, right?
There still remains ambiguity, though. First I admit it is partly
due to my lack of classic Arabic knowledge. Now:
1. Thomas Milo proposed as a "less ambitious approach", where unique
"yeh" is used throughout, and it should drop dots at the final
position under Qur'anic locale. If this is to be adopted, I could not
understand the following point: today's usage sometimes require both
the dotted and un-dotted yeh at the final, so one yes does not
suffice. Is this solution limited to meet Meor's need?
2. On the above "farsi yeh loses points with hamza/small alif" suggestion.
It looks natural for today's texts, where the orthography is I believe
well-established. Now what about the situation of classic materials
(not limited to Qur'aan)? In the relevant era, is it always the case that
hamza or small alif are acutually written with dotless yeh, while
ini/mid yeh which represents "y" consonant has two dots? Are there no
occurrence of "unwritten hamza" with dotless yeh? (If I remember
correctly, hamza is invented later by headhunting `ain.)
Is Tom's dots codepoints necessary in this case, too?
What I can suggest to Meor is to be at ease for the time being, not in
haste. The final solution is yet to come, since there's no codepoint of
dots nor unified true yeh in the current Unicode standard.
If you keep your policy is clear and consistent, it is easy to filter the
text later. (And I think yours so far is practically good. Any policy
is OK, but yours, visual coincidence, is understood straightforward.)
And to your first question: are there any clear criterion for final
yeh to be dotted or not? I don't think you already have received a
clear-cut answer. For cotemporary writing which allows final dotted
yeh, no. You should understand each word. For Qur'aan, yes, your guess
is correct.
Now since I'm far from being expert, I propose another bold solution:
(I suppose it must have been taken into account in early days of
Unicode, but I don't know. I know it can never be merged into Unicode.)
Totality of the encoding elements be ini/mid/fin/isol forms of letters.
They form true graphemes.
I dare not call them representation forms in this definition.
Today, "letters" are considered to be elemental, and shaped forms are
used behind user interface. What I propose is to consider letters to
be virtual, or transparent. They get bound to keys, and texts are
encoded with actual shape elements. If bare letters were included,
it's illegal.
It then forces yeh-hamza to be dotless, since there's no such thing
"dotted-yeh-hamza". At key binding level, overwrap of dotless yeh +
hamza and dotted yeh + hamza is allowed. They are encoded equally.
Joiner and non-joiner are also virtual and merely key binding
candidates, etc etc...
By the way, does anyone know this? I know "alif saghiirah" means the
small alif. Are there direct translation of "dagger alif" in Arabic?
I once guessed that since it resembles to dagger sign which indicates
footnote, the word "dagger alif" was coined by a european orientalist...
Thank you all. Good day.
"Oibane"
pflm52td at wsitta.dion.ne.jp
# sitta is 6
_______________________________________________
General mailing list
General at arabeyes dot org
http://lists.arabeyes.org/mailman/listinfo/general