[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Volunteers for verifying the quran data
- To: General Arabization Discussion <general at arabeyes dot org>
- Subject: Re: Volunteers for verifying the quran data
- From: Gregg Reynolds <gar at arabink dot com>
- Date: Sat, 09 Jul 2005 08:44:23 -0500
- User-agent: Mozilla Thunderbird 1.0.2 (Windows/20050317)
Meor Ridzuan Meor Yahaya wrote:
Dear all,
...
You can find everything at
http://www.pakistanopensource.org/projects/quran/ . The main data is
Hi Meor,
I took a look at your data. Excellent work.
However, I have a few suggestions.
First, there are really two verifications going on. One is to verify
the underlying data; the other is to verify that your font design works
correctly with the data.
As for the data, I see you have hacked Unicode a bit in order to get
elements that are not yet encoded. Which is fine; but the problem is
that you use already defined Unicode points and give them different
semantics. This will be a problem in the future. Better to use PUA points.
Example: idghaam tanween. Sura 111, Ayah 5. You encode the idghaam
mark as <dammatan><small low meem>, and your font is designed to render
this as tanween idghaam. This is problematic because your display logic
conflicts with the defined Unicode semantics. It means your text data
is dependent on your font. Better to encode this as e.g.
<damma><tanween idghaam>, where <tanween idgham> is assigned a PUA
codepoint (or possibly a currently unused point in the Arabic code
block). That way, the text is font-independent, and when Unicode gets
around to encoding it officially, it will be easy to change your text to
match.
In the same Ayah, you encode kasra + iqlaab as <kasratan><small low
meem>. Same problem: the text is dependent on the display logic of a
single font, and conflicts with Unicode semantics. In addition, <small
low meem> as used in your text now has dual semantics: in one place it
means idghaam, in another it means iqlaab. This one I would encode
<kasra><iqlaab>.
As for verifying the data, I strongly encourage you *not* to verify it
visually! If you do that, you're only verifying that the *display* of
the text using your font matches the graphical structure of the Quranic
text. But that doesn't mean that the underlying textual data is
correct! Since your font does customized rendering that sometimes
conflicts with Unicode semantics, I cannot infer from a graphical
rendition of the text that the underlying coding structure is accurate.
What you really want to verify is that the *data* accurately represents
the printed Quranic text. So you need a way of looking at the data
without customised rendering. You need to go through the text and
verify that 1) all Unicode codepoints are used with their Unicode
defined semantics; 2) any invented codepoints are assigned in the PUA or
to currently unused Unicode codepoints; and 3) that each invented
codepoint is used in only one way, with one well-defined meaning.
I recommend you make a list of the invented codepoints you have used,
showing codepoint and semantics, and make it available to help reviewers.
Since I'm stuck with Windows, I use Babelpad
(http://www.babelstone.co.uk/Software/BabelPad.html). The nice thing
about this is that it displays the Unicode number and name of the
character under the cursor. So when I use it to look at your text file
using your font, I can see both the font rendition and also see the
Unicode values of the data. I expect an editor with similar
capabilities is available for Linux, but I don't know where.
I hope that helps. Keep up the good work.
Sincerely,
gregg