[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Volunteers for verifying the quran data



Meor Ridzuan Meor Yahaya wrote:
Dear all,
...

You can find everything at http://www.pakistanopensource.org/projects/quran/ . The main data is

Hi Meor,

I took a look at your data.  Excellent work.

However, I have a few suggestions.

First, there are really two verifications going on. One is to verify the underlying data; the other is to verify that your font design works correctly with the data.

As for the data, I see you have hacked Unicode a bit in order to get elements that are not yet encoded. Which is fine; but the problem is that you use already defined Unicode points and give them different semantics. This will be a problem in the future. Better to use PUA points.

Example: idghaam tanween. Sura 111, Ayah 5. You encode the idghaam mark as <dammatan><small low meem>, and your font is designed to render this as tanween idghaam. This is problematic because your display logic conflicts with the defined Unicode semantics. It means your text data is dependent on your font. Better to encode this as e.g. <damma><tanween idghaam>, where <tanween idgham> is assigned a PUA codepoint (or possibly a currently unused point in the Arabic code block). That way, the text is font-independent, and when Unicode gets around to encoding it officially, it will be easy to change your text to match.

In the same Ayah, you encode kasra + iqlaab as <kasratan><small low meem>. Same problem: the text is dependent on the display logic of a single font, and conflicts with Unicode semantics. In addition, <small low meem> as used in your text now has dual semantics: in one place it means idghaam, in another it means iqlaab. This one I would encode <kasra><iqlaab>.

As for verifying the data, I strongly encourage you *not* to verify it visually! If you do that, you're only verifying that the *display* of the text using your font matches the graphical structure of the Quranic text. But that doesn't mean that the underlying textual data is correct! Since your font does customized rendering that sometimes conflicts with Unicode semantics, I cannot infer from a graphical rendition of the text that the underlying coding structure is accurate.

What you really want to verify is that the *data* accurately represents the printed Quranic text. So you need a way of looking at the data without customised rendering. You need to go through the text and verify that 1) all Unicode codepoints are used with their Unicode defined semantics; 2) any invented codepoints are assigned in the PUA or to currently unused Unicode codepoints; and 3) that each invented codepoint is used in only one way, with one well-defined meaning.

I recommend you make a list of the invented codepoints you have used, showing codepoint and semantics, and make it available to help reviewers.

Since I'm stuck with Windows, I use Babelpad (http://www.babelstone.co.uk/Software/BabelPad.html). The nice thing about this is that it displays the Unicode number and name of the character under the cursor. So when I use it to look at your text file using your font, I can see both the font rendition and also see the Unicode values of the data. I expect an editor with similar capabilities is available for Linux, but I don't know where.

I hope that helps.  Keep up the good work.

Sincerely,

gregg