[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Emacs Lisp Quran parser (was: TashkeelHandler: A QT C++ Class)



Asalamu alaikum wa rahmatullaah.

On Thursday 01 September 2005 02:12, Thamer Mahmoud wrote:
 
> I may be unfamiliar with string manipulation, but usually, when I see
> duplicate contents I assume that it's there for efficiency
> purposes.

Insha-Allah I will discuss that later in this  message.

> Just wondering, how does your algorithm compare in 
> performance when used for searching the whole Quran? (assume common
> words or multiple aya search results ..etc)

It takes around 1.7 to 1.78 seconds (on a Pentium III 1.13 mobile CPU) to 
search the whole Quran (which can be downloaded from www.sultan.org/quran.zip;
You have to convert it to UTF8, it's CP-1256: open it with Kword and save it as
Plain Text, then select UTF8 -- it will take ages to load) with complex words
entered as a search string.

<SAMPLE OUTPUT>
~/Projects/arabic/toys/tashkeelhandler #-> tdriver

Enter Search String: مدرارا
Found match:
ألم يروا كم أهلكنا من قبلهم من قرن مكناهم في الأرض ما لم نمكن لكم وأرسلنا السماء عليهم مدرارا وجعلنا الأنهار تجري من تحتهم فأهلكناهم بذنوبهم وأنشأنا من بعدهم قرنا آخرين
Found match:
ويا قوم استغفروا ربكم ثم توبوا إليه يرسل السماء عليكم مدرارا ويزدكم قوة إلى قوتكم ولا تتولوا مجرمين
Found match:
يرسل السماء عليكم مدرارا
Finished in 1.71 seconds.

~/Projects/arabic/toys/tashkeelhandler #-> l quran-utf.txt
-rw-r--r--  1 abdalla users 1294298 2005-09-01 03:31 quran-utf.txt
~/Projects/arabic/toys/tashkeelhandler #->
<OUTPUT DONE>

Comments:
1. Search content and search results contain tashkeel. The entered user
string has no tashkeel.

2. The last aya is in surat NuuH, the one above it is in surat Huud, the
first one I forgot :) Maybe surat Al-A'raaf, Al-An'aam or Huud again.

A complex word is a word that has from 1-3 marks on one or more of
its letters. An example is the word MIDRARA in Surat NuuH (#71). Those
words were very tricky (to me) to trap with regular expressions... but
that was in 2003. 

Timing the search is accomplished by a stopwatch class that's probably
written by Dan Kalev from the Informit C++ Guide (very nice resource).
It's very short and easy to use. I will enclose it so you can time the
process on your computer. To use it:

1. #include "stopwatch.h" in tdriver.cc

2. Create an instance in the first example file: tdriver.cc right
above the while loop and open a block above the while to time the
loop.

Stopwatch s;
{
  while(...)
  {
    ....
  } //end while
} // end timer
//rest of file...

3. You might want to get rid of the counter; it will display meaningless
counts, or let it count the found instances by incrementing when a match
is found.

4. Then: g++ tdriver.cc tashkeelhandler.cc stopwatch.cc -o tdriver -lqt

To answer your questions:

1. I wouldn't search a large content residing in a plain file or a flat
file.

2. There are two issues to consider: efficiency vs. storage. Which one
you favor is a matter of taste. With duplicate content you have the content
plus -- when it comes to an Arabic Quran XML file -- the tags. With regular
expressions, the file is reduced to probably 50% because the marked content
is directly searched. Time is also lost with duplicated content due to size
(you load the file in memory, the parser needs to do its work, and so on).

3. Regular expressions are greedy algorithms. So they may consume time
when the content is large. Results of searching the Quran (a mini large
content) in a database with a regex is a matter of blinking as fast as
you can, the same is not true if the search content is Fi Thilaal
Al-Quran, the famous tafseer by Sayyid Qutub or Altafseer Al-Kabeer
by Al-Fakhr Al-Razi may their souls rest in peace.

4. Regular expressions also vary from one implementation to another.
The regular expression constructed with the class I posted, will not
work with MySQL. It would, however, work with PostgreSQL (with some
changes, e.g., escapes should be doubled, once for the compiler and
once for the server). Likewise, some regex compositions that would
work with PostgreSQL will not work with QT. So there are disadvantages
to regular expressions.

> To create the search text, I have previously used the following Emacs
> Lisp regexp code to strip all harakat and tajweed marks:
> 
> (setq AyaData (replace-regexp-in-string "[^ﻱ-ﺀ ]" "" AyaData))

I assume, wallahu a'lam, this character class might not work with
QT, MySQL, and PostgreSQL. It's really interesting, I didn't know
Emacs supported that character range!

Which database are you using? If PostgreSQL try searching for a word
in the Quran using a regular expression that has that character class.

select something something from ht where aya ~* 'w[ya2hamza]o[ya2hamza]..';

In MySQL: select .... where aya RLIKE 'regexed-string';

where regexed-string would be a string that has the character class above.

can you get any results?
 
> To use, just evaluate or put in your .emacs file (then restart
> emacs). Visit the file quran.ar.xml and run the function.

I will insha-Allah.

I do have a tool that generates SQL, XML, and other files from the Quran,
but it's large to be posted. Probably I can share it with you and others
when I have a place to put the files  bi-ithnillaah.

Wishing you and your family peace and good health.

Salam,
Abdalla Alothman
GSM: +965-662-2595
/////////////////////
// FILE: stopwatch.h
#include <ctime>
using namespace std;

class Stopwatch
{
   public:
            Stopwatch() : start( clock() ) {}
            ~Stopwatch();

   private:
             clock_t start;
}; ////////////////// end of stopwatch.h

/////////////////////
//FILE: stopwatch.cc
#include "stopwatch.h"
#include <iostream>
using namespace std;

Stopwatch::~Stopwatch()
{
   clock_t total = clock() - start;
   // cout << "Total ticks: " << total << endl;
   cout << "Finished in " << double(total)/ CLOCKS_PER_SEC << " seconds." << endl;
}