[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Emacs Lisp Quran parser (was: TashkeelHandler: A QT C++ Class)



Abdalla Alothman writes:
 > 
 > With #2, there would be no need to strip the diacritical marks from any
 > diacritically marked text, which eliminates the necessity to provide a
 > duplicated content to be the searching content while the marked text
 > becomes only the visible content.
 > 

I may be unfamiliar with string manipulation, but usually, when I see
duplicate contents I assume that it's there for efficiency
purposes. Just wondering, how does your algorithm compare in
performance when used for searching the whole Quran? (assume common
words or multiple aya search results ..etc)

To create the search text, I have previously used the following Emacs
Lisp regexp code to strip all harakat and tajweed marks:

(setq AyaData (replace-regexp-in-string "[^ﻱ-ﺀ ]" "" AyaData))

The results looks fine to me, but I might be missing a letter or
two..;-)

In any case, the above line was part of small Emacs Lisp script that
parses the Arabeyes Quran project XML data, and automatically creates
SQL calls to a new buffer.

I created this to learn lisp and to fill an SQL database for personal
use, but someone else may find it usefull. You can modify the
insert-string in printf like syntax for different output styles.

To use, just evaluate or put in your .emacs file (then restart
emacs). Visit the file quran.ar.xml and run the function.

Salaam,
Thamer Mahmoud


;; Function: quran-extract-data
;; Author: Thamer Mahmoud <thamer at newkuwait dot org>
;; Keywords: Arabic, Quran, Arabeyes
;; License: GPL
;; Copyright (C) 2005 Thamer Mahmoud

;; Code

(defun quran-extract-data ()
  "Converts Arabeyes quran project XML data in current buffer to
a formatted string in a new buffer.  Will also generate a last
aya number for each sura and a searchtext if none found.  Modify
the insert-string call below for different output
styles (printf-like)"

  (interactive)

  ;; Set header and footer content
  (defvar quran-header-text "BEGIN;\n" 
    "Insert this string at the start of Quran buffer")
  (defvar quran-footer-text "COMMIT;\n" 
    "Insert this string at the end of Quran buffer")

  (let ((tempbuf (get-buffer-create "Quran-temp")) 
	SuraNum SuraName AyaNum AyaData AyaDataEx 
	AyaLastNum sura-point)

    ;; Insert header
    (save-excursion
      (set-buffer tempbuf)
      (insert-string quran-header-text))

    ;; Iterate on every sura
    (save-excursion
      (while (re-search-forward "<sura.id=\"\\([0-9]*\\)\".name=\"\\(.*\\)\">"
				(point-max) t)
	(setq SuraNum (match-string 1))
	(setq SuraName (match-string 2))

	;; Get next sura limit and last aya number	    
	(save-excursion
	  (search-forward "sura")
	  (setq sura-point (point)))
	(save-excursion
	  (setq AyaLastNum 0)
	  (while (search-forward "aya id" sura-point t)
	    (setq AyaLastNum (1+ AyaLastNum))))

	;; Go over all the ayat
	(while (re-search-forward "<aya.id=\"\\([0-9]*\\)\">" sura-point t)
	  (setq AyaNum (match-string 1))
	  (message "Now processing Aya: %s from Sura: %s" AyaNum SuraNum)
	  (if (re-search-forward  "<searchtext>\\(.*\\)\n?</searchtext>" sura-point t)
	      (setq AyaData (match-string 1))
	    (setq AyaData nil))
	  (re-search-forward  "<qurantext>\\(.*\\)\n?</qurantext>" sura-point t)
	  (setq AyaDataEx (match-string 1))

	  ;; If no searchtext found, automatically create one from the
	  ;; quran text.  The search data is created by striping all
	  ;; harakat and keeping only elements in the range:
	  ;; [U+0621->U+064A]
	  (if (null AyaData)
	      (progn 
		(setq AyaData AyaDataEx)

		;; FIXIT: This lacks efficiency big time! Replace
		;; characters in buffer instead of a single string.
 		(setq AyaData (replace-regexp-in-string "[^ﻱ-ﺀ ]" "" AyaData))))

	  ;; Insert formated data in a new buffer
	  (save-excursion
	    (set-buffer tempbuf)

	    ;; For Example: This insert-string will insert SQL
	    ;; commands in the new buffer
	    (insert-string 
	     (format 
	"INSERT INTO ht VALUES(%s, '%s', %s, '%s', '%s', %s);\n"
		SuraNum SuraName AyaNum AyaData AyaDataEx AyaLastNum))))))

    ;; Insert footer
    (save-excursion
      (set-buffer tempbuf)
      (insert-string quran-footer-text))

    ;; Display the result buffer
    (switch-to-buffer tempbuf)))

;; End