[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

TashkeelHandler: A QT C++ Class



Asalamu alaikum.

I am sending a small QT-based C++ class to handle diacritical marks.
What's demonstrated is the following:

1. Removing all diacritical marks from an Arabic string.
2. Searching diacritically marked text with regular expressions.
3. Proof that QT is not just a GUI library.

in #2, we follow a simple algorithm:

* break the text into its characters.

* add the regular expression, rule1, after every character.

* join the new characters into a new string.

With #2, there would be no need to strip the diacritical marks from any
diacritically marked text, which eliminates the necessity to provide a
duplicated content to be the searching content while the marked text
becomes only the visible content.

A driver is included to test the class. To use it, your console must support
bidirectional text. If you are using a recent version of KDE, you can instruct
Konsole (the KDE console application) to use bidi in the settings dialog.

The driver performs the following:

1. After a search string is entered, a new TashkeelHandler is instantiated.
2. The TashkeelHandler instance constructs the new regex.
3. A search is made to see if the regex is in the current line.
4. If a match is found, the line containing the match is displayed and then
   saved into the file: searchresults.txt.
5. The instance then removes the marks from the match and saves it in the
   same file.
6. The searched content are the ayat in surat alqamar. This file is attached, don't
   forget to save it in the same directory along with the sources.

SAMPLE RUN (You can enter more than one word like: kathabat thamuud):

~/Projects/arabic/toys/tashkeelhandler #-> tdriver

Enter Search String: دسر
Found match in aya: 13
وحملناه على ذات ألواح ودسر
~/Projects/arabic/toys/tashkeelhandler #->      


You can use KWord to view the diffrences between both lines.

To compile: 

g++ tashkeelhandler.cc tdriver.cc -o tdriver -lqt

Of course your QT library shouldn't be outdated.

Sorry if this message is too long.

Salam,
Abdalla Alothman

/////////////////////////////////////////////////////////////////////////
// FILE: tashkeelhandler.h
// Interface for class TashkeelHandler
// 1. Strip Tashkeel from a string
// 2. Construct Regular Expressions to match diacritically marked content
// By Abdalla Alothman - abdalla at pheye dot net - June 2003
// Updated January 12 2004
// Updated April 07 2004
/////////////////////////////////////////////////////////////////////////

#ifndef __TASHKEELHANDLER_H__
#define __TASHKEELHANDLER_H__

#include <qstring.h>
#include <qstringlist.h>
#include <qregexp.h>
#include <iostream>

namespace trule1
{
  class TashkeelHandler
  {
    public:
      TashkeelHandler();
      ~TashkeelHandler() { delete rule1; }
      void removeTashkeel(QString&, QString&);
      QRegExp constructRegex(QString&);
      void findInString(QString&, QRegExp&);

    private:
      QString singleWordConstructor(QString&);
      QString multipleWordConstructor(QString&);
      
      QString tashkeelstr;
      QRegExp *rule1;
      //QRegExp rule1(QString::fromUtf8("([ًٌٍَُِّْ]*)"));
  };
}
#endif

/////////////////////////////////////////////////////////////////////////
// FILE: tashkeelhandler.h
// Implementation for class TashkeelHandler
// 1. constructRegex: Build a regular expression
//   [a] PRIVATE: singleWordConstructor: Builds regex for single words
//   [b] PRIVATE: multipleWordConstructor: Builds regex for multiple
//       words
// 2. removeTashkeel: Removes tashkeel from a string.
//
// By Abdalla Alothman - abdalla at pheye dot net - June 2003
/////////////////////////////////////////////////////////////////////////
#include "tashkeelhandler.h"
namespace trule1
{
  TashkeelHandler::TashkeelHandler()
  {
    tashkeelstr = QString::fromUtf8("([ًٌٍَُِّْ]){0,2}");
    rule1 = new QRegExp(tashkeelstr);
  }

  void TashkeelHandler::removeTashkeel(QString &in, QString &r1)
  {
    QString rule1;
    if(r1.isEmpty())
    {
      rule1 = tashkeelstr.utf8();
    }
    else rule1 = r1;

    QRegExp r2(QString::fromUtf8(rule1));

    if( in.contains(r2) )
    {
      in = in.remove(r2);
    }
  }

  QRegExp TashkeelHandler::constructRegex(QString &in)
  {
    
    if( in.contains( QRegExp("(\\s)") ) )
    {
      in = multipleWordConstructor(in);
    }
    
    else
    {
      in = singleWordConstructor(in);
    }
    return in;
  }

  QString TashkeelHandler::multipleWordConstructor(QString &in)
  {
    // First: Check if input string contains any tashkeel
    if( in.contains( *rule1 ) )
    {
      QString d1 = rule1->pattern();
      removeTashkeel(in, d1);
    }
    
    QStringList inList = QStringList::split(" ", QString::fromUtf8(in));
    QStringList outList;
    QString inTemp("");

    for(QStringList::iterator i = inList.begin(); i != inList.end(); ++i)
    {
      inTemp = *i;
    
      QStringList tList = QStringList::split("", inTemp);
      inTemp = tList.join(tashkeelstr) + tashkeelstr;
      outList << inTemp;
    }
    inTemp = outList.join(" ");
    inTemp.prepend("(^|\\s)");
    inTemp.prepend(tashkeelstr);
    inTemp.append("($|\\s)");
    return inTemp;
  }

  QString TashkeelHandler::singleWordConstructor(QString& in)
  {
    QStringList inList = QStringList::split("", QString::fromUtf8(in));
    in = inList.join(tashkeelstr);
    return in;
  }
}

///////////////////////////////////////////////////////////////////
// FILE: tdriver.cc
// Test drives class TashkeelHandler
// By Abdalla Alothman - abdalla at pheye dot net - August 31, 2005
//
// compile with: g++ tashkeelhandler.cc tdriver.cc -o tdriver -lqt
///////////////////////////////////////////////////////////////////

#include "tashkeelhandler.h"
#include <qfile.h>
#include <iostream>
#include <string>
#include <memory>
#include <qtextstream.h>
// Class declared in namespace trule1, so use using
// qualify the instance(s)
using namespace trule1;
using namespace std;

int main()
{
  cout << "Enter Search String: ";
  string input;
  getline(cin, input);
  QString i2(input);
  QString line("");
  QFile f1("054-alqamar-utf.txt"); // THIS FILE IS AN ATTACHMENT
  QFile f2("searchresults.txt");
  QTextStream tstream2(&f2);
  tstream2.setEncoding(QTextStream::UnicodeUTF8);
  QRegExp r;
  if(f1.open(IO_ReadOnly) && f2.open(IO_WriteOnly))
  {
    auto_ptr<TashkeelHandler> h1(new TashkeelHandler);
    // this will construct a regex
    r = h1->constructRegex(i2);
    QTextStream tstream(&f1);
    int i = -1; // to show aya number, but don't count the title and the Basmala
    
    while( !tstream.eof() )
    {
      line = tstream.readLine();

      // search the input line using the constructed regex
      if(line.contains(r))
      {
        cout << "Found match in aya: "
             << i
             << endl
             << line.utf8()
             << endl;
        tstream2 << "With Tashkeel: " << endl << line << endl;
        QString a("");
        // Remove all tashkeel
        // "a" is an empty tashkeel string. say you only want to remove
        // the dhamma then let a = "dhamma" and send a, nothing will be
        // remove except the dhamma.
        // Do not abuse!!
        h1->removeTashkeel(line, a);
        tstream2 << "Tashkeel removed: "
                 << endl
                 << line
                 << endl;
      }
      i++;
      continue;
    }
  }
  return 0;
}
*سورة القمر - 54 - 55*
بِسْمِ اللّهِ الرَّحْمنِ الرَّحِيمِ
اقْتَرَبَتِ السَّاعَةُ وَانشَقَّ الْقَمَرُ
وَإِن يَرَوْا آيَةً يُعْرِضُوا وَيَقُولُوا سِحْرٌ مُّسْتَمِرٌّ
وَكَذَّبُوا وَاتَّبَعُوا أَهْوَاءهُمْ وَكُلُّ أَمْرٍ مُّسْتَقِرٌّ
وَلَقَدْ جَاءهُم مِّنَ الْأَنبَاء مَا فِيهِ مُزْدَجَرٌ
حِكْمَةٌ بَالِغَةٌ فَمَا تُغْنِ النُّذُرُ
فَتَوَلَّ عَنْهُمْ يَوْمَ يَدْعُ الدَّاعِ إِلَى شَيْءٍ نُّكُرٍ
خُشَّعًا أَبْصَارُهُمْ يَخْرُجُونَ مِنَ الْأَجْدَاثِ كَأَنَّهُمْ جَرَادٌ مُّنتَشِرٌ
مُّهْطِعِينَ إِلَى الدَّاعِ يَقُولُ الْكَافِرُونَ هَذَا يَوْمٌ عَسِرٌ
كَذَّبَتْ قَبْلَهُمْ قَوْمُ نُوحٍ فَكَذَّبُوا عَبْدَنَا وَقَالُوا مَجْنُونٌ وَازْدُجِرَ
فَدَعَا رَبَّهُ أَنِّي مَغْلُوبٌ فَانتَصِرْ
فَفَتَحْنَا أَبْوَابَ السَّمَاء بِمَاء مُّنْهَمِرٍ
وَفَجَّرْنَا الْأَرْضَ عُيُونًا فَالْتَقَى الْمَاء عَلَى أَمْرٍ قَدْ قُدِرَ
وَحَمَلْنَاهُ عَلَى ذَاتِ أَلْوَاحٍ وَدُسُرٍ
تَجْرِي بِأَعْيُنِنَا جَزَاء لِّمَن كَانَ كُفِرَ
وَلَقَد تَّرَكْنَاهَا آيَةً فَهَلْ مِن مُّدَّكِرٍ
فَكَيْفَ كَانَ عَذَابِي وَنُذُرِ
وَلَقَدْ يَسَّرْنَا الْقُرْآنَ لِلذِّكْرِ فَهَلْ مِن مُّدَّكِرٍ
كَذَّبَتْ عَادٌ فَكَيْفَ كَانَ عَذَابِي وَنُذُرِ
إِنَّا أَرْسَلْنَا عَلَيْهِمْ رِيحًا صَرْصَرًا فِي يَوْمِ نَحْسٍ مُّسْتَمِرٍّ
تَنزِعُ النَّاسَ كَأَنَّهُمْ أَعْجَازُ نَخْلٍ مُّنقَعِرٍ
فَكَيْفَ كَانَ عَذَابِي وَنُذُرِ
وَلَقَدْ يَسَّرْنَا الْقُرْآنَ لِلذِّكْرِ فَهَلْ مِن مُّدَّكِرٍ
كَذَّبَتْ ثَمُودُ بِالنُّذُرِ
فَقَالُوا أَبَشَرًا مِّنَّا وَاحِدًا نَّتَّبِعُهُ إِنَّا إِذًا لَّفِي ضَلَالٍ وَسُعُرٍ
أَؤُلْقِيَ الذِّكْرُ عَلَيْهِ مِن بَيْنِنَا بَلْ هُوَ كَذَّابٌ أَشِرٌ
سَيَعْلَمُونَ غَدًا مَّنِ الْكَذَّابُ الْأَشِرُ
إِنَّا مُرْسِلُو النَّاقَةِ فِتْنَةً لَّهُمْ فَارْتَقِبْهُمْ وَاصْطَبِرْ
وَنَبِّئْهُمْ أَنَّ الْمَاء قِسْمَةٌ بَيْنَهُمْ كُلُّ شِرْبٍ مُّحْتَضَرٌ
فَنَادَوْا صَاحِبَهُمْ فَتَعَاطَى فَعَقَرَ
فَكَيْفَ كَانَ عَذَابِي وَنُذُرِ
إِنَّا أَرْسَلْنَا عَلَيْهِمْ صَيْحَةً وَاحِدَةً فَكَانُوا كَهَشِيمِ الْمُحْتَظِرِ
وَلَقَدْ يَسَّرْنَا الْقُرْآنَ لِلذِّكْرِ فَهَلْ مِن مُّدَّكِرٍ
كَذَّبَتْ قَوْمُ لُوطٍ بِالنُّذُرِ
إِنَّا أَرْسَلْنَا عَلَيْهِمْ حَاصِبًا إِلَّا آلَ لُوطٍ نَّجَّيْنَاهُم بِسَحَرٍ
نِعْمَةً مِّنْ عِندِنَا كَذَلِكَ نَجْزِي مَن شَكَرَ
وَلَقَدْ أَنذَرَهُم بَطْشَتَنَا فَتَمَارَوْا بِالنُّذُرِ
وَلَقَدْ رَاوَدُوهُ عَن ضَيْفِهِ فَطَمَسْنَا أَعْيُنَهُمْ فَذُوقُوا عَذَابِي وَنُذُرِ
وَلَقَدْ صَبَّحَهُم بُكْرَةً عَذَابٌ مُّسْتَقِرٌّ
فَذُوقُوا عَذَابِي وَنُذُرِ
وَلَقَدْ يَسَّرْنَا الْقُرْآنَ لِلذِّكْرِ فَهَلْ مِن مُّدَّكِرٍ
وَلَقَدْ جَاء آلَ فِرْعَوْنَ النُّذُرُ
كَذَّبُوا بِآيَاتِنَا كُلِّهَا فَأَخَذْنَاهُمْ أَخْذَ عَزِيزٍ مُّقْتَدِرٍ
أَكُفَّارُكُمْ خَيْرٌ مِّنْ أُوْلَئِكُمْ أَمْ لَكُم بَرَاءةٌ فِي الزُّبُرِ
أَمْ يَقُولُونَ نَحْنُ جَمِيعٌ مُّنتَصِرٌ
سَيُهْزَمُ الْجَمْعُ وَيُوَلُّونَ الدُّبُرَ
بَلِ السَّاعَةُ مَوْعِدُهُمْ وَالسَّاعَةُ أَدْهَى وَأَمَرُّ
إِنَّ الْمُجْرِمِينَ فِي ضَلَالٍ وَسُعُرٍ
يَوْمَ يُسْحَبُونَ فِي النَّارِ عَلَى وُجُوهِهِمْ ذُوقُوا مَسَّ سَقَرَ
إِنَّا كُلَّ شَيْءٍ خَلَقْنَاهُ بِقَدَرٍ
وَمَا أَمْرُنَا إِلَّا وَاحِدَةٌ كَلَمْحٍ بِالْبَصَرِ
وَلَقَدْ أَهْلَكْنَا أَشْيَاعَكُمْ فَهَلْ مِن مُّدَّكِرٍ
وَكُلُّ شَيْءٍ فَعَلُوهُ فِي الزُّبُرِ
وَكُلُّ صَغِيرٍ وَكَبِيرٍ مُسْتَطَرٌ
إِنَّ الْمُتَّقِينَ فِي جَنَّاتٍ وَنَهَرٍ
فِي مَقْعَدِ صِدْقٍ عِندَ مَلِيكٍ مُّقْتَدِرٍ