[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Normalization



On Mon, Nov 11, 2002 at 12:11:03AM -0800, sara mraish wrote:
> Salam,
> 
> My question is how can I normalize teh marbuta to replace the heh? Also, how
> can I normalize the alif-maksura to yeh? So that I can use it in the
> information retrieval when searching for a string. Arabic words must be
> normalized before the text is ready for indexing, keyword searches, or text
> manuipulation.

Salam,

You simply replace the characters ;) If you would give more details then
perhaps someone can help you, but as it is, I cannot understand what you are
asking (and I doubt others can).

Assuming that you have a text file you want to normalize, you would need to:
  - Remove punctuation
  - Remove diacritics
  - Remove non-Arabic letters
  - Replace any ALEF with a HAMZA or MADDA with a plain ALEF (U+0627)
  - Replace any YEH followed by a HAMZA by itself with a YEH with a HAMZA on
    top (U+0626)
  - Replace any ALEF_MAKSURA with a YEH (U+064A)
  - Replace any TEH_MARBUTA with a HEH (U+0647)
  
How you do that is of course with whatever language you choose. But without
more information in _details_, there is only so much I can help with.  

  
P.S. When replying to an email from a digest, please remove the unrelated
     mail from the reply. This would save a lot of people bandwidth
     downloading their mail.
     
later
-- 
-------------------------------------------------------
| Mohammed Elzubeir    | Visit us at:                 |
|                      |  http://www.arabeyes.org/    |
| Arabeyes Project     | Homepage:                    |
| Unix the 'right' way |  http://fakkir.net/~elzubeir/|
-------------------------------------------------------
---
Was I helpful? Let others know:
http://svcs.affero.net/rm.php?r=elzubeir

Attachment: pgp00011.pgp
Description: PGP signature