[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Developer digest, Vol 1 #263 - 1 msg



Thanks a lot,

So to answer my questions you mentioned we can normalize an alef_maksura
with a yeh and a teh_marbuta with a heh?

I see teh marbuta online as a teh marbuta I don't see the dots removed and
it's a heh only if the word ends with a heh especially when we have a
masclin word. I see teh marbuta as a teh marbuta.  I am just having a hard
time convincing myself that we can remove the dots from the teh marbuta
because if the word is feminin then it should have the two dots on top of
the heh to make the word feminin..therefore, you are saying that we can
remove the dots from the teh marbuta and make it a heh for indexing and
usage in IR applications for Arabic language.


Thanks and sorry if I am not clear

Sara


My question is how can I normalize teh marbuta to replace the heh? Also, =
how
> can I normalize the alif-maksura to yeh? So that I can use it in the
> information retrieval when searching for a string. Arabic words must be
> normalized before the text is ready for indexing, keyword searches, or te=
xt
> manuipulation.

Salam,

You simply replace the characters ;) If you would give more details then
perhaps someone can help you, but as it is, I cannot understand what you are
asking (and I doubt others can).

Assuming that you have a text file you want to normalize, you would need to:
  - Remove punctuation
  - Remove diacritics
  - Remove non-Arabic letters
  - Replace any ALEF with a HAMZA or MADDA with a plain ALEF (U+0627)
  - Replace any YEH followed by a HAMZA by itself with a YEH with a HAMZA on
    top (U+0626)
  - Replace any ALEF_MAKSURA with a YEH (U+064A)
  - Replace any TEH_MARBUTA with a HEH (U+0647)
 =20
How you do that is of course with whatever language you choose. But without
more information in _details_, there is only so much I can help with. =20

 =20
P.S. When replying to an email from a digest, please remove the unrelated
     mail from the reply. This would save a lot of people bandwidth
     downloading their mail.
    =20
later
--=20