[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Arabic posts



--- Nadim Shaikli <shaikli at yahoo dot com> wrote:
> I think our mailing-list archives are broken when it comes to Arabic posts.
> I came across this by shear accident (while looking into Youcef's question
> about Jan 2003 - I fixed that ezmlm problem, so Jan 2003 should be OK).
> 
> After more looking into the Arabic posts, I've come to this conclusion,
> 
>  1. The tar files given to me (even the old (pre-.uk)) are NOT raw data
>     and have been saved in the improper encoding (emacs ?).  For instance
>     look at the 'ae-lists-20030430.tar.bz2' files (you can preview it on
>     arabeyes ~nadim/maillists/doc.mbox and search for 'Nov 2002' - you'll
>     see all sorts of '=D8=A7=D9=84=D8=B3=D9=84=D8=A7=D9=85' which are the
>     encodings, but they are discrete ASCII characters).  I'm sure that can
>     be fixed via a script or something, I'm just not sure how to go about
>     it at the moment.  Mohammed, could you please look into this and fix
>     it (since you might be fresh about these manipulations from your recent
>     duali work) - it should have been in raw format throughout, so I highly
>     suggest you do some sample checks on your .emacs setup as well (I can
>     help if you like to flush out your emacs setup if that is indeed a
>     problem).

OK, I'm on crack - I figured what needs to be done -- those pesky Quoted
Printables were the culprits (I wasted alot of time on this too).  Mohammed,
you can ingore me on this one.

>  2. I have improper mhonarc settings, the NEW arabic posts sitting on our
>     live 'LIST.mbox' files are all fine, the way they are being processed
>     is incorrect and I will look into fixing those (they are not related
>     to #1 above).  This is a much more minor problem than #1 since the
>     original raw file is not touched and its just a matter of figuring out
>     the proper configuration setup.

I have figured everything out except for one this (that I mailed the author
with and I do NOT expect resolution).  If we get a UTF-8 email (or CP-1256)
it will get archived correctly (post my pending local fixes), but if we
reply to those emails and cite their posts, their encoded text will be all
jumbled-up and incorrect if we use a different encoding from their orig
post (which is very likely).  Case in point, I get an Arabic UTF-8 post, I
reply to the sending including their post and my reply is in english so
yahoo (or other) will note that the encoding on the message is us-ascii.
When mhonarc sees it, it will treat the entire message as us-ascii (due to
the header) which results in jumbled cites.  As noted, I've asked the author
to simply not modify things if they are cited and/or not noted as encoded,
but I doubt to get anything meaningful back (its not a clear solution).

> Once #1 & #2 are fixed, I will regenerate the mailing-list archives.

Sorry for the false alarm.

We still need to setup a meeting and talk about the 1U recent posts.

Salam.

 - Nadim


__________________________________
Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!
http://sbc.yahoo.com