[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Arabic posts
- To: Core Arabeyes Team <core at arabeyes dot org>
- Subject: Re: Arabic posts
- From: Nadim Shaikli <shaikli at yahoo dot com>
- Date: Tue, 1 Jul 2003 02:03:32 -0700 (PDT)
--- Nadim Shaikli <shaikli at yahoo dot com> wrote:
> I think our mailing-list archives are broken when it comes to Arabic posts.
> I came across this by shear accident (while looking into Youcef's question
> about Jan 2003 - I fixed that ezmlm problem, so Jan 2003 should be OK).
>
> After more looking into the Arabic posts, I've come to this conclusion,
>
> 1. The tar files given to me (even the old (pre-.uk)) are NOT raw data
> and have been saved in the improper encoding (emacs ?). For instance
> look at the 'ae-lists-20030430.tar.bz2' files (you can preview it on
> arabeyes ~nadim/maillists/doc.mbox and search for 'Nov 2002' - you'll
> see all sorts of '=D8=A7=D9=84=D8=B3=D9=84=D8=A7=D9=85' which are the
> encodings, but they are discrete ASCII characters). I'm sure that can
> be fixed via a script or something, I'm just not sure how to go about
> it at the moment. Mohammed, could you please look into this and fix
> it (since you might be fresh about these manipulations from your recent
> duali work) - it should have been in raw format throughout, so I highly
> suggest you do some sample checks on your .emacs setup as well (I can
> help if you like to flush out your emacs setup if that is indeed a
> problem).
OK, I'm on crack - I figured what needs to be done -- those pesky Quoted
Printables were the culprits (I wasted alot of time on this too). Mohammed,
you can ingore me on this one.
> 2. I have improper mhonarc settings, the NEW arabic posts sitting on our
> live 'LIST.mbox' files are all fine, the way they are being processed
> is incorrect and I will look into fixing those (they are not related
> to #1 above). This is a much more minor problem than #1 since the
> original raw file is not touched and its just a matter of figuring out
> the proper configuration setup.
I have figured everything out except for one this (that I mailed the author
with and I do NOT expect resolution). If we get a UTF-8 email (or CP-1256)
it will get archived correctly (post my pending local fixes), but if we
reply to those emails and cite their posts, their encoded text will be all
jumbled-up and incorrect if we use a different encoding from their orig
post (which is very likely). Case in point, I get an Arabic UTF-8 post, I
reply to the sending including their post and my reply is in english so
yahoo (or other) will note that the encoding on the message is us-ascii.
When mhonarc sees it, it will treat the entire message as us-ascii (due to
the header) which results in jumbled cites. As noted, I've asked the author
to simply not modify things if they are cited and/or not noted as encoded,
but I doubt to get anything meaningful back (its not a clear solution).
> Once #1 & #2 are fixed, I will regenerate the mailing-list archives.
Sorry for the false alarm.
We still need to setup a meeting and talk about the 1U recent posts.
Salam.
- Nadim
__________________________________
Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!
http://sbc.yahoo.com