[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]



While I'm not an Arabeyes contributor, I found the April 2007 thread
about Siragi OCR very interesting.  I have some comments, in three
areas: (1) OCRopus / Tesseract, (2) Summer of Code and (3) Hebrew.

OCRopus / Tesseract

I was surprised that people were not more strongly in favor of joining the project to the OCRopus or Tesseract projects. While I don't have in-depth knowledge of any of these projects, joining seems like a good idea to me, for these reasons:

1) Joining OCRopus or Tesseract could make it easier to collaborate with people who are working on open source OCR for other Arabic-script languages such as Persian and Urdu. Faisal Shafait at IUPR (http://www.iupr.org/), who is one of the leaders of the OCRopus project, wants to add Urdu support to OCRopus:

2) Joining OCRopus or Tesseract could create more opportunities to get technical advice from experts like the people at IUPR who are involved in OCRopus.  I don't think it's necessary for people to know Arabic in order to give good advice.  I think many issues that are specific to Arabic could be quickly explained to people who don't know Arabic by referring them to a resource like the excellent tutorial on Arabic text processing by Nizar Habash

3) I expect a lot of code in an OCR system can be re-used across different languages, even if the languages don't use same characters for writing.  I expect a lot of the code in an OCR system is tricky stuff involving difficult algorithms, so I think it's great to be able to re-use existing high-performance code, and I fear that an effort to build a high-accuracy system from scratch might get bogged down.  I don't think the open-source Hebrew OCR system mentioned here in April proves otherwise, because it is not high-accuracy (there are a lot of errors in the "300DPI Line-Art.scan" example at http://hocr.berlios.de/examples.html), and also because Hebrew is presumably an easier language to OCR than Arabic because the characters don't join and there is much less shaping.

4) Since OCRopus and Tesseract are sponsored by Google, joining with
these projects may improve Siragi's chances of Summer of Code funding
in the future.

Summer of Code mentors

For next year's Summer of Code proposal, it might be a good idea to look for a mentor or co-mentor from outside of Arabeyes who already has a lot of experience with OCR technology or related technology (unless you already have this experience inside Arabeyes).  Voxforge.org contacted a university researcher about being a mentor for one of their Summer of Code ideas, and I think the idea makes a lot of sense.

IUPR might be a good place to find someone.  You could also search around on the Internet to see who has published OCR papers, or has participated in conferences like the DRR and ICDAR conferences mentioned at

Regarding Arabic in particular, I just took a quick look at the ICDAR 2007 web site and I see there are three people from Tunisia and one from Algeria on the program committee.  A quick Google search (alimi OR amara OR kacem OR sellami icdar) shows they have published papers at ICDAR in the past on OCR and handwriting recognition, including work focused on Arabic.  

Also, the mailing list of the ACL SIG on Semitic Languages (http://www.semitic.tk/) might be a good place to find potential mentors, especially for later-stage modeling techniques such as n-grams.  (By later-stage I mean it is later than the image processing.  By n-grams I mean models of the probabilities of sequences of words or characters.)   For example, Kareem Darwish, who I see from the SIG's web site has participated in at least one event sponsored by the SIG, did a PhD thesis on improving later-stage modeling in Arabic OCR.

Also, maybe the recent ISCAL event on Arabic computing (http://www.iscal.org.sa/) had something about OCR?  I see that OCR is
listed in the English-language list of potential topics for that event, but I can't read the rest of that site since I don't read Arabic.


As was already mentioned in the thread, there is very little shaping in Hebrew. It only happens for five letters and only at the ends of words

However, there are some similarities between the two languages, besides RTL, that might be relevant to open source OCR.

First, the two languages omit vowel information from writing in similar ways.  If you'd like to feed OCR output into a speech synthesizer so that a blind person can use it, I guess you will need to recover the vowel information.  Maybe there is some potential here for code or algorithms to be re-used between the two languages?  (By the way, Hebrew with added vowel marks is sometimes called "pointed Hebrew" or "dotted Hebrew" and the usual written Hebrew is sometimes
called "unpointed Hebrew" or "undotted Hebrew".)

Second, the two languages have a number of similarities in grammar and morphology (e.g., the use of three-character roots).  If you want to use word-level n-grams, I think it will work well (if there's enough training data) to use word-level n-grams of the same type that people use for English, but it may work even better if you use specialized n-gram algorithms that reflect Arabic morphology.  I've seen some papers on Arabic speech recognition papers which do this.  And if it
is helpful for Arabic it may be helpful for Hebrew too.  Here are the papers I've seen:

Ghaoui et al.:
or http://www.isca-speech.org/archive/interspeech_2005/i05_1281.html

Kirchhoff et al. (the link is to a PDF of her 2006 journal paper in
Computer Speech and Language):

David Gelbart

Ask a question on any topic and get answers from real people. Go to Yahoo! Answers.