[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: start of SIRAGI project



Hi Behdad,

Le Jeudi 7 Avril 2005 19:33, Behdad Esfahbod a écrit :
> I don't want to be a bastard, but I really think things like tiff
> handling do not belong to an OCR project.  

Files dealing with tiff are in SIRAGI, simply to read image files ;-) In the 
same way in GOCR source directory you find files pnm.c, pcx.c and tga.c to 
read pnm, pcx and tga image files. 

Why including and excerpt of libtiff files and not simply telling developers 
to use libtiff ? First to simplify developement since you have all files in 
the same directory. Second I have made some modification in Makefile to adapt 
libtiff to unix and windows. Now, I'm not against removing these files from 
CVS if other contributors agree with that.

Why working with TIFF and not another format ? Simply because all scanner 
drivers generate TIFF files but no one generate PNM files ! Another reason : 
in the domain of OCR the best format is B&W TIFF Group 4 files. Since it is 
efficient in space storage, there is no loss of pixels and it is well known 
by all softwares. 

> In fact, I belove
> an Arabic OCR application is out of place by definition too, the
> same for an Arabic editor, an Arabic spell-checker, etc.

- First  :
This is an old discussion : why coding KDE and GNOME ? why writing GNU/Linux 
while BSD exist ?, etc. IMHO free softwares offer many solutions for the same 
problem since there is many ideas and many falvours of the same 
functionnality. I'm not saying that we should reinvent the wheel, no, I think 
if a new project give some new ideas, a new design or a new approach it 
should be done.

- second :
I already tried GOCR and I had read its documentation and look at its source 
code. His design is to specific to latin characters, I can't see what I can 
do to adapt it to arabic without all rewriting. Here some examples :

* If you consider line detection, the algorithm assumes that character are 
written from left to right. If you want to address this issue you should 
rewrite the entire horizontal segmentation. That's what I have done in 
SIRAGI.

* if you see what GOCR's author call "cluster detection". This is the tool to 
detect characters in a line. If you apply this algorithm to arabic OCR you 
will get words not characters. An OCR should recognize characters not words 
since there is only 28 characters but an infinity of words :-)

* concerning the heart of the GOCR, the OCR engines. They are not general, but 
specifically designed for latin characters. There is no neural networks nore 
classification using a general pixel comparisons, nore vectorization. So no 
line of code can be adapted from these engines  :-)

Conclusion : I think SIRAGI-OCR is really a necessity. We have no other 
alternative than writing from scratch a new software with a more general 
design to address arabic texts. Then, later we can easily adapt it to 
recognize latin characters !

Anyway, thanks Behdad for alerting us not reinventing the wheel but I think, 
sincerly, we are not falling in this trap.

Best regards

Tarik