[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: start of SIRAGI project
- To: Development Discussions <developer at arabeyes dot org>
- Subject: Re: start of SIRAGI project
- From: Tarik FDIL <tfdil at sagma dot ma>
- Date: Thu, 7 Apr 2005 21:35:38 +0000
Hi Behdad,
Le Jeudi 7 Avril 2005 19:33, Behdad Esfahbod a écrit :
> I don't want to be a bastard, but I really think things like tiff
> handling do not belong to an OCR project.
Files dealing with tiff are in SIRAGI, simply to read image files ;-) In the
same way in GOCR source directory you find files pnm.c, pcx.c and tga.c to
read pnm, pcx and tga image files.
Why including and excerpt of libtiff files and not simply telling developers
to use libtiff ? First to simplify developement since you have all files in
the same directory. Second I have made some modification in Makefile to adapt
libtiff to unix and windows. Now, I'm not against removing these files from
CVS if other contributors agree with that.
Why working with TIFF and not another format ? Simply because all scanner
drivers generate TIFF files but no one generate PNM files ! Another reason :
in the domain of OCR the best format is B&W TIFF Group 4 files. Since it is
efficient in space storage, there is no loss of pixels and it is well known
by all softwares.
> In fact, I belove
> an Arabic OCR application is out of place by definition too, the
> same for an Arabic editor, an Arabic spell-checker, etc.
- First :
This is an old discussion : why coding KDE and GNOME ? why writing GNU/Linux
while BSD exist ?, etc. IMHO free softwares offer many solutions for the same
problem since there is many ideas and many falvours of the same
functionnality. I'm not saying that we should reinvent the wheel, no, I think
if a new project give some new ideas, a new design or a new approach it
should be done.
- second :
I already tried GOCR and I had read its documentation and look at its source
code. His design is to specific to latin characters, I can't see what I can
do to adapt it to arabic without all rewriting. Here some examples :
* If you consider line detection, the algorithm assumes that character are
written from left to right. If you want to address this issue you should
rewrite the entire horizontal segmentation. That's what I have done in
SIRAGI.
* if you see what GOCR's author call "cluster detection". This is the tool to
detect characters in a line. If you apply this algorithm to arabic OCR you
will get words not characters. An OCR should recognize characters not words
since there is only 28 characters but an infinity of words :-)
* concerning the heart of the GOCR, the OCR engines. They are not general, but
specifically designed for latin characters. There is no neural networks nore
classification using a general pixel comparisons, nore vectorization. So no
line of code can be adapted from these engines :-)
Conclusion : I think SIRAGI-OCR is really a necessity. We have no other
alternative than writing from scratch a new software with a more general
design to address arabic texts. Then, later we can easily adapt it to
recognize latin characters !
Anyway, thanks Behdad for alerting us not reinventing the wheel but I think,
sincerly, we are not falling in this trap.
Best regards
Tarik