You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Matthias Keller <li...@matthias-keller.ch> on 2006/08/09 10:00:19 UTC

Poor gocr results on some pics?

Hi

I have some troubles getting good results using gocr on some of the pics 
that came in.
Strangely Chris from the FuzzyOCR Plugin was able to scan them correctly 
but we didn't find out why there's so much of a difference

I'm using gocr-0.40-3 on SuSE 10.1 and netpbm-10.26.12-5.4 (for giftopnm)

Here's the pic in question as original gif (I joined the parts to make 
it easier for gocr):
http://www.matthias-keller.ch/ocrmail.gif
and converted to pnm:
http://www.matthias-keller.ch/ocrmail.pnm

And here's what   gocr -i ocrmail.pnm   spits out in my case:
http://www.matthias-keller.ch/ocrmail.gocr

Chris was able to recognize WAY more... I suspect it has something to do 
with the background color - I found similiar gifs with colourful 
backgrounds of which i was able to ocr some and others not at all like 
this one....

Do you have any hints as of why I get such poor results on that one?
What can I try? Perhaps some params?  I already tried playing with the 
grey level and dust_size params of gocr but didn't get any close to 
Chris's results

Thanks!

Matt

Re: Poor gocr results on some pics?

Posted by decoder <de...@own-hero.net>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Loren Wilton wrote:
>> Here's the pic in question as original gif (I joined the parts to
>>  make it easier for gocr):
>> http://www.matthias-keller.ch/ocrmail.gif and converted to pnm:
>> http://www.matthias-keller.ch/ocrmail.pnm
>>
>> And here's what   gocr -i ocrmail.pnm   spits out in my case:
>> http://www.matthias-keller.ch/ocrmail.gocr
>
> The only thing your scan got decently was the sans-serif font.  All
>  of the serif font stuff and the italic sans-serif fonts stuff
> turned to garbage.
>
> I'm not quite sure why this should be.  That looks like pretty
> clean text that should be pretty recognizable.  The contrast could
> be a problem, but that 100% accuracy on the one line indicates that
> it probably isn't.  There should be an option to one of the
> programs to do a b/w transform on this. That may help.
>
> I'd look to see if the ocr program has any options on the kinds of
> fonts it recognizes.
>
> Ok.  A little playing around in photoship.  That is all
> anti-aliased fonts. It looks real good in the gif.  If you convert
> it to jpg, or I suspect any other lossy compression at standard
> compression rates, the results are unusable; there just aren't
> enough pixels.
>
> If you keep all of the pixels (doing this on Windows I went
> gif->bmp to import it to photoshop) you have better luck.
>
> However, if you attempt to threshold to b/w at the default 50%
> threshold level the results are unusable.  If you threshold at
> around 170-190 (out of 255), or around 70-75%, then you get much
> better results.
>
> If you can't control the threshold level, you can try taking the
> contrast up.  I set the contrast to 100% and then thresholded.  The
>  results weren't quite as good, but they were numbers that don't
> require experimentation.  A contrast around 90% might have been
> better, but I didn't try that yet.
>
> Loren
>

We traced down the problem now to his gocr version. I ran my gocr over
his pnm file and got very good results compared to him, actually the
same results I got on his gif. So probably something is wrong with his
gocr version because you don't need special arguments (we are using
the same).

Chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFE2btHJQIKXnJyDxURAjnHAJ4zriGQSU4B2Sr/ii+ivMfG3QRMZwCeI/7a
lVOtMTrJPQbVSkrLpt0760g=
=spr2
-----END PGP SIGNATURE-----


Re: Poor gocr results on some pics?

Posted by jdow <jd...@earthlink.net>.
From: "Loren Wilton" <lw...@earthlink.net>

>> Here's the pic in question as original gif (I joined the parts to make it 
>> easier for gocr):
>> http://www.matthias-keller.ch/ocrmail.gif
>> and converted to pnm:
>> http://www.matthias-keller.ch/ocrmail.pnm
>>
>> And here's what   gocr -i ocrmail.pnm   spits out in my case:
>> http://www.matthias-keller.ch/ocrmail.gocr
> 
> The only thing your scan got decently was the sans-serif font.  All of the 
> serif font stuff and the italic sans-serif fonts stuff turned to garbage.
> 
> I'm not quite sure why this should be.  That looks like pretty clean text 
> that should be pretty recognizable.  The contrast could be a problem, but 
> that 100% accuracy on the one line indicates that it probably isn't.  There 
> should be an option to one of the programs to do a b/w transform on this. 
> That may help.
> 
> I'd look to see if the ocr program has any options on the kinds of fonts it 
> recognizes.
> 
> Ok.  A little playing around in photoship.  That is all anti-aliased fonts. 
> It looks real good in the gif.  If you convert it to jpg, or I suspect any 
> other lossy compression at standard compression rates, the results are 
> unusable; there just aren't enough pixels.
> 
> If you keep all of the pixels (doing this on Windows I went gif->bmp to 
> import it to photoshop) you have better luck.
> 
> However, if you attempt to threshold to b/w at the default 50% threshold 
> level the results are unusable.  If you threshold at around 170-190 (out of 
> 255), or around 70-75%, then you get much better results.
> 
> If you can't control the threshold level, you can try taking the contrast 
> up.  I set the contrast to 100% and then thresholded.  The results weren't 
> quite as good, but they were numbers that don't require experimentation.  A 
> contrast around 90% might have been better, but I didn't try that yet.

Based on my video work I suggested to Loren that treating the background
color as the "blue screen" color and super impose the blue screened text
over a white background. Then enhance contrast. That gives a better OCR
potential to the image.

{^_^}

Re: Poor gocr results on some pics?

Posted by Loren Wilton <lw...@earthlink.net>.
> Here's the pic in question as original gif (I joined the parts to make it 
> easier for gocr):
> http://www.matthias-keller.ch/ocrmail.gif
> and converted to pnm:
> http://www.matthias-keller.ch/ocrmail.pnm
>
> And here's what   gocr -i ocrmail.pnm   spits out in my case:
> http://www.matthias-keller.ch/ocrmail.gocr

The only thing your scan got decently was the sans-serif font.  All of the 
serif font stuff and the italic sans-serif fonts stuff turned to garbage.

I'm not quite sure why this should be.  That looks like pretty clean text 
that should be pretty recognizable.  The contrast could be a problem, but 
that 100% accuracy on the one line indicates that it probably isn't.  There 
should be an option to one of the programs to do a b/w transform on this. 
That may help.

I'd look to see if the ocr program has any options on the kinds of fonts it 
recognizes.

Ok.  A little playing around in photoship.  That is all anti-aliased fonts. 
It looks real good in the gif.  If you convert it to jpg, or I suspect any 
other lossy compression at standard compression rates, the results are 
unusable; there just aren't enough pixels.

If you keep all of the pixels (doing this on Windows I went gif->bmp to 
import it to photoshop) you have better luck.

However, if you attempt to threshold to b/w at the default 50% threshold 
level the results are unusable.  If you threshold at around 170-190 (out of 
255), or around 70-75%, then you get much better results.

If you can't control the threshold level, you can try taking the contrast 
up.  I set the contrast to 100% and then thresholded.  The results weren't 
quite as good, but they were numbers that don't require experimentation.  A 
contrast around 90% might have been better, but I didn't try that yet.

        Loren