You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by sa...@excite.com on 2005/03/02 19:35:36 UTC

Suggestion: OCR

I've kust made tests with gocr (a OCR command-line linux software) and it proves to be safe, i.e. if it fails to detect a text, you see some nonsense collection of symbols. It can handle pnm (and some other formats) directly and cannot handle gifs and jpegs directly. It supposses the text is darker than the background, so some preprocessing is needed (i don't know how to invert the colors with linux tools, but it's a matter of googling). What i managed to detect is

[quote]
click here to get removed

all other enquiries send to:
we_master@emarketingdeals,com
[/quote]

The real picture is in the attachment. Two OCR errors were made. The command that used was
giftopnm tG0rzUDQO.gif | pnmtojpeg --quality=100 |djpeg -pnm -grayscale | gocr -

That gives you the quote on the standard output.
Sometimes you have to split animated gif's into a sequence of .png's. Then it looks like this

gif2png tG0rzUDQO.gif
[.png, .p01, .p02, ... are generated]
pngtopnm tG0rzUDQO.png | pnmtojpeg --quality=100 |djpeg  -pnm -grayscale | gocr - 

For jpegs, you have a simpler procedure for the last two steps. Time for working is neglectible (at least on my machine).

The question is whether it is implementable to convert all pictures via this tool gocr into text and then run the usual SA tests on the so combined email? If yes, could someone do that?
Thanks a lot.
Sasha.

_______________________________________________
Join Excite! - http://www.excite.com
The most personalized portal on the Web!

Re: Suggestion: OCR

Posted by Matt Kettler <mk...@evi-inc.com>.
At 01:35 PM 3/2/2005,  wrote:
>I've kust made tests with gocr (a OCR command-line linux software) and it 
>proves to be safe, i.e. if it fails to detect a text, you see some 
>nonsense collection of symbols.

That part is definitely NOT safe in the context of spamassassin... Nonsense 
looks a lot like bugs in spam mailers, and very little like legitimate 
email to SA.

If nothing else, consider the tripwire rules, which look for letter 
combinations that don't exist in normal English...





RE: Suggestion: OCR

Posted by Greg Allen <ga...@netrox.net>.
User to user post... ( I am not a developer)

I can see where this my be something to consider 10 or 20 years from now
when we all have supercomputers in our pockets. :-)

But until then...

I would concentrate on implementing the latest Spamassassin 3.0.2

It is a bit of work to get it working correctly. There are a lot of
plug-ins, but they can be tested with spamassassin -D --lint etc. until you
get them all running correctly.

But, once you get all of the supporting programs installed, dcc, pyzor, db,
dns, uri, etc, etc, etc (this upgrade is not for woosies) this new version
kicks major spam ass..assin.

The new URIBL function is the greatest thing since rbl, dcc and Pyzor. It
detects the spam website in the email.

It doesn't matter if they post an image in the email with no text, if the
image has an IP or web address that is reported as spam they are likely to
trip the spam points. I just took care of the same issue with the upgrade to
3.0.2

Good luck!



-----Original Message-----
From: sasha.mal@excite.com [mailto:sasha.mal@excite.com]
Sent: Wednesday, March 02, 2005 1:36 PM
To: users@spamassassin.apache.org
Subject: Suggestion: OCR



I've kust made tests with gocr (a OCR command-line linux software) and it
proves to be safe, i.e. if it fails to detect a text, you see some nonsense
collection of symbols. It can handle pnm (and some other formats) directly
and cannot handle gifs and jpegs directly. It supposses the text is darker
than the background, so some preprocessing is needed (i don't know how to
invert the colors with linux tools, but it's a matter of googling). What i
managed to detect is

[quote]
click here to get removed

all other enquiries send to:
we_master@emarketingdeals,com
[/quote]

The real picture is in the attachment. Two OCR errors were made. The command
that used was
giftopnm tG0rzUDQO.gif | pnmtojpeg --quality=100 |djpeg -pnm -grayscale |
gocr -

That gives you the quote on the standard output.
Sometimes you have to split animated gif's into a sequence of .png's. Then
it looks like this

gif2png tG0rzUDQO.gif
[.png, .p01, .p02, ... are generated]
pngtopnm tG0rzUDQO.png | pnmtojpeg --quality=100 |djpeg  -pnm -grayscale |
gocr -

For jpegs, you have a simpler procedure for the last two steps. Time for
working is neglectible (at least on my machine).

The question is whether it is implementable to convert all pictures via this
tool gocr into text and then run the usual SA tests on the so combined
email? If yes, could someone do that?
Thanks a lot.
Sasha.

_______________________________________________
Join Excite! - http://www.excite.com
The most personalized portal on the Web!