You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Jeff Chan <je...@surbl.org> on 2006/10/27 15:29:52 UTC

ImageInfo vs FuzzyOCR performance?

Does anyone have any recent feedback about the performance of
ImageInfo versus FuzzyOCR about detecting stock image spams (or
any others)?  Does FuzzyOCR catch significantly more spams than
ImageInfo?

Cheers,

Jeff C.
-- 
Jeff Chan
mailto:jeffc@surbl.org
http://www.surbl.org/

RE: ImageInfo vs FuzzyOCR performance?

Posted by Rob McEwen <ro...@PowerViewSystems.com>.

Jeff Chan wrote:
> Does anyone have any recent feedback about the performance of
> ImageInfo versus FuzzyOCR about detecting stock image spams (or
> any others)?  Does FuzzyOCR catch significantly more spams than
> ImageInfo?

But one of the things that ImageInfo does to avoid FPs is assign a higher
score to image-only spam where the ratio of screen-space/amount-of-text is
high. But notice how more of this type of spam lately has more gibberish
text at the bottom lately? This messes that formula up and creates a VERY
small ImageInfo score. I know that the spammers might have been doing this
to get around bayes... but I suspect that they were really trying to get
around ImageInfo because this change-up seemed to happen soon after
ImageInfo was introduced.

Nevertheless, I've found that manually readjusting those ratios has helped
to catch more spam. (And I'm reluctant to mention this in the first place
because if they are adjusted at the SARE site, then the spammers will only
readjust accordingly!)

Rob McEwen
PowerView Systems

Re: ImageInfo vs FuzzyOCR performance?

Posted by Jorge Valdes <jv...@intercom.com.sv>.

Jeff Chan wrote:
> Does anyone have any recent feedback about the performance of
> ImageInfo versus FuzzyOCR about detecting stock image spams (or
> any others)?  Does FuzzyOCR catch significantly more spams than
> ImageInfo?
>
> Cheers,
>
> Jeff C.
>   
I maybe biased, as I help in FuzzyOcr development, but do use both.  
ImageInfo is fine and will get you part of the way there, but FuzzyOcr 
hits more often. Daily scanning ~8Kmsg/day, FuzzyOcr hits ~1600 times 
and ImageInfo hits < 150 times on average. On my system, here are the 
top10 rule hits from yesterday:

 SPAM Results:
       3936 Message(s) 49.83%
     19.399 Average Score
 
       3343 Time(s)    7.50%   84.93% Hit Rule: BAYES_99
       3068 Time(s)    6.88%   77.95% Hit Rule: HTML_MESSAGE
       1655 Time(s)    3.71%   42.05% Hit Rule: FUZZY_OCR
       1527 Time(s)    3.42%   38.80% Hit Rule: SARE_GIF_ATTACH
       1411 Time(s)    3.16%   35.85% Hit Rule: URIBL_BLACK
       1274 Time(s)    2.86%   32.37% Hit Rule: URIBL_BLACK_OVERLAP
       1271 Time(s)    2.85%   32.29% Hit Rule: MIME_HTML_ONLY
       1215 Time(s)    2.72%   30.87% Hit Rule: URIBL_JP_SURBL
       1187 Time(s)    2.66%   30.16% Hit Rule: RCVD_IN_BL_SPAMCOP_NET
       1184 Time(s)    2.66%   30.08% Hit Rule: SARE_GIF_STOX


Jorge Valdes

Re: ImageInfo vs FuzzyOCR performance?

Posted by Kenneth Porter <sh...@sewingwitch.com>.

--On Friday, October 27, 2006 6:29 AM -0700 Jeff Chan <je...@surbl.org> 
wrote:

> Does anyone have any recent feedback about the performance of
> ImageInfo versus FuzzyOCR about detecting stock image spams (or
> any others)?  Does FuzzyOCR catch significantly more spams than
> ImageInfo?

The last I checked, ImageInfo simply reads some header info from the image. 
It's pretty lightweight, probably more so than any Perl-based regex in SA. 
FuzzyOCR is much more compute-intensive, since it has to perform image 
processing (through gocr, as well as conversions necessary to get the input 
into the format that gocr expects).