You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Andrew Bruce <ab...@hope-st.ath.cx> on 2009/04/29 06:05:17 UTC

FuzzyOCR only runs when specifying spamassassin -D


I've been looking at some of the spam emails I've received lately with
images attached and noticed that FuzzyOCR wasn't running against them. 

The same seems to be true when I take these messages and run them with: 

spamassassin -t < img-email.eml 

However if I run them through as follows, I get FuzzyOCR showing up in the
results: 

spamassassin -t -D < img-email.eml 

I also get substantially different AWL results between the two (although I
guess that maybe part of the debug procedure). 

Does anyone know why this might be happening? I seem to recall
experiencing this before, but can't remember what I did to fix it. 

spamassassin -t: 

Content analysis details: (22.2 points, 5.0 required)

 pts rule name description
---- ----------------------
--------------------------------------------------
 1.2 RCVD_IN_PBL RBL: Received via a relay in Spamhaus PBL
 [68.186.154.187 listed in zen.spamhaus.org]
 3.0 RCVD_IN_XBL RBL: Received via a relay in Spamhaus XBL
 0.9 RCVD_IN_SORBS_DUL
RBL: SORBS: sent directly from dynamic IP address
 [68.186.154.187 listed in dnsbl.sorbs.net]
 3.5 BAYES_99 BODY: Bayesian spam probability is 99 to 100%
 [score: 1.0000]
 1.0 FH_HELO_EQ_CHARTER Helo is d-d-d-d charter.com
 4.3 HELO_DYNAMIC_HCC Relay HELO'd using suspicious hostname (HCC)
 4.4 HELO_DYNAMIC_IPADDR2 Relay HELO'd using suspicious hostname (IP addr
 2)
 0.0 FH_HELO_EQ_D_D_D_D Helo is d-d-d-d
 2.0 RCVD_IN_BL_SPAMCOP_NET RBL: Received via a relay in bl.spamcop.net
 [Blocked - see ]
 0.0 HTML_MESSAGE BODY: HTML included in message
 0.1 RDNS_DYNAMIC Delivered to trusted network by host with
 dynamic-looking rDNS
 1.8 AWL AWL: From: address is in the auto white-list

spamassassin -t -D: 

Content analysis details: (25.7 points, 5.0 required)

 pts rule name description
---- ----------------------
--------------------------------------------------
 3.0 RCVD_IN_XBL RBL: Received via a relay in Spamhaus XBL
 [68.186.154.187 listed in zen.spamhaus.org]
 1.2 RCVD_IN_PBL RBL:
Received via a relay in Spamhaus PBL
 0.9 RCVD_IN_SORBS_DUL RBL: SORBS: sent directly from dynamic IP address
 [68.186.154.187 listed in dnsbl.sorbs.net]
 3.5 BAYES_99 BODY: Bayesian spam probability is 99 to 100%
 [score: 1.0000]
 1.0 FH_HELO_EQ_CHARTER Helo is d-d-d-d charter.com
 4.3 HELO_DYNAMIC_HCC Relay HELO'd using suspicious hostname (HCC)
 4.4 HELO_DYNAMIC_IPADDR2 Relay HELO'd using suspicious hostname (IP addr
 2)
 0.0 FH_HELO_EQ_D_D_D_D Helo is d-d-d-d
 2.0 RCVD_IN_BL_SPAMCOP_NET RBL: Received via a relay in bl.spamcop.net
 [Blocked - see ]
 0.0 HTML_MESSAGE BODY: HTML included in message
 0.1 RDNS_DYNAMIC Delivered to trusted network by host with
 dynamic-looking rDNS
 10 FUZZY_OCR_KNOWN_HASH BODY:
-5.2 AWL AWL: From: address is in the auto white-list

Re: FuzzyOCR only runs when specifying spamassassin -D

Posted by Matt Kettler <mk...@verizon.net>.
Andrew Bruce wrote:
>
> I've been looking at some of the spam emails I've received lately with
> images attached and noticed that FuzzyOCR wasn't running against them.
>
>  
>
> The same seems to be true when I take these messages and run them with:
>
> spamassassin -t < img-email.eml
>
>  
>
> However if I run them through as follows, I get FuzzyOCR showing up in
> the results:
>
> spamassassin -t -D < img-email.eml
>
Well, the rule that tripped was FUZZY_OCR_KNOWN_HASH, I'm no FuzzyOCR
expert, but I'm guessing that's related to it storing the hashes of
images attached to previous spam in a SQL database. So, in that case, it
would have fired the second time regardless of -D being enabled. It's
just firing off because it's already seen the image once before and
cataloged it as belonging on spam.

Glancing at fuzzyOCR's code for the first time, I think this is realated
to the focr_enable_image_hashing option.
>
>  
>
> I also get substantially different AWL results between the two
> (although I guess that maybe part of the debug procedure).
>
-D does not change the AWL.

The AWL score change that's a function of two things:

1) scanning the message multiple times. Every time you process it, the
AWL will change, because every scanned message gets factored into the
AWL's historical average score.

2) fuzzyOCR triggered off, raising the pre-AWL score, which is going to
drive down the AWL score. (remember, the AWL score is based on the
difference between this message and the past average). Adding +10 to the
pre-AWL (which FuzzyOCR did) score should change the AWL score by -5.0,
assuming the default AWL factor of 0.5.

You saw a total swing of  -7, so it looks like the first run raised the
average by 4.0, in turn affecting the AWL score by -2.0, and then
fuzzyOCR caused another -5.0 change in the AWL.

In both cases the AWL still "thought" the message was spam, but in the
second case it noted it had a much higher spam score than the previous
spam, so it brought it back down a bit to split the difference. That's
what the AWL does.

See also:
http://wiki.apache.org/spamassassin/AwlWrongWay
http://wiki.apache.org/spamassassin/AutoWhitelist



>  
>


Re: FuzzyOCR only runs when specifying spamassassin -D

Posted by René Berber <r....@computer.org>.
Andrew Bruce wrote:

> I've been looking at some of the spam emails I've received lately with
> images attached and noticed that FuzzyOCR wasn't running against them.
> 
[snip]
> However if I run them through as follows, I get FuzzyOCR showing up in
> the results:
> 
> spamassassin -t -D < img-email.eml
> 
[snip]
> Does anyone know why this might be happening?  I seem to recall
> experiencing this before, but can't remember what I did to fix it.

That's the way FuzzyOCR works: if a message already has scored above a
configurable threshold it doesn't scan it, if you run in debug mode the
threshold is ignored.
-- 
René Berber