You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Karsten Bräckelmann <gu...@rudersport.de> on 2010/04/18 00:50:32 UTC

Re: Is this a bug? Content analysis details: (6.4 points, 6.5 required)

On Sat, 2010-04-17 at 17:18 -0400, Kevin A. McGrail wrote:
> Comparing the content analysis to the #'s returned, we have a math
> discrepancy....
> 
> X-Spam-Status: Yes, hits=6.5 required=6.5 tests= [...]
> X-Spam-Score: 6.5 (******) [...]

These are non-default. Are they generated and inserted by SA?

> Note the 6.4 below:
> 
> Content analysis details:   (6.4 points, 6.5 required)

Rounding issues? This one added by SA itself, unlike the above?

IIRC, there is *no* rounding (up) by SA, which caused confusion on the
users list before. So, did something else round up the value for the
other headers above? Hmm, *and* decided about the spam status "yes"...?


> -0.5 RCVD_IN_RPBLGOOD       RBL: Good machine, passed greylist, high amounts of
>                             Ham
>                             [64.18.1.27 listed in good.wookie.rptn.ca]
>  1.1 URIBL_GREY             Contains an URL listed in the URIBL greylist
>                             [URIs: constantcontact.com]
>  1.0 EXTRA_MPART_TYPE       Header has extraneous Content-type:...type= entry
>  2.4 ONLINE_PHARMACY        BODY: Online Pharmacy
>  1.2 TVD_VISIT_PHARMA       BODY: TVD_VISIT_PHARMA
>  0.0 HTML_MESSAGE           BODY: HTML included in message
>  2.3 ADVANCE_FEE_2          Appears to be advance fee fraud (Nigerian 419)
>  0.0 T_LOTS_OF_MONEY        Huge... sums of money
> -1.0 KAM_RPTR_PASSED        Passed Mail Relay Reverse DNS Test

Please get a snapshot copy of all your scores *now*. Before any
sa-update run. So we can compute the accurate value with more than one
digit after the comma.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: Is this a bug? Content analysis details: (6.4 points, 6.5 required)

Posted by "Kevin A. McGrail" <KM...@PCCC.com>.
> Regarding code duplication... See, the patch didn't even land in trunk
> yet, but we got inconsistent comments already. *ducks* ;)
>    
I deserved that. I even wrote that the patch should be unified in my 
notes for version 1 of the patch.

So for the record, code forks are evil and lead to inconsistencies very 
quickly.  Mea culpa.

Regards,
KAM

Re: Is this a bug? Content analysis details: (6.4 points, 6.5 required)

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Wed, 2010-04-21 at 17:45 -0400, Kevin A. McGrail wrote:
> > So the message in question actually was NOT spam, and falsely reported
> > to be by spamd.
> 
> Exactly.  But "falsely" reported by spamd may be a bit harsh.  It's a 
> confusing rounding error, though that had me very confused

I didn't mean to be harsh. But to get the facts straight, and stress the
point that according to the analysis the message actually was marginally
below the spam threshold. Or in other words, $is_spam == 0. Hence,
falsely. :)

> What we are doing is using spamd with the -R option which gives the 
> score and threshold as the first line.  We are using the first line of 
> data to test and see if score >= threshold.
> 
> If we do use -E, the error level is accurately reporting the is_spam 
> status but the inconsistency on scores is still something I consider a bug.

No arguing here. At the very least, this is inconsistent behavior.

> >> However, SpamD/C uses rounding to the 10th for the output of the first
> >> line but then utilizes PerMsgStatus.pm for the report, etc.
> >>      
> > Hmm, the patch is code duplication. And that in a case that caused
> > confusion before. Would be nice if spamd could use PMS functions, rather
> > than duplicating the code. No, I did not look closely at the surrounding
> > code, just a quick look at the patch. :)
> >
> > And there's a bug in your patch, more precisely the comments, added to
> > both PMS and spamd. The comment right before the final return *should*
> > read, with the relevant "not" already added:
> >
> > +  # if the email is NOT spam and $score = $rscore, return the $rscore - 0.1
> > +  #   effectively flooring the value to the closest tenth
> >    
> Thanks.  I fixed this in my routine and left it in the PMS.  Another 
> reason why code duplication is bad.

Please don't do that. The PMS comment actually is part of your patch,
attachment 4750.

Regarding code duplication... See, the patch didn't even land in trunk
yet, but we got inconsistent comments already. *ducks* ;)

> OK, I believe this routine belongs in Util.pm and will submit a new 
> patch that unifies the code.

Unifying sounds really good.

> https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6419

-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: Is this a bug? Content analysis details: (6.4 points, 6.5 required)

Posted by "Kevin A. McGrail" <KM...@PCCC.com>.
> So the message in question actually was NOT spam, and falsely reported
> to be by spamd.
>    

Exactly.  But "falsely" reported by spamd may be a bit harsh.  It's a 
confusing rounding error, though that had me very confused

What we are doing is using spamd with the -R option which gives the 
score and threshold as the first line.  We are using the first line of 
data to test and see if score >= threshold.

If we do use -E, the error level is accurately reporting the is_spam 
status but the inconsistency on scores is still something I consider a bug.
>> However, SpamD/C uses rounding to the 10th for the output of the first
>> line but then utilizes PerMsgStatus.pm for the report, etc.
>>      
> Hmm, the patch is code duplication. And that in a case that caused
> confusion before. Would be nice if spamd could use PMS functions, rather
> than duplicating the code. No, I did not look closely at the surrounding
> code, just a quick look at the patch. :)
>
> And there's a bug in your patch, more precisely the comments, added to
> both PMS and spamd. The comment right before the final return *should*
> read, with the relevant "not" already added:
>
> +  # if the email is NOT spam and $score = $rscore, return the $rscore - 0.1
> +  #   effectively flooring the value to the closest tenth
>
>    
Thanks.  I fixed this in my routine and left it in the PMS.  Another 
reason why code duplication is bad.

OK, I believe this routine belongs in Util.pm and will submit a new 
patch that unifies the code.

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6419

regards,
KAM

Re: Is this a bug? Content analysis details: (6.4 points, 6.5 required)

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Wed, 2010-04-21 at 12:03 -0400, Kevin A. McGrail wrote:
> From my understanding of the code and the wiki 
> http://wiki.apache.org/spamassassin/RoundingIssues, the score for an 
> email that is 6.4500000000 to 6.4999999999 with a threshold of 6.5 as 
> SpamAssassin reports should be score=6.4, required=6.5.  The old code 
> used to round to nearest 10th giving the message that X-Spam-Status: No, 
> score=6.5, required=6.5 was confusing so a special routine for rounding 
> when near the required score was implemented.

Ah, a gray area I didn't really know so far. :)  OK, so in case it is
NOT spam, but a rounded score would suggest this nonetheless, just don't
round and floor instead -- less precise number, but no confusion because
the result matches the visible numbers.


> The PerMsgStatus.pm implements this special rounding so that it is 
> reported as score=6.4.

So the message in question actually was NOT spam, and falsely reported
to be by spamd.

> However, SpamD/C uses rounding to the 10th for the output of the first 
> line but then utilizes PerMsgStatus.pm for the report, etc.

Hmm, the patch is code duplication. And that in a case that caused
confusion before. Would be nice if spamd could use PMS functions, rather
than duplicating the code. No, I did not look closely at the surrounding
code, just a quick look at the patch. :)

And there's a bug in your patch, more precisely the comments, added to
both PMS and spamd. The comment right before the final return *should*
read, with the relevant "not" already added:

+  # if the email is NOT spam and $score = $rscore, return the $rscore - 0.1 
+  #   effectively flooring the value to the closest tenth


> Hence the lack of consistency with one part of the output saying 
> score=6.5 and another saying score=6.4.  They should have both relayed 
> the score as 6.4 for clarity as decided previously.
> 
> My patch implements logic with the same rounding in both places.

-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: Is this a bug? Content analysis details: (6.4 points, 6.5 required)

Posted by "Kevin A. McGrail" <KM...@PCCC.com>.
Correcting my own previous post...

 From my understanding of the code and the wiki 
http://wiki.apache.org/spamassassin/RoundingIssues, the score for an 
email that is 6.4500000000 to 6.4999999999 with a threshold of 6.5 as 
SpamAssassin reports should be score=6.4, required=6.5.  The old code 
used to round to nearest 10th giving the message that X-Spam-Status: No, 
score=6.5, required=6.5 was confusing so a special routine for rounding 
when near the required score was implemented.

The PerMsgStatus.pm implements this special rounding so that it is 
reported as score=6.4.

However, SpamD/C uses rounding to the 10th for the output of the first 
line but then utilizes PerMsgStatus.pm for the report, etc.

Hence the lack of consistency with one part of the output saying 
score=6.5 and another saying score=6.4.  They should have both relayed 
the score as 6.4 for clarity as decided previously.

My patch implements logic with the same rounding in both places.

Regards,
KAM


Re: Is this a bug? Content analysis details: (6.4 points, 6.5 required)

Posted by "Kevin A. McGrail" <KM...@PCCC.com>.
> No, very unlikely you get 6.5 exactly. In particular, because there are
> two rules scoring 0.0, which already is after per-score rounding for the
> report. Hence me asking for the exact scores.
>    

You are correct. Without rounding, it scores 6.4894 or something like 
that.  And the exact adding up is explained well in the rounding wiki 
referenced below.

> A good alternative to spamc for such testing is spamassassin with an
> explicit --cf="use_auto_whitelist 0" option. :)
>    
Thanks!  That is a good trick.  I often have to clear AWL scores while 
doing research.

>> OK, so long and short, this is bug 2607 rearing it's head in spamd.
>>
>> I've found the cause of the issue and the fix.  Will open a bug in
>> bugzilla and address this issue further there.
>>      
> Without reading the patch, which score and result is actually correct?
> And what is the exact score?
>    

 From my understanding of the code and the wiki 
http://wiki.apache.org/spamassassin/RoundingIssues, the score for an 
email that is 6.45 to 5.4999999999 with a threshold of 6.5 as 
SpamAssassin reports should be score=6.4, required=6.5.  The old code 
used to round to nearest 10th the message that X-Spam-Status: No, 
score=6.5, required=6.5 was confusing so a special routine for rounding 
when near the required score was implemented.

The PerMsgStatus.pm implements this special rounding.

SpamD/C uses rounding to the 10th for the output of the first line but 
then utilizes PerMsgStatus.pm for the report, etc.

Hence the lack of consistency with one part of the output saying 
score=6.5 and another saying score=6.4.  They should have both said 
score=6.4.

My patch implements logic with the same rounding in both places.

Regards,
KAM



Re: Is this a bug? Content analysis details: (6.4 points, 6.5 required)

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Mon, 2010-04-19 at 16:20 -0400, Kevin A. McGrail wrote:
> > Rounding issues? This one added by SA itself, unlike the above?
> > IIRC, there is *no* rounding (up) by SA, which caused confusion on the
> > users list before. So, did something else round up the value for the
> > other headers above? Hmm, *and* decided about the spam status "yes"...?
> 
> If you manually add up the rules below, you get 6.5. So looking for 
> rounding issues was my the next step.  In short, here's the bug:

No, very unlikely you get 6.5 exactly. In particular, because there are
two rules scoring 0.0, which already is after per-score rounding for the
report. Hence me asking for the exact scores.


> spamc called with parameter -R is printing the report with
> 6.5/6.5

> Content analysis details:   (6.4 points, 6.5 required)
[...]
> -0.0 AWL                    AWL: From: address is in the auto white-list

A good alternative to spamc for such testing is spamassassin with an
explicit --cf="use_auto_whitelist 0" option. :)


> OK, so long and short, this is bug 2607 rearing it's head in spamd.
> 
> I've found the cause of the issue and the fix.  Will open a bug in
> bugzilla and address this issue further there.

Without reading the patch, which score and result is actually correct?
And what is the exact score?


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: Is this a bug? Content analysis details: (6.4 points, 6.5 required)

Posted by "Kevin A. McGrail" <KM...@PCCC.com>.
PREFACE: I've found a bug.  Perhaps more an inconsistency but a bug none 
the less.
> These are non-default. Are they generated and inserted by SA?
>    

You are right.  These headers are generated from calls to spamd.  
However, I can recreate the issue though it did take me quite a while!

> Rounding issues? This one added by SA itself, unlike the above?
> IIRC, there is *no* rounding (up) by SA, which caused confusion on the
> users list before. So, did something else round up the value for the
> other headers above? Hmm, *and* decided about the spam status "yes"...?
>    
If you manually add up the rules below, you get 6.5. So looking for 
rounding issues was my the next step.  In short, here's the bug:

spamc called with parameter -R is printing the report with

    "The first line of the output is the message score and the thresh­old, in this format: score/threshold"

However, the first line does NOT agree with the content analysis details. Here's the output from /usr/local/bin/spamc -R -z -d localhost -u root<  /tmp/3:

6.5/6.5

...(STUFF REMOVED)

Content analysis details:   (6.4 points, 6.5 required)

  1.1 URIBL_GREY             Contains an URL listed in the URIBL greylist
                             [URIs: constantcontact.com]
  1.0 EXTRA_MPART_TYPE       Header has extraneous Content-type:...type= entry
-0.5 TEMP                   TEMP
  2.4 ONLINE_PHARMACY        BODY: Online Pharmacy
  1.2 TVD_VISIT_PHARMA       BODY: TVD_VISIT_PHARMA
  0.0 HTML_MESSAGE           BODY: HTML included in message
  2.3 ADVANCE_FEE_2          Appears to be advance fee fraud (Nigerian 419)
  0.0 T_LOTS_OF_MONEY        Huge... sums of money
-1.0 KAM_RPTR_PASSED        Passed Mail Relay Reverse DNS Test
-0.0 AWL                    AWL: From: address is in the auto white-list


I am using spamc -R and using the first line of output to determine if_ham&&  to test if Spam.

OK, so long and short, this is bug 2607 rearing it's head in spamd.

I've found the cause of the issue and the fix.  Will open a bug in bugzilla and address this issue further there.

Regards,
KAM