You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Jason Marshall <ma...@spots.ab.ca> on 2006/04/03 23:19:28 UTC

Is Spamassassin failing math?

X-Spam-Status: No, score=2.7 required=5.0 tests=BAYES_60,SARE_MLB_Stock1,
     TW_AQ autolearn=no version=3.1.0
X-Spam-Report:
     *  1.7 SARE_MLB_Stock1 BODY: SARE_MLB_Stock1
     *  0.1 TW_AQ BODY: Odd Letter Triples with AQ
     *  1.0 BAYES_60 BODY: Bayesian spam probability is 60 to 80%
     *      [score: 0.6809]

To me, that looks more like 2.8 not 2.7 points!  Is this just my site? 
Sorry if someone already brought this up long ago...

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
| Jason Marshall, marshalj@spots.ab.ca. Spots InterConnect, Inc. Calgary, AB |
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Re: Is Spamassassin failing math?

Posted by Jason Marshall <ma...@spots.ab.ca>.

> http://wiki.apache.org/spamassassin/RoundingIssues

Thanks Daryl, I didn't realize the scores were actually accurate to 3 
decimal places!  Makes sense now, thanks!

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
| Jason Marshall, marshalj@spots.ab.ca. Spots InterConnect, Inc. Calgary, AB |
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Re: Is Spamassassin failing math?

Posted by Matt Kettler <mk...@evi-inc.com>.

Daryl C. W. O'Shea wrote:
> Jason Marshall wrote:
>> X-Spam-Status: No, score=2.7 required=5.0 tests=BAYES_60,SARE_MLB_Stock1,
>>     TW_AQ autolearn=no version=3.1.0
>> X-Spam-Report:
>>     *  1.7 SARE_MLB_Stock1 BODY: SARE_MLB_Stock1
>>     *  0.1 TW_AQ BODY: Odd Letter Triples with AQ
>>     *  1.0 BAYES_60 BODY: Bayesian spam probability is 60 to 80%
>>     *      [score: 0.6809]
>>
>> To me, that looks more like 2.8 not 2.7 points!  Is this just my site?
>> Sorry if someone already brought this up long ago...
> 
> http://wiki.apache.org/spamassassin/RoundingIssues
> 

There are bugs in that page.

The second-half example is from the old rounding behavior, not the more recent
truncation behavior. Modern SA versions would display 8.5 as the score, not 8.6.
We need an example where the error swings in the other direction...

(The top half has been adapted to reflect SA's current behavior, but the bottom
half has not)

Re: Is Spamassassin failing math?

Posted by "Daryl C. W. O'Shea" <sp...@dostech.ca>.

Jason Marshall wrote:
> X-Spam-Status: No, score=2.7 required=5.0 tests=BAYES_60,SARE_MLB_Stock1,
>     TW_AQ autolearn=no version=3.1.0
> X-Spam-Report:
>     *  1.7 SARE_MLB_Stock1 BODY: SARE_MLB_Stock1
>     *  0.1 TW_AQ BODY: Odd Letter Triples with AQ
>     *  1.0 BAYES_60 BODY: Bayesian spam probability is 60 to 80%
>     *      [score: 0.6809]
> 
> To me, that looks more like 2.8 not 2.7 points!  Is this just my site? 
> Sorry if someone already brought this up long ago...

http://wiki.apache.org/spamassassin/RoundingIssues

Re: Is Spamassassin failing math?

Posted by Thomas Hochstein <ml...@ancalagon.inka.de>.

Jason Marshall schrieb:

> I'm sure I'm not the first one to suggest this, but why NOT always display 
> the numbers in their entirety?  I can't think of any reason why a user 
> would say "please give me less accuracy and a lot more confusion in return 
> for fewer digits to parse".

But I can - and I would. :) I can read numbers with just one decimal
place much better, and I'm not interested at all in "more accuracy".

Why (and when) do I need the scores? When I want to see why mail is
tagged spam or not, and how relevant each rule was for that decision.
It's not important if a rule scores 1.223 or 1.2 - it's not even
important if it scores 1.1 oder 1.3. But it *does* matter if it scores
0.2 or 2.5.

So that accuracy is just unneceassaray, and it *would* make it harder
- at least for me - to "get" the scores with just one quick look.

Regards,
-thh

Re: Is Spamassassin failing math?

Posted by Jason Marshall <ma...@spots.ab.ca>.

> As stated above: "That's rather ugly and creates a cluttered report".

And as I stated below, I disagree.

> Yes, and it's unnecessary. Life is full of round-number issues.

This one would have been pretty avoidable.

> accept rounding is unavoidable. People like rounded numbers because they are
> fast and easy to read. Nearly every number you see in life is rounded. You've
> just never checked the background math before.

If I had a nickel for every one of my users who actually read the report 
added to the scanned mail, I'd have about a buck fifty.  As a geek, I like 
real numbers that add up to exactly what they say they'll add up to.

> Do you call the highway dept and complain they can't measure? Do you 
> complain to the auto-maker that your odometer should only show 10km 
> increments?

No, but maybe I should!  *8-)  Or, get my shovel out and "fix" the 
"problem"...

> Why should SA be so different? Why do you expect numbers the to add 
> exactly down to the last decimal place?

That's just goofy.  Okay...  Because SA runs inside a computer, and the 
computer is better at adding up numbers for real than making 
approximations.  Because when you're looking at something a computer did, 
you expect the numbers to actually add up.  Because when you see something 
that appears to be accurate to one tenth, you expect it to actually be 
accurate to one tenth.  Because the effort of adding up all those numbers 
accurately to three decimal places has already been done; why throw away 
the accuracy if you went to the trouble of computing it in the first 
place?  Because no computer user, no matter how unseasoned, is going to be 
shocked to see numbers accurate to three decimal places that actually add 
up to the right answer.  If my dumbest user sees "scored 4.999 out of 
5.000" he'll say "gee, that was close".  If he sees "scored 4.9 out of 
5.0", but all the numbers under it add up to 5.0, he's going to say "gee, 
that's dumb" and pick up the phone to tell me how dumb it is.

> That would be reasonable, however you'd have to re-code the perceptron to
> generate scores that way.

Fair enough.

> That said, I still think the shorter report is more readable and elegant.

I disagree.

Anyway, I've been using this stuff for years and never noticed this 
before, so it's clearly not that big a deal.  I feel better for venting (a 
little), and apologize for wasting everyone's bits.

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
| Jason Marshall, marshalj@spots.ab.ca. Spots InterConnect, Inc. Calgary, AB |
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Re: Is Spamassassin failing math?

Posted by Matt Kettler <mk...@evi-inc.com>.

Jason Marshall wrote:
>> But SA rounds that rule score to 1.7 to save display space. Most SA rules
>> actually have scores with 3 decimal places. (ie: 1.268)
> 
>> The "real" answer would be to always display 3-decimal place scores,
>> but that's rather of ugly and creates a cluttered report. However,
>> you'd always be 100% accurate.
> 
> I'm sure I'm not the first one to suggest this, but why NOT always
> display the numbers in their entirety?  

As stated above: "That's rather ugly and creates a cluttered report".

> 
> Would it really make things more cluttered to add two digits to each number in the report? 

Yes, and it's unnecessary. Life is full of round-number issues. Learning to
accept rounding is unavoidable. People like rounded numbers because they are
fast and easy to read. Nearly every number you see in life is rounded. You've
just never checked the background math before.

Take Signposts along roads. Have you never checked those signs telling you how
far away a city is against your odometer? Sometimes you'll pass one sign saying
70km, then another saying 60km, but your odometer will show less than (or more
than) 10km between the two signs.. Do you call the highway dept and complain
they can't measure? Do you complain to the auto-maker that your odometer should
only show 10km increments?

No, because we all intuitively know that the numbers are rounded. We all know
that a measurement 10.5 really means "something near, but not exactly 10.5"

Why should SA be so different? Why do you expect numbers the to add exactly down
to the last decimal place?

> Or how about making the actual scores accurate to two decimal places, and display those in their entirety -- meeting both sides of the argument 1/2 way?  *8-) 

That would be reasonable, however you'd have to re-code the perceptron to
generate scores that way.

That said, I still think the shorter report is more readable and elegant.

Re: Is Spamassassin failing math?

Posted by Jason Marshall <ma...@spots.ab.ca>.

> But SA rounds that rule score to 1.7 to save display space. Most SA rules
> actually have scores with 3 decimal places. (ie: 1.268)

> The "real" answer would be to always display 3-decimal place scores, but 
> that's rather of ugly and creates a cluttered report. However, you'd 
> always be 100% accurate.

I'm sure I'm not the first one to suggest this, but why NOT always display 
the numbers in their entirety?  I can't think of any reason why a user 
would say "please give me less accuracy and a lot more confusion in return 
for fewer digits to parse".

Would it really make things more cluttered to add two digits to each 
number in the report?

Or how about making the actual scores accurate to two decimal places, and 
display those in their entirety -- meeting both sides of the argument 1/2 
way?  *8-)

I could live with:

X-Spam-Status: No, score=-2.50 required=5.00 tests=AWL,BAYES_00 
autolearn=ham version=3.1.0
X-Spam-Report:
     * -2.60 BAYES_00 BODY: Bayesian spam probability is 0 to 1%
     *       [score: 0.0000]
     *  0.10 AWL AWL: From: address is in the auto white-list

In fact, I'll bet the spamassassin developers could make that change and 
I'd never even notice!

Just a thought...

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
| Jason Marshall, marshalj@spots.ab.ca. Spots InterConnect, Inc. Calgary, AB |
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-