You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Leigh Sharpe <ls...@pacificwireless.com.au> on 2006/07/21 03:40:19 UTC

Bayes_00 on spam

Hi all,
 Bayes seems to be missing quite  a lot of spam. I'm getting these
results quite often:
 
TOP SPAM RULES FIRED
----------------------------------------------------------------------
RANK    RULE NAME                       COUNT  %OFMAIL %OFSPAM  %OFHAM
----------------------------------------------------------------------
   1    HTML_MESSAGE                     2127    68.95   75.29   62.00
   2    URIBL_BLACK                      1822    39.09   64.50   11.22
   3    URIBL_SC_SURBL                   1664    31.07   58.90    0.54
   4    URIBL_OB_SURBL                   1654    31.61   58.55    2.06
   5    BAYES_00                         1471    65.28   52.07   79.77
   6    URIBL_SBL                        1360    29.81   48.14    9.70
   7    URIBL_WS_SURBL                    922    17.37   32.64    0.62
   8    AWL                               911    42.81   32.25   54.39
   9    URIBL_AB_SURBL                    746    13.89   26.41    0.16
  10    BAYES_99                          707    13.09   25.03    0.00

 
To me, it looks like Bayes_00 is hitting far too much spam.
 
I have fed a large amount of mail into Bayes:
 
[root@mail ~]# sa-learn --dump magic
0.000          0          3          0  non-token data: bayes db version
0.000          0       6468          0  non-token data: nspam
0.000          0       6471          0  non-token data: nham
0.000          0     160969          0  non-token data: ntokens
0.000          0 1150774613          0  non-token data: oldest atime
0.000          0 1153439019          0  non-token data: newest atime
0.000          0 1153436831          0  non-token data: last journal
sync atime
0.000          0 1153426735          0  non-token data: last expiry
atime
0.000          0    1382400          0  non-token data: last expire
atime delta
0.000          0      97882          0  non-token data: last expire
reduction count

And I'm quite certain that it was fed correctly. 
All of the misses I have checked have hit Bayes_00.
 
Any ideas why this is happening? I have toyed with the idea of lowering
the bayes_00 score. Anyone care to enlighten me on whether this would be
a bad idea and why?
 
 
Regards,
             Leigh
 
Leigh Sharpe
Network Systems Engineer
Pacific Wireless
Ph +61 3 9584 8966
Mob 0408 009 502
email lsharpe@pacificwireless.com.au
web www.pacificwireless.com.au
 


Re: Bayes_00 on spam

Posted by jdow <jd...@earthlink.net>.
Leigh, I am VERY sure your Bayes has been severely mistrained if BAYES_00
hits on more spam than BAYES_99. There is no way this inversion should
be taking place. For me BAYES_99 is hitting almost 85% of all spam and
about 0.04% of messages later declared to be ham. (Although I've never
noticed it making this mistake that I can remember.) BAYES_00 is hitting
78% of all ham and 0.07% of spam. Actually it is closer to 0.1% because
I recently had a couple messages that I regarded as spam slip completely
under its radar. This is on a database currently of 78677 over the
last month plus. Of that 58000 are spam and nearly 21000 are ham.

So basically I cannot believe your Bayes was properly trained. (I note
I have hand trained mine on under 2000 messages each for ham and spam.)

{^_^}
----- Original Message ----- 
From: "Leigh Sharpe" <ls...@pacificwireless.com.au>


Hi all,
 Bayes seems to be missing quite  a lot of spam. I'm getting these
results quite often:
 
TOP SPAM RULES FIRED
----------------------------------------------------------------------
RANK    RULE NAME                       COUNT  %OFMAIL %OFSPAM  %OFHAM
----------------------------------------------------------------------
   1    HTML_MESSAGE                     2127    68.95   75.29   62.00
   2    URIBL_BLACK                      1822    39.09   64.50   11.22
   3    URIBL_SC_SURBL                   1664    31.07   58.90    0.54
   4    URIBL_OB_SURBL                   1654    31.61   58.55    2.06
   5    BAYES_00                         1471    65.28   52.07   79.77
   6    URIBL_SBL                        1360    29.81   48.14    9.70
   7    URIBL_WS_SURBL                    922    17.37   32.64    0.62
   8    AWL                               911    42.81   32.25   54.39
   9    URIBL_AB_SURBL                    746    13.89   26.41    0.16
  10    BAYES_99                          707    13.09   25.03    0.00

 
To me, it looks like Bayes_00 is hitting far too much spam.
 
I have fed a large amount of mail into Bayes:
 
[root@mail ~]# sa-learn --dump magic
0.000          0          3          0  non-token data: bayes db version
0.000          0       6468          0  non-token data: nspam
0.000          0       6471          0  non-token data: nham
0.000          0     160969          0  non-token data: ntokens
0.000          0 1150774613          0  non-token data: oldest atime
0.000          0 1153439019          0  non-token data: newest atime
0.000          0 1153436831          0  non-token data: last journal
sync atime
0.000          0 1153426735          0  non-token data: last expiry
atime
0.000          0    1382400          0  non-token data: last expire
atime delta
0.000          0      97882          0  non-token data: last expire
reduction count

And I'm quite certain that it was fed correctly. 
All of the misses I have checked have hit Bayes_00.
 
Any ideas why this is happening? I have toyed with the idea of lowering
the bayes_00 score. Anyone care to enlighten me on whether this would be
a bad idea and why?
 
 
Regards,
             Leigh
 
Leigh Sharpe
Network Systems Engineer
Pacific Wireless
Ph +61 3 9584 8966
Mob 0408 009 502
email lsharpe@pacificwireless.com.au
web www.pacificwireless.com.au
 



Re: Bayes_00 on spam

Posted by "Gary D. Margiotta" <ga...@tbe.net>.
> Hi all,
> Bayes seems to be missing quite  a lot of spam. I'm getting these
> results quite often:
>

<snip>

Email:    63252  Autolearn: 26740  AvgScore:  14.53  AvgScanTime:  1.69 sec
Spam:     51232  Autolearn: 23252  AvgScore:  21.08  AvgScanTime:  1.68 sec
Ham:      12020  Autolearn:  3488  AvgScore: -13.40  AvgScanTime:  1.72 sec


TOP SPAM RULES FIRED
----------------------------------------------------------------------
RANK    RULE NAME                       COUNT  %OFMAIL %OFSPAM  %OFHAM
----------------------------------------------------------------------
    1    HTML_MESSAGE                    36720    70.25   71.67   64.18
    2    BAYES_99                        35269    56.74   68.84    5.17
    3    URIBL_SBL                       32502    54.28   63.44   15.22
    4    URIBL_JP_SURBL                  31805    50.70   62.08    2.20
    5    URIBL_SC_SURBL                  27524    43.83   53.72    1.65
    6    URIBL_OB_SURBL                  22908    36.27   44.71    0.29
    7    RCVD_IN_BL_SPAMCOP_NET          22082    35.55   43.10    3.35
    8    URIBL_AB_SURBL                  21789    34.63   42.53    0.96
    9    AWL                             19280    43.57   37.63   68.89
   10    RCVD_IN_XBL                     17122    27.09   33.42    0.12
   11    FORGED_RCVD_HELO                15386    28.34   30.03   21.12
   12    RCVD_IN_SORBS_DUL               13501    21.49   26.35    0.74
   13    RCVD_IN_NJABL_DUL               10934    17.37   21.34    0.43
   14    BODY_GAPPY_TEXT                 10888    22.04   21.25   25.40
   15    URIBL_WS_SURBL                  10615    16.80   20.72    0.08
   16    NO_REAL_NAME                     8883    22.63   17.34   45.18
   17    MIME_HTML_ONLY                   8226    16.09   16.06   16.21
   18    MSGID_FROM_MTA_ID                7667    13.04   14.97    4.83
   19    BAYES_00                         7445    23.53   14.53   61.87
   20    SUBJ_SPAMWORD                    7012    11.56   13.69    2.49
----------------------------------------------------------------------


>
> To me, it looks like Bayes_00 is hitting far too much spam.

<snip>

~ $ sa-learn --dump magic
0.000          0          3          0  non-token data: bayes db version
0.000          0    2110713          0  non-token data: nspam
0.000          0     156758          0  non-token data: nham
0.000          0    1608693          0  non-token data: ntokens
0.000          0 1153323145          0  non-token data: oldest atime
0.000          0 1153446556          0  non-token data: newest atime
0.000          0 1153446557          0  non-token data: last journal sync atime
0.000          0 1153367234          0  non-token data: last expiry atime
0.000          0      43200          0  non-token data: last expire atime delta
0.000          0    1204872          0  non-token data: last expire reduction count

>
> I have fed a large amount of mail into Bayes:
>
>
> And I'm quite certain that it was fed correctly.
> All of the misses I have checked have hit Bayes_00.
>
> Any ideas why this is happening? I have toyed with the idea of lowering
> the bayes_00 score. Anyone care to enlighten me on whether this would be
> a bad idea and why?
>


Methinks you don't have enough mail trained in bayes... take a look at my 
numbers for hit count, then see how many spam and ham tokens I have in my 
bayes database.

If more training doesn't correct the scoring, you could lower the score 
for bayes_00, but mine's untouched.

>
> Regards,
>             Leigh
>
> Leigh Sharpe
> Network Systems Engineer
> Pacific Wireless
> Ph +61 3 9584 8966
> Mob 0408 009 502
> email lsharpe@pacificwireless.com.au
> web www.pacificwireless.com.au
>
>
>


-Gary