You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Leigh Sharpe <ls...@pacificwireless.com.au> on 2006/07/21 03:40:19 UTC
Bayes_00 on spam
Hi all,
Bayes seems to be missing quite a lot of spam. I'm getting these
results quite often:
TOP SPAM RULES FIRED
----------------------------------------------------------------------
RANK RULE NAME COUNT %OFMAIL %OFSPAM %OFHAM
----------------------------------------------------------------------
1 HTML_MESSAGE 2127 68.95 75.29 62.00
2 URIBL_BLACK 1822 39.09 64.50 11.22
3 URIBL_SC_SURBL 1664 31.07 58.90 0.54
4 URIBL_OB_SURBL 1654 31.61 58.55 2.06
5 BAYES_00 1471 65.28 52.07 79.77
6 URIBL_SBL 1360 29.81 48.14 9.70
7 URIBL_WS_SURBL 922 17.37 32.64 0.62
8 AWL 911 42.81 32.25 54.39
9 URIBL_AB_SURBL 746 13.89 26.41 0.16
10 BAYES_99 707 13.09 25.03 0.00
To me, it looks like Bayes_00 is hitting far too much spam.
I have fed a large amount of mail into Bayes:
[root@mail ~]# sa-learn --dump magic
0.000 0 3 0 non-token data: bayes db version
0.000 0 6468 0 non-token data: nspam
0.000 0 6471 0 non-token data: nham
0.000 0 160969 0 non-token data: ntokens
0.000 0 1150774613 0 non-token data: oldest atime
0.000 0 1153439019 0 non-token data: newest atime
0.000 0 1153436831 0 non-token data: last journal
sync atime
0.000 0 1153426735 0 non-token data: last expiry
atime
0.000 0 1382400 0 non-token data: last expire
atime delta
0.000 0 97882 0 non-token data: last expire
reduction count
And I'm quite certain that it was fed correctly.
All of the misses I have checked have hit Bayes_00.
Any ideas why this is happening? I have toyed with the idea of lowering
the bayes_00 score. Anyone care to enlighten me on whether this would be
a bad idea and why?
Regards,
Leigh
Leigh Sharpe
Network Systems Engineer
Pacific Wireless
Ph +61 3 9584 8966
Mob 0408 009 502
email lsharpe@pacificwireless.com.au
web www.pacificwireless.com.au
Re: Bayes_00 on spam
Posted by jdow <jd...@earthlink.net>.
Leigh, I am VERY sure your Bayes has been severely mistrained if BAYES_00
hits on more spam than BAYES_99. There is no way this inversion should
be taking place. For me BAYES_99 is hitting almost 85% of all spam and
about 0.04% of messages later declared to be ham. (Although I've never
noticed it making this mistake that I can remember.) BAYES_00 is hitting
78% of all ham and 0.07% of spam. Actually it is closer to 0.1% because
I recently had a couple messages that I regarded as spam slip completely
under its radar. This is on a database currently of 78677 over the
last month plus. Of that 58000 are spam and nearly 21000 are ham.
So basically I cannot believe your Bayes was properly trained. (I note
I have hand trained mine on under 2000 messages each for ham and spam.)
{^_^}
----- Original Message -----
From: "Leigh Sharpe" <ls...@pacificwireless.com.au>
Hi all,
Bayes seems to be missing quite a lot of spam. I'm getting these
results quite often:
TOP SPAM RULES FIRED
----------------------------------------------------------------------
RANK RULE NAME COUNT %OFMAIL %OFSPAM %OFHAM
----------------------------------------------------------------------
1 HTML_MESSAGE 2127 68.95 75.29 62.00
2 URIBL_BLACK 1822 39.09 64.50 11.22
3 URIBL_SC_SURBL 1664 31.07 58.90 0.54
4 URIBL_OB_SURBL 1654 31.61 58.55 2.06
5 BAYES_00 1471 65.28 52.07 79.77
6 URIBL_SBL 1360 29.81 48.14 9.70
7 URIBL_WS_SURBL 922 17.37 32.64 0.62
8 AWL 911 42.81 32.25 54.39
9 URIBL_AB_SURBL 746 13.89 26.41 0.16
10 BAYES_99 707 13.09 25.03 0.00
To me, it looks like Bayes_00 is hitting far too much spam.
I have fed a large amount of mail into Bayes:
[root@mail ~]# sa-learn --dump magic
0.000 0 3 0 non-token data: bayes db version
0.000 0 6468 0 non-token data: nspam
0.000 0 6471 0 non-token data: nham
0.000 0 160969 0 non-token data: ntokens
0.000 0 1150774613 0 non-token data: oldest atime
0.000 0 1153439019 0 non-token data: newest atime
0.000 0 1153436831 0 non-token data: last journal
sync atime
0.000 0 1153426735 0 non-token data: last expiry
atime
0.000 0 1382400 0 non-token data: last expire
atime delta
0.000 0 97882 0 non-token data: last expire
reduction count
And I'm quite certain that it was fed correctly.
All of the misses I have checked have hit Bayes_00.
Any ideas why this is happening? I have toyed with the idea of lowering
the bayes_00 score. Anyone care to enlighten me on whether this would be
a bad idea and why?
Regards,
Leigh
Leigh Sharpe
Network Systems Engineer
Pacific Wireless
Ph +61 3 9584 8966
Mob 0408 009 502
email lsharpe@pacificwireless.com.au
web www.pacificwireless.com.au
Re: Bayes_00 on spam
Posted by "Gary D. Margiotta" <ga...@tbe.net>.
> Hi all,
> Bayes seems to be missing quite a lot of spam. I'm getting these
> results quite often:
>
<snip>
Email: 63252 Autolearn: 26740 AvgScore: 14.53 AvgScanTime: 1.69 sec
Spam: 51232 Autolearn: 23252 AvgScore: 21.08 AvgScanTime: 1.68 sec
Ham: 12020 Autolearn: 3488 AvgScore: -13.40 AvgScanTime: 1.72 sec
TOP SPAM RULES FIRED
----------------------------------------------------------------------
RANK RULE NAME COUNT %OFMAIL %OFSPAM %OFHAM
----------------------------------------------------------------------
1 HTML_MESSAGE 36720 70.25 71.67 64.18
2 BAYES_99 35269 56.74 68.84 5.17
3 URIBL_SBL 32502 54.28 63.44 15.22
4 URIBL_JP_SURBL 31805 50.70 62.08 2.20
5 URIBL_SC_SURBL 27524 43.83 53.72 1.65
6 URIBL_OB_SURBL 22908 36.27 44.71 0.29
7 RCVD_IN_BL_SPAMCOP_NET 22082 35.55 43.10 3.35
8 URIBL_AB_SURBL 21789 34.63 42.53 0.96
9 AWL 19280 43.57 37.63 68.89
10 RCVD_IN_XBL 17122 27.09 33.42 0.12
11 FORGED_RCVD_HELO 15386 28.34 30.03 21.12
12 RCVD_IN_SORBS_DUL 13501 21.49 26.35 0.74
13 RCVD_IN_NJABL_DUL 10934 17.37 21.34 0.43
14 BODY_GAPPY_TEXT 10888 22.04 21.25 25.40
15 URIBL_WS_SURBL 10615 16.80 20.72 0.08
16 NO_REAL_NAME 8883 22.63 17.34 45.18
17 MIME_HTML_ONLY 8226 16.09 16.06 16.21
18 MSGID_FROM_MTA_ID 7667 13.04 14.97 4.83
19 BAYES_00 7445 23.53 14.53 61.87
20 SUBJ_SPAMWORD 7012 11.56 13.69 2.49
----------------------------------------------------------------------
>
> To me, it looks like Bayes_00 is hitting far too much spam.
<snip>
~ $ sa-learn --dump magic
0.000 0 3 0 non-token data: bayes db version
0.000 0 2110713 0 non-token data: nspam
0.000 0 156758 0 non-token data: nham
0.000 0 1608693 0 non-token data: ntokens
0.000 0 1153323145 0 non-token data: oldest atime
0.000 0 1153446556 0 non-token data: newest atime
0.000 0 1153446557 0 non-token data: last journal sync atime
0.000 0 1153367234 0 non-token data: last expiry atime
0.000 0 43200 0 non-token data: last expire atime delta
0.000 0 1204872 0 non-token data: last expire reduction count
>
> I have fed a large amount of mail into Bayes:
>
>
> And I'm quite certain that it was fed correctly.
> All of the misses I have checked have hit Bayes_00.
>
> Any ideas why this is happening? I have toyed with the idea of lowering
> the bayes_00 score. Anyone care to enlighten me on whether this would be
> a bad idea and why?
>
Methinks you don't have enough mail trained in bayes... take a look at my
numbers for hit count, then see how many spam and ham tokens I have in my
bayes database.
If more training doesn't correct the scoring, you could lower the score
for bayes_00, but mine's untouched.
>
> Regards,
> Leigh
>
> Leigh Sharpe
> Network Systems Engineer
> Pacific Wireless
> Ph +61 3 9584 8966
> Mob 0408 009 502
> email lsharpe@pacificwireless.com.au
> web www.pacificwireless.com.au
>
>
>
-Gary