You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Leigh Sharpe <ls...@pacificwireless.com.au> on 2006/06/30 01:45:07 UTC

Training Bayes properly

So it looks like I have to reset my Bayes and re-train it. I want to do
it properly this time. I will be making sure I personally review every
message that our users put into the spam folder first, to make sure they
haven't put spam into the wrong folder. However, I have a couple of
questions:
 
1) Am I better off to feed it a few emails a day, or wait until I get a
few hundred, then feed them all to sa-learn at once? Is there really a
difference?
2) How many spams should I feed it? I've heard in some places that 200
is OK, I've heard elsewhere that 10000 or more are needed.
3) Just how 'balanced' should it's diet be? Should I use the same
quantity of ham as spam, or can I get away with less ham than spam?
 
 
Regards,
             Leigh
 
Leigh Sharpe
Network Systems Engineer
Pacific Wireless
Ph +61 3 9584 8966
Mob 0408 009 502
email lsharpe@pacificwireless.com.au
web www.pacificwireless.com.au

Re: Training Bayes properly

Posted by jdow <jd...@earthlink.net>.

200 is OK. 2000 is enough. Over the years from 2.43 forward my entire
spam and ham corpus contents amount to under 2000 each and Bayes is
running remarkably smoothly for me. I am "tempted" to enable automatic
learning to see what will happen. I'll take a snapshot of my Bayes
first, though. (The "get a round tuit" aspect involved is that I have
a strong aversion to fixing what isn't broken. {^_-})

{^_^}
----- Original Message ----- 
From: "Leigh Sharpe" <ls...@pacificwireless.com.au>


So it looks like I have to reset my Bayes and re-train it. I want to do
it properly this time. I will be making sure I personally review every
message that our users put into the spam folder first, to make sure they
haven't put spam into the wrong folder. However, I have a couple of
questions:
 
1) Am I better off to feed it a few emails a day, or wait until I get a
few hundred, then feed them all to sa-learn at once? Is there really a
difference?
2) How many spams should I feed it? I've heard in some places that 200
is OK, I've heard elsewhere that 10000 or more are needed.
3) Just how 'balanced' should it's diet be? Should I use the same
quantity of ham as spam, or can I get away with less ham than spam?
 
 
Regards,
             Leigh
 
Leigh Sharpe
Network Systems Engineer
Pacific Wireless
Ph +61 3 9584 8966
Mob 0408 009 502
email lsharpe@pacificwireless.com.au
web www.pacificwireless.com.au

Re: Training Bayes properly

Posted by jdow <jd...@earthlink.net>.

 sa-learn --dump magic
0.000          0          3          0  non-token data: bayes db version
0.000          0       2709          0  non-token data: nspam
0.000          0       3366          0  non-token data: nham
0.000          0     184836          0  non-token data: ntokens
0.000          0 1075078447          0  non-token data: oldest atime
0.000          0 1151877819          0  non-token data: newest atime
0.000          0 1151877980          0  non-token data: last journal sync atime
0.000          0 1120128830          0  non-token data: last expiry atime
0.000          0     691200          0  non-token data: last expire atime delta
0.000          0      41004          0  non-token data: last expire reduction count

{^_^}    Joanne
----- Original Message ----- 
From: "Will Nordmeyer" <wi...@willspc.net>

>I have a similar little question...
> 
> How'd these stats get generated:
> 
>> After 6 month's I'm at
>> 
>> 0.000          0    1258041          0  non-token data: nspam
>> 0.000          0     996687          0  non-token data: nham
> 
> I'd like to know what mine is at... I've got the sa-stats (and a modified
> version) that I run periodically).
> 
> --Will 
> 
> -----Original Message-----
> From: jdow [mailto:jdow@earthlink.net] 
> Sent: Friday, June 30, 2006 7:36 PM
> To: users@spamassassin.apache.org
> Subject: Re: Training Bayes properly
> 
> From: "Stefan Jakobs" <st...@rus.uni-stuttgart.de>
> 
>> Am Freitag, 30. Juni 2006 02:09 schrieb Rick Macdougall:
>>> Hi,
>> 
>> Hello,
>> 
>>> And my hit rates are
>>>
>>> For HAM
>>> RANK    RULE NAME    COUNT %OFRULES %OFMAIL %OFSPAM  %OFHAM
>>>     1    BAYES_00     22819    24.15   54.61    1.65   96.70
>>>
>>> And SPAM
>>> RANK    RULE NAME    COUNT %OFRULES %OFMAIL %OFSPAM  %OFHAM
>>>   4      BAYES_99     10419     4.64   24.93   57.28    0.05
>> 
>> I've got just a little question: How can I generate the hit rates
> statistics?
> 
> http://www.rulesemporium.com/programs/sa-stats.txt
> 
> {^_^}
> 
>

RE: Training Bayes properly

Posted by Will Nordmeyer <wi...@willspc.net>.

I have a similar little question...

How'd these stats get generated:

> After 6 month's I'm at
> 
> 0.000          0    1258041          0  non-token data: nspam
> 0.000          0     996687          0  non-token data: nham

I'd like to know what mine is at... I've got the sa-stats (and a modified
version) that I run periodically).

--Will 

-----Original Message-----
From: jdow [mailto:jdow@earthlink.net] 
Sent: Friday, June 30, 2006 7:36 PM
To: users@spamassassin.apache.org
Subject: Re: Training Bayes properly

From: "Stefan Jakobs" <st...@rus.uni-stuttgart.de>

> Am Freitag, 30. Juni 2006 02:09 schrieb Rick Macdougall:
>> Hi,
> 
> Hello,
> 
>> And my hit rates are
>>
>> For HAM
>> RANK    RULE NAME    COUNT %OFRULES %OFMAIL %OFSPAM  %OFHAM
>>     1    BAYES_00     22819    24.15   54.61    1.65   96.70
>>
>> And SPAM
>> RANK    RULE NAME    COUNT %OFRULES %OFMAIL %OFSPAM  %OFHAM
>>   4      BAYES_99     10419     4.64   24.93   57.28    0.05
> 
> I've got just a little question: How can I generate the hit rates
statistics?

http://www.rulesemporium.com/programs/sa-stats.txt

{^_^}

Re: Training Bayes properly

Posted by jdow <jd...@earthlink.net>.

From: "Stefan Jakobs" <st...@rus.uni-stuttgart.de>

> Am Freitag, 30. Juni 2006 02:09 schrieb Rick Macdougall:
>> Hi,
> 
> Hello,
> 
>> And my hit rates are
>>
>> For HAM
>> RANK    RULE NAME    COUNT %OFRULES %OFMAIL %OFSPAM  %OFHAM
>>     1    BAYES_00     22819    24.15   54.61    1.65   96.70
>>
>> And SPAM
>> RANK    RULE NAME    COUNT %OFRULES %OFMAIL %OFSPAM  %OFHAM
>>   4      BAYES_99     10419     4.64   24.93   57.28    0.05
> 
> I've got just a little question: How can I generate the hit rates statistics?

http://www.rulesemporium.com/programs/sa-stats.txt

{^_^}

Re: Training Bayes properly

Posted by Stefan Jakobs <st...@rus.uni-stuttgart.de>.

Am Freitag, 30. Juni 2006 02:09 schrieb Rick Macdougall:
> Hi,

Hello,

> And my hit rates are
>
> For HAM
> RANK    RULE NAME    COUNT %OFRULES %OFMAIL %OFSPAM  %OFHAM
>     1    BAYES_00     22819    24.15   54.61    1.65   96.70
>
> And SPAM
> RANK    RULE NAME    COUNT %OFRULES %OFMAIL %OFSPAM  %OFHAM
>   4      BAYES_99     10419     4.64   24.93   57.28    0.05

I've got just a little question: How can I generate the hit rates statistics?

Bye
Stefan

Re: Training Bayes properly

Posted by jdow <jd...@earthlink.net>.

From: "Rick Macdougall" <ri...@ummm-beer.com>
> jdow wrote:

> I don't know if it's a good example of YMMV, I think both of our bayes 
> are operating at respectable levels given the data they have to deal 
> with. I may wish I could get better results but I really don't think 
> it's possible in the environment I run it.

The YMMV comes in with regards to the training process needs. With
a large user base you must use a different strategy than I can use.
And since I do not have an smtp engine running I can't do some of
the other tricks you might do to eliminate gobs of spam before it
ever reaches the SpamAssassin levels. I am also probably running
far more rules than you are. I run over a megabyte of rules here,
since I can afford the ram and time that consumes.

I was not referring directly to the BAYES_00 score as much as your
over all configuration strategy being the YMMV item. Individual Bayes,
individual rules, and hand training are probably not in the least
suitable for your environment. They are for mine. No single package
SpamAssassin setup is likely to ever be optimum for all needs.

{^_^}

Re: Training Bayes properly

Posted by Rick Macdougall <ri...@ummm-beer.com>.

jdow wrote:
> From: "Rick Macdougall" <ri...@ummm-beer.com>
> EEEEK! I bet you are running system wide Bayes for a very non-homogeneous
> collection of people. I've appended my figures (not the best I have
> seen but very good) below yours. Your BAYES_00 is better than mine
> only if you do not consider the figure I consider most significant,
> the ratio of %OFHAM/%OFSPAM. Your BAYES_99 is worse than mine either
> absolute or vie the %OFSPAM/%OFHAM ratio.
> 
>> For HAM
>> RANK    RULE NAME    COUNT %OFRULES %OFMAIL %OFSPAM  %OFHAM
>>    1    BAYES_00     22819    24.15   54.61    1.65   96.70
>     1    BAYES_00     47047    11.65   57.35    0.05   78.57

>> And SPAM
>> RANK    RULE NAME    COUNT %OFRULES %OFMAIL %OFSPAM  %OFHAM
>>  4      BAYES_99     10419     4.64   24.93   57.28    0.05
>   1      BAYES_99     18898     4.42   23.04   85.29    0.04
> 
>> That 1.65 % SPAM is bayes_00 is spam slipping through that I learn 
>> later as spam.
> 
> The slip through on BAYES_00 hints you can do better. The scoring
> makes me think you need to feed escaped spams back through to learn
> them as spam more often, if possible.
> 
>> It's been stable now for the last 5 months with about 100K emails a day.
> 
> Whereas I do not use automatic anything and process on the order of
> 2500 per day. This is a fine YMMV example, isn't it?

Yah, it's system wide for about 50K users with both French and English 
users (probably about 70-30 for the French users) and I mostly get 
English spam.

I'll take a bayes_00 FP of 1.65% because it's almost never enough to 
mark it back down to HAM (hits all the uri and razor and ixhash rules 
usually) and 0.05% of ham on bayes_99 I think is outstanding considering 
the amount of mail and different mish mash of users we have.  I've 
actually never seen a FP bayes_99 so I think the 0.05 FP is a bit of a 
misleading percentage and I score bayes_99 at 4.5.

2 weeks ago the percentages were even worse and I expect that next week 
they should be even better.  I have not seen 1 spam slip through in the 
last two days were as I usually see 3 or 4 a day slip through (in my 
personal mail box, can't speak for others as I don't peek).

I don't know if it's a good example of YMMV, I think both of our bayes 
are operating at respectable levels given the data they have to deal 
with. I may wish I could get better results but I really don't think 
it's possible in the environment I run it.

Regards,

Rick

Re: Training Bayes properly

Posted by jdow <jd...@earthlink.net>.

From: "Rick Macdougall" <ri...@ummm-beer.com>
> Nigel Frankcom wrote:
>> On Fri, 30 Jun 2006 09:45:07 +1000, "Leigh Sharpe"
>> 
>>> So it looks like I have to reset my Bayes and re-train it. I want to do
>>> it properly this time. I will be making sure I personally review every
>>> message that our users put into the spam folder first, to make sure they
>>> haven't put spam into the wrong folder. However, I have a couple of
>>> questions:
>>>
>>> 1) Am I better off to feed it a few emails a day, or wait until I get a
>>> few hundred, then feed them all to sa-learn at once? Is there really a
>>> difference?
>>> 2) How many spams should I feed it? I've heard in some places that 200
>>> is OK, I've heard elsewhere that 10000 or more are needed.
>>> 3) Just how 'balanced' should it's diet be? Should I use the same
>>> quantity of ham as spam, or can I get away with less ham than spam?
>>>
>>>
>> 
>> The minimum corpus is recommended as 200 spam and 200 ham, then add in
>> on an as received basis. My initial corpus was around 500 of each and
>> my bayes has remained stable for several years. The numbers should be
>> about equal though in my experience they don't have to be exact.
>> Though if you do 200 ham and 2000 spam you will skew the scoring in
>> bayes.
>> 
>> Here as FPs or FNs are reported they are trained in accordingly.
>> 
>> I don't use the auto train feature, I've personally found that to be
>> problematic.
> 
> Hi,
> 
> I use auto-train plus feed all my personal spam to bayes (I get 100 - 
> 400 spams a day in my personal email account because I've had the same 
> address since 1995 and I get postmaster, dns, hostmaster, abuse etc).
> 
> After 6 month's I'm at
> 
> 0.000          0    1258041          0  non-token data: nspam
> 0.000          0     996687          0  non-token data: nham
> 
> And my hit rates are

EEEEK! I bet you are running system wide Bayes for a very non-homogeneous
collection of people. I've appended my figures (not the best I have
seen but very good) below yours. Your BAYES_00 is better than mine
only if you do not consider the figure I consider most significant,
the ratio of %OFHAM/%OFSPAM. Your BAYES_99 is worse than mine either
absolute or vie the %OFSPAM/%OFHAM ratio.

> For HAM
> RANK    RULE NAME    COUNT %OFRULES %OFMAIL %OFSPAM  %OFHAM
>    1    BAYES_00     22819    24.15   54.61    1.65   96.70
     1    BAYES_00     47047    11.65   57.35    0.05   78.57



> And SPAM
> RANK    RULE NAME    COUNT %OFRULES %OFMAIL %OFSPAM  %OFHAM
>  4      BAYES_99     10419     4.64   24.93   57.28    0.05
   1      BAYES_99     18898     4.42   23.04   85.29    0.04

> That 1.65 % SPAM is bayes_00 is spam slipping through that I learn later 
> as spam.

The slip through on BAYES_00 hints you can do better. The scoring
makes me think you need to feed escaped spams back through to learn
them as spam more often, if possible.

> It's been stable now for the last 5 months with about 100K emails a day.

Whereas I do not use automatic anything and process on the order of
2500 per day. This is a fine YMMV example, isn't it?

{^_^}

Re: Training Bayes properly

Posted by Rick Macdougall <ri...@ummm-beer.com>.

Nigel Frankcom wrote:
> On Fri, 30 Jun 2006 09:45:07 +1000, "Leigh Sharpe"
> <ls...@pacificwireless.com.au> wrote:
> 
>> So it looks like I have to reset my Bayes and re-train it. I want to do
>> it properly this time. I will be making sure I personally review every
>> message that our users put into the spam folder first, to make sure they
>> haven't put spam into the wrong folder. However, I have a couple of
>> questions:
>>
>> 1) Am I better off to feed it a few emails a day, or wait until I get a
>> few hundred, then feed them all to sa-learn at once? Is there really a
>> difference?
>> 2) How many spams should I feed it? I've heard in some places that 200
>> is OK, I've heard elsewhere that 10000 or more are needed.
>> 3) Just how 'balanced' should it's diet be? Should I use the same
>> quantity of ham as spam, or can I get away with less ham than spam?
>>
>>
> 
> The minimum corpus is recommended as 200 spam and 200 ham, then add in
> on an as received basis. My initial corpus was around 500 of each and
> my bayes has remained stable for several years. The numbers should be
> about equal though in my experience they don't have to be exact.
> Though if you do 200 ham and 2000 spam you will skew the scoring in
> bayes.
> 
> Here as FPs or FNs are reported they are trained in accordingly.
> 
> I don't use the auto train feature, I've personally found that to be
> problematic.

Hi,

I use auto-train plus feed all my personal spam to bayes (I get 100 - 
400 spams a day in my personal email account because I've had the same 
address since 1995 and I get postmaster, dns, hostmaster, abuse etc).

After 6 month's I'm at

0.000          0    1258041          0  non-token data: nspam
0.000          0     996687          0  non-token data: nham

And my hit rates are

For HAM
RANK    RULE NAME    COUNT %OFRULES %OFMAIL %OFSPAM  %OFHAM
    1    BAYES_00     22819    24.15   54.61    1.65   96.70

And SPAM
RANK    RULE NAME    COUNT %OFRULES %OFMAIL %OFSPAM  %OFHAM
  4      BAYES_99     10419     4.64   24.93   57.28    0.05

That 1.65 % SPAM is bayes_00 is spam slipping through that I learn later 
as spam.

It's been stable now for the last 5 months with about 100K emails a day.

Regards,

Rick

Re: Training Bayes properly

Posted by Nigel Frankcom <ni...@blue-canoe.net>.

On Fri, 30 Jun 2006 09:45:07 +1000, "Leigh Sharpe"
<ls...@pacificwireless.com.au> wrote:

>So it looks like I have to reset my Bayes and re-train it. I want to do
>it properly this time. I will be making sure I personally review every
>message that our users put into the spam folder first, to make sure they
>haven't put spam into the wrong folder. However, I have a couple of
>questions:
> 
>1) Am I better off to feed it a few emails a day, or wait until I get a
>few hundred, then feed them all to sa-learn at once? Is there really a
>difference?
>2) How many spams should I feed it? I've heard in some places that 200
>is OK, I've heard elsewhere that 10000 or more are needed.
>3) Just how 'balanced' should it's diet be? Should I use the same
>quantity of ham as spam, or can I get away with less ham than spam?
> 
> 
>Regards,
>             Leigh
> 
>Leigh Sharpe
>Network Systems Engineer
>Pacific Wireless
>Ph +61 3 9584 8966
>Mob 0408 009 502
>email lsharpe@pacificwireless.com.au
>web www.pacificwireless.com.au
> 

The minimum corpus is recommended as 200 spam and 200 ham, then add in
on an as received basis. My initial corpus was around 500 of each and
my bayes has remained stable for several years. The numbers should be
about equal though in my experience they don't have to be exact.
Though if you do 200 ham and 2000 spam you will skew the scoring in
bayes.

Here as FPs or FNs are reported they are trained in accordingly.

I don't use the auto train feature, I've personally found that to be
problematic.

HTH

Nigel

Re: Training Bayes properly

Posted by Anthony Peacock <a....@chime.ucl.ac.uk>.

Hi,

Loren Wilton wrote:
> You are into the land of opinions here, so you will get different answers.
>  

<SNIP>

> Once you have the basic stuff I personally prefer to leave auto-learning 
> turned off and only had Bayes hams and spams that might be 
> misclassified, or ones where the bayes score isn't high enough in the 
> appropriate direction.  Others may want to do things differently.
>  
> Personally I'd say that you REALLY should turn off auto-learning at the 
> start, until you have got Bayes a good start in life by hand.  Once you 
> have it working and you are happy with it you may want to turn 
> auto-learning back on, or may not.  If you do turn it back on, you 
> probably want to set bayes-ham-threshold (or whatever the name really 
> is) to around -.1 rather than the default value.

I entirely agree about turning auto-learning off until you are happy 
with that Bayes is working pretty well for you.  If you do turn on 
auto-learning it is vital that you adjust the thresholds.  These are my 
values:

bayes_auto_learn_threshold_nonspam -0.1
bayes_auto_learn_threshold_spam 12.0


This has worked really well for me, with a site-wide Bayes database, 
which I manually learn by mistake.  I also occasionally learn a handful 
of hams to keep them up to date.






>  
>         Loren
> 
>     ----- Original Message -----
>     *From:* Leigh Sharpe <ma...@pacificwireless.com.au>
>     *To:* users <ma...@spamassassin.apache.org>
>     *Sent:* Thursday, June 29, 2006 4:45 PM
>     *Subject:* Training Bayes properly
> 
>     So it looks like I have to reset my Bayes and re-train it. I want to
>     do it properly this time. I will be making sure I personally review
>     every message that our users put into the spam folder first, to make
>     sure they haven't put spam into the wrong folder. However, I have a
>     couple of questions:
>      
>     1) Am I better off to feed it a few emails a day, or wait until I
>     get a few hundred, then feed them all to sa-learn at once? Is there
>     really a difference?
>     2) How many spams should I feed it? I've heard in some places that
>     200 is OK, I've heard elsewhere that 10000 or more are needed.
>     3) Just how 'balanced' should it's diet be? Should I use the same
>     quantity of ham as spam, or can I get away with less ham than spam?
>      
>      
>     Regards,
>                  Leigh
>      
>     Leigh Sharpe
>     Network Systems Engineer
>     Pacific Wireless
>     Ph +61 3 9584 8966
>     Mob 0408 009 502
>     email lsharpe@pacificwireless.com.au
>     <bl...@pacificwireless.com.au>
>     web www.pacificwireless.com.au
>     <blocked::http://www.pacificwireless.com.au/>
>      


-- 
Anthony Peacock
CHIME, Royal Free & University College Medical School
WWW:    http://www.chime.ucl.ac.uk/~rmhiajp/
"If you have an apple and I have  an apple and we  exchange apples
then you and I will still each have  one apple. But  if you have an
idea and I have an idea and we exchange these ideas, then each of us
will have two ideas." -- George Bernard Shaw

Re: Training Bayes properly

Posted by Loren Wilton <lw...@earthlink.net>.

You are into the land of opinions here, so you will get different answers.

1. The 200 ham and 200 spam is a hard minimum. You can change this. But don't.
So you MUST give Bayes at least 200 each ham and spam before it will start doing anything. What you give it for ham should hopefully be fairly representative of the ham your sites really gets. Likewise the spam should be moderately representative of average spam.

Once you have the basic stuff I personally prefer to leave auto-learning turned off and only had Bayes hams and spams that might be misclassified, or ones where the bayes score isn't high enough in the appropriate direction. Others may want to do things differently.

Personally I'd say that you REALLY should turn off auto-learning at the start, until you have got Bayes a good start in life by hand. Once you have it working and you are happy with it you may want to turn auto-learning back on, or may not. If you do turn it back on, you probably want to set bayes-ham-threshold (or whatever the name really is) to around -.1 rather than the default value.

Loren
----- Original Message -----
From: Leigh Sharpe
To: users
Sent: Thursday, June 29, 2006 4:45 PM
Subject: Training Bayes properly

So it looks like I have to reset my Bayes and re-train it. I want to do it properly this time. I will be making sure I personally review every message that our users put into the spam folder first, to make sure they haven't put spam into the wrong folder. However, I have a couple of questions:

1) Am I better off to feed it a few emails a day, or wait until I get a few hundred, then feed them all to sa-learn at once? Is there really a difference?
2) How many spams should I feed it? I've heard in some places that 200 is OK, I've heard elsewhere that 10000 or more are needed.
3) Just how 'balanced' should it's diet be? Should I use the same quantity of ham as spam, or can I get away with less ham than spam?

Regards,
Leigh

Leigh Sharpe
Network Systems Engineer
Pacific Wireless
Ph +61 3 9584 8966
Mob 0408 009 502
email lsharpe@pacificwireless.com.au
web www.pacificwireless.com.au