You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Kailash Vyas <ka...@gmail.com> on 2006/07/31 12:26:03 UTC

spamassasin learn

Hi all,

I am running SpamAssassin version 3.1.4.
How do I make spamassasin learn. I have been reading about sa-learn
where I am supposed to run sa-learn on a spam folder. But why should I
 run it on spam folder as i would assume that it should already be in
spamassasin database as spamassasin has already marked it as spam and
autolearn is on. how do I make spamassasin learn junk messages which
are not marked as spam.

Also if I have collected all spam in a junk folder and run sa-learn on
this will it affect globally and show it as spam for all the users. I
ran sa-learn on a junk folder which I had collected from my
thunderbird junk folder. I got this message though after running it
even though there are around 200 messages in this folder.
Learned tokens from 0 message(s) (0 message(s) examined)

I ran  sa-learn --dump magic command to find out how much spamassasin
has learned and it shows this output

0.000          0          3          0  non-token data: bayes db version
0.000          0          0          0  non-token data: nspam
0.000          0          0          0  non-token data: nham
0.000          0          0          0  non-token data: ntokens
0.000          0          0          0  non-token data: oldest atime
0.000          0          0          0  non-token data: newest atime
0.000          0          0          0  non-token data: last journal sync atime
0.000          0          0          0  non-token data: last expiry atime
0.000          0          0          0  non-token data: last expire atime delta
0.000          0          0          0  non-token data: last expire reduction co

Please advise,


Thanks,
Kailash

Re: Re: spamassasin learn

Posted by Nigel Frankcom <ni...@blue-canoe.net>.
Hi Kailash,

A lot depends on how you have things set up. Here I run a baseline of
global settings with user options available; using MTS Professional
mailserver (which has a lot of built in SA inter-operability) on Win32
and SA on CentOS with MySQL. 

Firstly, you can't put ham & spam in the same dir and then let
sa-learn loose on it. They *must* be in separate directories. For
example I use...

>sa-learn --ham -u <SA USERNAME> /downloads/ham && sa-learn --spam -u <SA USERNAME> /downloads/spam

The above uses individual messages my users report back to me as false
positives or false negatives. A colleague uses mbox format mail fails
to achieve the same results.

You also must use a fairly balanced amount of ham/spam. An analogy
that was recently used on the list was about a policeman who only ever
saw knives in fights, one day he meets a chef; by the policeman's
experience ALL knives are bad, therefore the chef is a bad person.

The same applies to bayes, if you train it only with spam then it has
no way to compare it against ham.

As to how it helps, you will find there are spam that are not in the
RBL/URI lists, at which point bayes will catch them (in an ideal
world). Bayes is an important part of SA IMO and it's well worth the
effort of getting the initial corpus of 200 ham & spam right. It can
save you an awful lot of bother later down the line.

Some of the other list members may be able to give you the more
technical reasons why, all I can say is that it's worked very well
here for a number of years; also for a number of my colleagues.

Once you have bayes set up, it shouldn't need changing bar adding in
FN's & FP's as they arrive. My bayes db is 4 or 5 years old now and
I've read of older ones.

While a great deal of SA is automated, it does require some manual
intervention. And, it kills an awful lot of spam.

When I first started using SA I was highly skeptical, now I wouldn't
be without it. I'd recommend you persevere, add in your corpus and run
it up for a few weeks to see how you go.

What OS are you running this on and are you using the MySQL backend
for it?

HTH

Kind regards

Nigel





On Mon, 31 Jul 2006 12:19:37 +0100, "Kailash Vyas"
<ka...@gmail.com> wrote:

>Thanks Nigel for your help.  As you say it has caught spam using the
>additional tests. will it not be marked as spam everytime in that
>case. how would it help me to make it learn from spam already marked
>as spam by spam assasin. Is there a way where I can train spamassasin
>by running sa-learn on mail folder which have both ham as well as spam
>messages. Also running sa-learn on a mail folder containing only spam
>is the bayes database going to apply the spam rules learnt globally on
>mailboxes.
>
>Thanks,
>Kailash
>
>
>On 7/31/06, Nigel Frankcom <ni...@blue-canoe.net> wrote:
>> Hi,
>>
>> The sa-learn instruction trains the bayes database; without it bayes
>> will not tag any messages. You need to do the training with at least
>> 200 spam and 200 ham. Be very careful that the messages in each are
>> correct, so no spam in the ham folder.
>>
>> There are options to learn from the mbox format see
>> http://spamassassin.apache.org/full/3.1.x/dist/doc/sa-learn.html for
>> full instructions and options.
>>
>> Any spam SA has already caught will have been caught through the
>> additional tests SA runs.
>>
>> Bayes is a powerful tool and not one to discard lightly (imo)
>>
>> HTH
>>
>> Nigel
>>
>>
>>
>> On Mon, 31 Jul 2006 11:26:03 +0100, "Kailash Vyas"
>> <ka...@gmail.com> wrote:
>>
>> >Hi all,
>> >
>> >I am running SpamAssassin version 3.1.4.
>> >How do I make spamassasin learn. I have been reading about sa-learn
>> >where I am supposed to run sa-learn on a spam folder. But why should I
>> > run it on spam folder as i would assume that it should already be in
>> >spamassasin database as spamassasin has already marked it as spam and
>> >autolearn is on. how do I make spamassasin learn junk messages which
>> >are not marked as spam.
>> >
>> >Also if I have collected all spam in a junk folder and run sa-learn on
>> >this will it affect globally and show it as spam for all the users. I
>> >ran sa-learn on a junk folder which I had collected from my
>> >thunderbird junk folder. I got this message though after running it
>> >even though there are around 200 messages in this folder.
>> >Learned tokens from 0 message(s) (0 message(s) examined)
>> >
>> >I ran  sa-learn --dump magic command to find out how much spamassasin
>> >has learned and it shows this output
>> >
>> >0.000          0          3          0  non-token data: bayes db version
>> >0.000          0          0          0  non-token data: nspam
>> >0.000          0          0          0  non-token data: nham
>> >0.000          0          0          0  non-token data: ntokens
>> >0.000          0          0          0  non-token data: oldest atime
>> >0.000          0          0          0  non-token data: newest atime
>> >0.000          0          0          0  non-token data: last journal sync atime
>> >0.000          0          0          0  non-token data: last expiry atime
>> >0.000          0          0          0  non-token data: last expire atime delta
>> >0.000          0          0          0  non-token data: last expire reduction co
>> >
>> >Please advise,
>> >
>> >
>> >Thanks,
>> >Kailash
>>

Re: spamassasin learn

Posted by Kailash Vyas <ka...@gmail.com>.
Thanks Nigel for your help.  As you say it has caught spam using the
additional tests. will it not be marked as spam everytime in that
case. how would it help me to make it learn from spam already marked
as spam by spam assasin. Is there a way where I can train spamassasin
by running sa-learn on mail folder which have both ham as well as spam
messages. Also running sa-learn on a mail folder containing only spam
is the bayes database going to apply the spam rules learnt globally on
mailboxes.

Thanks,
Kailash


On 7/31/06, Nigel Frankcom <ni...@blue-canoe.net> wrote:
> Hi,
>
> The sa-learn instruction trains the bayes database; without it bayes
> will not tag any messages. You need to do the training with at least
> 200 spam and 200 ham. Be very careful that the messages in each are
> correct, so no spam in the ham folder.
>
> There are options to learn from the mbox format see
> http://spamassassin.apache.org/full/3.1.x/dist/doc/sa-learn.html for
> full instructions and options.
>
> Any spam SA has already caught will have been caught through the
> additional tests SA runs.
>
> Bayes is a powerful tool and not one to discard lightly (imo)
>
> HTH
>
> Nigel
>
>
>
> On Mon, 31 Jul 2006 11:26:03 +0100, "Kailash Vyas"
> <ka...@gmail.com> wrote:
>
> >Hi all,
> >
> >I am running SpamAssassin version 3.1.4.
> >How do I make spamassasin learn. I have been reading about sa-learn
> >where I am supposed to run sa-learn on a spam folder. But why should I
> > run it on spam folder as i would assume that it should already be in
> >spamassasin database as spamassasin has already marked it as spam and
> >autolearn is on. how do I make spamassasin learn junk messages which
> >are not marked as spam.
> >
> >Also if I have collected all spam in a junk folder and run sa-learn on
> >this will it affect globally and show it as spam for all the users. I
> >ran sa-learn on a junk folder which I had collected from my
> >thunderbird junk folder. I got this message though after running it
> >even though there are around 200 messages in this folder.
> >Learned tokens from 0 message(s) (0 message(s) examined)
> >
> >I ran  sa-learn --dump magic command to find out how much spamassasin
> >has learned and it shows this output
> >
> >0.000          0          3          0  non-token data: bayes db version
> >0.000          0          0          0  non-token data: nspam
> >0.000          0          0          0  non-token data: nham
> >0.000          0          0          0  non-token data: ntokens
> >0.000          0          0          0  non-token data: oldest atime
> >0.000          0          0          0  non-token data: newest atime
> >0.000          0          0          0  non-token data: last journal sync atime
> >0.000          0          0          0  non-token data: last expiry atime
> >0.000          0          0          0  non-token data: last expire atime delta
> >0.000          0          0          0  non-token data: last expire reduction co
> >
> >Please advise,
> >
> >
> >Thanks,
> >Kailash
>

Re: spamassasin learn

Posted by Nigel Frankcom <ni...@blue-canoe.net>.
Hi,

The sa-learn instruction trains the bayes database; without it bayes
will not tag any messages. You need to do the training with at least
200 spam and 200 ham. Be very careful that the messages in each are
correct, so no spam in the ham folder.

There are options to learn from the mbox format see
http://spamassassin.apache.org/full/3.1.x/dist/doc/sa-learn.html for
full instructions and options.

Any spam SA has already caught will have been caught through the
additional tests SA runs.

Bayes is a powerful tool and not one to discard lightly (imo)

HTH

Nigel



On Mon, 31 Jul 2006 11:26:03 +0100, "Kailash Vyas"
<ka...@gmail.com> wrote:

>Hi all,
>
>I am running SpamAssassin version 3.1.4.
>How do I make spamassasin learn. I have been reading about sa-learn
>where I am supposed to run sa-learn on a spam folder. But why should I
> run it on spam folder as i would assume that it should already be in
>spamassasin database as spamassasin has already marked it as spam and
>autolearn is on. how do I make spamassasin learn junk messages which
>are not marked as spam.
>
>Also if I have collected all spam in a junk folder and run sa-learn on
>this will it affect globally and show it as spam for all the users. I
>ran sa-learn on a junk folder which I had collected from my
>thunderbird junk folder. I got this message though after running it
>even though there are around 200 messages in this folder.
>Learned tokens from 0 message(s) (0 message(s) examined)
>
>I ran  sa-learn --dump magic command to find out how much spamassasin
>has learned and it shows this output
>
>0.000          0          3          0  non-token data: bayes db version
>0.000          0          0          0  non-token data: nspam
>0.000          0          0          0  non-token data: nham
>0.000          0          0          0  non-token data: ntokens
>0.000          0          0          0  non-token data: oldest atime
>0.000          0          0          0  non-token data: newest atime
>0.000          0          0          0  non-token data: last journal sync atime
>0.000          0          0          0  non-token data: last expiry atime
>0.000          0          0          0  non-token data: last expire atime delta
>0.000          0          0          0  non-token data: last expire reduction co
>
>Please advise,
>
>
>Thanks,
>Kailash