You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Michael <mi...@michi.su> on 2015/03/27 16:16:13 UTC

How to automatically train each users Bayes?

Hi,

I would like automatically learn each users Bayes database in the  
following way:

Do the following once a day for each user:
1.) sa-learn -u username --ham ../maildir/cur
2.) sa-learn -u username --spam ../maildir/.Spam/cur

The idea is to train the Bayes for each user without the need to take  
care of learning Spam/Ham on their own.

The reason for taking the "cur" folder instead of the "new" folder is  
that I assume that the contents of these folders have already been  
verified for false-positives/negatives by the user.

A problem that could occur is when the user always deletes all mails  
in .Spam/cur. Then the Bayes is only trained with Ham, but never Spam.  
Or isn't that a problem?

What do you think about this strategy?

Thanks,
Michael

Re: How to automatically train each users Bayes?

Posted by Michael <mi...@michi.su>.


On 27.03.2015 16:21, Reindl Harald wrote:
> 
> 
> Am 27.03.2015 um 16:16 schrieb Michael:
>> I would like automatically learn each users Bayes database in the
>> following way:
>>
>> Do the following once a day for each user:
>> 1.) sa-learn -u username --ham ../maildir/cur
>> 2.) sa-learn -u username --spam ../maildir/.Spam/cur
>>
>> The idea is to train the Bayes for each user without the need to take
>> care of learning Spam/Ham on their own.
>>
>> The reason for taking the "cur" folder instead of the "new" folder is
>> that I assume that the contents of these folders have already been
>> verified for false-positives/negatives by the user.
>>
>> A problem that could occur is when the user always deletes all mails in
>> .Spam/cur. Then the Bayes is only trained with Ham, but never Spam. Or
>> isn't that a problem?
>>
>> What do you think about this strategy?
> 
> nothing good because in that case you can just stay at autolearning
> which is on by default after a bayes has at least 200 ham and 200 spam
> samples to get enabled at all
> 

You are probably right. Auto learning is already working for all users
because I'm always training new users with a preselected ham/spam folder

Re: How to automatically train each users Bayes?

Posted by Reindl Harald <h....@thelounge.net>.


Am 27.03.2015 um 16:16 schrieb Michael:
> I would like automatically learn each users Bayes database in the
> following way:
>
> Do the following once a day for each user:
> 1.) sa-learn -u username --ham ../maildir/cur
> 2.) sa-learn -u username --spam ../maildir/.Spam/cur
>
> The idea is to train the Bayes for each user without the need to take
> care of learning Spam/Ham on their own.
>
> The reason for taking the "cur" folder instead of the "new" folder is
> that I assume that the contents of these folders have already been
> verified for false-positives/negatives by the user.
>
> A problem that could occur is when the user always deletes all mails in
> .Spam/cur. Then the Bayes is only trained with Ham, but never Spam. Or
> isn't that a problem?
>
> What do you think about this strategy?

nothing good because in that case you can just stay at autolearning 
which is on by default after a bayes has at least 200 ham and 200 spam 
samples to get enabled at all

Re: How to automatically train each users Bayes?

Posted by James Michael Keller <jm...@houseofzen.org>.

Here is what I'm using to do the same globally based on each users mail, 
but it could be tweaked to do per user.    This happens to be a family 
only server, so I'm generally doing the spam/ham review for each user as 
needed:

root@omega:/usr/local/bin# more sa-learn-systemwide
#!/bin/sh
#
# sa-learn-systemwide
#
# Run sa-lean against user Maildir folders for ham / spam token learning
#
#

LOGFILE="/var/log/sa-learn-run.log"

SALEARNBIN="/usr/bin/sa-learn"
SAUSERNAME="Debian-exim"
SADBPATH="/var/spool/exim4/.spamassassin/bayes"
SAFOLDERS="/etc/spamassassin/sa-learn-folders.conf"
MAILTO="root@localhost"


#
# Execute sa-learn token database expire of old tokens
#
TIMESTAMP=`date`
echo $TIMESTAMP "sa-learn: Starting token expiration ..." >> $LOGFILE
$SALEARNBIN --force-expire --username=$SAUSERNAME --dbpath=$SADBPATH 
2>&1 >> $LOGFILE

#
# Execute sa-learn against configured folders
#
TIMESTAMP=`date`
echo $TIMESTAMP "sa-learn: Starting Learning ..." >> $LOGFILE
$SALEARNBIN --no-sync --username=$SAUSERNAME --dbpath=$SADBPATH 
--folders=$SAFOLDERS 2>&1 >> $LOGFILE

#
# Execute sa-learn sync
#
TIMESTAMP=`date`
echo $TIMESTAMP "sa-learn: Starting token journal sync ..." >> $LOGFILE
$SALEARNBIN --sync --username=$SAUSERNAME --dbpath=$SADBPATH 2>&1 >> 
$LOGFILE

#
# Execute chown
#
TIMESTAMP=`date`
echo $TIMESTAMP "sa-learn: Fixing file permissions ..." >> $LOGFILE
chown -c Debian-exim.Debian-exim $SADBPATH* 2>&1 >> $LOGFILE


#
# Execute sa-learn stats dump
#
TIMESTAMP=`date`
echo $TIMESTAMP "sa-learn: Starting stats dump ..." >> $LOGFILE
$SALEARNBIN --dump magic --progress --username=$SAUSERNAME 
--dbpath=$SADBPATH >> $LOGFILE


root@omega:/usr/local/bin# more /etc/spamassassin/sa-learn-folders.conf
spam:dir:/home/*/Maildir/.SPAM.Spam-Missed/{cur,new}
spam:dir:/home/*/Maildir/.SPAM.Spam-Mail/{cur,new}
ham:dir:/home/*/Maildir/.SPAM.Spam-Ham/{cur,new}
ham:dir:/home/*/Maildir/{cur,new}
ham:dir:/home/*/Maildir/.Sent/{cur,new}
root@omega:/usr/local/bin#

Log snip:

Mon Mar 30 09:00:01 EDT 2015 sa-learn: Starting token expiration ...
bayes: synced databases from journal in 0 seconds: 304 unique entries 
(605 total entries)
Mon Mar 30 09:00:06 EDT 2015 sa-learn: Starting Learning ...
Learned tokens from 24 message(s) (6971 message(s) examined)
Mon Mar 30 09:06:11 EDT 2015 sa-learn: Starting token journal sync ...
Mon Mar 30 09:06:14 EDT 2015 sa-learn: Fixing file permissions ...
Mon Mar 30 09:06:14 EDT 2015 sa-learn: Starting stats dump ...
0.000          0          3          0  non-token data: bayes db version
0.000          0      84238          0  non-token data: nspam
0.000          0     379365          0  non-token data: nham
0.000          0     142093          0  non-token data: ntokens
0.000          0 1427425402          0  non-token data: oldest atime
0.000          0 1427720336          0  non-token data: newest atime
0.000          0 1427720773          0  non-token data: last journal 
sync atime
0.000          0 1427720406          0  non-token data: last expiry atime
0.000          0     228435          0  non-token data: last expire 
atime delta
0.000          0          0          0  non-token data: last expire 
reduction count

Obvious issues if users leave spam sitting in their inbox, but if they 
move to the spam folder it will get relearned correctly.    In this case 
I trust the users with well behaved mail clients, so I also feed the 
sent mail in as ham.

Spam older then 14 days gets deleted from the spam folder.


-James

On 3/27/2015 2:09 PM, RW wrote:
> On Fri, 27 Mar 2015 15:16:13 +0000
> Michael wrote:
>
>> Hi,
>>
>> I would like automatically learn each users Bayes database in the
>> following way:
>>
>> Do the following once a day for each user:
>> 1.) sa-learn -u username --ham ../maildir/cur
>> 2.) sa-learn -u username --spam ../maildir/.Spam/cur
>>
>> The idea is to train the Bayes for each user without the need to
>> take care of learning Spam/Ham on their own.
>>
>> The reason for taking the "cur" folder instead of the "new" folder
>> is that I assume that the contents of these folders have already
>> been verified for false-positives/negatives by the user.
> "cur" doesn't imply that the mail has been read; for that you
> need to check the seen flag in the filename, an S somewhere after the
> colon.
>
>
>> A problem that could occur is when the user always deletes all mails
>> in .Spam/cur. Then the Bayes is only trained with Ham, but never
>> Spam. Or isn't that a problem?
> Not if you tell them - then it's their fault if it doesn't work.
> Alternately you could have a separate train-spam folder and empty it
> after training.
>
> You could also supplement spam training by autolearning only spam, e.g.
> I have:
>
> bayes_auto_learn 1
> bayes_auto_learn_on_error 1
> bayes_auto_learn_threshold_nonspam -2000.0
>
> Personally I've never seen a spam miss-trained as a ham with the
> default threshold, and sensible rule scores.
>
> I think where some people go wrong is that they don't specify
> aggressive custom scores correctly. With autolearning it's better to
> keep conservative scores in the non-Bayes scoresets e.g.
>
> score SOME_RULE  2 2 8 8
>
> not
>
> score SOME_RULE  8
>
> There's no difference in classification, but the latter is more like to
> cause miss-training on FPs.
>
>

Re: How to automatically train each users Bayes?

Posted by Alex Regan <my...@gmail.com>.

Hi,

>> Yes, that's true. But if I'm right, new mails stay in "new" until the
>> appropriate folder in the IMAP client has been opened, right? I just
>> assume, if the use has some false negatives in the folder, he will
>> either immediately delete it or just move it into the Spam folder.
>
> People can have mail clients running unattended in the background,
> often on multiple devices, so you can't assume it's been seen by a
> human.

Does anyone have any suggestions on how to enable Exchange users to 
submit samples for analysis they consider to be spam? With the latest 
Exchange, they've disabled IMAP on public folders.

We have one setup where we forward the mail to their internal Exchange 
system. We used to have spam and ham folders where users would place 
mail for us to review then train bayes, but we haven't been able to do 
it for a while because of this lack of IMAP issue.

Thanks,
Alex

Re: How to automatically train each users Bayes?

Posted by RW <rw...@googlemail.com>.

On Fri, 27 Mar 2015 20:03:18 +0100
Michael wrote:

> On 27.03.2015 19:09, RW wrote:
> > On Fri, 27 Mar 2015 15:16:13 +0000

> > "cur" doesn't imply that the mail has been read; for that you
> > need to check the seen flag in the filename, an S somewhere after
> > the colon.
> 
> Yes, that's true. But if I'm right, new mails stay in "new" until the
> appropriate folder in the IMAP client has been opened, right? I just
> assume, if the use has some false negatives in the folder, he will
> either immediately delete it or just move it into the Spam folder.

People can have mail clients running unattended in the background,
often on multiple devices, so you can't assume it's been seen by a
human.

> > You could also supplement spam training by autolearning only spam,
> > e.g. I have:
> > 
> > bayes_auto_learn 1
> > bayes_auto_learn_on_error 1
> > bayes_auto_learn_threshold_nonspam -2000.0
> 
> But that learns spam only if its score is above 12.0. And learns no
> nonspam.

That's why I suggested using it to "*supplement* spam training". When it
works, autotraining does have the advantage of happening in real-time.

> And then maybe the default config which auto learns spam and
> ham is already the best...

the default doesn't learn ham well, I'd only do that as a last resort.

> My setup is already configured retrain when the user moves mail from
> Inbox to Spam or from Spam to another folder.

This is a really poor way of training Bayes because it trains on SA
misclassifications rather than Bayes misclassifications. It's a poor
way of training spam and very much worse at training ham.  

On Fri, 27 Mar 2015 20:14:03 +0100
Matus UHLAR - fantomas wrote:

> >On 27.03.2015 19:54, Matus UHLAR - fantomas wrote:
> >> the easiest way is to train on false positives and false negatives.
> >> dovecot imapd has plugin to train when mail is moved from/to spam.
> 
> On 27.03.15 20:10, Michael wrote:
> >My concerns are the following:
> >Sometimes new kind of spam is appearing. This new kind often gets low
> >scores so that they are just 0.1 to 0.5 points above the limit. And
> >the auto learner gets no hit.
> >If the same spam then comes from another sending server, the score is
> >just a little bit below the border so that I'm getting a
> >false-negative. If the previous spam would have already been
> >learned, the second mail would have been scored as spam.
> 
> I don't get this. 

By the sound of it the OP is already using the dovecot plugin or
equivalent.

The first spam wasn't autolearned and was correctly identified as
spam. In this case the plugin doesn't provide a way of training it,
even if it has BAYES_00, because it's already in the spam folder.

People keep recommending the plugin, but IMO it's a poor choice for
SpamAssassin.

Re: How to automatically train each users Bayes?

Posted by @lbutlr, kr...@kreme.com.

On 27 Mar 2015, at 13:03 , Michael <mi...@michi.su> wrote:
> Yes, that's true. But if I'm right, new mails stay in "new" until the
> appropriate folder in the IMAP client has been opened, right?

No. As soon as a client access the folder, it is moved to cur.

This has nothing to do with a user opening the folder.

-- 
Where am I going and why am I in this handbasket?

Re: How to automatically train each users Bayes?

Posted by Michael <mi...@michi.su>.

On 27.03.2015 19:09, RW wrote:
> On Fri, 27 Mar 2015 15:16:13 +0000
> Michael wrote:
> 
>> Hi,
>>
>> I would like automatically learn each users Bayes database in the  
>> following way:
>>
>> Do the following once a day for each user:
>> 1.) sa-learn -u username --ham ../maildir/cur
>> 2.) sa-learn -u username --spam ../maildir/.Spam/cur
>>
>> The idea is to train the Bayes for each user without the need to
>> take care of learning Spam/Ham on their own.
>>
>> The reason for taking the "cur" folder instead of the "new" folder
>> is that I assume that the contents of these folders have already
>> been verified for false-positives/negatives by the user.
> 
> "cur" doesn't imply that the mail has been read; for that you
> need to check the seen flag in the filename, an S somewhere after the
> colon.

Yes, that's true. But if I'm right, new mails stay in "new" until the
appropriate folder in the IMAP client has been opened, right? I just
assume, if the use has some false negatives in the folder, he will
either immediately delete it or just move it into the Spam folder.

> 
> 
>> A problem that could occur is when the user always deletes all mails  
>> in .Spam/cur. Then the Bayes is only trained with Ham, but never
>> Spam. Or isn't that a problem?
> 
> Not if you tell them - then it's their fault if it doesn't work.
> Alternately you could have a separate train-spam folder and empty it
> after training.

I think it's more easy for the user if they just leave Spam in the Spam
folder for at least one day. Most of them will not move Spam into a
learn-folder.

> 
> You could also supplement spam training by autolearning only spam, e.g.
> I have:
> 
> bayes_auto_learn 1
> bayes_auto_learn_on_error 1
> bayes_auto_learn_threshold_nonspam -2000.0

But that learns spam only if its score is above 12.0. And learns no nonspam.
And then maybe the default config which auto learns spam and ham is
already the best...
My setup is already configured retrain when the user moves mail from
Inbox to Spam or from Spam to another folder.

> 
> Personally I've never seen a spam miss-trained as a ham with the
> default threshold, and sensible rule scores.
> 
> I think where some people go wrong is that they don't specify
> aggressive custom scores correctly. With autolearning it's better to
> keep conservative scores in the non-Bayes scoresets e.g.
> 
> score SOME_RULE  2 2 8 8
> 
> not
> 
> score SOME_RULE  8
> 
> There's no difference in classification, but the latter is more like to
> cause miss-training on FPs. 
>

Re: How to automatically train each users Bayes?

Posted by RW <rw...@googlemail.com>.

On Fri, 27 Mar 2015 15:16:13 +0000
Michael wrote:

> Hi,
> 
> I would like automatically learn each users Bayes database in the  
> following way:
> 
> Do the following once a day for each user:
> 1.) sa-learn -u username --ham ../maildir/cur
> 2.) sa-learn -u username --spam ../maildir/.Spam/cur
> 
> The idea is to train the Bayes for each user without the need to
> take care of learning Spam/Ham on their own.
> 
> The reason for taking the "cur" folder instead of the "new" folder
> is that I assume that the contents of these folders have already
> been verified for false-positives/negatives by the user.

"cur" doesn't imply that the mail has been read; for that you
need to check the seen flag in the filename, an S somewhere after the
colon.

> A problem that could occur is when the user always deletes all mails  
> in .Spam/cur. Then the Bayes is only trained with Ham, but never
> Spam. Or isn't that a problem?

Not if you tell them - then it's their fault if it doesn't work.
Alternately you could have a separate train-spam folder and empty it
after training.

You could also supplement spam training by autolearning only spam, e.g.
I have:

bayes_auto_learn 1
bayes_auto_learn_on_error 1
bayes_auto_learn_threshold_nonspam -2000.0

Personally I've never seen a spam miss-trained as a ham with the
default threshold, and sensible rule scores.

I think where some people go wrong is that they don't specify
aggressive custom scores correctly. With autolearning it's better to
keep conservative scores in the non-Bayes scoresets e.g.

score SOME_RULE  2 2 8 8

not

score SOME_RULE  8

There's no difference in classification, but the latter is more like to
cause miss-training on FPs.

Re: How to automatically train each users Bayes?

Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.

>On 27.03.2015 19:54, Matus UHLAR - fantomas wrote:
>> the easiest way is to train on false positives and false negatives.
>> dovecot imapd has plugin to train when mail is moved from/to spam.

On 27.03.15 20:10, Michael wrote:
>My concerns are the following:
>Sometimes new kind of spam is appearing. This new kind often gets low
>scores so that they are just 0.1 to 0.5 points above the limit. And the
>auto learner gets no hit.
>If the same spam then comes from another sending server, the score is
>just a little bit below the border so that I'm getting a false-negative.
>If the previous spam would have already been learned, the second mail
>would have been scored as spam.

I don't get this. Or should I add that it's of course good to continue with
autolearning, but _also_ allow manual learning ?

-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Micro$oft random number generator: 0, 0, 0, 4.33e+67, 0, 0, 0...

Re: How to automatically train each users Bayes?

Posted by Michael <mi...@michi.su>.

On 27.03.2015 19:54, Matus UHLAR - fantomas wrote:
> On 27.03.15 15:16, Michael wrote:
>> I would like automatically learn each users Bayes database in the
>> following way:
>>
>> Do the following once a day for each user:
>> 1.) sa-learn -u username --ham ../maildir/cur
>> 2.) sa-learn -u username --spam ../maildir/.Spam/cur
> 
>> What do you think about this strategy?
> 
> the easiest way is to train on false positives and false negatives.
> dovecot imapd has plugin to train when mail is moved from/to spam.

My concerns are the following:
Sometimes new kind of spam is appearing. This new kind often gets low
scores so that they are just 0.1 to 0.5 points above the limit. And the
auto learner gets no hit.
If the same spam then comes from another sending server, the score is
just a little bit below the border so that I'm getting a false-negative.
If the previous spam would have already been learned, the second mail
would have been scored as spam.

> 
> you use something other, you should create pair of special folders for
> users
> to train both ham and spam.
> 
>

Re: How to automatically train each users Bayes?

Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.

On 27.03.15 15:16, Michael wrote:
>I would like automatically learn each users Bayes database in the 
>following way:
>
>Do the following once a day for each user:
>1.) sa-learn -u username --ham ../maildir/cur
>2.) sa-learn -u username --spam ../maildir/.Spam/cur

>What do you think about this strategy?

the easiest way is to train on false positives and false negatives.
dovecot imapd has plugin to train when mail is moved from/to spam.

you use something other, you should create pair of special folders for users
to train both ham and spam.


-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
"To Boot or not to Boot, that's the question." [WD1270 Caviar]