You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Rolf Loudon <ro...@ses.tas.gov.au> on 2007/11/20 03:37:25 UTC

bayes_99 matching since sa-update

hi

I use sa-update with channels saupdates.openprotect.com and  
updates.spamassassin.org.

After the latest run today I am getting matches against BAYES_99  
(which adds 3.5) to many messages, where they previously triggered  
virtually no rules at all.

This is causing many false positives, to the extent that I've had to  
set the score to zero to avoid them.

Anyone else seeing this? Better, have the rule or rules that are  
causing this been identified (and fixed)?

Else, if the bayes db has been damaged by something, how do I remove  
whatever is persuading it about the high probability this rule  
indicates?

thanks

rolf.


This message may contain confidential information which is intended only for the individual named.
If you are not the named addressee you should not disseminate, distribute or copy this email.
Please notify the sender immediately by email if you have received this email by mistake and delete this email from your system.
Email transmission cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses.
The sender therefore does not accept liability for any errors or omissions
 in the contents of this message which arise as a result of email transmission.
If verification is required please request a hard copy version.

Re: retraining bayes question (Was: bayes_99 matching since sa-update)

Posted by Loren Wilton <lw...@earthlink.net>.
I don't personally use auto-learning, but I can't see anything wrong with 
the sort of things you have done in the configuration.  The number of spam 
and ham learned is at least partially controlled by the auto-learn 
thresholds.  I think these are bayes_auto_learn_ham and 
bayes_auto_learn_spam or something close to that.

It may be that your normal ham and spam scores are such that you aren't 
getting many spams learned, and teaking the thresholds might do some good. 
I would be cautious with this though, since many people have reported bayes 
going bad on them with what used to be the default spam learning score.

        Loren


----- Original Message ----- 
From: "Rolf Loudon" <ro...@ses.tas.gov.au>
To: <us...@spamassassin.apache.org>
Sent: Monday, January 14, 2008 4:53 PM
Subject: retraining bayes question (Was: bayes_99 matching since sa-update)


> hello
>
> I have been trying to retrain my BayesDB to correct whatever  strangeness 
> had crept in to show a dramatically different numbers of  spam and ham in 
> the output of sa-learn --dump magic.
>
> As recommended below I collected 420 messages each of spam and ham and 
> checked for wrongly assessed ones.  I did a sa-learn --clear then a sa- 
> learn --spam /path/to/spam/mail and sa-learn --ham /path/to/ham/mail. 
> The sa-learn process reported it learned 416 and 418 of the 420  supplied 
> for the two types.
>
> Ever since, the number of ham reported from sa-learn --dump magic has 
> grown faster than the number of spam.  After 3 hours spam had risen to 
> 460 while nham was at 790.  Now, 21 hours later nspam is 468 while  nham 
> is 1619.  At this rate the disparity brought to my attention from  my 
> original post will be reached.
>
> Methinks this is not normal, is something wrong?
>
> The only changes I have made from a standard package install on debian 
> linux (currently version 3.1.7) are:
>
> (a) to use the sa-update mechanism to incorporate updated rules from  the 
> channels saupdates.openprotect.com and updates.spamassassin.org
>
> (b) occasionally adjust downwards a few rules that repeatedly cause  false 
> positives.  In the case of those FPs they are where the total  score is 
> quite low excepting particular rule hits that have scores of  2+  Scores I 
> have adjusted are for rules such as the MANGLED_ series  (_WHILE, _TOOL, 
> _MEN, _OFF, _GOOD, _NAIL etc), the TVD_ series and  individual ones like 
> DEAR_SOMETHING and DATE_IN_FUTURE_12_24.
>
> (c) added a couple of very primitive simple rules capturing literal 
> strings of an offensive nature in the Subject lines.
>
> (d) added various blacklist_from and whitelist_from entries as 
> appropriate.
>
> Are these kind of adjustments ill-advised?
>
> I receive far more ham than spam.  Is this problematic for Bayesian 
> learning?
>
> Any ideas why the ratio of spam/ham is growing at this rate and what 
> changes could be made, if indeed it is actually a developing problem.
>
> many thanks
>
> r.
>
>
>
>
> On 20/11/2007, at 4:39 PM, Matt Kettler wrote:
>
>> Rolf Loudon wrote:
>>>
>>>> What's a
>>>> "sa-learn --dump magic" output look like?
>>>
>>> # sa-learn --dump magic
>>> 0.000          0          3          0  non-token data: bayes db 
>>> version
>>> 0.000          0        297          0  non-token data: nspam
>>> 0.000          0     982365          0  non-token data: nham
>>> 0.000          0     160628          0  non-token data: ntokens
>>> 0.000          0 1195344836          0  non-token data: oldest atime
>>> 0.000          0 1195532636          0  non-token data: newest atime
>>> 0.000          0 1195532327          0  non-token data: last journal
>>> sync atime
>>> 0.000          0 1195517625          0  non-token data: last expiry 
>>> atime
>>> 0.000          0     172800          0  non-token data: last expire
>>> atime delta
>>> 0.000          0      72520          0  non-token data: last expire
>>> reduction count
>>>
>>> Thoughts?
>> That's a *really* unusual sa-learn dump, and would imply that bayes  was
>> completely inactive until recently.
>>
>> Note that there are 900k messages that have been trained as ham (ie:
>> nonspam email), but only 297 trained as spam. That's very little spam
>> compared to the quantity of ham. Usually you see by more spam than  ham,
>> but not by that large a margin (50:1 spam to ham isn't unheard of..  but
>> this is 1:3307).
>>
>> Did you do some really goofy hand training with sa-learn, or did the
>> autolearner really do that? If it's all autolearning, do you have a  lot
>> of spam matching ALL_TRUSTED?
>>
>> Also bayes won't become active until there are at least 200 spams and
>> 200 hams, and given there's only 297 spams, it may not have crossed  that
>> line until recently and bayes may have been disabled.
>>
>> I'd be very concerned about the health of your bayes database. It's
>> possible the autolearner went awry and learned poorly here.
>>
>> I would seriously consider doing the following, if at all possible:
>>
>> 1) round up a few hundred spam and nonspam messages as text files  (with
>> complete headers)
>> 2) run sa-learn --clear to wipe out your bayes database
>> 3) use sa-learn --spam and sa-learn --ham to hand-train those messages
>> from step 1.
>>
>> Once given a little hand training, usually the autolearner is fine  (with
>> the occasional hand training to fix minor confusions, but it looks  like
>> you're way past minor confusion...).
>> 



retraining bayes question (Was: bayes_99 matching since sa-update)

Posted by Rolf Loudon <ro...@ses.tas.gov.au>.
hello

I have been trying to retrain my BayesDB to correct whatever  
strangeness had crept in to show a dramatically different numbers of  
spam and ham in the output of sa-learn --dump magic.

As recommended below I collected 420 messages each of spam and ham and  
checked for wrongly assessed ones.  I did a sa-learn --clear then a sa- 
learn --spam /path/to/spam/mail and sa-learn --ham /path/to/ham/mail.   
The sa-learn process reported it learned 416 and 418 of the 420  
supplied for the two types.

Ever since, the number of ham reported from sa-learn --dump magic has  
grown faster than the number of spam.  After 3 hours spam had risen to  
460 while nham was at 790.  Now, 21 hours later nspam is 468 while  
nham is 1619.  At this rate the disparity brought to my attention from  
my original post will be reached.

Methinks this is not normal, is something wrong?

The only changes I have made from a standard package install on debian  
linux (currently version 3.1.7) are:

(a) to use the sa-update mechanism to incorporate updated rules from  
the channels saupdates.openprotect.com and updates.spamassassin.org

(b) occasionally adjust downwards a few rules that repeatedly cause  
false positives.  In the case of those FPs they are where the total  
score is quite low excepting particular rule hits that have scores of  
2+  Scores I have adjusted are for rules such as the MANGLED_ series  
(_WHILE, _TOOL, _MEN, _OFF, _GOOD, _NAIL etc), the TVD_ series and  
individual ones like DEAR_SOMETHING and DATE_IN_FUTURE_12_24.

(c) added a couple of very primitive simple rules capturing literal  
strings of an offensive nature in the Subject lines.

(d) added various blacklist_from and whitelist_from entries as  
appropriate.

Are these kind of adjustments ill-advised?

I receive far more ham than spam.  Is this problematic for Bayesian  
learning?

Any ideas why the ratio of spam/ham is growing at this rate and what  
changes could be made, if indeed it is actually a developing problem.

many thanks

r.




On 20/11/2007, at 4:39 PM, Matt Kettler wrote:

> Rolf Loudon wrote:
>>
>>> What's a
>>> "sa-learn --dump magic" output look like?
>>
>> # sa-learn --dump magic
>> 0.000          0          3          0  non-token data: bayes db  
>> version
>> 0.000          0        297          0  non-token data: nspam
>> 0.000          0     982365          0  non-token data: nham
>> 0.000          0     160628          0  non-token data: ntokens
>> 0.000          0 1195344836          0  non-token data: oldest atime
>> 0.000          0 1195532636          0  non-token data: newest atime
>> 0.000          0 1195532327          0  non-token data: last journal
>> sync atime
>> 0.000          0 1195517625          0  non-token data: last expiry  
>> atime
>> 0.000          0     172800          0  non-token data: last expire
>> atime delta
>> 0.000          0      72520          0  non-token data: last expire
>> reduction count
>>
>> Thoughts?
> That's a *really* unusual sa-learn dump, and would imply that bayes  
> was
> completely inactive until recently.
>
> Note that there are 900k messages that have been trained as ham (ie:
> nonspam email), but only 297 trained as spam. That's very little spam
> compared to the quantity of ham. Usually you see by more spam than  
> ham,
> but not by that large a margin (50:1 spam to ham isn't unheard of..  
> but
> this is 1:3307).
>
> Did you do some really goofy hand training with sa-learn, or did the
> autolearner really do that? If it's all autolearning, do you have a  
> lot
> of spam matching ALL_TRUSTED?
>
> Also bayes won't become active until there are at least 200 spams and
> 200 hams, and given there's only 297 spams, it may not have crossed  
> that
> line until recently and bayes may have been disabled.
>
> I'd be very concerned about the health of your bayes database. It's
> possible the autolearner went awry and learned poorly here.
>
> I would seriously consider doing the following, if at all possible:
>
> 1) round up a few hundred spam and nonspam messages as text files  
> (with
> complete headers)
> 2) run sa-learn --clear to wipe out your bayes database
> 3) use sa-learn --spam and sa-learn --ham to hand-train those messages
> from step 1.
>
> Once given a little hand training, usually the autolearner is fine  
> (with
> the occasional hand training to fix minor confusions, but it looks  
> like
> you're way past minor confusion...).
>


Re: bayes_99 matching since sa-update

Posted by Rolf Loudon <ro...@ses.tas.gov.au>.
>>>  What's a
>>> "sa-learn --dump magic" output look like?
>>
>> # sa-learn --dump magic
>> 0.000          0          3          0  non-token data: bayes db  
>> version
>> 0.000          0        297          0  non-token data: nspam
>> 0.000          0     982365          0  non-token data: nham
>> 0.000          0     160628          0  non-token data: ntokens
>> 0.000          0 1195344836          0  non-token data: oldest atime
>> 0.000          0 1195532636          0  non-token data: newest atime
>> 0.000          0 1195532327          0  non-token data: last journal
>> sync atime
>> 0.000          0 1195517625          0  non-token data: last  
>> expiry atime
>> 0.000          0     172800          0  non-token data: last expire
>> atime delta
>> 0.000          0      72520          0  non-token data: last expire
>> reduction count
>>
>> Thoughts?
> That's a *really* unusual sa-learn dump, and would imply that bayes  
> was
> completely inactive until recently.
> Note that there are 900k messages that have been trained as ham (ie:
> nonspam email), but only 297 trained as spam. That's very little spam
> compared to the quantity of ham. Usually you see by more spam than  
> ham,
> but not by that large a margin (50:1 spam to ham isn't unheard of..  
> but
> this is 1:3307).
>
> Did you do some really goofy hand training with sa-learn, or did the
> autolearner really do that? If it's all autolearning, do you have a  
> lot
> of spam matching ALL_TRUSTED?

I have not done hand training. The autolearner did it.

Spam is not kept for very long and in the collection I now have there  
are no occurrences of ALL_TRUSTED.

> I'd be very concerned about the health of your bayes database. It's
> possible the autolearner went awry and learned poorly here.
>
>  I would seriously consider doing the following, if at all possible:
>
> 1) round up a few hundred spam and nonspam messages as text files  
> (with
> complete headers)
> 2) run sa-learn --clear to wipe out your bayes database
> 3) use sa-learn --spam and sa-learn --ham to hand-train those messages
> from step 1.

I would like to do this.  I have yet to find a way to extract ham  
from our corporate mail system. It runs IBM's Lotus Notes and a mail  
message is split up awkwardly into database fields and I know of no  
way to extract a raw form message. Is there any such tool?

SA runs on a gateway box which does not store any mail passing  
through it except that which it quarantines (for possible retrieval  
in the case of false positives) as spam.  So I have a source of spam  
there.

Apart from hand learning, what would be the overall effect of  
clearing the bayes db as it currently stands and having autolearn to  
start again?

Thanks for your help and suggestions thus far.

> Once given a little hand training, usually the autolearner is fine  
> (with
> the occasional hand training to fix minor confusions, but it looks  
> like
> you're way past minor confusion...).
>



This message may contain confidential information which is intended only for the individual named.
If you are not the named addressee you should not disseminate, distribute or copy this email.
Please notify the sender immediately by email if you have received this email by mistake and delete this email from your system.
Email transmission cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses.
The sender therefore does not accept liability for any errors or omissions
 in the contents of this message which arise as a result of email transmission.
If verification is required please request a hard copy version.

Re: bayes_99 matching since sa-update

Posted by Matt Kettler <mk...@verizon.net>.
Rolf Loudon wrote:
>
>>  What's a
>> "sa-learn --dump magic" output look like?
>
> # sa-learn --dump magic
> 0.000          0          3          0  non-token data: bayes db version
> 0.000          0        297          0  non-token data: nspam
> 0.000          0     982365          0  non-token data: nham
> 0.000          0     160628          0  non-token data: ntokens
> 0.000          0 1195344836          0  non-token data: oldest atime
> 0.000          0 1195532636          0  non-token data: newest atime
> 0.000          0 1195532327          0  non-token data: last journal
> sync atime
> 0.000          0 1195517625          0  non-token data: last expiry atime
> 0.000          0     172800          0  non-token data: last expire
> atime delta
> 0.000          0      72520          0  non-token data: last expire
> reduction count
>
> Thoughts?
That's a *really* unusual sa-learn dump, and would imply that bayes was
completely inactive until recently.

Note that there are 900k messages that have been trained as ham (ie:
nonspam email), but only 297 trained as spam. That's very little spam
compared to the quantity of ham. Usually you see by more spam than ham,
but not by that large a margin (50:1 spam to ham isn't unheard of.. but
this is 1:3307).

Did you do some really goofy hand training with sa-learn, or did the
autolearner really do that? If it's all autolearning, do you have a lot
of spam matching ALL_TRUSTED?

Also bayes won't become active until there are at least 200 spams and
200 hams, and given there's only 297 spams, it may not have crossed that
line until recently and bayes may have been disabled.

I'd be very concerned about the health of your bayes database. It's
possible the autolearner went awry and learned poorly here.

 I would seriously consider doing the following, if at all possible:

1) round up a few hundred spam and nonspam messages as text files (with
complete headers)
2) run sa-learn --clear to wipe out your bayes database
3) use sa-learn --spam and sa-learn --ham to hand-train those messages
from step 1.

Once given a little hand training, usually the autolearner is fine (with
the occasional hand training to fix minor confusions, but it looks like
you're way past minor confusion...).


Re: bayes_99 matching since sa-update

Posted by Rolf Loudon <ro...@ses.tas.gov.au>.
>> hi
>>
>> I use sa-update with channels      and updates.spamassassin.org.
>>
>> After the latest run today I am getting matches against BAYES_99
>> (which adds 3.5) to many messages, where they previously triggered
>> virtually no rules at all.
>>
>> This is causing many false positives, to the extent that I've had to
>> set the score to zero to avoid them.
>>
>> Anyone else seeing this? Better, have the rule or rules that are
>> causing this been identified (and fixed)?
>>
>> Else, if the bayes db has been damaged by something, how do I remove
>> whatever is persuading it about the high probability this rule  
>> indicates?
>
> Well, the sa-update itself wouldn't change the behavior of BAYES_99
> unless there was a grossly stupid or malicious error made by the
> maintainers.  All sa-update could do is change the rule, which  
> amounts to:
>
> body BAYES_99               eval:check_bayes('0.99', '1.00')

Thanks.  Yes that is how I reasoned it too.

> *however* an updated ruleset might change the behavior of your
> auto-learning, by increasing spam scores with rule hits. You might  
> want
> to go digging through your logs and see if there's a lot more spam
> autolearning going on post-upgrade.  That said, I'd expect that to  
> make
> a change over a period of a few weeks, not instantly.

Agreed and a quick look through the logs showed that bayes_99 was  
listed in all reports over the last day, but virtually non existent  
for a week or so before that.  which pointed to some amiss by dint of  
upgrade.

> Perhaps your bayes DB is merely just not well trained and this is a
> problem that's been building but went unnoticed so far? What's a
> "sa-learn --dump magic" output look like?

# sa-learn --dump magic
0.000          0          3          0  non-token data: bayes db version
0.000          0        297          0  non-token data: nspam
0.000          0     982365          0  non-token data: nham
0.000          0     160628          0  non-token data: ntokens
0.000          0 1195344836          0  non-token data: oldest atime
0.000          0 1195532636          0  non-token data: newest atime
0.000          0 1195532327          0  non-token data: last journal  
sync atime
0.000          0 1195517625          0  non-token data: last expiry  
atime
0.000          0     172800          0  non-token data: last expire  
atime delta
0.000          0      72520          0  non-token data: last expire  
reduction count

Thoughts?

many thanks

rolf.


This message may contain confidential information which is intended only for the individual named.
If you are not the named addressee you should not disseminate, distribute or copy this email.
Please notify the sender immediately by email if you have received this email by mistake and delete this email from your system.
Email transmission cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses.
The sender therefore does not accept liability for any errors or omissions
 in the contents of this message which arise as a result of email transmission.
If verification is required please request a hard copy version.

Re: bayes_99 matching since sa-update

Posted by Matt Kettler <mk...@verizon.net>.
Rolf Loudon wrote:
> hi
>
> I use sa-update with channels      and updates.spamassassin.org.
>
> After the latest run today I am getting matches against BAYES_99
> (which adds 3.5) to many messages, where they previously triggered
> virtually no rules at all.
>
> This is causing many false positives, to the extent that I've had to
> set the score to zero to avoid them.
>
> Anyone else seeing this? Better, have the rule or rules that are
> causing this been identified (and fixed)?
>
> Else, if the bayes db has been damaged by something, how do I remove
> whatever is persuading it about the high probability this rule indicates? 

Well, the sa-update itself wouldn't change the behavior of BAYES_99
unless there was a grossly stupid or malicious error made by the
maintainers.  All sa-update could do is change the rule, which amounts to:

body BAYES_99               eval:check_bayes('0.99', '1.00')

And it's pretty much been that for a few years now, and the latest
sa-update is no different. An error here would be really obvious.
BAYES_99's real behavior is going to be based on the contents of your
bayes database and possibly changes to the bayes code, neither of which
is touched by sa-update.

*however* an updated ruleset might change the behavior of your
auto-learning, by increasing spam scores with rule hits. You might want
to go digging through your logs and see if there's a lot more spam
autolearning going on post-upgrade.  That said, I'd expect that to make
a change over a period of a few weeks, not instantly.

Perhaps your bayes DB is merely just not well trained and this is a
problem that's been building but went unnoticed so far? What's a
"sa-learn --dump magic" output look like?