You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Rolf Loudon <ro...@ses.tas.gov.au> on 2008/01/15 01:53:53 UTC
retraining bayes question (Was: bayes_99 matching since sa-update)
hello
I have been trying to retrain my BayesDB to correct whatever
strangeness had crept in to show a dramatically different numbers of
spam and ham in the output of sa-learn --dump magic.
As recommended below I collected 420 messages each of spam and ham and
checked for wrongly assessed ones. I did a sa-learn --clear then a sa-
learn --spam /path/to/spam/mail and sa-learn --ham /path/to/ham/mail.
The sa-learn process reported it learned 416 and 418 of the 420
supplied for the two types.
Ever since, the number of ham reported from sa-learn --dump magic has
grown faster than the number of spam. After 3 hours spam had risen to
460 while nham was at 790. Now, 21 hours later nspam is 468 while
nham is 1619. At this rate the disparity brought to my attention from
my original post will be reached.
Methinks this is not normal, is something wrong?
The only changes I have made from a standard package install on debian
linux (currently version 3.1.7) are:
(a) to use the sa-update mechanism to incorporate updated rules from
the channels saupdates.openprotect.com and updates.spamassassin.org
(b) occasionally adjust downwards a few rules that repeatedly cause
false positives. In the case of those FPs they are where the total
score is quite low excepting particular rule hits that have scores of
2+ Scores I have adjusted are for rules such as the MANGLED_ series
(_WHILE, _TOOL, _MEN, _OFF, _GOOD, _NAIL etc), the TVD_ series and
individual ones like DEAR_SOMETHING and DATE_IN_FUTURE_12_24.
(c) added a couple of very primitive simple rules capturing literal
strings of an offensive nature in the Subject lines.
(d) added various blacklist_from and whitelist_from entries as
appropriate.
Are these kind of adjustments ill-advised?
I receive far more ham than spam. Is this problematic for Bayesian
learning?
Any ideas why the ratio of spam/ham is growing at this rate and what
changes could be made, if indeed it is actually a developing problem.
many thanks
r.
On 20/11/2007, at 4:39 PM, Matt Kettler wrote:
> Rolf Loudon wrote:
>>
>>> What's a
>>> "sa-learn --dump magic" output look like?
>>
>> # sa-learn --dump magic
>> 0.000 0 3 0 non-token data: bayes db
>> version
>> 0.000 0 297 0 non-token data: nspam
>> 0.000 0 982365 0 non-token data: nham
>> 0.000 0 160628 0 non-token data: ntokens
>> 0.000 0 1195344836 0 non-token data: oldest atime
>> 0.000 0 1195532636 0 non-token data: newest atime
>> 0.000 0 1195532327 0 non-token data: last journal
>> sync atime
>> 0.000 0 1195517625 0 non-token data: last expiry
>> atime
>> 0.000 0 172800 0 non-token data: last expire
>> atime delta
>> 0.000 0 72520 0 non-token data: last expire
>> reduction count
>>
>> Thoughts?
> That's a *really* unusual sa-learn dump, and would imply that bayes
> was
> completely inactive until recently.
>
> Note that there are 900k messages that have been trained as ham (ie:
> nonspam email), but only 297 trained as spam. That's very little spam
> compared to the quantity of ham. Usually you see by more spam than
> ham,
> but not by that large a margin (50:1 spam to ham isn't unheard of..
> but
> this is 1:3307).
>
> Did you do some really goofy hand training with sa-learn, or did the
> autolearner really do that? If it's all autolearning, do you have a
> lot
> of spam matching ALL_TRUSTED?
>
> Also bayes won't become active until there are at least 200 spams and
> 200 hams, and given there's only 297 spams, it may not have crossed
> that
> line until recently and bayes may have been disabled.
>
> I'd be very concerned about the health of your bayes database. It's
> possible the autolearner went awry and learned poorly here.
>
> I would seriously consider doing the following, if at all possible:
>
> 1) round up a few hundred spam and nonspam messages as text files
> (with
> complete headers)
> 2) run sa-learn --clear to wipe out your bayes database
> 3) use sa-learn --spam and sa-learn --ham to hand-train those messages
> from step 1.
>
> Once given a little hand training, usually the autolearner is fine
> (with
> the occasional hand training to fix minor confusions, but it looks
> like
> you're way past minor confusion...).
>
Re: retraining bayes question (Was: bayes_99 matching since sa-update)
Posted by Loren Wilton <lw...@earthlink.net>.
I don't personally use auto-learning, but I can't see anything wrong with
the sort of things you have done in the configuration. The number of spam
and ham learned is at least partially controlled by the auto-learn
thresholds. I think these are bayes_auto_learn_ham and
bayes_auto_learn_spam or something close to that.
It may be that your normal ham and spam scores are such that you aren't
getting many spams learned, and teaking the thresholds might do some good.
I would be cautious with this though, since many people have reported bayes
going bad on them with what used to be the default spam learning score.
Loren
----- Original Message -----
From: "Rolf Loudon" <ro...@ses.tas.gov.au>
To: <us...@spamassassin.apache.org>
Sent: Monday, January 14, 2008 4:53 PM
Subject: retraining bayes question (Was: bayes_99 matching since sa-update)
> hello
>
> I have been trying to retrain my BayesDB to correct whatever strangeness
> had crept in to show a dramatically different numbers of spam and ham in
> the output of sa-learn --dump magic.
>
> As recommended below I collected 420 messages each of spam and ham and
> checked for wrongly assessed ones. I did a sa-learn --clear then a sa-
> learn --spam /path/to/spam/mail and sa-learn --ham /path/to/ham/mail.
> The sa-learn process reported it learned 416 and 418 of the 420 supplied
> for the two types.
>
> Ever since, the number of ham reported from sa-learn --dump magic has
> grown faster than the number of spam. After 3 hours spam had risen to
> 460 while nham was at 790. Now, 21 hours later nspam is 468 while nham
> is 1619. At this rate the disparity brought to my attention from my
> original post will be reached.
>
> Methinks this is not normal, is something wrong?
>
> The only changes I have made from a standard package install on debian
> linux (currently version 3.1.7) are:
>
> (a) to use the sa-update mechanism to incorporate updated rules from the
> channels saupdates.openprotect.com and updates.spamassassin.org
>
> (b) occasionally adjust downwards a few rules that repeatedly cause false
> positives. In the case of those FPs they are where the total score is
> quite low excepting particular rule hits that have scores of 2+ Scores I
> have adjusted are for rules such as the MANGLED_ series (_WHILE, _TOOL,
> _MEN, _OFF, _GOOD, _NAIL etc), the TVD_ series and individual ones like
> DEAR_SOMETHING and DATE_IN_FUTURE_12_24.
>
> (c) added a couple of very primitive simple rules capturing literal
> strings of an offensive nature in the Subject lines.
>
> (d) added various blacklist_from and whitelist_from entries as
> appropriate.
>
> Are these kind of adjustments ill-advised?
>
> I receive far more ham than spam. Is this problematic for Bayesian
> learning?
>
> Any ideas why the ratio of spam/ham is growing at this rate and what
> changes could be made, if indeed it is actually a developing problem.
>
> many thanks
>
> r.
>
>
>
>
> On 20/11/2007, at 4:39 PM, Matt Kettler wrote:
>
>> Rolf Loudon wrote:
>>>
>>>> What's a
>>>> "sa-learn --dump magic" output look like?
>>>
>>> # sa-learn --dump magic
>>> 0.000 0 3 0 non-token data: bayes db
>>> version
>>> 0.000 0 297 0 non-token data: nspam
>>> 0.000 0 982365 0 non-token data: nham
>>> 0.000 0 160628 0 non-token data: ntokens
>>> 0.000 0 1195344836 0 non-token data: oldest atime
>>> 0.000 0 1195532636 0 non-token data: newest atime
>>> 0.000 0 1195532327 0 non-token data: last journal
>>> sync atime
>>> 0.000 0 1195517625 0 non-token data: last expiry
>>> atime
>>> 0.000 0 172800 0 non-token data: last expire
>>> atime delta
>>> 0.000 0 72520 0 non-token data: last expire
>>> reduction count
>>>
>>> Thoughts?
>> That's a *really* unusual sa-learn dump, and would imply that bayes was
>> completely inactive until recently.
>>
>> Note that there are 900k messages that have been trained as ham (ie:
>> nonspam email), but only 297 trained as spam. That's very little spam
>> compared to the quantity of ham. Usually you see by more spam than ham,
>> but not by that large a margin (50:1 spam to ham isn't unheard of.. but
>> this is 1:3307).
>>
>> Did you do some really goofy hand training with sa-learn, or did the
>> autolearner really do that? If it's all autolearning, do you have a lot
>> of spam matching ALL_TRUSTED?
>>
>> Also bayes won't become active until there are at least 200 spams and
>> 200 hams, and given there's only 297 spams, it may not have crossed that
>> line until recently and bayes may have been disabled.
>>
>> I'd be very concerned about the health of your bayes database. It's
>> possible the autolearner went awry and learned poorly here.
>>
>> I would seriously consider doing the following, if at all possible:
>>
>> 1) round up a few hundred spam and nonspam messages as text files (with
>> complete headers)
>> 2) run sa-learn --clear to wipe out your bayes database
>> 3) use sa-learn --spam and sa-learn --ham to hand-train those messages
>> from step 1.
>>
>> Once given a little hand training, usually the autolearner is fine (with
>> the occasional hand training to fix minor confusions, but it looks like
>> you're way past minor confusion...).
>>