You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Arik Raffael Funke <ar...@gmx.de> on 2007/04/25 11:49:03 UTC
Any drawbacks of cron-scheduled bayesian leanring?
Hi,
I was wondering if it has any negative effects on my Bayes database if I
regularly learn all spam/ham messages via a cron job. Sa-learn skips
already learned messages. Am I thus right to assume that apart from the
relatively high CPU load there are no drawbacks? Or should I keep a
separate folder for "new" spam/ham?
I.e. what about expiring tags, etc. Sa-learn would routinely
re-encounter 5 year-old spam...
Cheers,
Arik
Re: Any drawbacks of cron-scheduled bayesian leanring?
Posted by Faisal N Jawdat <fa...@faisal.com>.
On Apr 25, 2007, at 4:30 PM, Arik Raffael Funke wrote:
> I am now probably venturing off-topic on my own thread but the
> point you make is interesting: You train only misfiled messages.
> What about new but correctly filed messages? You _never_ train on
> them?
> Given that bayes is a statistical method, is it really sufficient
> to only train on the mis-files?
the nightly cron job trained against the spam folder and a subset of
the read folders likely to have spam in them (archive, recent working
folders, etc.). i'd periodically retrain across the entire mail
tree. the retraining only for specific misfiled messages handles
both spam and hand.
retraining only on misfiles is not as accurate as training on all
mail, but is a lot lighter weight, so i can run it every 5 minutes
instead of every night.
> The proportional spam/ham weight of keywords would in this case not
> be adjusted in the database if/when they change in your mail
> traffic, or? Are you not encountering a higher number of mis-files
> compared to your previous learning practise?
the number of misfiles i get is so low that it's hard to tell if
there's a difference. i periodically get floods of new false-
negatives, but those typically correct after the first few are
retrained. when retraining across the entire mail spool the problems
usually corrected after the first night.
-faisal
Re: Any drawbacks of cron-scheduled bayesian leanring?
Posted by Arik Raffael Funke <ar...@gmx.de>.
Faisal N Jawdat wrote:
> On Apr 25, 2007, at 5:49 AM, Arik Raffael Funke wrote:
>> I was wondering if it has any negative effects on my Bayes database if
>> I regularly learn all spam/ham messages via a cron job.
>
> I did this for a while and didn't find any problems.
Good news. At least in practise it does not seem to produce problems...
> That said, I keep a rolling 1 month corpus of spam, so it's easy to
> retrain when I need to. I stopped doing full retrains on cron, and at
> this point I only retrain on messages that were misfiled. See:
I am now probably venturing off-topic on my own thread but the point you
make is interesting: You train only misfiled messages. What about new
but correctly filed messages? You _never_ train on them? Given that
bayes is a statistical method, is it really sufficient to only train on
the mis-files? The proportional spam/ham weight of keywords would in
this case not be adjusted in the database if/when they change in your
mail traffic, or? Are you not encountering a higher number of mis-files
compared to your previous learning practise?
Regards,
Arik
Re: Any drawbacks of cron-scheduled bayesian leanring?
Posted by Faisal N Jawdat <fa...@faisal.com>.
On Apr 25, 2007, at 5:49 AM, Arik Raffael Funke wrote:
> I was wondering if it has any negative effects on my Bayes database
> if I regularly learn all spam/ham messages via a cron job.
>
> Sa-learn skips already learned messages. Am I thus right to assume
> that apart from the relatively high CPU load there are no
> drawbacks? Or should I keep a separate folder for "new" spam/ham?
I did this for a while and didn't find any problems.
I'm using Maildir, and I only trained on the cur folders, not the new
folders. In theory this would prevent me from training on something
that had come in mis-filed (so long as I remembered to quit my mail
client at night).
See here for details and a script to do this:
http://www.faisal.com/software/sa-harvest/
Note that this script will also attempt to rebuild your whitelist
(all the code after the 'sa-learn --dump magic'). This has some
downsides, and turns out to be less useful with modern Spamassassin,
so I'm reworking the script to break out the whitelist code into a
separate script.
That said, I keep a rolling 1 month corpus of spam, so it's easy to
retrain when I need to. I stopped doing full retrains on cron, and
at this point I only retrain on messages that were misfiled. See:
http://www.faisal.com/software/sa-harvest/quicktrain.xhtml
If you're doing any of this on a shared system, my one bit of advice
is to set up the cron to use 'batch' and 'nice'.
-faisal
Re: Any drawbacks of cron-scheduled bayesian leanring?
Posted by Arik Raffael Funke <ar...@gmx.de>.
Randal, Phil wrote:
> Arik Raffael Funke wrote:
>> Matthias Haegele wrote:
>>> Arik Raffael Funke schrieb:
>>>> I.e. what about expiring tags, etc. Sa-learn would routinely
>>>> re-encounter 5 year-old spam...
>>> Q: Would it be useful (regarding cpu and i/o performance) if only
>>> learned messages (copied from a maildir) that are new (e.g. not older
>>> than a week) or would checking this (date of file), be almost as bad
>>> as copying it for sa-learn?
> I would have thought that relearning age-old ham & spam would have the
> effect of polluting the Bayes database, not enhancing it, because both
> ham and spam characteristics change over time.
Thanks everybody. Opinion on whether this training procedure is
counter-productive seems divided... There seems quite a lot anecdotal
evidence that it does not have negative effects on one side and
"theoretical objections" on the other.
The effect Phil mentioned was actually what prompted me to ask my
question. More clearly phrased the question is: whether previously seen,
old messages _really_ pollutes the Bayes database. In my opinion this
depends on the actual implementation of the learning function in
spamassassin or resp. the implementation of the "skipping" of previously
seen messages by sa-learn.
Is anybody familiar with the inner workings of spamassassin and thus
able to provide an answer to the question on that basis?
Best regards,
Arik
RE: Re: Any drawbacks of cron-scheduled bayesian leanring?
Posted by "Randal, Phil" <pr...@herefordshire.gov.uk>.
Arik Raffael Funke wrote:
> Matthias Haegele wrote:
>> Arik Raffael Funke schrieb:
>>> I.e. what about expiring tags, etc. Sa-learn would routinely
>>> re-encounter 5 year-old spam...
>>
>> Q: Would it be useful (regarding cpu and i/o performance) if only
>> learned messages (copied from a maildir) that are new (e.g. not older
>> than a week) or would checking this (date of file), be almost as bad
>> as copying it for sa-learn?
>
> I am not sure whether this question was directed to me... But I
> personally would like to avoid discriminating between mail... apart
> from the ham/spam distinction obviously. ;-)
>
> I do not care about cpu/io load, only about the quality of my Bayes
> database. Therefore: does anybody know whether it is a
> problem for the
> Bayes database if I routinely re-learn ALL my spam/ham, especially
> regarding the accurate expiry of tokens, etc.?
>
> Cheers,
> Arik
I would have thought that relearning age-old ham & spam would have the
effect of polluting the Bayes database, not enhancing it, because both
ham and spam characteristics change over time.
I personally wouldn't throw more than the last couple of week's worth of
ham/spam at sa-learn.
sa-learn --dump magic
and a look at the age of your oldest token could be of use (you can use
something like http://www.onlineconversion.com/unix_time.htm to convert
the unix timestamp to readable format).
Cheers,
Phil
--
Phil Randal
Network Engineer
Herefordshire Council
Hereford, UK
Re: Any drawbacks of cron-scheduled bayesian leanring?
Posted by Arik Raffael Funke <ar...@gmx.de>.
Matthias Haegele wrote:
> Arik Raffael Funke schrieb:
>> I.e. what about expiring tags, etc. Sa-learn would routinely
>> re-encounter 5 year-old spam...
>
> Q: Would it be useful (regarding cpu and i/o performance) if only
> learned messages (copied from a maildir) that are new (e.g. not older
> than a week) or would checking this (date of file), be almost as bad as
> copying it for sa-learn?
I am not sure whether this question was directed to me... But I
personally would like to avoid discriminating between mail... apart from
the ham/spam distinction obviously. ;-)
I do not care about cpu/io load, only about the quality of my Bayes
database. Therefore: does anybody know whether it is a problem for the
Bayes database if I routinely re-learn ALL my spam/ham, especially
regarding the accurate expiry of tokens, etc.?
Cheers,
Arik
Re: Any drawbacks of cron-scheduled bayesian leanring?
Posted by Matthias Haegele <mh...@linuxrocks.dyndns.org>.
Arik Raffael Funke schrieb:
> Hi,
Hello!
> I was wondering if it has any negative effects on my Bayes database if I
> regularly learn all spam/ham messages via a cron job. Sa-learn skips
> already learned messages. Am I thus right to assume that apart from the
> relatively high CPU load there are no drawbacks? Or should I keep a
> separate folder for "new" spam/ham?
>
> I.e. what about expiring tags, etc. Sa-learn would routinely
> re-encounter 5 year-old spam...
Q: Would it be useful (regarding cpu and i/o performance) if only
learned messages (copied from a maildir) that are new (e.g. not older
than a week) or would checking this (date of file), be almost as bad as
copying it for sa-learn?
> Cheers,
> Arik
>
--
GrĂ¼sse/Greetings
MH
Dont send mail to: ubecatcher@linuxrocks.dyndns.org
--
Re: Any drawbacks of cron-scheduled bayesian leanring?
Posted by "John D. Hardin" <jh...@impsec.org>.
On Wed, 25 Apr 2007, Arik Raffael Funke wrote:
> I was wondering if it has any negative effects on my Bayes
> database if I regularly learn all spam/ham messages via a cron
> job. Sa-learn skips already learned messages. Am I thus right to
> assume that apart from the relatively high CPU load there are no
> drawbacks? Or should I keep a separate folder for "new" spam/ham?
>
> I.e. what about expiring tags, etc. Sa-learn would routinely
> re-encounter 5 year-old spam...
Here's my two cents:
(1) Keep your training corpus around. It will help you recover from a
corrupted database and mislearning. In other words, don't delete
messages once they are learned.
(2) I have a SpamAssassin-SPAM and SpamAssassin-HAM folder set up for
users to learn to. Periodically (monthly) I rotate them to keep the
size manageable and to reduce the burden of sa-learn rescanning old
messages.
(3) Only give sa-learn a training folder that has been modified in the
last couple of days. There is no need to have it continually scan a
mailbox where nothing has changed.
You may want to look at my learn script, which I run from cron.daily
http://www.impsec.org/~jhardin/antispam/
--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin@impsec.org FALaholic #11174 pgpk -a jhardin@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
It is sadly humorous that those who are the most shrilly vocal
about bemoaning the increasing violations of civil liberties by
the federal government and comparing the president to Hitler are
also those who are working hardest to ensure the citizens of our
nation are disarmed and unable to effectively resist that same
government. Who do these people think will protect them from the
Jackbooted Thugs they are so worried about?
-----------------------------------------------------------------------
559 days until the Presidential Election