You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Arik Raffael Funke <ar...@gmx.de> on 2007/04/25 11:49:03 UTC

Any drawbacks of cron-scheduled bayesian leanring?

Hi,

I was wondering if it has any negative effects on my Bayes database if I 
regularly learn all spam/ham messages via a cron job. Sa-learn skips 
already learned messages. Am I thus right to assume that apart from the 
relatively high CPU load there are no drawbacks? Or should I keep a 
separate folder for "new" spam/ham?

I.e. what about expiring tags, etc. Sa-learn would routinely 
re-encounter 5 year-old spam...

Cheers,
Arik


Re: Any drawbacks of cron-scheduled bayesian leanring?

Posted by Faisal N Jawdat <fa...@faisal.com>.
On Apr 25, 2007, at 4:30 PM, Arik Raffael Funke wrote:
> I am now probably venturing off-topic on my own thread but the  
> point you make is interesting: You train only misfiled messages.  
> What about new but correctly filed messages? You _never_ train on  
> them?
> Given that bayes is a statistical method, is it really sufficient  
> to only train on the mis-files?

the nightly cron job trained against the spam folder and a subset of  
the read folders likely to have spam in them (archive, recent working  
folders, etc.).  i'd periodically retrain across the entire mail  
tree.  the retraining only for specific misfiled messages handles  
both spam and hand.

retraining only on misfiles is not as accurate as training on all  
mail, but is a lot lighter weight, so i can run it every 5 minutes  
instead of every night.

> The proportional spam/ham weight of keywords would in this case not  
> be adjusted in the database if/when they change in your mail  
> traffic, or? Are you not encountering a higher number of mis-files  
> compared to your previous learning practise?

the number of misfiles i get is so low that it's hard to tell if  
there's a difference.  i periodically get floods of new false- 
negatives, but those typically correct after the first few are  
retrained.  when retraining across the entire mail spool the problems  
usually corrected after the first night.

-faisal


Re: Any drawbacks of cron-scheduled bayesian leanring?

Posted by Arik Raffael Funke <ar...@gmx.de>.
Faisal N Jawdat wrote:
> On Apr 25, 2007, at 5:49 AM, Arik Raffael Funke wrote:
>> I was wondering if it has any negative effects on my Bayes database if 
>> I regularly learn all spam/ham messages via a cron job.
> 
> I did this for a while and didn't find any problems.

Good news. At least in practise it does not seem to produce problems...

> That said, I keep a rolling 1 month corpus of spam, so it's easy to 
> retrain when I need to.  I stopped doing full retrains on cron, and at 
> this point I only retrain on messages that were misfiled.  See:

I am now probably venturing off-topic on my own thread but the point you 
make is interesting: You train only misfiled messages. What about new 
but correctly filed messages? You _never_ train on them? Given that 
bayes is a statistical method, is it really sufficient to only train on 
the mis-files?  The proportional spam/ham weight of keywords would in 
this case not be adjusted in the database if/when they change in your 
mail traffic, or? Are you not encountering a higher number of mis-files 
compared to your previous learning practise?

Regards,
Arik


Re: Any drawbacks of cron-scheduled bayesian leanring?

Posted by Faisal N Jawdat <fa...@faisal.com>.
On Apr 25, 2007, at 5:49 AM, Arik Raffael Funke wrote:
> I was wondering if it has any negative effects on my Bayes database  
> if I regularly learn all spam/ham messages via a cron job.
>
> Sa-learn skips already learned messages. Am I thus right to assume  
> that apart from the relatively high CPU load there are no  
> drawbacks? Or should I keep a separate folder for "new" spam/ham?

I did this for a while and didn't find any problems.

I'm using Maildir, and I only trained on the cur folders, not the new  
folders.  In theory this would prevent me from training on something  
that had come in mis-filed (so long as I remembered to quit my mail  
client at night).

See here for details and a script to do this:

http://www.faisal.com/software/sa-harvest/

Note that this script will also attempt to rebuild your whitelist  
(all the code after the 'sa-learn --dump magic').  This has some  
downsides, and turns out to be less useful with modern Spamassassin,  
so I'm reworking the script to break out the whitelist code into a  
separate script.

That said, I keep a rolling 1 month corpus of spam, so it's easy to  
retrain when I need to.  I stopped doing full retrains on cron, and  
at this point I only retrain on messages that were misfiled.  See:

http://www.faisal.com/software/sa-harvest/quicktrain.xhtml

If you're doing any of this on a shared system, my one bit of advice  
is to set up the cron to use 'batch' and 'nice'.

-faisal



Re: Any drawbacks of cron-scheduled bayesian leanring?

Posted by Arik Raffael Funke <ar...@gmx.de>.
Randal, Phil wrote:
> Arik Raffael Funke wrote:
>> Matthias Haegele wrote:
>>> Arik Raffael Funke schrieb:
>>>> I.e. what about expiring tags, etc. Sa-learn would routinely
>>>> re-encounter 5 year-old spam...
>>> Q: Would it be useful (regarding cpu and i/o performance) if only
>>> learned messages (copied from a maildir) that are new (e.g. not older
>>> than a week) or would checking this (date of file), be almost as bad
>>> as copying it for sa-learn?
> I would have thought that relearning age-old ham & spam would have the
> effect of polluting the Bayes database, not enhancing it, because both
> ham and spam characteristics change over time.

Thanks everybody. Opinion on whether this training procedure is 
counter-productive seems divided... There seems quite a lot anecdotal 
evidence that it does not have negative effects on one side and 
"theoretical objections" on the other.

The effect Phil mentioned was actually what prompted me to ask my 
question. More clearly phrased the question is: whether previously seen, 
old messages _really_ pollutes the Bayes database. In my opinion this 
depends on the actual implementation of the learning function in 
spamassassin or resp. the implementation of the "skipping" of previously 
seen messages by sa-learn.

Is anybody familiar with the inner workings of spamassassin and thus 
able to provide an answer to the question on that basis?

Best regards,
Arik


RE: Re: Any drawbacks of cron-scheduled bayesian leanring?

Posted by "Randal, Phil" <pr...@herefordshire.gov.uk>.
Arik Raffael Funke wrote:
> Matthias Haegele wrote:
>> Arik Raffael Funke schrieb:
>>> I.e. what about expiring tags, etc. Sa-learn would routinely
>>> re-encounter 5 year-old spam...
>> 
>> Q: Would it be useful (regarding cpu and i/o performance) if only
>> learned messages (copied from a maildir) that are new (e.g. not older
>> than a week) or would checking this (date of file), be almost as bad
>> as copying it for sa-learn?
> 
> I am not sure whether this question was directed to me... But I
> personally would like to avoid discriminating between mail... apart
> from the ham/spam distinction obviously. ;-)
> 
> I do not care about cpu/io load, only about the quality of my Bayes
> database. Therefore: does anybody know whether it is a
> problem for the
> Bayes database if I routinely re-learn ALL my spam/ham, especially
> regarding the accurate expiry of tokens, etc.?
> 
> Cheers,
> Arik

I would have thought that relearning age-old ham & spam would have the
effect of polluting the Bayes database, not enhancing it, because both
ham and spam characteristics change over time.

I personally wouldn't throw more than the last couple of week's worth of
ham/spam at sa-learn.

sa-learn --dump magic

and a look at the age of your oldest token could be of use (you can use
something like http://www.onlineconversion.com/unix_time.htm to convert
the unix timestamp to readable format).

Cheers,

Phil

-- 
Phil Randal
Network Engineer
Herefordshire Council
Hereford, UK

Re: Any drawbacks of cron-scheduled bayesian leanring?

Posted by Arik Raffael Funke <ar...@gmx.de>.
Matthias Haegele wrote:
> Arik Raffael Funke schrieb:
>> I.e. what about expiring tags, etc. Sa-learn would routinely 
>> re-encounter 5 year-old spam...
> 
> Q: Would it be useful (regarding cpu and i/o performance) if only 
> learned messages (copied from a maildir) that are new (e.g. not older 
> than a week) or would checking this (date of file), be almost as bad as 
> copying it for sa-learn?

I am not sure whether this question was directed to me... But I 
personally would like to avoid discriminating between mail... apart from 
the ham/spam distinction obviously. ;-)

I do not care about cpu/io load, only about the quality of my Bayes 
database. Therefore: does anybody know whether it is a problem for the 
Bayes database if I routinely re-learn ALL my spam/ham, especially 
regarding the accurate expiry of tokens, etc.?

Cheers,
Arik


Re: Any drawbacks of cron-scheduled bayesian leanring?

Posted by Matthias Haegele <mh...@linuxrocks.dyndns.org>.
Arik Raffael Funke schrieb:
> Hi,

Hello!

> I was wondering if it has any negative effects on my Bayes database if I 
> regularly learn all spam/ham messages via a cron job. Sa-learn skips 
> already learned messages. Am I thus right to assume that apart from the 
> relatively high CPU load there are no drawbacks? Or should I keep a 
> separate folder for "new" spam/ham?
> 
> I.e. what about expiring tags, etc. Sa-learn would routinely 
> re-encounter 5 year-old spam...

Q: Would it be useful (regarding cpu and i/o performance) if only 
learned messages (copied from a maildir) that are new (e.g. not older 
than a week) or would checking this (date of file), be almost as bad as 
copying it for sa-learn?

> Cheers,
> Arik
> 


-- 
GrĂ¼sse/Greetings
MH


Dont send mail to: ubecatcher@linuxrocks.dyndns.org
--


Re: Any drawbacks of cron-scheduled bayesian leanring?

Posted by "John D. Hardin" <jh...@impsec.org>.
On Wed, 25 Apr 2007, Arik Raffael Funke wrote:

> I was wondering if it has any negative effects on my Bayes
> database if I regularly learn all spam/ham messages via a cron
> job. Sa-learn skips already learned messages. Am I thus right to
> assume that apart from the relatively high CPU load there are no
> drawbacks? Or should I keep a separate folder for "new" spam/ham?
> 
> I.e. what about expiring tags, etc. Sa-learn would routinely
> re-encounter 5 year-old spam...

Here's my two cents:

(1) Keep your training corpus around. It will help you recover from a
corrupted database and mislearning. In other words, don't delete
messages once they are learned.

(2) I have a SpamAssassin-SPAM and SpamAssassin-HAM folder set up for
users to learn to. Periodically (monthly) I rotate them to keep the
size manageable and to reduce the burden of sa-learn rescanning old
messages.

(3) Only give sa-learn a training folder that has been modified in the 
last couple of days. There is no need to have it continually scan a 
mailbox where nothing has changed.

You may want to look at my learn script, which I run from cron.daily

  http://www.impsec.org/~jhardin/antispam/


--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  It is sadly humorous that those who are the most shrilly vocal
  about bemoaning the increasing violations of civil liberties by
  the federal government and comparing the president to Hitler are
  also those who are working hardest to ensure the citizens of our
  nation are disarmed and unable to effectively resist that same
  government. Who do these people think will protect them from the
  Jackbooted Thugs they are so worried about?
-----------------------------------------------------------------------
 559 days until the Presidential Election