You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Arthur Dent <sa...@troodos.demon.co.uk> on 2008/02/06 11:48:03 UTC

sa-learn weirdness...

Well, in fairness, it's probably not sa-learn that's causing the weirdness but
my setup. I don't understand what's causing the problem however. Allow me to
explain...

I have a nightly cron job that runs a script to do sa-learning. Learning spam
is no problem, it's all in one mail folder (2 actually but details irrelevant)
and contains roughly 4,000 spam mails.

Ham is more of a problem because procmail sorts my mail into several different
(mbox) folders and I manually file incoming mail into many others. What my script
does is concatenate all these various folders into one "TempHam" folder which
is then used for sa-learn and is then deleted.

I recently had a tidy-up and reorganisation of my folders and arranged a
hierarchical folder system. In so doing I realised that for many months
(years?) I had actually been leaving many of busier folders (e.g the one I
file all these spamassassin mailinglist emails) out of the cat routine. This
was my opportunity to fix this.

I expected a one-off large spike in sa-learn for ham messages therefore for the
first night the job would run (and sure enough the ham learn job went from c.
10 minutes to 1 hour 24 minutes - causing an overlap of backup routines etc.)

I was however, surprised when the same thing happened the next night (and the
next...)

Below I list the output from the last few nights (ham only). The first entry is
the last run under the previous system.*

Learned tokens from 8 message(s) (3165 message(s) examined)
Learned tokens from 4628 message(s) (8703 message(s) examined)
Learned tokens from 3890 message(s) (8634 message(s) examined)
Learned tokens from 2264 message(s) (8671 message(s) examined)
Learned tokens from 2303 message(s) (8620 message(s) examined)

Notice that although the amount of tokens being learned seems to be coming
down gradually, the total far exceeds the total amount of ham mails in the
corpus.

Is this normal?
Will it eventually settle down?

Thanks in advance for any advice or suggestions...

Mark


* Note: this is my home system. Mails >180 days old are archived out of the
folders using archivemail. I get probably c.40-50 non-spam mails per day which are kept in the various folders.



Re: sa-learn weirdness...

Posted by John Hardin <jh...@impsec.org>.
On Fri, 8 Feb 2008, Paolo Cravero wrote:

> Don't forget that sa-learn remembers which messages have been learned. 
> Once your old messages have all been learned, you need to feed to it 
> only new arrivals, that is since the last sa-learn run. No need to keep 
> 180 days worth of ham and spam in the temp folder!

Agreed. I rotate learning mailboxes to historical names monthly.

In other words, everybody has SpamAssassin-SPAM and SpamAssassin-HAM mail 
folders that sa-learn processes nightly, then on the first of the month 
those get moved to somthing like SpamAssassin-x-YYYYMM.

In addition, the script that launches sa-learn only tells it about 
learning folders that have been modified in the last three days.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Homeland Security: Specializing in Tactical Band-aids for Strategic
   Problems.                       -- Eric K. in Bruce Schneier's blog
-----------------------------------------------------------------------
  4 days until Abraham Lincoln's and Charles Darwin's 199th Birthdays

Re: sa-learn weirdness...

Posted by Arthur Dent <sa...@troodos.demon.co.uk>.
On Fri, Feb 08, 2008 at 02:02:45PM +0100, Paolo Cravero wrote:
> Arthur Dent wrote:
>
>> Hmmm... Not delete exactly, but the sa-learn job take so long that the
>> archivemail job has kicked off and finds the "TempSpam" and "TempHam" mboxes
>> in the Mail directory and dutifully chops out anything older than 180 days. I
>> didn't think that that would be a problem, but maybe it's upsetting sa-learn?
>> I will try switch the order of the jobs (archivemail running first) and see if
>> that makes a difference.. 
>
> At this point you have probably already swapped the two processes.
>
> I think sa-learn or the process feeding it does not like the chopping.

Yes. Sorry, I didn't post an update because I was embarrassed at my own
stupidity for not thinking it through more carefully before posting my
original message. Switching the jobs round did indeed mean that sa-learn
is no longer getting interfered with by archivemail while it's in
mid-learn. It now behaves quite sensibly.

>> Well, as I explained in my previous post, the "TempHam" folder is a
>> concatenation of all my non-spam folders. Mail that is older than 180 days is
>> taken off at one end and new mail (c. 30-40 per day) added on at the other.
>> The total remains roughly constant.
>
> Don't forget that sa-learn remembers which messages have been learned. Once 
> your old messages have all been learned, you need to feed to it only new 
> arrivals, that is since the last sa-learn run. No need to keep 180 days 
> worth of ham and spam in the temp folder!

Yes I understand that. It's not that I *keep* a temp folder of spam/ham, I
don't. I know that it only needs to learn the *new* mails. It's just that
I'm basically lazy, and it seemed far easier for me simply to take all
my non-spam folders and copy them together into one big temporary
file, run sa-learn on it and then delete the temporary file, eg:

#!/bin/bash
cat ~/mail/mailinglists/* ~/mail/WorkStuff/* ~/mail/Admin/* > TempHam
sa-learn --ham --mbox ~/mail/TempHam
rm ~/mail/TempHam

I'm no bash-scripting wiz (that much should be obvious!) so I could
think of no *easy* way to strip only today's mails out of my 20-odd
folders and just feed those to sa-learn. My way, I need to do nothing
myself, the job takes about half an hour, and I'm asleep when it
happens... OK, sa-learn has to work a bit harder than it needs to, but
hey, better it than me!

The 180 days thing is because I choose to keep only the last 6 months
(approx) mail in each of my 20 or so folders, the rest being zipped into a gzip
archive using "Archivemail" (a very neat little utility btw) and 180 days is
its default setting (see I told you I was lazy!).

> Let sa-learn complete and then chop the folder. Just concatenate the 
> process rather than schedule it in crontab. It should fix your apparent 
> weirdness.
>
> Paolo

Thanks for all the help and suggestions. Much appreciated...

Mark


Re: sa-learn weirdness...

Posted by Paolo Cravero <pc...@as2594.net>.
Arthur Dent wrote:

> Hmmm... Not delete exactly, but the sa-learn job take so long that the
> archivemail job has kicked off and finds the "TempSpam" and "TempHam" mboxes
> in the Mail directory and dutifully chops out anything older than 180 days. I
> didn't think that that would be a problem, but maybe it's upsetting sa-learn?
> I will try switch the order of the jobs (archivemail running first) and see if
> that makes a difference.. 

At this point you have probably already swapped the two processes.

I think sa-learn or the process feeding it does not like the chopping.

> Well, as I explained in my previous post, the "TempHam" folder is a
> concatenation of all my non-spam folders. Mail that is older than 180 days is
> taken off at one end and new mail (c. 30-40 per day) added on at the other.
> The total remains roughly constant.

Don't forget that sa-learn remembers which messages have been learned. Once 
your old messages have all been learned, you need to feed to it only new 
arrivals, that is since the last sa-learn run. No need to keep 180 days worth 
of ham and spam in the temp folder!


Let sa-learn complete and then chop the folder. Just concatenate the process 
rather than schedule it in crontab. It should fix your apparent weirdness.

Paolo


Re: sa-learn weirdness...

Posted by Arthur Dent <sa...@troodos.demon.co.uk>.
On Wed, Feb 06, 2008 at 05:02:46PM +0100, Paolo Cravero wrote:
> Arthur Dent wrote:
>
>> Learned tokens from 8 message(s) (3165 message(s) examined)
>> Learned tokens from 4628 message(s) (8703 message(s) examined)
>> Learned tokens from 3890 message(s) (8634 message(s) examined)
>> Learned tokens from 2264 message(s) (8671 message(s) examined)
>> Learned tokens from 2303 message(s) (8620 message(s) examined)
>
> "Odds 2,000,127 against one... and counting..."
*
>
>> Notice that although the amount of tokens being learned seems to be coming
>> down gradually, the total far exceeds the total amount of ham mails in the
>> corpus.
>
> The number of *messages* learned is decreasing, not the number of tokens.
Yes, sorry, lack of precision on my part. I meant number of messages of
course. But the point still stands, the process seems to have learned tokens
from a decreasing number of messages each time, but still, as these are
largely the same messages as the previous day it could not have processed
tokens from 13,085 messages as there are only around 8,650 in the corpus. (see
below for explanation).

> Could it be that something deletes the temp folder before sa-learn has 
> finished, so it gets distracted and starts flying away carrying a suitcase?

Hmmm... Not delete exactly, but the sa-learn job take so long that the
archivemail job has kicked off and finds the "TempSpam" and "TempHam" mboxes
in the Mail directory and dutifully chops out anything older than 180 days. I
didn't think that that would be a problem, but maybe it's upsetting sa-learn?
I will try switch the order of the jobs (archivemail running first) and see if
that makes a difference.. 

> Or do you receive >8600 messages each day? Some of them might have been 
> autolearned on the incoming SMTP channel, BTW.

Well, as I explained in my previous post, the "TempHam" folder is a
concatenation of all my non-spam folders. Mail that is older than 180 days is
taken off at one end and new mail (c. 30-40 per day) added on at the other.
The total remains roughly constant.

> IMHO it is not necessary to train so extensively the Bayes DB. If you want 
> the process to complete in a decent amount of time, feed it fewer messages 
> at a time.

Agreed, but I want to give it a good mix of ham that includes regular mail,
mailinglists (such as this one), newsletters, work stuff, etc. It just seemed
easier to lump everything together and feed it to sa-learn...

>
> Paolo

Thanks

Mark


> PS: who knows who "Arthud Dent" is/was, will understand the oddities in 
> this reply. All others: get a copy of the HHGTTG. :-)
* - But the problem is that my infinite improbability drive is broken. If only
I could just have a nice cup of tea...



Re: sa-learn weirdness...

Posted by Paolo Cravero <pc...@as2594.net>.
Arthur Dent wrote:

> Learned tokens from 8 message(s) (3165 message(s) examined)
> Learned tokens from 4628 message(s) (8703 message(s) examined)
> Learned tokens from 3890 message(s) (8634 message(s) examined)
> Learned tokens from 2264 message(s) (8671 message(s) examined)
> Learned tokens from 2303 message(s) (8620 message(s) examined)

"Odds 2,000,127 against one... and counting..."

> Notice that although the amount of tokens being learned seems to be coming
> down gradually, the total far exceeds the total amount of ham mails in the
> corpus.

The number of *messages* learned is decreasing, not the number of tokens.

Could it be that something deletes the temp folder before sa-learn has 
finished, so it gets distracted and starts flying away carrying a suitcase?

Or do you receive >8600 messages each day? Some of them might have been 
autolearned on the incoming SMTP channel, BTW.

IMHO it is not necessary to train so extensively the Bayes DB. If you want the 
process to complete in a decent amount of time, feed it fewer messages at a time.

Paolo

PS: who knows who "Arthud Dent" is/was, will understand the oddities in this 
reply. All others: get a copy of the HHGTTG. :-)